Netflix: A Lot More than “House of Cards”

on April 22, 2015

netflix_logoNetflix has become a big star by moving from its successful business of lending DVDs to streaming to the even greater success of its own productions. What’s less known is that Netflix has become a technological leader, teaching the rest of the industry how to run a vast system in the cloud that has made it a winner.

Last year, the company had revenues of $5.5 billion, up 25.8% from 2013. Profits were $267 million. Streaming delivered 88.9% of the business though it has only been in 2007 that Netflix’s business consisted entirely of shipping DVDs and Blu-rays.

But another place Netflix has by itself is as a huge and extremely clever, successful user of Amazon Web Services (AWS), to which it moved its operations in 2010. Any Netflix service a customer gets — from signing up to a subscription to making the service available on just about any device capable of displaying the content to watching a showing of “Orange Is the New Black” to data and streaming is handled on AWS.

Big distribution, big failures. A massive distribution system built on the cloud seems asking for big failure. In 2011, AWS had a large breakdown that cut off the services from a good number of cloud participants including Reddit, Quora, and FourSquare. Netflix kept its service intact with almost no loss. And it wasn’t luck; Netflix has gone on to suffer virtually no losses during subsequent network problems.

I’m not sure how many users with cloud bases services (even private clouds) use the Netflix-developed techniques. Netflix is so pleased with the processes, it is happy to share what it has learned to anyone who wants the information. Not only that, the critical software of its diagnostic and control service is available freely through GitHub.

From the beginning, Netflix laid out its intention of using AWS. John Ciancutti, then a Netflix vice president and now chief product officer at Coursera, wrote a blog post (( If you are at all interested in the Netflix effort, follow the blog. It contains a wealth of information on how to run a big cloud-based service. )) on why AWS was chosen. The four top reasons: flexibility in architecture of a growing service, letting Netflix engineers focus on the business while Amazon provided the data center, the ability to easily adjust the size of the system even if the company itself isn’t good at predicting, and a commitment to the cloud as the future. [pullquote]The design of Netflix operations on AWS and the means for testing them was in large measure because of the warning from Amazon CTO Werner Vogels: “Everything fails. All the time.”[/pullquote]

The Simian Army. The design of Netflix operations on AWS and the means for testing them was in large measure because of the warning from Amazon CTO Werner Vogels: “Everything fails. All the time.” A major accomplishment for Netflix has been a large package of routines with the quaint name of the Simian Army. It was getting organized just around the time of the 2011 AWS network failure and has been expanding steadily in recent years with additional functions such as security tests. It’s quite notable that the inventions of Netflix are recommended to its cloud-using customers by IBM.

The oldest and one of the best known is Chaos Monkey, a tool available since 2012 to anyone who wants to use it. It is designed to improve the system by injecting failure, teaching engineers how to make the network service the daily threats with minimal loss.

“The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption,” Yury Izrailevsky Ariel Tseitlin wrote in the blog. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.”

I’m convinced that Simian Army is one of the reasons behind Netflix’s excellence (of course, there is also the willingness to spend a lot of money on content and internet capacity). For whatever reason, there’s no indication Amazon itself uses Simian Army, though that may be just part of Amazon’s secrecy. In any event, I regularly watch both Netflix and Amazon Prime and the quality of Netflix streaming is better. It may be the quality of the network testing.