Netflix cloud migration targets four 9s uptime
February 12, 2016
By Colin Mann
Netflix has confirmed it has finally completed its cloud migration and shut down the last remaining data centre bits used by its streaming service, having begun the process in August 2008, when it experienced a major database corruption and for three days could not ship DVDs to its members. It suggests the development moves it nearer to its desired goal of four nines of service uptime.
According to Yury Izrailevsky, Vice President, Cloud and Platform Engineering, August 208 was when Netflix realised that it had to move away from vertically scaled single points of failure, such as relational databases in its datacentre, towards highly reliable, horizontally scalable, distributed systems in the cloud.
Writing in the Company Blog, Izrailevsky says that Netflix chose Amazon Web Services (AWS) as cloud provider because it provided the greatest scale and the broadest set of services and features. “The majority of our systems, including all customer-facing services, had been migrated to the cloud prior to 2015. Since then, we’ve been taking the time necessary to figure out a secure and durable cloud path for our billing infrastructure as well as all aspects of our customer and employee data management.”
“Moving to the cloud has brought Netflix a number of benefits,” advises Izrailevsky. “We have eight times as many streaming members than we did in 2008, and they are much more engaged, with overall viewing growing by three orders of magnitude in eight years.”
“The Netflix product itself has continued to evolve rapidly, incorporating many new resource-hungry features and relying on ever-growing volumes of data. Supporting such rapid growth would have been extremely difficult out of our own data centres; we simply could not have racked the servers fast enough,” he admits.
“Elasticity of the cloud allows us to add thousands of virtual servers and petabytes of storage within minutes, making such an expansion possible. On January 6, 2016, Netflix expanded its service to over 130 new countries, becoming a truly global Internet TV network. Leveraging multiple AWS cloud regions, spread all over the world, enables us to dynamically shift around and expand our global infrastructure capacity, creating a better and more enjoyable streaming experience for Netflix members wherever they are,” he says.
“We rely on the cloud for all of our scalable computing and storage needs — our business logic, distributed databases and big data processing/analytics, recommendations, transcoding, and hundreds of other functions that make up the Netflix application. Video is delivered through Netflix Open Connect, our content delivery network that is distributed globally to efficiently deliver our bits to members’ devices,” he advises.
Izrailevsky notes that the cloud also allowed Netflix significantly to increase its service availability. “There were a number of outages in our data centres, and while we have hit some inevitable rough patches in the cloud, especially in the earlier days of cloud migration, we saw a steady increase in our overall availability, nearing our desired goal of four nines of service uptime. Failures are unavoidable in any large scale distributed system, including a cloud-based one. However, the cloud allows one to build highly reliable services out of fundamentally unreliable but redundant components. By incorporating the principles of redundancy and graceful degradation in our architecture, and being disciplined about regular production drills using Simian Army, it is possible to survive failures in the cloud infrastructure and within our own systems without impacting the member experience,” he reports.
According to Izrailevsky, cost reduction was not the main reason Netflix decided to move to the cloud. “However, our cloud costs per streaming start ended up being a fraction of those in the data centre — a welcome side benefit. This is possible due to the elasticity of the cloud, enabling us to continuously optimise instance type mix and to grow and shrink our footprint near-instantaneously without the need to maintain large capacity buffers. We can also benefit from the economies of scale that are only possible in a large cloud ecosystem,” he acknowledges.
“Given the obvious benefits of the cloud, why did it take us a full seven years to complete the migration,” he asks. “The truth is, moving to the cloud was a lot of hard work, and we had to make a number of difficult choices along the way. Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data centre and drop them in AWS. But in doing so, you end up moving all the problems and limitations of the data centre along with it. Instead, we chose the cloud-native approach, rebuilding virtually all of our technology and fundamentally changing the way we operate the company. Architecturally, we migrated from a monolithic app to hundreds of micro-services, and denormalised and our data model, using NoSQL databases. Budget approvals, centralised release coordination and multi-week hardware provisioning cycles made way to continuous delivery, engineering teams making independent decisions using self-service tools in a loosely coupled DevOps environment, helping accelerate innovation. Many new systems had to be built, and new skills learned. It took time and effort to transform Netflix into a cloud-native company, but it put us in a much better position to continue to grow and become a global TV network,” he suggests.
“Netflix streaming technology has come a long way over the past few years, and it feels great to finally not be constrained by the limitations we’ve previously faced. As the cloud is still quite new to many of us in the industry, there are many questions to answer and problems to solve. Through initiatives such as Netflix Open Source, we hope to continue collaborating with great technology minds out there and together address all of these challenges,” he concludes.