For the past several years Netflix developers have been using self-service tools to build and deploy hundreds of applications and services to the Amazon cloud. One of those tools is Asgard, a web interface for application deployments and cloud management.
Asgard is named for the home of the Norse god of thunder and lightning, because Asgard is where Netflix developers go to control the clouds. I’m happy to announce that Asgard has now been open sourced on github and is available for download and use by anyone. All you’ll need is an Amazon Web Services account. Like other open source Netflix projects, Asgard is released under the Apache License, Version 2.0. Please feel free to fork the project and make improvements to it.
Some of the information in this blog post is also published in the following presentations. Note that Asgard was originally named the Netflix Application Console, or NAC.
- Asgard, the Grails App that Deploys Netflix to the Cloud (Slides 2012)
- Building Cloud Tools for Netflix (Slides 2011)
- Building Cloud Tools for Netflix (Video 2011)
Visual Language for the Cloud
To help people identify various types of cloud entities, Asgard uses the Tango open source icon set, with a few additions. These icons help establish a visual language to help people understand what they are looking at as they navigate. Tango icons look familiar because they are also used by Jenkins, Ubuntu, Mediawiki, Filezilla, and Gimp. Here is a sampling of Asgard's cloud icons.
Cloud Model
The Netflix cloud model includes concepts that AWS does not support directly: Applications and Clusters.
Application
Below is a diagram of some of the Amazon objects required to run a single front-end application such as Netflix’s autocomplete service.
Here’s a quick summary of the relationships of these cloud objects.
Here’s a quick summary of the relationships of these cloud objects.
- An Auto Scaling Group (ASG) can attach zero or more Elastic Load Balancers (ELBs) to new instances.
- An ELB can send user traffic to instances.
- An ASG can launch and terminate instances.
- For each instance launch, an ASG uses a Launch Configuration.
- The Launch Configuration specifies which Amazon Machine Image (AMI) and which Security Groups to use when launching an instance.
- The AMI contains all the bits that will be on each instance, including the operating system, common infrastructure such as Apache and Tomcat, and a specific version of a specific Application.
- Security Groups can restrict the traffic sources and ports to the instances.
When there are large numbers of those cloud objects in a service-oriented architecture (like Netflix has), it’s important for a user to be able to find all the relevant objects for their particular application. Asgard uses an application registry in SimpleDB and naming conventions to associate multiple cloud objects with a single application. Each application has an owner and an email address to establish who is responsible for the existence and state of the application's associated cloud objects.
Asgard limits the set of permitted characters in the application name so that the names of other cloud objects can be parsed to determine their association with an application.
Here is a screenshot of Asgard showing a filtered subset of the applications running in our production account in the Amazon cloud in the us-east-1 region:
Cluster
On top of the Auto Scaling Group construct supplied by Amazon, Asgard infers an object called a Cluster which contains one or more ASGs. The ASGs are associated by naming convention. When a new ASG is created within a cluster, an incremented version number is appended to the cluster's "base name" to form the name of the new ASG. The Cluster provides Asgard users with the ability to perform a deployment that can be rolled back quickly.
Example: During a deployment, cluster obiwan contains ASGs obiwan-v063 and obiwan-v064. Here is a screenshot of a cluster in mid-deployment.
The old ASG is “disabled” meaning it is not taking traffic but remains available in case a problem occurs with the new ASG. Traffic comes from ELBs and/or from Discovery, an internal Netflix service that is not yet open sourced.
The old ASG is “disabled” meaning it is not taking traffic but remains available in case a problem occurs with the new ASG. Traffic comes from ELBs and/or from Discovery, an internal Netflix service that is not yet open sourced.
Deployment Methods
Fast Rollback
One of the primary features of Asgard is the ability to use the cluster screen shown above to deploy a new version of an application in a way that can be reversed at the first sign of trouble. This method requires more instances to be in use during deployment, but it can greatly reduce the duration of service outages caused by bad deployments.
This animated diagram shows a simplified process of using the Cluster interface to try out a deployment and roll it back quickly when there is a problem:
The animation illustrates the following deployment use case:
The animation illustrates the following deployment use case:
- Create the new ASG obiwan-v064
- Enable traffic to obiwan-v064
- Disable traffic on obiwan-v063
- Monitor results and notice that things are going badly
- Re-enable traffic on obiwan-v063
- Disable traffic on obiwan-v064
- Analyze logs on bad servers to diagnose problems
- Delete obiwan-v064
Rolling Push
Asgard also provides an alternative deployment system called a rolling push. This is similar to a conventional data center deployment of a cluster on application servers. Only one ASG is needed. Old instances get gracefully deleted and replaced by new instances one or two at a time until all the instances in the ASG have been replaced. Rolling pushes are useful:
- If an ASG's instances are sharded so each instance has a distinct purpose that should not be duplicated by another instance.
- If the clustering mechanisms of the application (such as Cassandra) cannot support sudden increases in instance count for the cluster.
Downsides to a rolling push:
- Replacing instances in small batches can take a long time.
- Reversing a bad deployment can take a long time.
Task Automation
Several common tasks are built into Asgard to automate the deployment process. Here is an animation showing a time-compressed view of a 14-minute automated rolling push in action:
Auto Scaling
Netflix focuses on the ASG as the primary unit of deployment, so Asgard also provides a variety of graphical controls for modifying an ASG and setting up metrics-driven auto scaling when desired.
CloudWatch metrics can be selected from the default provided by Amazon such as CPUUtilization, or can be custom metrics published by your application using a library like Servo for Java.
Why not the AWS Management Console?
The AWS Management Console has its uses for someone with your Amazon account password who needs to configure something Asgard does not provide. However, for everyday large-scale operations, the AWS Management Console has not yet met the needs of the Netflix cloud usage model, so we built Asgard instead. Here are some of the reasons.
Hide the Amazon keys
Netflix grants its employees a lot of freedom and responsibility, including the rights and duties of enhancing and repairing production systems. Most of those systems run in the Amazon cloud. Although we want to enable hundreds of engineers to manage their own cloud apps, we prefer not to give all of them the secret keys to access the company’s Amazon accounts directly. Providing an internal console allows us to grant Asgard users access to our Amazon accounts without telling too many employees the shared cloud passwords. This strategy also saves us from needing to assign and revoke hundreds of Identity and Access Management (IAM) cloud accounts for employees.
Auto Scaling Groups
As of this writing the AWS Management Console lacks support for Auto Scaling Groups (ASGs). Netflix relies on ASGs as the basic unit of deployment and management for instances of our applications. One of our goals in open sourcing Asgard is to help other Amazon customers make greater use of Amazon’s sophisticated auto scaling features. ASGs are a big part of the Netflix formula to provide reliability, redundancy, cost savings, clustering, discoverability, ease of deployment, and the ability to roll back a bad deployment quickly.
Enforce Conventions
Like any growing collection of things users are allowed to create, the cloud can easily become a confusing place full of expensive, unlabeled clutter. Part of the Netflix Cloud Architecture is the use of registered services associated with cloud objects by naming convention. Asgard enforces these naming conventions in order to keep the cloud a saner place that is possible to audit and clean up regularly as things get stale, messy, or forgotten.
Logging
So far the AWS console does not expose a log of recent user actions on an account. This makes it difficult to determine whom to call when a problem starts, and what recent changes might relate to the problem. Lack of logging is also a non-starter for any sensitive subsystems that legally require auditability.
Integrate Systems
Having our own console empowers us to decide when we want to add integration points with our other engineering systems such as Jenkins and our internal Discovery service.
Costs
When using cloud services, it’s important to keep a lid on your costs. As of June 5, 2012, Amazon now provides a way to track your account’s charges frequently. This data is not exposed through Asgard as of this writing, but someone in your company should keep track of your cloud costs regularly. See http://aws.typepad.com/aws/2012/06/new-programmatic-access-to-aws-billing-data.html
Starting up Asgard does not initially cause you to incur any Amazon charges, because Amazon has a free tier for SimpleDB usage and no charges for creating Security Groups, Launch Configurations, or empty Auto Scaling Groups. However, as soon as you increase the size of an ASG above zero Amazon will begin charging you for instance usage, depending on your status for Amazon’s Free Usage Tier. Creating ELBs, RDS instances, and other cloud objects can also cause you to incur charges. Become familiar with the costs before creating too many things in the cloud, and remember to delete your experiments as soon as you no longer need them. Your Amazon costs are your own responsibility, so run your cloud operations wisely.
Cost references: http://aws.amazon.com/ec2/pricing/
Feature Films
By extraordinary coincidence, Thor and Thor: Tales of Asgard are now available to watch on Netflix streaming.
Conclusion
Asgard has been one of the primary tools for application deployment and cloud management at Netflix for years. By releasing Asgard to the open source community we hope more people will find the Amazon cloud and Auto Scaling easier to work with, even at large scale like Netflix. More Asgard features will be released regularly, and we welcome participation by users on GitHub.
Follow the Netflix Tech Blog and the @NetflixOSS twitter feed for more open source components of the Netflix Cloud Platform.
If you're interested in working with us to solve more of these interesting problems, have a look at the Netflix jobs page to see if something might suit you. We're hiring!
Related Resources
Asgard
Netflix Cloud Platform
- Netflix Open Source Projects
- Auto Scaling in the Amazon Cloud
- Servo (Publish application metrics for auto scaling)
- Netflix Cloud Architecture Slides
- Netflix Operations: Part I, Going Distributed
- @NetflixOSS Twitter Feed
No comments:
Post a Comment