Saturday, November 28, 2015

Cloud AWS Infrastructure Vs. Physical Infrastructure



Logo AWS
The Framework
Cloud
There are several types of Cloud possibilities and I will stick with the AWS types which are infrastructure-oriented services, rather than the Google-type services (GAE – Google App Engine), to mention just one, which offers a running environment for your web applications developed with the APIs provided (similar to a framework). In fact, regarding the latter, we can’t speak for clients (they are the ones holding the credit card) about infrastructure management, because we upload our application using the APIs provided and leave the entire infrastructure management to the service provider. It doesn’t mean it’s less about Cloud computing, but simply about a Cloud service that’s more PaaS-oriented than infrastructure-oriented.
XaaS
Several abstraction layers: each editor directs its service to one or more layers.
Physical infrastructure
As far as physical infrastructure is concerned, I will examine the notion of self-hosted infrastructure and equally the notion of infrastructure supported by a hosting provider. Similarly, I’ll also look at infrastructures based directly on hardware, as well as those based on virtualized environments. Cloud computing is also based on virtualization, but we are not so interested in that technology here, rather the way in which it is provided to clients (you). In fact, you can simply start up instances via a console, as you do for EC2s if you have an ESX (VMware) for example, but it involves “only” a hypervisor which partitions the physical servers into several virtual computers. You will still have to take care of buying equipment (blades, etc.), configuring the network, etc. But we will come back to these details later.
Cloud computing = Systems administrators marked down?
Yes, the sales are on! Are you looking for a sweater, a jacket, … a systems administrator? I have often come across people who think that Cloud (in the case of AWS) will enable them to get by without an experienced systems administrator, and to build an infrastructure with inferior competence. The answer is obvious: WRONG!
Perhaps a clever sales pitch can convince you that the various services are so user-friendly, you can do it all yourself, and that prepackaged AMIs (Amazon Machine Image) will make life easy, but it goes without saying that once you have started up the EC2 instances, you connect to the computers (SSH / port 22 for Linux and TSE / port 3389 for Windows), then you have to set the parameters, do the fine-tuning, etc.
Logo Google App Engine
What applies to systems administrators faced with AWS applies equally to systems architects in the context of Cloud computing services providing access to higher layers (PaaS like Google App Engine). A person with experience in the field, able to set up the infrastructure requirement is needed: the tool may change but the skills must still be available. Note however that if you use GAEs, you don’t need a systems administrator for the application. If the Cloud service editor is offering a service in a given layer (HaaS, IaaS, PaaS, etc.), there is no longer any need for people to deal with the lower layers. However, we do accept the framework supplied by the Cloud editor.
The systems administrator cannot be done away with, but his role is changing: he is becoming more and more of a developer. Indeed, being able to pull up resources on the fly will enable infrastructure management to be scheduled and automated via scripts which will call up the APIs provided by Amazon enabling communication with its web services. Everything at Amazon is web service: EC2, EBS, SQS, S3, SimpleDB: the only non-SOAP or REST operations that exist are when you connect directly to EC2 instances that you called up via web service or when EC2 instances dialogue with the EBS that you called up via … I’ll let you guess.
The administrator can then, rather than going into the computer room, add a disk, connect a server (which would be the case with a physical architecture) or else pick up the phone and ask the host to do it (fetch a coffee… call again… take a Xanax or a Prozac pill…), request resources via a script in Ruby or Python… You can then take automation of a Cloud infrastructure much, much further, with a set of scripts and tools.
Logo Puppet - Reductive Labs Logo Capistrano
The systems administrator’s craft is therefore evolving between a physical infrastructure and an AWS-type Cloud infrastructure: he is becoming more and more of a developer. But the systems administrator remains essential nevertheless.
Elastic is fantastic!
As I mentioned earlier, one of the crucial differences between the two types of infrastructure is the flexibility and dynamism provided by the Cloud solution, compared to a traditional physical architecture (whether it is based on virtualization or not). That means the elimination of the time it takes to install and set up the logistics (equipment purchase, installation of the OS, connecting to the network – the physical network and the configuration of the interfaces, etc.). Likewise, when you no longer need an item of hardware (EC2 virtual instance, EBS volume, S3 object, etc.), you return it to the resource pool: it will be reinitialized so that none of your data can be retrieved, and made available again until the next web service call.
You also have complete access to certain elements such as security groups (firewall) set for each instance… And that’s very useful. It’s very practical, particularly compared to a traditional hosting provider: do you remember the last time you had to change the firewall rules?
But it’s not simply about the pros and cons of purchasing servers versus running instances. AWS are backed by datacenters which are already industrially organized and tested. All the safety standards that need to be met in terms of fire protection, computer cooling, redundant electrical power supply, physical security against break-ins, distributing the hardware across 2 or more physical datacenters for disaster recovery, etc. entail a colossal initial investment and even when everything is installed, you still won’t be able to recreate the same quality within your company (99% of the time, in any case). You can get all that however, or part of it, with a traditional hosting provider.
But there are also all the more software-like services, such as data durability management (redundancy/replication as on the EBS and on S3), accessibility or high availability, monitoring hardware (to be alerted when physical components are showing signs of weakening), procedures for breakdowns, etc. I will let you read The 9 Principles of S3(French version), so you understand just how many concepts are included. You won’t get all that with a traditional hosting provider (and forget about having it @Home). The quality of the S3 service is effectively a huge advantage, especially compared to current pricing… Let’s talk about prices!
The cost
There are no fixed rules. With Cloud, you pay for the resource by the hour and when you stop using it, you stop paying. Instances (Linux at the time I’am writing the article, but Windows will come soon) can also be reserved on the AWS for one year or for three years: this is known as Reserved Instances: you pay a one-time fee at the start, and afterwards you pay the hourly usage charge at a discounted rate, which leads you to a tipping point starting from a certain percentage of the resource use over the year (or three years). For more information, click here. To easily calculate how much your infrastructure will cost you, take advantage of the new calculator provided by Amazon: Simple Monthly Calculator. In all cases one part is related to the hourly usage and another to your traffic/volume stored.
Simple Monthly Calculator
Simple Monthly Calculator
You can do the comparison with the cost of your local infrastructure or with that of your hosting provider. The Cloud set price is particularly attractive in the following cases:
  • The set price is unbeatable for POCs, demonstrations/presentations or architecture load/validation tests.
  • It is very attractive for applications or APIs mounted on an SaaS-type economic model and for which you need to spend money on their resources only when the clients pay to use the abovementioned APIs.
  • It is a good price for the social applications found on Facebook, for example, and which can take off overnight thanks to the social media phenomenon and may experience a boom (or a drop) in hits.
  • It is also cost-effective when you are launching a new company, or a specific project within a larger entity and you do not wish to invest heavily in logistics right at the start.
For all the many other specific cases among you, you’ll have to do your own calculations.
Slacking off? No, of course not …
Usually, no matter what type of infrastructure you have, the same components and mechanisms should be installed. However, it must be acknowledged that often the cosy aspects of “home”-hosted infrastructure can lead to a certain lack of rigor on many issues. The fact that Amazon, with its AWS, offers a dynamic and volatile solution for these EC2s, compels you to install mechanisms (which should be standard) in order to consider failure or disaster recovery plans more seriously, given the volatile nature of the tool, and to identify the important data with the aim of ensuring data durability (EBS, S3 backups, etc).

Network and shared resources
The network is an important feature … and in more than one way! Effectively, the Cloud configuration is already prepared for you, and that’s convenient. But that also means that you don’t control it, and consequently you cannot monitor it yourself and therefore diagnose, for example, the causes of a slow-down. There is a similar lack of transparency concerning the shared resources at Cloud: at our level it is impossible to estimate the impact of other people using the resource we are sharing (such as the physical computer on top of which is based our EC2 instance, the physical device we use as an EBS network-attached device, the network bandwidth, etc.). The only network monitoring possible is limited to input/output on the instance (for example, an EBS is a network-attached device, so … there is no means of verifying the connectivity conditions, you will have to do with the I/O disks). The monitoring you can do is concentrated on the EC2 virtual instance itself (the instance is managed by a hypervisor and based on a Xen virtualization). Total visibility of our infrastructure is therefore not possible on Cloud. This must be taken into account… and accepted, if you wish to implement your architecture on Cloud. You must also weigh up the same lack of transparency for other shared resources as previously mentioned (EBS use, etc.).
AWS : Isolation of the EC2 Instance (from aws.amazon.com)
AWS : Isolation of the EC2 Instance (from aws.amazon.com)
Likewise, a multicast is not possible when it comes to communication protocols: bear that in mind for certain clusters. This constraint is understandable, given the far-reaching impact that a mismanaged multicast can have.
That is due to Cloud’s own way of operating: it has easy ways which mask a certain number of elements you no longer control.
On-call support, monitoring, BCP (Business Continuity Planning) and penalties
One question I’ve been asked frequently is “Does Amazon provide on-call support (regarding your own application/infrastructure your are running on top of the AWS) ?” The answer is no. AWS must be seen as a set of tools offered by Amazon which ensures that these tools are always up and well working. They maintain these tools and the development of their various functions. However, you are responsible for your use of the tools (in any case they do not possess the private keys of your EC2 instances…), so there is no monitoring / on-call support / BCP (Business Continuity Planning) package.
Unlike the specific contracts you may sign with your hosting providers, you must provide for these elements yourself, or regarding on-call support for example, you could out-source it to a facilities management company. Ditto for monitoring: Amazon offers Amazon CloudWatch, but the information (% of CPU, read/written bytes onto disk and in/out bytes on the network) is too brief for genuine monitoring as provided by Centreon/Nagios, Cacti, Zabbix or Munin. The CloudWatch information is used by the Auto Scaling function, but does not replace true monitoring. Some traditional hosting providers offer packaged monitoring with their services.
AWS CloudWatch
AWS CloudWatch
As far as the BCP and penalties are concerned, it amounts to being internally hosted: you are responsible for your resources and you manage failure / disaster recovery in line with the tool’s capacities (the AWS). This is where it’s important to understand the global nature of the architecture of Amazon services: if you don’t understand how the tool works, you will not be able to implement an effective BCP. As for penalties, there’s nothing unusual: it simply means a smaller bill for the month when you come under the ‘Service unavailable’ category as defined by Amazon criteria. This has nothing to do with the penalties based on the amounts lost due to unavailability.
It is imperative to consider Amazon web services as a tool. In spite of it being an accessible AWS support (that you can call for tools’ level questions and issues) that you can pay for, you will not get the full contractual potential you would have with a more traditional hosting provider, and broadly speaking, you will be responsible for your architecture on all levels (including the security of your instances: don’t lose your keys!).
Security
Security… frequently a taboo topic, as soon as we start talking about Cloud computing. I don’t mean the integrity of stored data or even access management on the virtual instances we are responsible for, I’m talking about the confidentiality of the stored data on the different services (S3, EBS, EC2, SQS, etc.) or data which transit between those services.
The first key point is that the level of security supplied in Amazon datacenters, not just physically but – equally importantly – in programmatically terms, will still be streets ahead of your average corporate computer room, even the datacenters of the smallest hosting providers. Firstly because that is Amazon’s business: a security problem revealed in their infrastructure would have immediate implications in terms of user reactions (and thus in terms of business). It is therefore an essential detail, especially as Amazon have to prove themselves in this sensitive area and they are therefore obliged to do their best to win their customers over. Furthermore, the sheer size of their structure enables them to pool their investments in security and make them pay for themselves: this is not conceivable for the smallest companies, or companies who do not specialize therein. Amazon therefore has the means and the obligation to ensure security.
What provokes my scepticism is that Cloud is not easily audited. You have to have faith. It is no more risky than placing your trust in a traditional hosting provider, or in your own internal teams… But it’s brand new! So be careful! A normal reaction. But perhaps this is exactly an opportunity for us to work on security at our own level, something often neglected, due to over-confidence or lack of interest. The first task is to encrypt the information: for stored data as well as data in transit. Remember to take into account the CPU overhead encryption/decryption procedures. The second is to fully understand the various security mechanisms of Amazon’s services:
AWS Multi-Factor Authentication
AWS Multi-Factor Authentication
  • Access Credentials: Access Keys, X.509 Certificates & Key Pairs
  • Sign-In Credentials: E-mail Address, Password & AWS Multi-Factor Authentication Device
  • Account Identifiers: AWS Account ID & Canonical User ID
Next you must select the people who’ll be authorized to access the different security keys.
Conclusion
The evolving duties of infrastructure management can be clearly seen in this first part: from handling physical resources by means of APIs, underlying mechanisms ensuring data durability, availability of services, etc. right up to server power supply and the physical security of datacenters, all of which are supported transparently. The end result: one should “only” see the API which dialogs with a distant server. That is the difference with physical infrastructures. The virtualization (which is one facet of AWS) that we know and have already been using for some time now, is used by Amazon: it’s not so much a technical revolution - even though I don’t deny the complexity of the implementation and support that goes into it – as the service offered with it, which provides the real added value. It is matched with a new ‘pay-as-you-use’ aaS (as a Service) economic model. This has enabled the emergence of some applications (such as the games found on social networks), which only a few years ago would otherwise have been compromised by the initial investment.
The facilities provided by Cloud Computing inevitably come with some measure of reduced control and visibility on certain parts of the infrastructure, especially on the network. That’s the price you pay, be it quite negligible, or truly problematic: it all depends on your requirements.
AWS should be viewed as a complete tool, but one which does not excuse you from performing all the best practices or obtaining all the standard components of an infrastructure: log server, monitoring, BCP, configuration manager, etc. All these elements are and will continue to be your responsibility. One mustn’t have too naïve an expectation: as AWS offers HaaS and IaaS, you will still need a competent systems administrator, and particularly one who fully understands the AWS architecture (otherwise you might be disappointed) – if you switch to GAE (Google App Engine), you will still need an architect who fully understands GAE architecture, etc. The business is constantly evolving.
As for AWS security, I am reasonably confident. It must be emphasized firstly that information and data are probably less secure within your own company than entrusted to Amazon (in most cases anyway – I shouldn’t generalize). AWS’ exposure on the Net and Amazon’s commitment to the business imply that Amazon take security very seriously. Furthermore, you are responsible for a large part of this security (management of the keys, etc.) and believe me, that is surely the most risky feature. When it comes to transferring and storing data, think “encryption”.


10 Things You Should Know About AWS

1) Query AWS resource metadata
 
Can’t remember the EBS-Optimized IO throughput of your c1.xlarge cluster?  How about the size limit of an S3 object on a single PUT?  awsnow.info is the answer to all of your AWS-resource metadata questions.  Interested in integrating awsnow.info with your application?  You’re in luck.  There’s now a REST API, as well!
Note:  These are default soft limits and will vary by account.
2) Tame your S3 buckets
 
Delete an entire S3 bucket with a single CLI command:  
aws s3 rb s3://<bucket-name> --force
Recursively copy a local directory to S3:
aws s3 cp <local-dir-name> s3://<bucket-name> --region <region-name> --recursive
3) Understand AWS cross-region dependencies
 
A PagerDuty-outage blog post from earlier this summer revealed an interesting cross-region dependency that violates conventional thinking regarding region isolation in AWS.
Summary:  degradation of a Northern California common peering point between the us-west-1 and us-west-2 regions led to the loss of 2 simultaneous AZ’s - each running in a different region.   Those relying on this deployment configuration for things like HA, quorum (ZooKeeper, in PagerDuty’s example), or load balancing were hosed for those 86 mins on April 13, 2013.
4) Use ZFS with EBS
 
Two great tastes come together.
Per the AWS documentation, EBS volumes can expect an annual failure rate (AFR) of between 0.1-0.5% compared with commodity hard disks that have an AFR of around 4%; where failure is a complete loss of the volume.
To protect yourself against this risk (albeit small), you can pool multiple EBS volumes withinZFS in a RAIDZ configuration.  While relatively undocumented, Chip Schweiss has a great blog post - as well as some useful scripts - detailing this topic.
5) Stripe your RDS disks for better performance
 
There’s an easy trick to maximizing the performance of your MySQL and Oracle RDS instances.  Initially, provision your instance to the minimum size possible.  Then slowly increase your storage by 5GB increments.  Each increment will create an additional disk stripe which will increase IO and reduce seek time.
This, and many other optimization tips, can be found at the EX-AWS Engineer IAMA.
6) Avoid noisy neighbors
 
In any multi-tenant, non-dedicated, virtualized environment like AWS, you will certainly experience the noisy neighbor effect measured by CPU Steal.  CPU Steal is the percentage of time that your virtualized guest is involuntarily waiting for physical host CPU from another guest.  Great explanations of CPU steal can be found at Stackdriver’s blog hereand here.
When using EBS-backed EC2 instances, you can flee a noisy environment by simply stopping and starting the instance.  Of course, you’ll need to make sure your healthcheck policies don’t terminate the instance before it restarts.  For S3-backed EC2 instances, the only option is to terminate/launch a new instance.
You can avoiding noisy neighbors by preferring larger EC2 instance types.  Larger virtualized guest instances require more physical host resources - and therefore the host can support less overall tenants.  Tenants are always of equal instance type.  t1.micro’s are extremely noisy and bursty - similar to my old college-dorm neighbors above.
The most effective way to avoid noisy neighbors, however, is to pay extra for dedicated EC2 instances.  Can’t remember how much a dedicated instance costs?  Our buddies atawsnow.info have your answer.
7) Embrace CloudFormation
 
AWS has developed a tool called CloudFormer that allows you to create CloudFormation templates from your existing AWS resources.  This tool is as slick and mysterious as CloudFormation itself.
Maintenance of CloudFormation templates is not easy.  However, you can use #include-like macros on your template sources to cobble together per-resource snippets into a complete template.  This facilitates template reuse and snippet comments.  
The new unified AWS CLI has support for CloudFormation templates per a recent post by AWS’s Movember-friendly Evan Brown.
Another project, troposphere, is gaining traction along the same lines.  Peter Sankauskasfrom Answers For AWS describes this project here.
It’s worth noting that CloudFormation recently added support for Virtual Private Cloud (VPC). 
8) Avoid underperforming CPU architectures for your workload
 
Different CPU architectures enable better performance for certain workloads depending on the NUMA characteristics, cache sizes, number of cores, number of hardware threads, GPU, etc.  
Virtualized environments like AWS do not guarantee a particular CPU architecture.  Physical hardware is being refreshed throughout AWS data centers on a daily basis.  One of AWS’s oldest regions, us-east-1, contains a few different types of CPUs including AMD Opteron, Intel Xeon, and Intel Sandy Bridge.  My British colleague, Adrian Cockroft, from Netflix loves Sandy Bridge as much as mayonnaise on his fries.
If your workload is sensitive to such physical CPU architecture, it’s best to detect this at startup (cat /proc/cpuinfo on Linux) and restart/terminate the instance.  If you’re sensitive to cost, you can run the instance for 55 minutes, then terminate it.  With AWS, you’re paying for the hour either way.
9) Use Virtual Private Cloud (VPC) from the start
 
VPC is now the default configuration for new accounts in most regions.  Retrofitting a non-VPC environment to use VPC is very time-consuming task - particularly in large deployments with lots of security groups.
There is currently no migration path for non-VPC security groups (ingress only) and VPC security groups (ingress and egress).  In fact, Netflix is still running non-VPC given this lack of security-group migration.
Another benefit is that VPC enable ELB’s as a middletier load balancer behind a private subnet.  Without VPC, ELBs are public-facing.   
A common configuration is to use ELB with HAProxy to enable more fine-grained control over the load balancing algorithms since ELB’s are black boxes and can be stickier than expected.
It’s worth noting that Network Address Translation (NAT) instances in a VPC are actually EC2 instances performing Port Address Translation (PAT) versus dedicated hardware devices with single-purpose ASIC chips.
10) Pre-warm and autoscale-out your AWS resources before expected spikes
ELBs - along with other AWS blackbox resources such as NAT, S3, DynamoDB, Glacier, and Beanstalk - are made up of EC2 instances just like your custom application.  
These instances need to autoscale up and down just like your application - unless you provision them ahead of time.  There is currently no way to do this without contacting AWS directly.
By default, Amazon will do their best to autoscale based on traffic, but it’s always best to notify them for pre-warm and provisioning ahead of an expected spike.  
To demonstrate the difference in provisioning, here is dig against a high-traffic, pre-warmed and provisioned ELB at Netflix:
$dig cbp-us.nccp.netflix.com
;; ANSWER SECTION:
cbp-us.nccp.netflix.com. 291 IN CNAME dualstack.nccp-cbp-us-frontend-61636918.us-east-1.elb.amazonaws.com.
dualstack.nccp-cbp-us-frontend-61636918.us-east-1.elb.amazonaws.com. 18 IN A 107.21.243.94
dualstack.nccp-cbp-us-frontend-61636918.us-east-1.elb.amazonaws.com. 18 IN A 107.20.155.88

Top 10 Mobile security Threats






The key differentiator of this vulnerability is that it is concerned about unencrypted or improperly encrypted data being stolen during transmission, not on the device but through the airwaves.
Data Transport.jpg
Insecure Data Transmission should not be confused with others in the top 10 such as:
M3: Insufficient Transport Layer Protection – I DO NOT SEE ANT DIFFERENCE
M4: Unintended Data Leakage – Unintended data leakage is a result of insecure data transmission. Once data has been stolen and interpreted it may contain information that is valuable to attackers (leaked).
M6: Broken Cryptography – While broken cryptography does relate to data that has been improperly encrypted, and it may be an input or causes of insecure data transmission, it focuses on the encryption process/technique itself on the device transmitting the data.


Asgard: Web-based Cloud Management and Deployment


For the past several years Netflix developers have been using self-service tools to build and deploy hundreds of applications and services to the Amazon cloud. One of those tools is Asgard, a web interface for application deployments and cloud management. 
Asgard is named for the home of the Norse god of thunder and lightning, because Asgard is where Netflix developers go to control the clouds. I’m happy to announce that Asgard has now been open sourced on github and is available for download and use by anyone. All you’ll need is an Amazon Web Services account. Like other open source Netflix projects, Asgard is released under the Apache License, Version 2.0. Please feel free to fork the project and make improvements to it.
Some of the information in this blog post is also published in the following presentations. Note that Asgard was originally named the Netflix Application Console, or NAC.

Visual Language for the Cloud

To help people identify various types of cloud entities, Asgard uses the Tango open source icon set, with a few additions. These icons help establish a visual language to help people understand what they are looking at as they navigate. Tango icons look familiar because they are also used by Jenkins, Ubuntu, Mediawiki, Filezilla, and Gimp. Here is a sampling of Asgard's cloud icons. 

Cloud Model

The Netflix cloud model includes concepts that AWS does not support directly: Applications and Clusters.

Application

Below is a diagram of some of the Amazon objects required to run a single front-end application such as Netflix’s autocomplete service. 
Here’s a quick summary of the relationships of these cloud objects.
  • An Auto Scaling Group (ASG) can attach zero or more Elastic Load Balancers (ELBs) to new instances.
  • An ELB can send user traffic to instances.
  • An ASG can launch and terminate instances.
  • For each instance launch, an ASG uses a Launch Configuration.
  • The Launch Configuration specifies which Amazon Machine Image (AMI) and which Security Groups to use when launching an instance.
  • The AMI contains all the bits that will be on each instance, including the operating system, common infrastructure such as Apache and Tomcat, and a specific version of a specific Application.
  • Security Groups can restrict the traffic sources and ports to the instances.
That’s a lot of stuff to keep track of for one application.
When there are large numbers of those cloud objects in a service-oriented architecture (like Netflix has), it’s important for a user to be able to find all the relevant objects for their particular application. Asgard uses an application registry in SimpleDB and naming conventions to associate multiple cloud objects with a single application. Each application has an owner and an email address to establish who is responsible for the existence and state of the application's associated cloud objects.
Asgard limits the set of permitted characters in the application name so that the names of other cloud objects can be parsed to determine their association with an application.
Here is a screenshot of Asgard showing a filtered subset of the applications running in our production account in the Amazon cloud in the us-east-1 region: 
Screenshot of a detail screen for a single application, with links to related cloud objects:

Cluster

On top of the Auto Scaling Group construct supplied by Amazon, Asgard infers an object called a Cluster which contains one or more ASGs. The ASGs are associated by naming convention. When a new ASG is created within a cluster, an incremented version number is appended to the cluster's "base name" to form the name of the new ASG. The Cluster provides Asgard users with the ability to perform a deployment that can be rolled back quickly.
Example: During a deployment, cluster obiwan contains ASGs obiwan-v063 and obiwan-v064. Here is a screenshot of a cluster in mid-deployment.
The old ASG is “disabled” meaning it is not taking traffic but remains available in case a problem occurs with the new ASG. Traffic comes from ELBs and/or from Discovery, an internal Netflix service that is not yet open sourced.

Deployment Methods

Fast Rollback

One of the primary features of Asgard is the ability to use the cluster screen shown above to deploy a new version of an application in a way that can be reversed at the first sign of trouble. This method requires more instances to be in use during deployment, but it can greatly reduce the duration of service outages caused by bad deployments.
This animated diagram shows a simplified process of using the Cluster interface to try out a deployment and roll it back quickly when there is a problem: 
The animation illustrates the following deployment use case:
  1. Create the new ASG obiwan-v064
  2. Enable traffic to obiwan-v064
  3. Disable traffic on obiwan-v063
  4. Monitor results and notice that things are going badly
  5. Re-enable traffic on obiwan-v063
  6. Disable traffic on obiwan-v064
  7. Analyze logs on bad servers to diagnose problems
  8. Delete obiwan-v064

Rolling Push

Asgard also provides an alternative deployment system called a rolling push. This is similar to a conventional data center deployment of a cluster on application servers. Only one ASG is needed. Old instances get gracefully deleted and replaced by new instances one or two at a time until all the instances in the ASG have been replaced. Rolling pushes are useful:
  1. If an ASG's instances are sharded so each instance has a distinct purpose that should not be duplicated by another instance.
  2. If the clustering mechanisms of the application (such as Cassandra) cannot support sudden increases in instance count for the cluster.
Downsides to a rolling push:
  1. Replacing instances in small batches can take a long time.
  2. Reversing a bad deployment can take a long time.

Task Automation

Several common tasks are built into Asgard to automate the deployment process. Here is an animation showing a time-compressed view of a 14-minute automated rolling push in action: 

Auto Scaling

Netflix focuses on the ASG as the primary unit of deployment, so Asgard also provides a variety of graphical controls for modifying an ASG and setting up metrics-driven auto scaling when desired. 
CloudWatch metrics can be selected from the default provided by Amazon such as CPUUtilization, or can be custom metrics published by your application using a library like Servo for Java.

Why not the AWS Management Console?

The AWS Management Console has its uses for someone with your Amazon account password who needs to configure something Asgard does not provide. However, for everyday large-scale operations, the AWS Management Console has not yet met the needs of the Netflix cloud usage model, so we built Asgard instead. Here are some of the reasons.
  • Hide the Amazon keys

    Netflix grants its employees a lot of freedom and responsibility, including the rights and duties of enhancing and repairing production systems. Most of those systems run in the Amazon cloud. Although we want to enable hundreds of engineers to manage their own cloud apps, we prefer not to give all of them the secret keys to access the company’s Amazon accounts directly. Providing an internal console allows us to grant Asgard users access to our Amazon accounts without telling too many employees the shared cloud passwords. This strategy also saves us from needing to assign and revoke hundreds of Identity and Access Management (IAM) cloud accounts for employees.
  • Auto Scaling Groups

    As of this writing the AWS Management Console lacks support for Auto Scaling Groups (ASGs). Netflix relies on ASGs as the basic unit of deployment and management for instances of our applications. One of our goals in open sourcing Asgard is to help other Amazon customers make greater use of Amazon’s sophisticated auto scaling features. ASGs are a big part of the Netflix formula to provide reliability, redundancy, cost savings, clustering, discoverability, ease of deployment, and the ability to roll back a bad deployment quickly.
  • Enforce Conventions

    Like any growing collection of things users are allowed to create, the cloud can easily become a confusing place full of expensive, unlabeled clutter. Part of the Netflix Cloud Architecture is the use of registered services associated with cloud objects by naming convention. Asgard enforces these naming conventions in order to keep the cloud a saner place that is possible to audit and clean up regularly as things get stale, messy, or forgotten.
  • Logging

    So far the AWS console does not expose a log of recent user actions on an account. This makes it difficult to determine whom to call when a problem starts, and what recent changes might relate to the problem. Lack of logging is also a non-starter for any sensitive subsystems that legally require auditability.
  • Integrate Systems

    Having our own console empowers us to decide when we want to add integration points with our other engineering systems such as Jenkins and our internal Discovery service.
  • Automate Workflow

    Multiple steps go into a safe, intelligent deployment process. By knowing certain use cases in advance Asgard can perform all the necessary steps for a deployment based on one form submission.
  • Simplify REST API

    For common operations that other systems need to perform, we can expose and publish our own REST API to do exactly what we want in a way that hides some of the complex steps from the user.

Costs

When using cloud services, it’s important to keep a lid on your costs. As of June 5, 2012, Amazon now provides a way to track your account’s charges frequently. This data is not exposed through Asgard as of this writing, but someone in your company should keep track of your cloud costs regularly. See http://aws.typepad.com/aws/2012/06/new-programmatic-access-to-aws-billing-data.html
Starting up Asgard does not initially cause you to incur any Amazon charges, because Amazon has a free tier for SimpleDB usage and no charges for creating Security Groups, Launch Configurations, or empty Auto Scaling Groups. However, as soon as you increase the size of an ASG above zero Amazon will begin charging you for instance usage, depending on your status for Amazon’s Free Usage Tier. Creating ELBs, RDS instances, and other cloud objects can also cause you to incur charges. Become familiar with the costs before creating too many things in the cloud, and remember to delete your experiments as soon as you no longer need them. Your Amazon costs are your own responsibility, so run your cloud operations wisely.

Feature Films

By extraordinary coincidence, Thor and Thor: Tales of Asgard are now available to watch on Netflix streaming.

Conclusion

Asgard has been one of the primary tools for application deployment and cloud management at Netflix for years. By releasing Asgard to the open source community we hope more people will find the Amazon cloud and Auto Scaling easier to work with, even at large scale like Netflix. More Asgard features will be released regularly, and we welcome participation by users on GitHub.
Follow the Netflix Tech Blog and the @NetflixOSS twitter feed for more open source components of the Netflix Cloud Platform.
If you're interested in working with us to solve more of these interesting problems, have a look at the Netflix jobs page to see if something might suit you. We're hiring!

Related Resources

Asgard

Netflix Cloud Platform

Amazon Web Services