What is AWS Well-Architected
AWS Well-Architected is a collection of best practices that serves as a reference when utilizing AWS.
Structure of AWS Well-Architected
AWS Well-Architected is composed of the following "Six Pillars":
- Operational Excellence
- Security
- Reliability
- Performance Efficiency
- Cost Optimization
- Sustainability
Each of these pillars provides definitions regarding the necessary perspectives to consider for design principles and best practices.
The Six Pillars
By referring to the Six Pillars of AWS Well-Architected, you can build effective and stable systems. Here, I will only introduce the design principles.
Operational Excellence
The first pillar is Operational Excellence. To efficiently execute workloads and bring business value, this pillar focuses on the execution and monitoring of systems, as well as continuous improvement.
Design Principles
The following are the design principles for operational excellence in the cloud:
-
Perform operations as code
In the cloud, you can apply the same engineering discipline that you use for application code to your entire environment. You can define your entire workload (applications, infrastructure, etc.) as code and update it with code. You can script your operations procedures and automate their process by launching them in response to events. By performing operations as code, you limit human error and create consistent responses to events. -
Make frequent, small, reversible changes
Design workloads to allow components to be updated regularly to increase the flow of beneficial changes into your workload. Make changes in small increments that can be reversed if they fail to aid in the identification and resolution of issues introduced to your environment (without affecting customers when possible). -
Refine operations procedures frequently
As you use operations procedures, look for opportunities to improve them. As you evolve your workload, evolve your procedures appropriately. Set up regular game days to review and validate that all procedures are effective and that teams are familiar with them. -
Anticipate failure
Perform “pre-mortem” exercises to identify potential sources of failure so that they can be removed or mitigated. Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure they are effective and that teams are familiar with their process. Set up regular game days to test workload and team responses to simulated events. -
Learn from all operational failures
Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
Security
The second pillar is Security. This pillar addresses methods to establish a secure cloud environment, including protecting data and systems, ensuring data confidentiality, user management, and incident detection.
Design Principles
In the cloud, there are a number of principles that can help you strengthen your workload security:
-
Implement a strong identity foundation
Implement the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your AWS resources. Centralize identity management, and aim to eliminate reliance on long-term static credentials. -
Maintain traceability
Monitor, alert, and audit actions and changes to your environment in real time. Integrate log and metric collection with systems to automatically investigate and take action. -
Apply security at all layers
Apply a defense in depth approach with multiple security controls. Apply to all layers (for example, edge of network, VPC, load balancing, every instance and compute service, operating system, application, and code). -
Automate security best practices
Automated software-based security mechanisms improve your ability to securely scale more rapidly and cost-effectively. Create secure architectures, including the implementation of controls that are defined and managed as code in version-controlled templates. -
Protect data in transit and at rest
Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate. -
Keep people away from data
Use mechanisms and tools to reduce or eliminate the need for direct access or manual processing of data. This reduces the risk of mishandling or modification and human error when handling sensitive data. -
Prepare for security events
Prepare for an incident by having incident management and investigation policy and processes that align to your organizational requirements. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.
Reliability
The third pillar is Reliability. This pillar focuses on workloads that perform as expected and on rapidly recovering in cases where expectations are not met.
Design Principles
In the cloud, there are a number of principles that can help you increase reliability.
-
Automatically recover from failure
By monitoring a workload for key performance indicators (KPIs), you can run automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation of the service. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur. -
Test recovery procedures
In an on-premises environment, testing is often conducted to prove that the workload works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This approach exposes failure pathways that you can test and fix before a real failure scenario occurs, thus reducing risk. -
Scale horizontally to increase aggregate workload availability
Replace one large resource with
multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure. -
Stop guessing capacity
A common cause of failure in on-premises workloads is resource saturation, when the demands placed on a workload exceed the capacity of that workload (this is often the objective of denial of service attacks). In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over- or under-provisioning. There are still limits, but some quotas can be controlled and others can be managed. -
Manage change through automation
Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed.
Performance Efficiency
The fourth pillar is Performance Efficiency. This pillar emphasizes the optimization of computing resources for long-term and sustained performance.
Design Principles
The following design principles can help you achieve and maintain efficient workloads in the cloud.
-
Democratize advanced technologies
Make advanced technology implementation easier for your team by delegating complex tasks to your cloud vendor. Rather than asking your IT team to learn about hosting and running a new technology, consider consuming the technology as a service. For example, NoSQL databases, media transcoding, and machine learning are all technologies that require specialized expertise. In the cloud, these technologies become services that your team can consume, allowing your team to focus on product development rather than resource provisioning and management. -
Go global in minutes
Deploying your workload in multiple AWS Regions around the world allows you to provide lower latency and a better experience for your customers at minimal cost. -
Use serverless architectures
Serverless architectures remove the need for you to run and maintain physical servers for traditional compute activities. For example, serverless storage services can act as static websites (removing the need for web servers) and event services can host code. This removes the operational burden of managing physical servers, and can lower transactional costs because managed services operate at cloud scale. -
Experiment more often
With virtual and automatable resources, you can quickly carry out comparative testing using different types of instances, storage, or configurations. -
Consider mechanical sympathy
Use the technology approach that aligns best with your goals. For example, consider data access patterns when you select database or storage approaches.
Cost Optimization
The fifth pillar is Cost Optimization. This pillar explains methods for building the latest architectures while continuously controlling costs.
Design Principles
Consider the following design principles for cost optimization:
-
Implement cloud financial management
To achieve financial success and accelerate business value realization in the cloud, you must invest in Cloud Financial Management. Your organization must dedicate the necessary time and resources for building capability in this new domain of technology and usage management. Similar to your Security or Operations capability, you need to build capability through knowledge building, programs, resources, and processes to help you become a cost efficient organization. -
Adopt a consumption model
Pay only for the computing resources you consume, and increase or decrease usage depending on business requirements. For example, development and test environments are typically only used for eight hours a day during the work week. You can stop these resources when they’re not in use for a potential cost savings of 75% (40 hours versus 168 hours). -
Measure overall efficiency
Measure the business output of the workload and the costs associated with delivery. Use this data to understand the gains you make from increasing output, increasing functionality, and reducing cost. -
Stop spending money on undifferentiated heavy lifting
AWS does the heavy lifting of data center operations like racking, stacking, and powering servers. It also removes the operational burden of managing operating systems and applications with managed services. This allows you to focus on your customers and business projects rather than on IT infrastructure. -
Analyze and attribute expenditure
The cloud makes it easier to accurately identify the cost and usage of workloads, which then allows transparent attribution of IT costs to revenue streams and individual workload owners. This helps measure return on investment (ROI) and gives workload owners an opportunity to optimize their resources and reduce costs
Sustainability
The final pillar is Sustainability. This pillar presents architects with approaches to reduce resource consumption by focusing on the impact on the environment, energy consumption, and efficiency.
Design Principles
Apply these design principles when architecting your cloud workloads to maximize sustainability and
minimize impact.
-
Understand your impact
Measure the impact of your cloud workload and model the future impact of your workload. Include all sources of impact, including impacts resulting from customer use of your products, and impacts resulting from their eventual decommissioning and retirement. Compare the productive output with the total impact of your cloud workloads by reviewing the resources and emissions required per unit of work. Use this data to establish key performance indicators (KPIs), evaluate ways to improve productivity while reducing impact, and estimate the impact of proposed changes over time. -
Establish sustainability goals
For each cloud workload, establish long-term sustainability goals such as reducing the compute and storage resources required per transaction. Model the return on investment of sustainability improvements for existing workloads, and give owners the resources they need to invest in sustainability goals. Plan for growth, and architect your workloads so that growth results in reduced impact intensity measured against an appropriate unit, such as per user or per transaction. Goals help you support the wider sustainability goals of your business or organization, identify regressions, and prioritize areas of potential improvement. -
Maximize utilization
Right-size workloads and implement efficient design to ensure high utilization and maximize the energy efficiency of the underlying hardware. Two hosts running at 30% utilization are less efficient than one host running at 60% due to baseline power consumption per host. At the same time, eliminate or minimize idle resources, processing, and storage to reduce the total energy required to power your workload. -
Anticipate and adopt new, more efficient hardware and software offerings
Support the upstream improvements your partners and suppliers make to help you reduce the impact of your cloud workloads. Continually monitor and evaluate new, more efficient hardware and software offerings. Design for flexibility to allow for the rapid adoption of new efficient technologies. -
Use managed services
Sharing services across a broad customer base helps maximize resource utilization, which reduces the amount of infrastructure needed to support cloud workloads. For example, customers can share the impact of common data center components like power and networking by migrating workloads to the AWS Cloud and adopting managed services, such as AWS Fargate for serverless containers, where AWS operates at scale and is responsible for their efficient operation. Use managed services that can help minimize your impact, such as automatically moving infrequently accessed data to cold storage with Amazon S3 Lifecycle configurations or Amazon EC2 Auto Scaling to adjust capacity to meet demand. -
Reduce the downstream impact of your cloud workloads
Reduce the amount of energy or resources required to use your services. Reduce or eliminate the need for customers to upgrade their devices to use your services. Test using device farms to understand expected impact and test with customers to understand the actual impact from using your services.
References