AWS is the leading cloud provider in present times and it enables us to build and deliver products faster. There are several things which AWS services provide by default i.e. elasticity, auto scaling etc. AWS provides services in almost all the areas of software development. Be it infrastructure as a service i.e. EC2, or managed services like AWS Lambda & DynamoDB etc. All of such brilliant tools make AWS a default choice for building software.
There are several points which we should consider prior to building in AWS. It can haunt our system later on if we are not very mindful of these. There are some points and hidden gotchas which are important and we can learn these from the experiences of others. We can summarize these points as following:
- Availability of services (region wise)
- Availability of services (distribution wise)
- Cost & Pricing implications
- Dealing with compliance
- Limits or quotas
- IAM capabilities of services
We'll look at these points one by one and try to identify the things we should be careful about.
Availability of services (region wise)
We should be mindful of presence of service in our target region before we commit the architecture. This can not be changed easily later on. Newly established regions don't have all the services. It is also possible that a new service is not available in all regions.
There are scenarios when although a service is available in a region, certain features of that service are not available. Hence it is a good habit to read the documentation of a service thoroughly prior to deciding to use it.
Hence the things which we should look for:
- Our targeted service is available in our desired region
- All the features which we intend to use are available in the desired region
Although AWS documentation mentions these things to some extent, this table from trek10 has granular details comparatively.
Availability of services (distribution wise)
In AWS, services are distributed area wise as well. Some of the services create/provision resources globally. Some of them do at regional level and some at availability zone level. We can categorize it as following:
- Global services: which include Route53, DynamoDB Global Index, IAM etc.
- Regional services: which include S3 bucket, SNS, SQS etc.
- Zonal services: which include EC2 instances, RDS instances, EBS volumes etc.
If a service is distributed zonal wise and our system requires regional availability, our architecture needs to take care of that replication. Similarly, if a service is distributed region wise and we need multi religion (partially global) replication, our architecture is supposed to take care of that. This is the case when the service itself doesn't provide replication. For example, RDS provides the capability of replication into different AZs.
We must consider our service of choice and the requirement in order to add components in our architecture which will take care of such replication. For example we are planning to use EC2 instances which are distributed at availability zone level. We can use auto scaling groups (ASGs) for regional availability because ASGs spread EC2 instances in multiple AZs.
Following is an example from AWS docs for instance distribution using ASGs.
Cost & pricing (cost models)
Different AWS services have different pricing models. It will be a mistake on our part to assume that pricing model of one service will apply to another service. It should be know that we pay as we use in cloud. For example:
- EC2 is priced for time it runs in hours and seconds. We will be priced for a running EC2 even if it is not doing anything.
- Fargate containers and Lambdas are only priced when they run and are in use.
Hence whenever we are designing a system in AWS, we must ask our client/product team:
- What are the cost calculations and budget for the system to be built?
- What are the traffic estimates for the system to be built?
Both of these questions will allow us to make appropriate decisions regarding the services we use.
Whenever we are designing in AWS, Network Traffic is always and should be considered. The golden rule is:
Don't forget the traffic and cost drivers
For example, we are priced whenever traffic is moved from one region to another region or from one availability zone to another availability zone in EC2. Another example of cost model is the use of NAT gateway. Not only we pay for NAT gateway, but we also pay for internet gateway.
Hence we should always try to identify the cost drivers in our architecture. In some architectures, heavy data load is the cost driver. For example in a webhook architecture, we should be mindful of following:
- Event size
- Number of events
Cost driver here will be outbound traffic. As the number of webhook endpoints of clients increases, we have to dispatch more events from our system to internet. Hence the outbound traffic.
Dealing with compliance
Whenever we are designing a system, we should also be careful about the compliance which is required. Sometimes the client companies require certain compliance and sometimes governments require some standards to be followed. For example, PCI-DSS, HIPPA/HITECH etc.
We see in AWS description and documentation that it is compliant to certain standards. But in reality, this is an half answer. Instead of the whole AWS cloud,
Certain aws services, regions and edge locations are compliant to a standard or framework.
Hence, when we are making a choice of region or service, we need to check if that region or service is compliant to our target framework or standard.
For example, elasticache or memcached are not scoped in china region. Some edge locations are not scoped or covered in compliance. Hence if we are using AWS cloudfront, we are not sure which edge locations will be used and whether those edge locations are scoped in compliance framework or not.
These are the points related to compliance which we must consider before finalizing the architecture, or presenting the audit report to audit authority.
Limits or quotas
A limit is the constraint which protects us from excessive use of a services - protecting us against huge bills. Most of the quotas can be increased by requesting to AWS support. There are some limits which can't be increased. Common upper limits can easily be found in AWS documentation. If some limits are not clear, we can reach out to AWS support team.
All the AWS services are limited in some ways. For example,
- We can't spin 100 EC2 instances at once.
- By default port 25 of SMTP is blocked for EC2. It is not visible in security group either. However, we can request AWS to remove this filter.
- SNS has different throughput limits per different region. It will be higher in
us-east-1as compared to Frankfurt region. Hence some services behave differently depending on the region.
Some of the limits are documented in a very subtle way. It is very easy to overlook them. Hence we must do following:
- Read the docs carefully regarding limits
- Read the service FAQs carefully and identify if there are any limits.
IAM capabilities of services
All AWS services use IAM to control access to services using policies. It is very often that we rely on IAM to achieve certain architectural goals. For example tenant isolation using separate VPC per client.
We must consider that:
- Some services can have different IAM featres. For example IAM features are only at resource level like starting or stopping EC2 containers.
- Some AWS services allow access on the basis of tags and policies. For example all the resources having a particular tag.
- Some other services don't allow conditions based access or policies. Or don't allow resource level access.
Hence we shouldn't take resource based access or policy based access for granted and shouldn't assume that it will be available for all services in a similar way. We should always consider whether the access capability we desire is available in our target service. AWS IAM documentation is great for this purpose.
Since we have already discussed all the gotchas in detail, hence whenever developing an architecture in AWS,
- Gather knowledge of services at broad level to get an idea of what is possible.
- Once the services are selected, go into deeper details of the services to find constraints and limitations
- Go through the checklist which is also discussed in this article to avoid common pitfalls.
Note: I took help from my experience and existing resources for this article. Any mistake or factual inaccuracy can be reported in comments. I will try my best to maintain & update this with time to time.