Introduction

Lemonade's infrastructure is defined in code instead of manually provisioned by means of a graphical user interface. This practise is called infrastructure as code and brings the benefits of: automation of deployment and recovery processes; easy recovery and rollbacks; keeping track of changes via source version control; and most importantly, one place to define the infrastructure in a consistent, simple, and efficient manner. Our infrastructure is focused around fail safety, performance, scalability and security, and defined defined using AWS Cloud Development Kit (AWS CDK).

AWS CDK is an open-source software development framework to define cloud infrastructure in code and provision it through AWS CloudFormation. The CDK integrates fully with AWS services and offers a higher level object-oriented abstraction to define AWS resources imperatively.

Environments

Lemonade has two environments, production and staging. The resources of each environment are over two data centers, hereby avoiding accidental mistakes of actions against the wrong environment. The production environment is used by our end users while the staging environment is mostly used by Lemonade staff.

Production is provisioned against a stable infrastructure definition and runs stable versions of our applications, while staging may be provisioned against a next version of the infrastructure definition and applications that are being prepared for release.

Databases

A database allows persistent storage and efficient quering of data. Our databases are clusters to provide fail safety, high performance, and scalability through redundancy and distributivity. The production and staging database are managed using MongoDB Atlas in order to achieve these clusters characteristics while offering nice features as automatic backups, access control, and virtual private network peering.

MongoDB Atlas allows us choose where we would like to provision the database cluster and with how many resources. They are located in the same data centers as the compute cluster for fast data transfers. Another nice feature is peering, this is linking the two private networks of the database and compute cluster together. As result the database does not need to exposed to the internet but instead appears as a local machine to the compute cluster. This is the safest method as independent of whether authorization credentials are known, an attacker cannot reach the database cluster.

It is the only component that is not managed by the infrastructure definition. If it would be, then we would have to use AWS's managed database service which is extremely costly and running an older database version. It is favorable to use the managed database service of the creators that build the database themselves. Besides, managing the database via the infrastructure definition comes with the risk of accidentally reprovisioning the database with data loss as result. Therefor its best to consider the database as a separate entity.

MongoDB Atlas delivers the world’s leading database for modern applications as a fully automated cloud service engineered and run by the same team that builds the database. Proven operational and security practices are built in, automating time-consuming administration tasks such as infrastructure provisioning, database setup, ensuring availability, global distribution, backups, and more.

Stacks

AWS CDK has the concept of stacks, these are logical units of defined resources deployable by AWS CloudFormation. AWS CloudFormation is a tool to describe and provision AWS resources in a specific text format. AWS CDK is a wrapper around AWS CloudFormation, making its functionality available by means of code.

AWS CloudFormation provides a common language for you to describe and provision all the infrastructure resources in your cloud environment. CloudFormation allows you to use a simple text file to model and provision, in an automated and secure manner, all the resources needed for your applications across all regions and accounts. This file serves as the single source of truth for your cloud environment.

ECS Stack

Cluster

A cluster is a group of machines that are virtually or geographically separated and that work together to provide the same service or application to clients.

Our servers, also refered to as machines, are computers that are always on and geographically located in places with fast internet connections, running on Amazon Elastic Compute Cloud (Amazon EC2). Our servers work together, so that in many respects, they can be viewed as a single system. They form a cluster and span two geographically separated data centers. This provides fail safety as network errors in one location, or failure of one machine, will not interfere with providing service to our end users. The main advantages of a cluster are fail safety through redundancy, resource sharing, and scalability through distributivity.

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate them from common failure scenarios.

Each machine is monitored via AWS Systems Manager. The advantage of using AWS Systems Manager is that it provides methods to access and control the machines without exposing control over the internet.

AWS Systems Manager gives you visibility and control of your infrastructure on AWS. Systems Manager provides a unified user interface so you can view operational data from multiple AWS services and allows you to automate operational tasks across your AWS resources.

Our cluster is dynamic and automatically scales capacity based on the number of running applications. More precisely, it scales such that the memory reservation is between $x$ and $y$ percent while guaranteeing between $v$ and $w$ machines. This ensures that resources, and thus our expenses are always put to good use, while still allowing memory peaks in our application without the risk of memory saturation which can have failures as result. The minimum is two machines such that we always operate in two geographically separated data centers, hereby providing fail safety.

Amazon ECS allows us to easily run Docker containers on our EC2 servers. Docker is a virtualisation technique which allows applications to run in a virtualized abstracted manner, meaning that to applications the infrastructure is abstracted away where applications are invisible to each other and appearing as the only running application on the machine. This brings advantages as resource sharing and thus reduced costs; standardization by repeatable development, build, test, and production environments; concept of deployable units (called containers); simplicity; fast deployments; isolation; and security.

Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS.

Our applications are packaged as images, an encapsulation of the application and the operating system the application is running. An image is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably. Images are stored in Amazon Elastic Container Registry (ECR).

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. Amazon ECR is integrated with Amazon Elastic Container Service (ECS), simplifying your development to production workflow.

The cluster runs Docker containers, which are mutable instances of Docker images.

Load Balancer

The application load balancer is the central component that accepts incoming connections and forwards them to the Docker containers on the EC2 cluster. It is the central access point into the cluster.

Applications often accept incoming connections. Connections via the internet go via an IP address and a port number, where the port number is often standardized for the respective service. Websites for example accept incoming connections via HTTP on port 80, and HTTPS on port 443.

Whenever an infrastructure is dynamic the servers that are part of the infrastructure can dynamically be provisioned and broken down based on the needed capacity. This has the advantage that it is easy to scale up and down resources based on key performance metrics. An disadvantage is that the IP addresses of the servers can change as well. This complicates matters as the applications that make use of the infrastructure now won't have a single static IP address to which they connect. Not only that, as our cluster forms a uniform platform to run any application it may happen that two websites are running on the same machine, meaning they both operate via HTTP on port 80 and HTTPS on port 443, while only one application can listen on these ports.

Remember when using Docker, the applications are deployed as containers, and each container has their own network abstraction, meaning that they both can bind on these unique ports. As the network is abstracted to containers its not directly accessible from the outside, the container ports need to be mapped to two different server ports. These server ports are typically in the higher ranges, but important is that these mappings are dynamic, so not only can the IP address change to access the application, but also the port number.

Application Load Balancer operates at the request level (layer 7), routing traffic to targets – EC2 instances, containers, IP addresses and Lambda functions based on the content of the request. Ideal for advanced load balancing of HTTP and HTTPS traffic, Application Load Balancer provides advanced request routing targeted at delivery of modern application architectures, including microservices and container-based applications. Application Load Balancer simplifies and improves the security of your application, by ensuring that the latest SSL/TLS ciphers and protocols are used at all times.

Therefor we need a method that accepts incoming connections via a static IP address, and forwards the connection to the dynamically provisioned IP address and mapped port number. This is the responsibility of an application load balancer which is part of Amazon EC2. It accepts incoming connections for all our applications and forwards the connection to one of the servers that is running the respective application, on a server port that is mapped to the container port, with as result that the connection reaches the application which is running in the deployed container.

Using an application load balancer has the advantage that we can balance the load / requests over dynamically provisioned machines and services as the application load balancer is context-aware, while having a single place to connect to, gather statistics, and provide SSL encryption. The latter is very important, as managing SSL encryption in Docker containers is a challenge. Instead, encryption is now managed at the load balancer, and terminated behind it such that communication within the cluster can be left unencrypted, avoiding the hassle of provisioning certificates in each container. SSL encryption is offered using AWS Certificate Manager.

AWS Certificate Manager is a service that lets you easily provision, manage, and deploy public and private Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates for use with AWS services and your internal connected resources. SSL/TLS certificates are used to secure network communications and establish the identity of websites over the Internet as well as resources on private networks. AWS Certificate Manager removes the time-consuming manual process of purchasing, uploading, and renewing SSL/TLS certificates.

Another major advantage of a load balancer is healthchecks. The load balancer is context aware, meaning it knows exactly what is running where. This information is used to perform healthchecks, which are periodic checks to see if the container is running correctly. If so, it will forward connections, if not it will reprovision the container. This gives us fail safety, meaning if one application crashes this is detected, and another machine will take over until the container has been recovered by means of reprovisioning.

Continuous Deployment and Integration

Our applications consist of code that run in a certain environment / operating system. The code is stored in a central place, in a version control system, which contains definition files of: how to build the operating system and the application, and what to do when the operating system starts, so in general, how to run the application. We use continuous integration and deployment using AWS CodeBuild and CodePipeline.

Continuous integration is the process of merging new code from developers in the common code base. The developer's changes are validated by creating a build and running automated tests against the build. By doing so, you avoid the integration hell that usually happens when people wait for release day to merge their changes into the common code base.

AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers.

Continuous deployment is an extension of continuous integration to make sure that you can release new changes to your customers quickly in a sustainable way. This means that on top of having automated your testing, you also have automated your release process and you can deploy your application at any point of time. With this practice, every change that passes all stages of your production pipeline is released to your customers. There's no human intervention, and only a failed test will prevent a new change to be deployed to production. This brings the following advantages: we can develop faster as there's no need to pause development for releases; deployments pipelines are triggered automatically for every change; releases are less risky and easier to fix in case of problem as you deploy small batches of changes; customers see a continuous stream of improvements, and quality increases every day, instead of every month, quarter or year.

AWS CodePipeline is a continuous delivery service you can use to model, visualize, and automate the steps required to release your software. You can quickly model and configure the different stages of a software release process. CodePipeline automates the steps required to release your software changes continuously.

CI/CD Stack

So our applications are integrated and deployed automatically. Now suppose we change the infrastructure definition, the developer that changes it needs to update the infrastructure according to the definition by running an AWS CDK command. This is error prone as someone can release a change without integrating the definition code in the common base, meaning there is no track of the change. As solution our infrastructure definition also has continuous deployment, meaning we can change the infrastructure by integrating a code change in the common code base. So there is some inception here, as the infrastructure definition defines resources that update the definition itself!