Spot.IM is one of the earliest clients on our portfolio. The very successful startup based in NYC and TLV is backing some of the busiest media providers’ conversation platforms on the internet.
When giants like FoxNews, Diply, AOL, PlayBoy, HuffPost and many more are relying on your production to back their conversation, it needs to be stable, resilient, scalable and perform each with perfection.
Conversations are the second major interaction method users have with media websites, making it one of the two major revenue producers. As such, every outage is immediately translated to 💰 loss and lots of it.
The architecture design in the company was always striving for microservices and was deployed in containers from very early stages. Yet, throughout a long period of time, they were deployed “hard” on dedicated hosts, making them inflexible, slowly scaleable and inefficient.
We came to an understanding that a re-architecture is required and that orchestrating the services is the natural way to go. Keeping in mind the implications any mistake can project, it had to be planned thoroughly.
Migrating a live production that is currently serving customers to a new platform requires designing, planning and zooming into specific details. Even after doing these, the process has to involve short feedback cycles as the designs and plans are never final in their first draft.
To this end we utilized CloudFormation for templating the infrastructure from VPC level, all the way down to ECS TaskDefinition attributes.
After a multilevel iteration phase that included additional networking and service encapsulation, we turned to implement the entire monitoring and alerting system’s as templates as well. For this task we implemented CloudWatch Metrics, Alerts, Logs and Dashboards all deployed as CloudFormation stacks.
Once the templates were ready to work with, we started our POC.
In order to achieve a POC, we decided to implement the architecture, deploying a specific core service of the company in a staging environment. Had all went well, we would have proceeded to a full staging cluster.
The first deploy went well, and the service was successfully serving its purpose from its new location. However, during the initial load test, we identified that ECS optimized AMI’s are limiting the available open files on the host by default. Since the hosted containers inherit the maximum values from their underlying machine, we ended up allowing ~1000 open files to our service, which in our specific use case was unacceptable.
Tweaking the AMI to allow inheritance of larger open files max values, improved the tests results by a factor.
After a successful POC, passing load tests and yielding good monitoring metrics values, we went on to a full staging cluster deployment. This stage means we will be utilizing a full ECS cluster for the first time, hosting ~60% of the companies backend services. The deployment was successful with all services but that was no surprise. We started by deploying the core service; a cache server that handles 100% of the incoming traffic, which is the most complex of them all.
What we did encounter for the first time was the projection of scale activities in two different planes:
Once a full-grown cluster was in place, monitored, handling scale activities and looking good on load tests results it was time for the last preproduction test; Chaos.
Chaos Engineering is the discipline of experimenting on a distributed systemin order to build confidence in the system’s capability to withstand turbulent conditions in production.
Experimenting with chaos doesn’t have to be a cluster of application(s) that are continuously running such as Netflix’s Simian Army. Chaos is talking about “creating chaos” in different areas of the infrastructure to test their durability in times of mass failures, major outages that are out of our control, etc.
Testing chaos was done in 4 scenarios:
At this point, we had a fully functional ECS staging cluster, load tested, chaos experimented on, running with continuous integration and delivery processes. Next stop → PRODUCTION
Building a production cluster was a rather simple task. After experimenting with heavy load and “accidental” failures, we had a pretty good idea of how real traffic would be handled within the new system. The process involved a single service deployment of four core application services in a cluster of 15 containers. One by one the services were migrated from the old environment to ECS, carefully letting them live in cycles of 24 hours while monitoring them as much as possible. After a week of slow migrations, the first production cluster was live and healthy. Over the course of the 72 hours that followed, the mission was to bring scalability to perfection, by fine-tuning each service with its relevant metrics’ levels and thresholds, upon which they would scale up or down. Doing that, including a manual tuning of CloudWatch metrics and alerts, and later on translating the final values to the CloudFormation templates used to backup and deploy the services.
Running a production cluster that included a few heavy-duty-core services provided the confidence in the system and architecture. This was the starting pistol to the migration project of the entire production platform. Over a course of 3 months, different clusters were created and services were migrated into them. We decided to logically separate clusters into 6 groups of applications, and so they were divided and deployed. Each cluster went through the same cycle; deployment, migration, monitoring, fine-tuning, monitoring for results and when we were happy about them, the final step was updating the templates.
Major things I’ve learned in the process of building and migrating to ECS:
The production described above has been running and growing ever since 14 months and counting. Services were added, as well as some of the biggest content providers in the world such as FoxNews, Diply, AOL, PlayBoy, HuffPost and thousands of others enjoy it’s resiliency and scalability every day.
After considering the success of the first iteration of ECS deployment and striving to perfection with speed of deployments, usage of infrastructure-as-code concepts, rollback mechanisms and automatic updates; we’ve conceived a plan we call “ECS V2”. The path to V2, new concepts and improvements will be detailed in my next story, stay tuned.