Twist Bioscience Case Study

Implementation of Automatic Monitoring for all your environments allows you to deploy better systems at lower risk!

Contact Us
Twist Bioscience Case Study
A happy customer is one that has his bugs fixed before he realizes there were bugs to be fixed!
You may not realize it, but it’s not magic to get error information as it is happening.

                                      Highlights

Challenges:

  • More than 15 microservices running on the client's 5 different Kubernetes clusters.
  • Many AWS services that need to be monitored.
  • On-premise environment with several VPN tunnel connections between their on-premise environment and AWS accounts.
  • Problematic code versions deployed to a bunch of AWS lambda functions.
  • Used a broken third-party library that caused many problematic symptoms.
  • Non-scalable in-house code that had been deployed.

Solutions:

  • Added various alerts using Prometheus Alertmanager system.
  • Added an alert that sends a notification immediately each time a human error occurred.
  • Implemented Sentry, providing immediate notifications on integration and code issues.
  • Grafana graphs helped us understand and continue to refine our understanding of the problems.
  • Grafana shows all the resources consumed by the services running in Kubernetes clusters.

Results: 

  • Saved more than 25% per month on the cost of EC2 instances. 
  • Notification of errors quickly to allow implementation of a solution before the customer experiences any issue.
  • Time saved in finding and fixing bugs/problems quickly.

                               Full Story

These days, when a lot of companies are on a journey of moving to Microservice, Containers, Cloud, etc, we expose ourselves to a lot of different systems that potentially can break and create downtime.


For our customers, downtime equals losing money and losing money is unacceptable to us.


Downtime can be prevented in many ways, but two essential factors are alerting and monitoring

By Implementing the correct methods and tools for our customers, we reduce downtime and prevent the loss of money.

  • When using alerting, you can be notified as soon as you have a problem with your systems.
  • Monitoring can assist you to predict a potential problem and give you an inside look at the core problems. 


In the next several paragraphs, I’ll show and explain a case study we implemented for a client of ours, Twist Bioscience about how we helped them to reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Senty and New Relic.

These tools tackle all parts of the environment from cloud resources, infrastructure, dependencies and applications.


So let's get started with a description of the infrastructure and service we were dealing with:

Approximately 15 different microservices on 5 different Kubernetes clusters (dev, qa, staging, production and tools) on 3 different AWS accounts, several AWS lambda functions with API Gateways in front of them, more than 10 AWS RDS DB’s, ElasticSearch cluster, Redis instances and more.

The tools are just a means to an end, here we will focus on the improvements made allowing the client to feel safe and secure in his systems. At the bottom of this article, I have provided the tools overview from their formal websites.

Next, I’ll show several examples of problems we encountered, and how the alerting system notified us about them instantly.


  1. Twist Bioscience has an on-premise environment. We created several VPN tunnel connections between their on-premise environment and their AWS accounts. A number of services on the cloud should connect to on-premise services via these tunnels. We want to know instantly if for some reason one of the VPN tunnels goes down. So, we added an alert using Prometheus Alertmanager system that sends a notification to specific Slack channels each time it happened (AWS side alert).

            

We can see here a Production VPN tunnel that is down.


  1. The K8s clusters we created for them in specific environments are accessible for some developers, sometimes human errors can happen, a developer can delete a deployment (K8s resource) by mistake and we want to know immediately when this occurs.
    Once this happens, an alert is sent to an #alert Slack channel that the developers are monitoring.

It looks like this:

            

We can see here an example of mes-clu-celery deployment that is down in Staging cluster.


  1. Unfortunately, bugs are unavoidable because developers are human, all we can do is try our best to prevent them. Here, we can see an error in one of the third-party libraries they use. This causes a connectivity error from the local service to a saas service they use. They are notified immediately by Sentry when the exception happens and now have the ability to understand and fix the error by downgrading the third-party version.

           

            


“Our Beta testing customers were reporting problems they experienced in the system before we knew about them.  Since Prodops implemented their solution for our full scale e-commerce launch, we now know about, and can fix, a problem before a customer experiences it.”




Now, I will show some graphs from Grafana that helped us understand the problems and how it saved Twist Bioscience money.


  1. Twist Bioscience has AWS Lambda functions on several different AWS accounts, the functions were triggered by their services via API Gateway requests. When a problematic function version deployed once, it caused the execution time of the function to take more than 30 seconds - AWS API Gateway has a timeout of 30 seconds!!

So the service that calls this function got a timeout from the API Gateway. 

Using the correct graph in Grafana we can see the timeouts:  

            

Here we can see API Gateway Latency hit 30 Seconds.


  1. Another example of an AWS Lambda function problem was when a problematic function deployed. We can see in the graph below the amount of error the function had:

            



  1. More than 15 microservices were running on their Kubernetes clusters. We have followed Kubernetes best practices and provided to the pods request and limited compute resources. Initially, Twist Bioscience decided what the request and limit values were going to be based on the framework requirements they used. In Grafana, we have a graph that shows all the resources consumed by the services running in Kubernetes clusters. When we looked at some of the services metrics in this graph, we saw a bunch of services with way too many resources than they needed. After updating the services resources to the correct numbers, they saved more than 25% per month on the cost of EC2 instances. 

Take a look at the graph below regarding memory usage:

           

We can see here that the limit is much higher than the actual limit and we know that the actual limit is much lower than what we put as a resource in the pods.


Results:

When I asked Roy Nevo, Director of Product Development at Twist Bioscience, if he could measure the time it took to identify a problem in his systems before and after implementing the alerting and monitoring systems, he said: 

  1. “Our Beta testing customers were reporting problems they experienced in the system before we knew about them.  Since Prodops implemented their solution for our full scale e-commerce launch, we now know about, and can fix, a problem before a customer experiences it.” 
  2. “In regards to the time it takes for us to find out about a problem; it moved from days to minutes!!!”


By implementing monitoring and alerting, Twist Bioscience’s team sees many long-term benefits, including improved efficiency across the entire company. This includes the streamlining of several crucial processes and the elimination of the needless overhead that was wasting human resources.
Twist Bioscience’s team is left not only with a more efficient system, but also the confidence that they’re meeting their requirements, and that their new features will withstand future company changes and growth.

“In regards to the time it takes for us to find out about a problem; it moved from days to minutes!!!”

Tools Overview

  • Prometheus - is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community. 

We use Prometheus to collect metrics from our K8s clusters, AWS CloudWatch, CI server and more, check thresholds and send alerts.


  • Grafana - is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application. We use Grafana to build informative dashboards to receive a better in-depth view of what the hell is happening in our environments. Grafana uses AWS and Prometheus as data sources.


  • Sentry -  provides open source exception tracking, it tracks every exception in your applications as it happens and sends the stack trace, environment information needed to prioritize, identify, reproduce, and fix each issue.

For us Sentry is a must, it is very easy to implement, we use it to track errors in all our applications, it allows us to find all bugs in the Development stage and fix them before they move to a higher environment.


  • New Relic - gives you deep performance analytics for every part of your software environment.

With New Relic, we can optimize our services, In the context of Memory, CPU, number of workers and more.


“We knew we needed to improve our working production environment and after much research and based on outstanding recommendations, we chose to work with the experienced and trusted Prodops.”

Beneficial Monitoring and Alerting doesn’t have to be a pain. 

Find out how we can help you Prevent Downtime and Save Money by leveraging our expertise to provide the right solutions, so you can achieve the desired results. 


Consult with us at

ProdOps

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.