Reducing Downtime with Alerting and Monitoring

How ProdOps helped Twist Bioscience reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Sentry and New Relic.

Contact Us
Reducing Downtime with Alerting and Monitoring

Implementation of Automatic Monitoring for all your environments allows you to deploy better systems at lower risk!

A happy customer is one that has his bugs fixed before he realizes there were bugs to be fixed!

You may not realize it, but it’s not magic to get error information as it is happening.

These days, when a lot of companies are on a journey of moving to Microservice, Containers, Cloud, etc, we expose ourselves to a lot of different systems that potentially can break and create downtime.

For our customers, downtime equals losing money and losing money is unacceptable to us.

Downtime can be prevented in many ways, but two essential factors are alerting and monitoring.

By Implementing the correct methods and tools for our customers, we reduce downtime and prevent the loss of money.

  • When using alerting, you can be notified as soon as you have a problem with your systems.
  • Monitoring can assist you to predict a potential problem and give you an inside look at the core problems.

In the case study, I’ll show and explain how we implemented for a client of ours, Twist Bioscience about how we helped them to reduce downtime and prevent potential problems by using diverse tools such as Prometheus, Grafana, Senty and New Relic.

These tools tackle all parts of the environment from cloud resources, infrastructure, dependencies, and applications.

Highlights:

Challenges:

  • More than 15 microservices running on the client’s 5 different Kubernetes clusters.
  • Many AWS services that need to be monitored.
  • On-premise environment with several VPN tunnel connections between their on-premise environment and AWS accounts.
  • Problematic code versions deployed to a bunch of AWS lambda functions.
  • Used a broken third-party library that caused many problematic symptoms.
  • Non-scalable in-house code that had been deployed.

Solutions:

  • Added various alerts using Prometheus Alert manager system.
  • Added an alert that sends a notification immediately each time a human error occurred.
  • Implemented Sentry, providing immediate notifications on integration and code issues.
  • Grafana graphs helped us understand and continue to refine our understanding of the problems.
  • Grafana shows all the resources consumed by the services running in Kubernetes clusters.

Results:

  • Saved more than 25% per month on the cost of EC2 instances.
  • Notification of errors quickly to allow implementation of a solution before customers experience any issue.
  • Time saved in finding and fixing bugs/problems quickly.

Twist Bioscience’s team sees many long-term benefits, including improved efficiency across the entire company. This includes the streamlining of several crucial processes and the elimination of the needless overhead that was wasting human resources.

For more details on how we helped Twist Bioscience prevent downtime and save money, by implementing monitoring and alerting, read the full case study

Thank you for reading!

A special thanks to Twist Bioscience.

By Ziv Rechnitser
Reducing Downtime with Alerting and Monitoring
Ziv Rechnitzer
Senior Operations Architect
With more than 10 years of helping companies solve their operations and delivery problems, he focuses on providing the best experience he can provide. His technical expertise stretches across systems management, managed services and deep understanding of lifecycle management. He tries to spread the DevOps spirit wherever and whenever possible. He is loyal to his customers, reliable and inspires to make an impact on his work. In his spare time, he is a sports fan and enjoys playing boards games, watching movies and tv-shows. He is passionate about executing the most effective solutions that work and focusing on what really makes an impact.