The CTO of 90min explains the part DevOps played in their growth from a small start-up to a company with a user-base of more than 50 million football fans. He discusses what he considers the definition of a good DevOps engineer, and how 90min plans to occupy other markets around the globe (India, Asia) despite some infrastructure challenges.
“DevOps cannot be thought of as a one-time project. Since things are constantly changing, the company must have someone to be in charge of the DevOps field. It is not an off-the-shelf product that you install and can forget about. DevOps is something that lives and breathes on a daily basis. It is a huge territory and sometimes a painful one, but it is worth it,” says Sharon Weiss, CTO of 90min, who led his company to adopt best practices of DevOps technologies.
90min is a global media technology company focused on digital content generation for football matches. The company provides an interactive self-publishing platform for football fans by turning their generated media into global sport journalistic content. In addition, the platform supplies match predictions, videos, slideshows, team line-ups, player rankings and more.
CTO Weiss started his career at Walla (The largest portal in Israel) and at Seekingalpha. A couple of years ago he joined 90min. His department includes 25 employees with “one and a half” DevOps engineers, as he puts it — ”a DevOps engineer and another person that came from the back-end team.”
“We have reached a level where we suddenly grew to millions of users. Backed up by leading investors; this is exactly the point where you turn from a start-up to a company, and you need to prepare for the next step,” he explains. He spoke about the needs that led his company to search for new technologies and when exactly DevOps came into the picture. “Our users -mostly young people of today don’t have much tolerance — load time should be at its best otherwise they would leave to somewhere else”
Can you describe the decision-making process back when you considered implementing DevOps projects?
“We wanted a continuous deployment process — getting to the point where we can deliver and deploy 24/7 without damaging user experience or going down. We have users who consume our content all over the globe in different time zones. This process requires the DevOps engineer to be responsible for organizing the whole deployment process — taking the development code, deploying it on some kind of environment, meanwhile getting it into the queue for code review, moving to QA and then releasing to production. He connects the developers and the production engineers and is responsible for both sides. He checks that the code is well-written and can scale, and he checks that the cloud deployment works well.
The idea for the project came from Netflix
A few months ago Weiss and his team performed an auto-scale project, which included building automatic integration and deployment processes for their back-end infrastructure and designing an efficient financial model for cloud server usage.
“It came from one of our back-end engineers who later became our DevOps person.” He explains how it all began. “We started to realize that we were not scaling under loads. We used to buy more servers to prepare ourselves for the peak, but we didn’t scale back down — the peak doesn’t last for 24 hours, so you end up paying for resources you are not using. He [the engineer] told me about something he read about Netflix and what they have done, and we took a few months to think about it and learn. The team included five people: him, three guys from the back-end team, and me.”
The upper management was not involved?
“No, they weren’t involved. Business requirement is that the website and its services should always be live, up and in the best performance. As a CTO, I’m the one to decide regarding what is needed including the budget. When we started, our purpose was to get to the point where after three months we respond in a fast and successful manner to traffic peaks without the touch of a human hand. We’ll buy more servers instances only when and where we need them, and we free up servers in times and places we don’t need them.
“One of the business targets was to decrease servers’ workload by 35%. The second target was to react successfully to peaks, and that was easy since we knew the peak patterns — for instance, the football matches take place on Saturday and Sunday nights, so we knew to expect peaks at that time.
“Part of this project was to tune and define all the patterns, to know what to measure and to understand when to scale up or down with servers instances accordingly. You can measure the number of people that are turning to you, the latency of servers and so on. There are many parameters. The process is first to develop and then tune the parameters. There’s a set of seven or eight parameters that you can play with to help you decide how to scale successfully.”
You must have had some failures during the process — how did you deal with them?
“We knew from the beginning that there could be running failures in the first few weeks after the development process ended. At that time, you need to check manually. It was not completely automatic until we brought it to the stage that it works perfectly well.”
How did you manage and plan the project financially?
“We spent some money in advance for our outsourced DevOps consultants; that was the cost of hiring an engineer for six months. My DevOps engineer was dedicated to working on only that project for three to four months. But it was worth it, since we managed to decrease our monthly cloud invoices by 35% — that’s tens of thousands of dollars a year per server. It’s a huge savings! But it isn’t just money. It’s our brand reputation. We didn’t fail. If there was a goal for instance in a football match or an interesting sports news event, we send a push notification to hundreds of thousands of users, and it worked fantastically. If we had to be prepared for that huge peak in advance, it would have been very expensive. The investment is nothing compare to what we achieved afterward.”
Sharon Weiss: The idea came from Netflix
Development vs. Operations — who would be a better DevOps engineer?
The auto-scale project involved deployment and a large amount of code writing. It combined Chef, Jenkins and Amazon tools. “We are doing 10–15 deployments per day, always coding and changing,” Weiss says enthusiastically. “Before the auto-scale took place, when we wanted to update the code we needed to check that we were on the right version and that there were no instances of old code. We had to wrap the code, clone to the server, and QA. We had to schedule the process correctly, stop the update when there was a new code — almost all of it was manual.”
Was there a point in the auto-scale project when there were difficulties that made you wonder, ‘Why am I doing this?’ Did you ever think of dropping it?
“No. The alternative was so bad that we were desperate for a change. There were challenges, decisions to be made and priorities to be set. For instance, the trade-off between being updated with the code and to deal with the time it took to add servers on peaks. You have to decide what is more important. For us, back then, handling the peak properly was more important.
“Another example: After all the development took place, we reached a point where the servers’ deployment took too much time — over two minutes. It was not satisfactory. We decided to write the code again from scratch. That meant we had to delay other important things, but we did it and improved the time by 50%. We are still working to improve it.”
How would you define a good DevOps engineer for your team?
“A DevOps engineer must have a strong understanding of web development; we are a web company after all. A job description for a DevOps engineer in our company will consider a person who has knowledge beyond the operation of Jenkins, Chef and AWS. He must know web. He doesn’t have to be a developer, but he needs to understand the development and be familiar with back-end languages and databases.
“Personally, I prefer someone who comes from automation or development and moved to operations, rather than the other way around. From my perspective, the operations part is something you can learn pretty fast, where the global understanding of web ecosystems is not a very easy thing to understand. This ecosystem starts at the back-end code and moves to the front end; it involves cache servers, DNS and more. It is something that would be intuitive for developers who have grown to it, rather than to a person who organized machines, connected them to a network, installed operating systems and checked that the machines were live, and there are no failures. This is especially true for the cloud world.”
Paying attention to security
Nowadays, Weiss’ team is focusing on a VPC (virtual public cloud) project and is planning to start the CDN (content delivery network) transformation project to supply each market with the right CDN.
Can you describe your process on those projects?
“It always starts with discussions about what we would like to achieve. In the CDN project, we wanted to understand why we need it at all, what the transformation means in terms of capabilities, advantages and disadvantages. In the VPC project, the advantages are pretty clear — to be in a less hacked environment and place the company where it belongs in terms of reputation and performance. We can’t allow a situation where someone will take over our servers, put them off-air or do something that will harm our brand reputation, availability or performance.”
Why did you choose to do it now? Did you have something that caused it to be urgent now?
“When we talk about security it is not the urgent part that drives it. It is more like an insurance prospect rather than urgency because something is happening.”
But why has it come up just now and not one year ago or maybe in a few months?
“As a young start-up, it is less bad to be down for few hours or having other performance issues. Nothing happens — no many users, brand or partners to consider. When you grow, you start to think about security and performance. When you want to have a working and evolving product, most of the time, unfortunately, you will move security down on your priority list. I have started to think about it a year ago. Usually, it happens on the transformation from a start-up to a big company. For us, it comes now since we managed to line it up with the most urgent stuff, mostly high-availability matters and the auto-scale.
“When we make plans, we are using a method called OKR (objective key results). We define quarterly objectives for the whole company and for each department. We try to measure those objectives’ key results, and we grade ourselves. If something happened during the quarter that demands a delay in something else, we are doing it. The VPC appeared in the last three-quarters, but we didn’t stand for it because there were other business opportunities that took the highest priority to deliver”
Knocking on technology’s door
The DevOps market includes many technologies and tools that are constantly updated and changed.
Let’s say there’s a new tool on the market. How would you decide whether to use it or not?
“The technology doesn’t knock on your door. Usually, you will be the one to search for the technology that will provide an answer to a need. For instance, let’s take the new CDN. It comes from a painful performance situation in a few countries around the globe. Part of it relates to the local platforms. In some markets, for instance, the internet is pretty weak, the data packages are very expensive, people use very old devices, and the browsers are very old. We still want to provide our best performance. That’s a need. All those needs force me to find the right CDN instead of working with a general global one.”