F.A.Q. What is Site Reliability Engineering?

1 lutego 2019
Gregory Burmistrov, System Analyst, Site Reliability Engineer at DataArt
F.A.Q. What is Site Reliability Engineering?
Site Reliability Engineering (SRE) is booming but has not yet become mainstream. To many, this abbreviation still seems mysterious. The SR engineer position is often confused with the system administrator or DevOps.

We collected frequently asked questions about SRE and tried to answer who is expected in the new profession and what awaits those who decide to try their hand at it.

1. SRE — is it from the book that Google published?

Generally, yes. The concept of Site Reliability Engineering appeared in Google back in 2003. Since then, many companies have formed their own SRE teams. Above all, of course, those whose business success is directly related to the smooth operation of computer systems (Apple, Microsoft, Facebook, Twitter, Dropbox, Oracle, etc.).

The wide distribution of SRE began 4-5 years ago. Over the past 2-3 years, the list of those who singled out the relevant role in projects has expanded considerably. In the end, who now does not depend on internal IT systems, their reliability, performance, integration with external services? The tasks of Site Reliability engineers in different companies may vary and depend on the type of business itself. In this sense, SRE, as a relatively new approach, resembles Agile, which, as you probably noticed, is different for everyone. However, the list of knowledge and skills an SR specialist requires to have will, in any case, be approximately 80% same.


2. Ensuring reliability — is it just a fashionable technical support name?

Definitely not. Another thing is that the SRE concept assumes that developers do not just write code, but also monitor how it works in production. In this sense, the boundary between development and exploitation is erased here. One of the tasks of an SRE team is not to allow the release to become a ping pong match between developers and DevOps engineers, where everyone claims that the problem comes from the other side.

A Site Reliability engineer constantly looks at the possibility of automation and has quite broad powers. Any problem for them is, first of all, a reason for analysis. If it repeats or is fraught with high risks, the SRE may decide to fix something in the application itself or write (on their own or with the help of colleagues) a tool that can eliminate trouble without human intervention. Thanks to SRE, we understand whether there are bugs in the application, how to fix them, and how to continuously improve the reliability of this system in the future.

An SRE could become a support employee only if the process is improperly organized: when the number of incidents grows like an avalanche and the engineer simply does not have time to do their main work to solve current urgent tasks.

A very important difference between SRE and support is the number of communications. To an SRE, the extent to which many system analysts (especially in small companies) have to communicate, is a surprise. This is definitely not work on a narrow task in complete solitude. In our projects, it is a constant contact with business representatives and independent groups of developers.


3. SRE — developer or DevOps?

SRE is an attempt to make friends of these two directions. Engineers who work here understand the system well, know how to get deep “under the hood”, and are ready to rewrite bad code. But in this role, there is also a hint of DevOps: an SRE needs to understand how the servers work, on which the system is deployed, how the system is scaled, how the load is distributed, etc.

SREs are primarily required for working in large enterprise projects with complex, high-loaded applications. It’s them who know how the system behaves in real conditions, especially if something goes wrong, i.e. the network connection or database fails. This knowledge is needed not only to quickly stabilize the application but also to make the necessary changes to the source code.


4. What is 'reliability'? Are there clear criteria for measuring reliability?

The first thing an SRE does is surrounds any system with metrics that may vary from project to project. It is important not to overdo it and not to measure what we are not interested in measuring. For example, the amount of disk space on the server and the processor load alone affect the work but do not answer any of the important questions for us, since SR engineers are interested not in technical indicators, but in Service Level Indicators (SLI), i.e. business metrics. The system better serves customers not when the processor is less loaded, but when it is able to consistently handle more requests without sacrificing quality.

Only by learning to measure business-critical indicators can we begin the process of increasing reliability. It is clear that at the same time the cost of developing, supporting, and maintaining the system is growing. Moreover, they grow exponentially, especially if we are talking about a system operating in different regions, where the question of a universal line arises (and more often than not, SRE deals with such complex stories). And here an SRE turns out to be a key figure in negotiations with the business, since they can, with reference to quantitative indicators, explain how reliable the system is, what bottlenecks are fraught with and how much it will cost to eliminate any of these bottlenecks. It is together with business representatives that SR engineers set Service Level Objectives (SLO, another important abbreviation), i.e. the goals of the service level, acceptable reliability indicators.


5. What education and experience does an SRE require to have?

The concept is still new, and there are practically no ready-made specialists on the market. Therefore, for these roles, we consider both developers (it’s good when an SRE is not afraid of Python or Java) and DevOps engineers who are ready to dive deeper in code. Fortunately, the range of tasks is very wide, from monitoring and alerting (typical DevOps tasks) to the complex troubleshooting that can be done only by experienced developers.

Classic tasks: the server logs constantly running out of memory; or the thread pool ends, i.e. some threads are not returned; or one of the three servers behind the load balancer is constantly overloaded, although the other two are working normally. These are non-trivial technical problems whose solution requires a deep understanding of what’s “under the hood”, i.e. how systems are scaled in the clouds, how the load is distributed, and how the server handles it. Most likely, a senior level developer should investigate them. There are configuration tasks, as well as local and not so complex ones.

Having knowledge of all processes and tools are not expected from those who come here. SRE allows you to learn by working on specific tasks with experienced engineers. Therefore, there are prospects here not only for the seniors, but also for the junior developers or DevOps.


6. Is SRE an opposite of feature development?

SRE can limit the overly fast development of new features, acting as a stabilizer. But to associate it with an abstract evil in the form of non-development is a big mistake. SRE does not oppose feature developers; rather, it balances the business side, which constantly requires expansion of the features of any application.

New features, especially the ones designed in a hurry, always destabilize the system. If it risks falling into production, SRE can appeal to the error budget indicator. If an error budget is chosen or approaches a critical point, it is SRE that sounds the alarm and indicates the need for stabilization. Everyone intuitively understands: if the system is stable, it can be slightly destabilized by adding new features. If not, you cannot take risks. You need to eliminate the threats, postponing the development of a new one. But the concept of SRE allows you to talk about it in understandable terms, with the involvement of specific quantitatively expressed information. In addition, the role of SRE means responsibility for this balance and gives the engineer the appropriate authority.


7. Is the SRE work a routine?

No. On the whole, it is difficult to call it routine; we are not talking about an infinite set of repetitive operations. There really are tasks of supporting the system: one day, with a high degree of probability, a server may go down, and you will have to deal with it. Most likely, it will go down in the evening, when the client begins to process orders.

However, in our projects, no one waits around the clock for the team to be present at their workplaces, and on-call safety duty usually lasts a week every two months and is paid, even if no requests were received.

The work of an SRE can be divided into two parts. You got to admit that putting out the fire is actually quite fun: you run with a fire extinguisher acting deftly and bravely, and conquer the fire not without pleasure, though you scold those who put you up to this adventure. It seems that after this everyone should go to sleep with a sense of accomplishment. For an SRE, work is just the beginning: you need to understand what caused the incident, evaluate it and decide how to prevent this from happening in the future. Another thing is that such an investigation may in itself prove to be engaging, and its successful completion is a good enough reason to be satisfied.


8. Does an SRE work together with developers or as part of a separate team?

We use both approaches. In the first case, the SR engineer is introduced into the team, along with the developer and the QA engineer. They may be in a state of productive creative conflict that prevents them from making dangerous compromises.

With a different approach, a whole SRE team is working on the project. We use them especially often in projects that have passed the stage of active development, where the system is fairly stable. Such a system can be actively operated by the client. Therefore, improving the process and interaction, and providing automatic recovery can be especially important. The team complements it with metrics, understands the device, searches for problem areas. In some cases, SRE may request a revision from the team of developers or independently make changes in the code, if they are limited in size and fit into the error budget.


9. What can you learn by working as an SRE?

You can learn the complexity of the system, which in the future can help bring the system to production. Currently, almost no one can afford to just write code without thinking about its future. All participants of any major project have to monitor the load and safety. Working with existing systems as an SRE allows you to cut a path and immediately immerse yourself in this process, seeing the example of a large business.

This work allows developers to go beyond 30% of the time that is actually spent on the system to benefit real people. They have the opportunity to see and touch every modern way of working in production - monitoring and alerting. Moreover, the tools intended for this, as a rule, are open source programs. That is, the experience gained can be easily transferred to other projects and companies.

For DevOps engineers, SRE is a great opportunity to better understand how the systems are written. Such work allows you to dive in code at a level that in 2–3 years if you wish, will allow you to develop further as a programmer.


10. Is the SRE concept going to last a long time? Will this experience be in demand?

This is forever, and experience in the future may be indispensable. SRE is able to become a stable source of income for any company targeting large and complex enterprise projects. Systems continue to get complicated, overheads increase, and it’s almost impossible to remember all the details of deploying 200 microservices in your head.

Therefore, the role of an SRE in the coming years may become as ordinary as QA Automation 10 years ago, and DevOps five years ago. In order to manage projects with hundreds of developers, you will definitely need people able to withstand the oncoming chaos. Otherwise, applications will begin to collapse under their own weight.

In addition, the SRE experience will be useful even to those who, after some time, want to return (or jump into for the first time) strictly to development. The ability to see the future of your code in production may be mandatory in the coming years. And who wants to be cursed by colleagues and ordinary users, for whom the uninterrupted operation of the system is critical.