How to Become a Site Reliability Engineer

  • What’s site reliability engineering (SRE) all about?
  • Who is a site reliability engineer (also called an SRE for short)?
  • What’s the difference between DevOps and SRE?
  • What skills must you have to become an SRE?

What Is Site Reliability Engineering?

SRE combines software engineering practices with IT engineering practices to create highly reliable systems. Site reliability engineers are responsible for the reliability of the full stack, from the front-end, customer-facing applications to the back-end database and hardware infrastructure.

  • General systems uptimes
  • Systems performance
  • Latency
  • Incident and outage management
  • Systems and application monitoring
  • Change management
  • Capacity planning

What Is SRE?

“SRE is what happens when a software engineer is tasked with what used to be called operations.”
— Benjamin Treynor Sloss, founder of the SRE team at Google

How Things Have Changed

Historically, software engineers would design code. Then they would hand it over to IT operations to deploy, maintain, and respond to incidents regarding their code.

How Can SREs Achieve Success?

How can an SRE ensure they are delivering software faster than before and maintaining systems uptime and performance? Site reliability engineers must monitor all components of the product or system. Since this could be a very large task, SREs identify and measure a set of key reliability metrics to help them pinpoint the most important tasks and identify weak areas. These metrics include cservice-level indicators, service-level objectives, and service-level agreements.

DevOps vs. SRE

After reading all this, you must be thinking that SRE is similar to DevOps. It is, but only to an extent.

Skills Required to Become a Site Reliability Engineer

Well, everyone’s path is a little different. But there are some common things that just about all successful site reliability engineers need to know.

Skill 1: Knowing How to Code

Because of the nature of the SRE role, understanding development and coding can go a long way.

Skill 2: Understanding Operating Systems

Working with servers at a large scale can be a bit stressful. Having a thorough knowledge of your organization’s operating system (usually Linux or Windows) is necessary. As an SRE, you’ll be working with these operating systems regularly.

Skill 3: CI/CD

Implementing DevOps practices is what differentiates the SRE role from the DevOps role, but both roles have things in common. Continuous integration/continuous deployment is one of them. To be a top-notch SRE, you need to be able to build a CI/CD pipeline from scratch for any application.

Skill 4: Using Version Control Tools

As a software developer, while working with code, you’ll be using Git or some other kind of version control tool. So it makes sense to learn about version control tools. The best way to accomplish this is to learn Git and GitHub.

Skill 5: Using Monitoring Tools

Monitoring tools make your life easier when you’re an SRE. They give you a brief look into your system performance and issues your system is dealing with. Implementing these tools and getting insights from them is the primary goal of SRE, so the system experiences as little downtime as possible.

Skill 6: Gain a Deep Understanding of Databases

Learn about so-called NoSQL databases. There are many types, and each has pretty specific use cases where they excel. Compare and contrast with relational databases like MySQL. This is an excellent time to dive into understanding what a data model is, why data models are necessary, and how the data model should inform your choice of database and your service architecture.

Skill 7: Make Your Life Easier With Cloud Native Applications

Knowing cloud native applications is another way to make your life easier in this line of work. You don’t have to know them in depth, but here are some knowledge areas that can help your organization and you as you get on the road to becoming a successful SRE.

Skill 8: Master Distributed Computing

Knowing how distributed computing works and understanding the concept of microservices are both significant advantages for an SRE. You’ll be handling large, distributed systems, so having some experience with these topics can really help you get ahead in this career.

Skill 9: Improve Your Communication

As an SRE, you’ll often be on call with the chief executive officer, chief technical officer, or with your manager, depending on the size of your company. You’ll need to report critical incidents that affect applications. Even when you aren’t on call, you’ll be working with software engineers and others. In all these situations, having effective, well-developed communication skills makes life much easier. For example, you can make sure there are no miscommunications while reporting incidents.

Where Can You Learn These Skills?

So, we’ve covered a long list of things an SRE needs to know. Let’s see where you can learn many of these skills.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cprime

Cprime

An Alten Company, Cprime is a global consulting firm helping transforming businesses get in sync.