SREcon21

SREcon21, a gathering of engineers who care deeply about site reliability, systems engineering, and working with complex distributed systems at scale, will be held as a virtual event for the global SREcon community on October 12–14, 2021.

https://www.usenix.org/conference/srecon21

Table of Contents

10 lessons learned in 10 years of SRE

https://www.usenix.org/conference/srecon21/presentation/spadaccini
Andrea Spadaccini, Microsoft Azure

In this talk we’ll discuss some key principles and lessons learned that I’ve developed and refined in more than 10 years of experience as a Site Reliability Engineer across several teams within Google and Microsoft.

These are topics that often come up as I discuss Site Reliability Engineering with Microsoft customers that are at different stages of their own SRE journey, and that they—hopefully!—find insightful. They broadly belong to the areas of “Starting SRE” and “Steady-state SRE.”

Please join us if you want to discuss fundamental principles of adopting SRE, want to listen to my mistakes (so you can avoid making them!), and want to compare notes on different ways of doing SRE.

  • SRE must serve business goals.
  • Anti-pattern: SRE roadmap drifts from product dev roadmap.
  • Anti-pattern: SRE-owned services with no dedicated staffing.

Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity

https://www.usenix.org/conference/srecon21/presentation/sbaraglia
Francesco Sbaraglia and Adriana Petrich, Accenture

Security Chaos Engineering is built around observability and cyber resiliency practices, aiming to uncover the “unknown unknowns” and build confidence in the system. Engineering teams will progressively work against missing understanding for security concerns within complex infrastructure and distributed systems.

This session with enable you to formulate a valuable simple hypothesis that can be verified based on security chaos experimentation.

  • Kinda neat demonstration, but didn’t leave me with any actionable insights or takeaways.

What To Do When SRE is Just a New Job Title?

https://www.usenix.org/conference/srecon21/presentation/buetikofer
Benjamin Bütikofer, Ricardo.ch

When the SRE Book was published in 2016 the job title of SRE was not widely used outside Google. Fast-forward five years and it seems like every company is hiring SREs. Did the System Administrator and Operations jobs disappear or have their job titles simply changed?

At the end of the talk, you will know one way of transforming a disjoint team of engineers into a high-performing SRE team. If you are a manager of a team or you are interested in team building this talk is for you. This is not a technical talk; I will focus solely on how to set up a team for success.

  • Benjamin was hired at Ricardo after the company had just made a shift from managing its own hardware to moving into the cloud. The infra people that were left were made into an SRE team.
  • Shield your team from interruptions.
    • Suggestion: assign one person to dealing with these so the rest of the team can focus on other work.
  • Create opportunities for spontaneous events.
  • Speaker sounded like he was reading a pre-written speech (common with pre-recorded talks), which detracted from the quality of the talk in my opinion.

Capacity Management for fun and profit

https://www.usenix.org/conference/srecon21/presentation/fulton
Aly Fulton, Elastic

Are you looking to move past the “throw unoptimized infrastructure at a problem and worry about waste later” stage? Join me as I talk about my journey green fielding all things infrastructure capacity for Elastic’s growing multi-cloud-based SaaS. This talk is NOT about cost savings or cost optimization directly in the traditional sense, but you will discover that proper capacity management and planning do lead to increasing profit margins!

Overall a pretty nice story of the journey from legacy practices and technical debt, to investments in tooling and automation to make the situation better. Some random notes:

  • Capacity management was new at Elastic when Aly joined.
  • Built tooling (written in Go) to manage scaling operations.
  • Capacity management falls under the Cloud SRE Infrastructure team, where she’s currently a lone wolf in the team.
  • VMs are stateful.
  • Old default cluster resizing method was “grow and shrink” of VM scalesets.
  • YAML files to manage autoscaling groups were hand-managed. Changes to autoscale settings required making PRs.
  • Lots of other pain around lack of automation, or automation that wasn’t smart enough and needed manual actions.
  • It’s probably a good idea to have alerting on quotas that are approaching their limits.

A Political Scientist’s View on Site Reliability

https://www.usenix.org/conference/srecon21/presentation/krax
Dr. Michael Krax, Google

Political science can provide novel and fresh insights into software engineering problems:

  • Empirical research on social change is helpful to understand team dynamics and how to evolve teams.
  • Analyzing political systems as self-organizing systems provides insight on how to simplify modern production environments.

If you are interested in a different look at your everyday questions, join us. No prior political science training or knowledge expected.

Structured talk, conceptual session

  • One of the opening slides defined:
    • Politics Interactions
    • Policy Rules (eg. foreign policy)
    • Polity Group (eg. state)
  • Political science is a part of social science.
    • Analysis of power in social interactions
  • Political science is a relatively new domain, largely originated after the second world war.
  • Explains the concepts of Realism, Liberalism and Constructivism.
  • Pushing change top-down from a position of power is likely to cause frustration or meet resistance.
  • Getting consensus by having everyone agree on new rules is also costly. “Too many meetings” symptom.
  • Suggestion/experiment: use social change
    • Start changing your behavior and convincing others to follow.
    • Encourage a growth mindset. Update rules later, if they work.
  • References quotes from Niklas Luhmann
  • Personally didn’t get much out of this. I think you’ll get more mileage out of reading the works of the Heath brothers, such as Switch: How To Change Things When Change Is Hard.

Panel: Engineering Onboarding

https://www.usenix.org/conference/srecon21/presentation/panel-engineering-onboarding
Moderator: Daria Barteneva, Microsoft
Panelists: Jennifer Petoff, Google; Anne Hamilton, Microsoft; Sandi Friend, LinkedIn; Ilse White, Learnovate

In this panel on Engineering Onboarding we will discuss with a few industry experts their thoughts on what are the big questions and challenges in this field? What have been the significant changes in the past few years? And, finally, what next?

  • Ilse White thinks organizations are starting to recognize the value of proper onboarding of employees.
    • 83% of companies feels their onboarding process is good, but only 13% of employees actually feel that way.
    • Research shows it can take as long as a year for new employees to be completely up to speed and onboarded (unlike the first 90 days most companies focus on).
    • No source/attribution on these numbers, but I got the impression this was original research by Learnovate.
    • Sandi Friend agrees they are seeing these things.
      • She also remarks that this corresponds to retaining talent. Better onboarding experiences increase the chances that people stay with the company. At LinkedIn, they could directly correlate retention to the quality of onboarding processes.
      • Emphasizes the importance of belonging. Especially noticeable as a result of COVID-19.
  • Jennifer Petoff says effective onboarding isn’t about spraying people with new information (and hoping some of it sticks), but about building confidence. Avoiding Impostor syndrome and making sure people feel confident in their role and the systems they’ll be expected to work with.
  • Anne Hamilton explains how a lot of their onboarding programs were structured around in-person interactions to give employees a sense of community and belonging. With COVID-19 this became more “transactional” so they really needed to focus on how to keep this aspect alive (she didn’t give any examples of how they achieved this though).
    • Both Anne Hamilton and Jennifer Petoff mention using cohort-based learning sessions. Anne also mentioned they offered onboarding sessions in two variants: Live training, and pre-recorded videos people could watch at their own time. This accommodates people with different learning styles (some people prefer self-study where they can set their own pace, for example).
    • Sandi Friend also mentioned these different learning styles are also influenced by age distribution. In her experience, on average, younger people are more video/tech-savvy, more accustomed to being able to watch videos and self-study at their own pace, as opposed to older generations that may be more used to class-based learning. Offering the two modalities helps address that as well.
  • Ilse White mentioned their research showed that you really need to provide high-quality material.
  • Jennifer Petoff: Main difference between general onboarding and SRE-specific onboarding is that SRE isn’t taught in school.
  • Sandi Friend remarks that a lot of companies expect new hires to join on-call rotations within the first couple of weeks. But how can new people learn all the context and big picture in such a short amount of time?
    • In her experience, there also tends to be a bit more tribal knowledge (and hero culture) in SRE compared to other disciplines.
  • Ilse White mentions onboarding in remote settings involves a lot of scheduled, pre-planned meetings. This reduces the chances of serendipitous conversations taking place, which can make it more difficult for new hires to build context, ask questions, etc.

DevOps Ten Years After: Review of a Failure with John Allspaw and Paul Hammond

https://www.usenix.org/conference/srecon21/presentation/depierre-devops
Thomas Depierre, Liveware Problems; John Allspaw, Adaptive Capacity Labs; Paul Hammond

Missed part of this. Need to rewatch.

How We Built Out Our SRE Department to Support over 100 Million Users for the World’s 3rd Biggest Mobile Marketplace

https://www.usenix.org/conference/srecon21/presentation/oreilly
Sinéad O’Reilly, Aspiegel SE

March 2020 was a strange month for everyone—our work and employee interactions changed fundamentally, and perhaps permanently, as the entire office-bound workforce shifted to working from home. Here in Aspiegel, it wasn’t the only challenge that came our way. We combined an increased role in Huawei service management, with retiring our managed services SRE team. This meant that over the year we would need to hire aggressively to replace the team, and also to support our new growth. Working through this onboarding over the course of the year would cause some hiccups along the way, but ultimately it would force us to change into a leaner, and more professional SRE Department. Join us, as we talk about what we did, what we learned, and how we can help others get there too!

  • When planning for growth/headcount, also look at types of work. Is future work going to be more of the same, or will there be new work-streams with potential demand for new skillsets?
  • Think about the different hiring stages. Which stages will there be?, who has to sign off on what?, who will be involved in which stage?, etc.
  • Make sure to have an onboarding plan that is accessible to new hires.
  • Establish a support group for new starters to ask questions.
    • Not sure how I feel about this. I think common questions should be covered by existing documentation, and people should feel empowered to ask questions in public channels and team channels without needing a special support group for it.
  • Create support channels and FAQs for widely-used tools.

You’ve Lost That Process Feeling: Some Lessons from Resilience Engineering

https://www.usenix.org/conference/srecon21/presentation/woods
David Woods Ohio State University and Adaptive Capacity Labs; Laura Nolan, Slack

Software systems are brittle in various ways, and prone to failures. We can sometimes improve the robustness of our software systems, but true resilience always requires human involvement: people are the only agents that can detect, analyze, and fix novel problems.

But this is not easy in practice. Woods’ Theorem states that as the complexity of a system increases, the accuracy of any single agent’s own model of that system—their ‘process feel’—decreases rapidly. This matters, because we work in teams, and a sustainable on-call rotation requires several people.

This talk brings a researcher and a practitioner together to discuss some Resilience Engineering concepts as they apply to SRE, with a particular focus on how teams can systematically approach sharing experiences about anomalies in their systems and create ongoing learning from ‘weak signals’ as well as major incidents.

  • Note to self: Read ‘Above the Line, Below the Line’ by Richard Cook in ACM Queue Jan 2019 again.
  • Three factors that help in handling anomalies
    • Process Feel
    • High-Signal Alerts
    • Graceful Extensibility
  • Alerts are not a panacea
    • The ‘Dark board’, that is no alarms going off, does not necessary mean everything is okay.
    • Complexity means there are lots of dependencies. This causes “effects at a distance” to happen. Meaning a change in one part of the system can affect/trigger something in a very different area of the system.
    • Alert overload is a common occurrence.
  • Graceful Extensibility and Overload
    • What parts of a system are approaching saturation?
    • How does the way systems respond to increasing load/overload contribute to the spread of overload throughout the system as a whole?
  • Current systems tend towards:
    • Late responses
    • Large responses
    • Responses that dump overload elsewhere
    • Automatic scaling mechanisms with low sensitivity
  • Theory of graceful extensibility provides guidelines
    • All parts of a system have limits: you must have a strategy to manage those challenges
    • Distributed systems are connected: you need to manage behavior in overload across interconnected units, in a dynamic way
    • Individual pieces only have a partial view of the system as a whole: but they should signal to their neighbors to adapt to conditions
  • In interconnected systems, behavior under load tends to be brittle by default. By paying attention to behavior when approaching saturation, and thinking about systems as a connected whole, this brittleness can be reduced.

Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19

https://www.usenix.org/conference/srecon21/presentation/schaevitz
Samantha Schaevitz, Google

Many teams will have practiced and refined their Incident Management skills and practices over time, but no one had a playbook ready to go to manage the dramatic Coronavirus-driven usage growth of Google Meet without a user-facing incident. The response resembled more a temporary reorganization of more than 100 people than it did your typical page—the fact that there was no user-facing outage (yet), notwithstanding.

This talk will cover what this incident response structure looked like, and what made it successful.

  • Multi-month effort to scale Google Meet capacity.
  • While there was no user-facing impact (for most of the duration), it was declared as an incident throughout this time.
  • Communications was a full-time job.
  • Active management of people to prevent burn-out
    • Assigned standbys for all key people, who shadowed all meetings of those people, in order to have the same context and be able to step in in case of those key people becoming ill, needing time off, etc.
  • Assigned different working groups/work streams
    • A team specifically working with downstream service dependencies to make sure they could handle the load
    • A team looking into technical bottlenecks with meet itself.
    • A team adding knobs to tune, such as ability to force all streams from HD to SD, in order to be able to have more graceful degradation options.
  • They were reporting key health metrics to higher management every day. Scaled down the incident response once the situation was no longer changing daily, but only weekly.

Ceci N’est Pas un CPU Load

https://www.usenix.org/conference/srecon21/presentation/depierre-cpu
Thomas Depierre, Liveware Problems

No notes.

What If the Promise of AIOps Was True?

https://www.usenix.org/conference/srecon21/presentation/murphy-aiops
Niall Murphy, RelyAbility

Many SREs treat the idea of AIOps as a joke, and the community has good reasons for this. But what if it were actually true? What if our jobs were in danger? What if AI—which can play chess, Go, and Breakout more fluidly than any human being—was poised to conquer the world of production engineering as well? What could we, or should we, do about it?

Join me in this talk as we examine the current state of affairs, and the future, in the light of the promises made by AIOps companies. Or, in short, to ask the question, what if AIOps were true?

  • Definitions:
    • MLOps: Running Machine Learning (ML) infrastructure
    • AIOps: Running production infrastructure using AI/ML

Didn’t make any other notes, but this was a pretty nice overview of where AIOps might help us, and what SRE work is fundamentally unsuitable to being replaced by AIOps.

Nine Questions to Build Great Infrastructure Automation Pipelines

https://www.usenix.org/conference/srecon21/presentation/hirschfeld
Rob Hirschfeld, RackN

Sure we love Infrastructure as Code, but it’s not the end of the story. This talk steps back to see how different automation types and components can be connected together to create Infrastructure Pipelines. We’ll review the nine essential questions that turn great automation into modular, portal, and continuous infrastructure delivery pipelines.

  • Why doesn’t my CI/CD pipeline understand infra?
    • CI/CD pipelines flow artifacts, not resources.
    • Infra deals with environmental contraints, resource lifecycle, cluster join/drain operations.
  • CI/CD pipelines deal with mostly linear flow. Orchestration of infra tends to be more dynamic, environmental, subject to more variations.
  • Why are provisioning and configuration so different?
    • Provisioning and configuration mix different types of automation with different operational contexts.
  • Why can’t I share state between tools?
    • Ops tools are built bottom-up and tend not to be designed for sharing state.
    • Pipelines on the other hand tend to require sharing of state and synchronization points.
  • Infrastructure is Code (IaC) is not just “configuration lives in git”. IaC requires code-like patterns like modules and abstractions to set standards and allow reuse.
    • This requires clear separation between standard, reusable pieces and site/service-specific pieces.
    • They found it helps to have clear, separate pre- and post-steps in deployment workflows.

Hard Problems We Handle in Incidents

…but aren’t often recognized

https://www.usenix.org/conference/srecon21/presentation/allspaw
John Allspaw, Adaptive Capacity Labs

If we know how and where to look closely, we can find a number of dynamics, dilemmas, and sacrifices that people make in handling incidents. In this talk, we’ll highlight a few of these often-missed aspects of incidents, what makes them sometimes difficult to notice, and give some descriptive vocabulary that we can use when we do notice them in the future.

  • 4 types of activities in incidents
    • Diagnostic activities
      • What is happening?
      • How is it happening?
      • How did it get like this?
      • What will it do next?
      • What tools
        • Could I use?
        • Are others already using?
      • What observations..
        • Should I share?
        • Do I need to explain? How much detail?
    • Therapeutic activities
      • What can we do/are we able to do to…
        • ..lessen the impact or prevent it from getting worse?
        • ..halt/revert systems, sacrificing potential data to restore service or prevent further damage.
        • ..to resolve the issue entirely?
    • Recruiting activities
      • What expertise does the group have?
      • What expertise does the group need?
      • What authority does the group have?
      • What authority does the group need?
      • Who do I know that has that expertise/authority?
      • How do I reach them?
    • Status/reporting activities
      • There’s responding to an incident
      • There’s reporting on an incident and keeping people informed.
      • Who needs to be informed about the current status of the response?
      • Who needs to be informed about potential downstream impact/effects?
      • How often do people need to be informed?
  • Costs of Coordination
    • See Costs of coordination.
    • As time in an incident goes on, more and more people might be pulled in to the incident.
    • More people could mean more skills/knowledge/expertise. It also increases demand for coordination/communication, taking attention away from further diagnosis and repair.
    • This leads to the question: should people stay focused on solving the incident, or devote some of their time in bringing others up to speed so they can assist?
    • “Divide and conquer” (where you divide work across different work streams) also has costs
    • It only makes sense to assign tasks that are:
      • Well bounded
      • Can be accomplished by an individual, and
      • For which a suitable person is both available and not already working on a higher priority task.
    • Shifts in the pattern of failure may make some of those above tasks unnecessary or even dangerous.
  • Sacrifice Decisions
    • Achieving important/high-level goals may require abandoning less important goals. For example:
      • Forcing a network partition to allow recovery
      • Killing slow database queries
      • Reducing quality of service to improve throughput/let systems catch up
      • Shut down systems to prevent data leakage
    • Even if you make the “right” decision, it’s still lose-lose. People will always criticize these decisions.
    • This practically encourages Hindsight Bias.
  • Parallel Incidents Dilemma
    • If two incident responses are related, combining efforts and observations could help and be productive.
    • If two incident responses are not related, investigating if they were related could be seen as a wait of time.
    • How can you discover if another incident is happening at the same time as yours?
    • If you do find one, how do you tell whether the time/effort spent determine if they are related is worth the time investment?
  • Despite all of these challenges and hard problems above, as a community we are good at this.
  • People do this work. It’s all the people that really make complex systems resilient, not the technical tools.
  • Having vocabulary for these phenomena is important. When we’ve got words for them, we should use those in our stories when we share our experiences.

Experiments for SRE

https://www.usenix.org/conference/srecon21/presentation/ma
Debbie Ma, Google LLC

Incident management for complex services can be overwhelming. SREs can use experiments to attribute and mitigate production changes that contribute to an outage. With experiments to guard production changes, SREs can also reduce a (potential) outage’s impact by preventing further experiment ramp up if the production change is associated with unhealthy metrics. Beyond incident management, SREs can use experiments to ensure that reliable changes are introduced to production.

  • SRE best practices
    • Gradual rollouts: Ramp up experimental features.
    • Change attribution: find changes associated with experiments.
    • Controlled mitigation: Rollback experiments.

When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field

https://www.usenix.org/conference/srecon21/presentation/butt
Sarah Butt, Salesforce

In many ways, incident management is the “emergency room” for technical systems. As technology has evolved, it has progressed from auxiliary systems, to essential business systems of record, to critical systems of engagement across multiple industries. As these systems become increasingly critical, SRE’s role in incident management and resolution has become vital for any essential technical system.

This talk focuses on how various strategies used in the medical field can be applied to incident response. From looking at algorithm guided decisions (and learning a bit about what “code blue” really means) to discussing approaches to triage and stabilization based on the ATLS protocol, to considering the role of response standardization such as surgical checklists in reducing cognitive overhead (especially when PagerDuty goes off at 2 a.m.!), this talk aims to take key learnings from the medical field and apply it in practical ways to incident management and response. This talk is largely conceptual in nature, with takeaways for attendees from a wide variety of backgrounds and technical experience levels.

  • Disclaimer: this talk assumes you are already familiar with incident command systems such as ICS, and have this implemented in your organization already.
  • Concept 1: Algorithm Guided Decisions
    • Think of the ABCDE protocol and CPR protocols used for first aid.
  • Concept 2: Rapid Stabilization
    • The medical profession puts a lot of emphasis on stabilizing patients first.
    • SRE take-away: shift from “figuring out the why” mindset to “minimizing the impact” mindset.
    • “Notice who’s yelling vs. who’s quiet”. Often the quiet people are the people needing the most immediate care.
  • Concept 3: Standardization and checklists.

SRE “Power Words”—the Lexicon of SRE as an Industry

https://www.usenix.org/conference/srecon21/presentation/oconnor
Dave O’Connor, Elastic

As the SRE Industry develops, we’ve come to rely on certain words, phrases, and mnemonics as part of our conversations with ourselves and our stakeholders. Words and naming have power, and the collective definition and use of words like ’toil’ as a shorthand can help with any SRE practice. This talk will set out the premise and some examples and includes a call to action around thinking how naming and words can strengthen SRE’s position as the function continues to develop.

No notes.

How Our SREs Safeguard Nanosecond Performance—at Scale—in an Environment Built to Fail

https://www.usenix.org/conference/srecon21/presentation/hawker
Jillian Hawker, Optiver

The core principles of SRE—automation, error budgets, risk tolerance—are well described, but how can we apply these to a tightly regulated high-frequency trading environment in an increasingly competitive market? How do you maintain sufficient control of your environment while not blocking the innovation cycle? How do you balance efficiency with an environment where misconfigured components can result in huge losses, monetary or otherwise?

Find out about our production environment at Optiver, how we deal with these challenges, and how we have applied (some of) the SRE principles to different areas of our systems.

  • How often do you think about failure, or what happens when your actions may break something?
  • At Optiver, virtual and physical kill-switches are in place to stop trading if necessary.
  • Story of Knight Capital losing $440m in 2012 is mentioned.
  • Jillian Hawker talks about failing hard and stopping trades in the face of failure/unexpected behavior. This reminds me of Erlang’s “Let it crash” philosophy.
  • Retaining control
    • Explicit changes
    • In-house applications, written under the principle of minimal complexity
    • Simplified trading stack
  • “Culture beats strategy every time” - Betsy Beyer

Panel: Unsolved Problems in SRE

https://www.usenix.org/conference/srecon21/presentation/panel-unsolved-problems-sre
Moderator: Kurt Andersen, Blameless
Panelists: Niall Murphy, RelyAbility; Narayan Desai, Google; Laura Nolan, Slack; Xiao Li, JP Morgan Chase; Sandhya Ramu, LinkedIn

No notes.

Rethinking the SDLC

https://www.usenix.org/conference/srecon21/presentation/freeman
Emily Freeman, AWS

The software (or systems) development lifecycle has been in use since the 1960s. And it’s remained more or less the same since before color television and the touchtone phone. While it’s been looped it into circles and infinity loops and designed with trendy color palettes, the stages of the SDLC remain almost identical to its original layout.

Yet the ecosystem in which we develop software is radically different. We work in systems that are distributed, decoupled, complex and can no longer be captured in an archaic model. It’s time to think different. It’s time for a revolution.

The Revolution model of the SDLC captures the multi-threaded, nonsequential nature of modern software development. It embodies the roles engineers take on and the considerations they encounter along the way. It builds on Agile and DevOps to capture the concerns of DevOps derivatives like DevSecOps and AIOps. And it, well, revolves to embrace the iterative nature of continuous innovation. This talk introduces this new model and discusses the need for how we talk about software to match the experience of development.

  • The traditional SDLC was designed in a different era, in a world where we still had physical server, dedicated ops teams, monolithic systems, etc. It hasn’t significantly changed since it came into existence in the 1960s.
  • Emily believes DevOps practices are not good enough anymore for modern businesses and software systems.
  • This is a good talk, best watched directly as second-hand notes don’t do this sort of thing justice.

Elephant in the Blameless War Room—Accountability

https://www.usenix.org/conference/srecon21/presentation/tan
Christina Tan and Emily Arnott, Blameless

This is effectively the blog post Elephant in the Blameless War Room: Accountability turned into a talk.

How do you reconcile the ideal of blamelessness with the demand for blame? When is it constructive to hold someone accountable, and how? To change a blameful culture, we must empathize with those that point the finger and see how their goals align with our own. We’ll show you how to communicate that their goals can be achieved blamelessly. Lastly, we’ll share how to hold true accountability well.

If you’re already familiar with the works of, say, Sidney Dekker, for example through his book The field guide to understanding human error, then there’s not a lot to be learned here. But if “blameless” concepts are still novel, this talk may be worth watching.