October 2nd - 4th, 2019 – Dublin, Ireland

Full program:

Venue photos

IMG_20191001_192135.jpg IMG_20191001_192151.jpg IMG_20191002_154929.jpg

My personal favorites

In order of appeal.

All talks attended

In order of appearance.

The SRE I aspire to be

Yaniv Aknin, Google Cloud

  • Engineering –> Using scientific methods. Requires measurements.
  • SRE: Measuraby optimize revenue versus cost.
  • Techniques in our toolbox:
    • Redundant resources – trade cost
    • Degraded results – trade quality
    • Retry transient failures – trade latency
  • The error budget steers reliability versus innovation.
  • A really good SRE understands the business and can communicate with business leaders in their language.


A Systems Approach to Safety and Cybersecurity

Nancy Leveson, MIT

  • Current safety tooling is 50-60 years old. Assumes accidents are caused by individual component failures.

  • The problem (with safety) is complexity, mostly from interactions between different components.

  • Human error is inevitable. Human behavior is always affected by the context it’s in.

  • Nancy describes the old model of modeling reliability, and how it fails in modern (software) systems.

  • Nancy considers systems theory a possible way forward.

    • Focuses on systems as a whole.
    • Treat safety as a control problem (rather than a failure problem)
  • STAMP: Systems Theoretic Accident Modeling Process.

  • STPA: Systems Theoretic Process Analysis.

  • STPA is gaining industry adoption.

  • All of Nancy’s evaluations show STPA is better at identifying more critical requirements or design flaws and orders of magnitude cheaper.

IMG_20191002_101553.jpg IMG_20191002_102509.jpg IMG_20191002_102749.jpg

A Tale of Two Rotations: Building a Humane & Effective On-Call

Nick Lee, Uber

  • Triage aggressively.
  • Constantly refine alerts and thresholds
  • Lack of mitigation resulted in a bad on-call experience.
  • Quantify on-call.
  • Tools need to be trustworthy and feel safe to use.

Latency SLOs Done Right

Heinrich Hartmann, Circonus

  • Whole talk boils down to percentile metrics not being suited for SLOs because they cannot be aggregated (or have other math used on them)
  • Possible solutions:
    • Log data
    • Counter metrics
    • (HDR) histograms
  • Personal note: With Prometheus this is clearly documented, with plenty of resources on how to use histograms for this type of data. Also, the speaker is quite a curious character.

Building a Scalable Monitoring System

Molly Struve, Kenna Security

  • Good intro story from Molly about their history/failures and the road to where they are now.
  • Monitoring must-haves:
    • Centralize alerts in 1 place.
    • Make all alerts actionable.
    • Make alerts mutable/silenceable.
    • Track alert history.
  • With a good system, developers began to contribute to and improve the monitoring system.

Being Reasonable about SRE

Vítek Urbanec, Unity Technologies

  • Rant against buzzwords and hype adoption.
  • You probably already do parts of SRE.
  • Shifting from Ops to true SRE takes time and effort.
  • More problems happen on the dev side than on the infra side – so join them to learn about their issues.

From nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations

Matthew Huxtable, Sparx

  • Complexity often creeps up on us.
  • How do you navigate risk?
  • “Embedded SRE” model doesn’t affect expected change.
  • Dedicated team also doesn’t work. Gets out of touch or becomes a tooling team.
  • Instead, make developers have control over and own their own stuff.
  • Paper: the compliance budget.


My Life as a Solo SRE

Brian Murphy, G-Research

No notes.

All of Our ML Ideas Are Bad (and We Should Feel Bad)

Todd Underwood, Google

No notes.

Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems

Ramin Keene,

  • Forego correctness, embrace safety.
  • Bugs may be incidents too.


Advanced Napkin Math: Estimating System Performance from First Principles

Simon Eskildsen, Shopify

  • Effectively a talk about, but highly recommended.
    • “The goal of this project is to collect software, numbers, and techniques to quickly estimate the expected performance of systems from first-principles. For example, how quickly can you read 1 GB of memory? By composing these resources you should be able to answer interesting questions like: how much storage cost should you expect to pay for a cloud application with 100,000 RPS?”
  • Want a monthly napkin challenge?

The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It

Narayan Desai, Google

  • Bad things happen when you over-simpliy
  • SLOs are not static. Review them and update them as situations change.
  • Document rationale for changes to SLOs.
  • Important operations often happen at low QPS. Math/statistics doesn’t work well there.
  • Set per-customer SLOs.
  • Clearly communicate effects of service behavior changes.

Load Balancing Building Blocks

Kyle Lexmond, Facebook

  • DNS LBs
  • Round-robin
  • Anycast DNS
  • Geo-aware
  • Network-aware
  • Latency-aware
  • Caching problematic with DNS LBs.
  • Anycast routing not always optimal.
  • Facebook POPs deliberately use near-similar LBs to actual DCs.
  • Maglev LBs (termed by Google)

What Happens When You Type

Effie Mouzeli and Alexandros Kosiaris, Wikimedia Foundation

  • Wikimedia uses 2 primary DCs and 3 POPs at the moment.
  • Kubernetes
    • Calico
    • Helm
    • 2 clusters + 1 staging
  • Message queues
    • Kafka for everything
  • Fun fact: Ping offload servers for all the people checking their connectivity

IMG_20191003_115301.jpg IMG_20191003_120450.jpg IMG_20191003_120842.jpg

Refining Systems Data without Losing Fidelity

Liz Fong-Jones,

  • Two types of metrics
    • Host metrics
    • Per user/behavior metrics
  • We need context with data
  • Reducing costs
    1. Store less data
      • Don’t store “read never” data
      • One event per transaction
      • Use tracing for linked events
    2. Sample your data.
      • Sample traces together, not independently
    3. Aggregate data
      • Destroys cardinality
      • Cheap to answer known questions
      • Inflexible/unsuitable for new/unknown questions
  • When sampling:
    • Adjust sampling dynamically
    • Normalize sampling per-key
    • Retain errors & slow queries


One on One SRE

Amy Tobey

  • Trauma: Extreme stress which overwhelms the ability to cope.
  • Example from GitHub is cited where personal breaks, relief, etc were stressed.
  • One-on-one incident debrief advocate.
    • Stresses informed consent. Let people know they can talk safely.
    • Questions/agenda:
      • Your role in the incident?
      • What surprised you?
      • How long did you work on the incident?
      • Did you get the support you needed?
      • Do you feel it was preventable?
      • What actions do you feel good about?
      • What could have gone better?
      • What did you learn from this?
      • What could we do to prevent re-occurrence?
      • Did our tools and documentation help you?
      • Did you practice self-care?
      • Can you think of anyone else for me to talk to?
  • Talking to people individually can build powerful, under the radar shadow networks.
  • Empathy powerful way to affect organisational change (according to Hardvard Business Review).

Prioritizing Trust While Creating Applications

Jennifer Davis, Microsoft

  • It’s easy to postpone security until the end.
  • Foundations: Defense in Depth.
  • Threat modeling: Cheap and easy to do during early design.
  • Architectural trade-offs.
  • Linting/static code analysis.
  • Secure Code Reviews.
  • Plan for security violations.
  • Don’t forget to talk to your vendors when spotting issues with them.

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager

Jen Wohlner, Livepeer

  1. Know your users and talk to them.
    • Do user interviews.
    • Ensure diversity in roles, titles, etc.
  2. Ask non-leading questions.
  3. Hold prototyping sprints.
  4. Add user-centric goals to roadmaps.
    • Idea: Monthly readmap meetings.
    • The roadmap is an internal tool. Don’t just share it “as-is”, use bi-weekly emails/status updates for outside messaging.
  5. Follow-up with users regularly.
    • Uses, painpoints and needs change over time.


Building Resilience: How to Learn More from Incidents

Nick Stenning, Microsoft

  • Why learn from incidents?
  • Human error is a symptom, not a cause.
  • Avoid counterfactual reasoning during investigation.
  • Avoid normative judgment.
  • Avoid mechanistic reasoning.
  • Instead:
    • Run a facilitated post-incident review.
      • Not just the on-call responder.
      • Have a neutral facilitator.
      • Prepare with 1:1 interviews.
      • Lots of incidents? Don’t do it each time.
    • Language in questions matters.
      • Ask how over why. “How did…”
      • Ask after different viewpoints.
      • Ask about what normally happens in a similar situation when there is no incident.
      • See also Etsy’s Debriefing Facilitation Guide.
    • Ask how things went right.
      • How did we recover?
      • What insights/tools/people were involved?
    • Keep the review and todo planning separate.
      • Keep mitigation talk out of the review. Plan a separate session for that.
      • Helps keep focus on analysing what happened.
      • Allows subconscious to work out mitigations in the background in between the two sessions.


How Stripe Invests in Technical Infrastructure

Will Larson, Stripe

  • Dig out of firefighting trenches.
    • Reduce concurrent work.
    • Finish something useful.
    • Automate.
    • Eliminate entire catagories of problems (with creative solutions).
  • If that doesn’t work? Hire!
  • Also, don’t fall in love with firefighting.
  • Listen to your users, especially when innovating.
  • Benchmark against other similiarly-sized companies.
  • Do surveys of your users.
  • Prioritization:
    • Order by Return on Investment (RoI).
    • Don’t try without users in the room.
    • Have a long-term vision.
  • Avoiding the wrong solution to a problem:
    • Validate with (potential) users first.
    • Try the hard cases early/first.
  • Investment strategy:
    • 40% user asks.
    • 30% platform quality.
    • 30% “key initiatives”.
    • These are somewhat arbitrary, adjust to your own needs/constraints.

Pushing through Friction

Dan Na, Squarespace

  • Why does friction occur?
    • Company growth.
  • Example from aviation safety.
    • 5 mandatory checklists – ignored.
    • Alarm – ignored
    • Normalization of deviance. Deviance becomes the norm.
    • Friction inevitable with growth, but made worse by normalization of deviance.
  • Solutions:
    • Document single sources of truth.
    • Update docs with acceptance criteria for work.
    • Adopt processes to vet tech choices.
    • Solicit the “What the fuck!?” of new hires.
  • Long-term cultural fixes:
  • Individuals:
    • Develop your own sense of agency (see Drive by Daniel Pink).
    • Strategies:
      • Have important discussions face to face.
      • Get to know people in other teams/departments.
      • New idea? Try it once.
  • See also

Autopsy of a MySQL Automation Disaster

Jean-François Gagné, MessageBird (formerly

  • References MySQL High Availability at GitHub.
  • Messagebird uses Orchestrator + ProxySQL.
  • Solutions for split-brain:
    • Kill on DB, lose data.
    • Replay writes.
      • AUTO_INCREMENT gets in the way.
      • UUIDs are one possible solution.
      • If doing UUIDs, consider monotonically increasing IDs instead, optimized for indexing.

Evolution of Observability Tools at Pinterest

Naoman Abbas, Pinterest

  • Challenge #1: Usage growth.
  • Avoid tool fragmentation.
  • Observability is expensive, but good Return on Investment.
  • At Pinterest everything going through kafka.

Fault Tree Analysis Applied to Apache Kafka

Andrey Falko, Lyft

Applicable and Achievable Formal Verification

Heidy Khlaaf, Adelard LLP

  • Coq isn’t very usable.
  • IEC 61508 the “Golden boy” safety standard.
  • Safety justification triangle. (?)
  • TLA+.