SREcon19 EMEA

October 2nd - 4th, 2019 – Dublin, Ireland

Full program: https://www.usenix.org/conference/srecon19emea/program

Venue photos

IMG_20191001_192135.jpg IMG_20191001_192151.jpg IMG_20191002_154929.jpg

My personal favorites

In order of appeal.

All talks attended

In order of appearance.

The SRE I aspire to be

Yaniv Aknin, Google Cloud

https://www.usenix.org/conference/srecon19emea/presentation/aknin

  • Engineering –> Using scientific methods. Requires measurements.
  • SRE: Measuraby optimize revenue versus cost.
  • Techniques in our toolbox:
    • Redundant resources – trade cost
    • Degraded results – trade quality
    • Retry transient failures – trade latency
  • The error budget steers reliability versus innovation.
  • A really good SRE understands the business and can communicate with business leaders in their language.

IMG_20191002_091537.jpg

A Systems Approach to Safety and Cybersecurity

Nancy Leveson, MIT

https://www.usenix.org/conference/srecon19emea/presentation/leveson

  • Current safety tooling is 50-60 years old. Assumes accidents are caused by individual component failures.

  • The problem (with safety) is complexity, mostly from interactions between different components.

  • Human error is inevitable. Human behavior is always affected by the context it’s in.

  • Nancy describes the old model of modeling reliability, and how it fails in modern (software) systems.

  • Nancy considers systems theory a possible way forward.

    • Focuses on systems as a whole.
    • Treat safety as a control problem (rather than a failure problem)
  • STAMP: Systems Theoretic Accident Modeling Process.

  • STPA: Systems Theoretic Process Analysis.

  • STPA is gaining industry adoption.

  • All of Nancy’s evaluations show STPA is better at identifying more critical requirements or design flaws and orders of magnitude cheaper.

IMG_20191002_101553.jpg IMG_20191002_102509.jpg IMG_20191002_102749.jpg

A Tale of Two Rotations: Building a Humane & Effective On-Call

Nick Lee, Uber

https://www.usenix.org/conference/srecon19emea/presentation/lee

  • Triage aggressively.
  • Constantly refine alerts and thresholds
  • Lack of mitigation resulted in a bad on-call experience.
  • Quantify on-call.
  • Tools need to be trustworthy and feel safe to use.

Latency SLOs Done Right

Heinrich Hartmann, Circonus

https://www.usenix.org/conference/srecon19emea/presentation/hartmann-latency

  • Whole talk boils down to percentile metrics not being suited for SLOs because they cannot be aggregated (or have other math used on them)
  • Possible solutions:
    • Log data
    • Counter metrics
    • (HDR) histograms
  • Personal note: With Prometheus this is clearly documented, with plenty of resources on how to use histograms for this type of data. Also, the speaker is quite a curious character.

Building a Scalable Monitoring System

Molly Struve, Kenna Security

https://www.usenix.org/conference/srecon19emea/presentation/struve

  • Good intro story from Molly about their history/failures and the road to where they are now.
  • Monitoring must-haves:
    • Centralize alerts in 1 place.
    • Make all alerts actionable.
    • Make alerts mutable/silenceable.
    • Track alert history.
  • With a good system, developers began to contribute to and improve the monitoring system.

Being Reasonable about SRE

Vítek Urbanec, Unity Technologies

https://www.usenix.org/conference/srecon19emea/presentation/urbanec

  • Rant against buzzwords and hype adoption.
  • You probably already do parts of SRE.
  • Shifting from Ops to true SRE takes time and effort.
  • More problems happen on the dev side than on the infra side – so join them to learn about their issues.

From nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations

Matthew Huxtable, Sparx

https://www.usenix.org/conference/srecon19emea/presentation/huxtable

  • Complexity often creeps up on us.
  • How do you navigate risk?
  • “Embedded SRE” model doesn’t affect expected change.
  • Dedicated team also doesn’t work. Gets out of touch or becomes a tooling team.
  • Instead, make developers have control over and own their own stuff.
  • Paper: the compliance budget.

IMG_20191002_144929.jpg

My Life as a Solo SRE

Brian Murphy, G-Research

https://www.usenix.org/conference/srecon19emea/presentation/murphy

No notes.

All of Our ML Ideas Are Bad (and We Should Feel Bad)

Todd Underwood, Google

https://www.usenix.org/conference/srecon19emea/presentation/underwood

No notes.

Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems

Ramin Keene, fuzzbox.io

https://www.usenix.org/conference/srecon19emea/presentation/keene

  • Forego correctness, embrace safety.
  • Bugs may be incidents too.

IMG_20191002_165512.jpg

Advanced Napkin Math: Estimating System Performance from First Principles

Simon Eskildsen, Shopify

https://www.usenix.org/conference/srecon19emea/presentation/eskildsen

  • Effectively a talk about https://github.com/sirupsen/napkin-math, but highly recommended.
    • “The goal of this project is to collect software, numbers, and techniques to quickly estimate the expected performance of systems from first-principles. For example, how quickly can you read 1 GB of memory? By composing these resources you should be able to answer interesting questions like: how much storage cost should you expect to pay for a cloud application with 100,000 RPS?”
  • Want a monthly napkin challenge? https://sirupsen.com/napkin/

The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It

Narayan Desai, Google

https://www.usenix.org/conference/srecon19emea/presentation/desai

  • Bad things happen when you over-simpliy
  • SLOs are not static. Review them and update them as situations change.
  • Document rationale for changes to SLOs.
  • Important operations often happen at low QPS. Math/statistics doesn’t work well there.
  • Set per-customer SLOs.
  • Clearly communicate effects of service behavior changes.

Load Balancing Building Blocks

Kyle Lexmond, Facebook

https://www.usenix.org/conference/srecon19emea/presentation/lexmond

  • DNS LBs
  • Round-robin
  • Anycast DNS
  • Geo-aware
  • Network-aware
  • Latency-aware
  • Caching problematic with DNS LBs.
  • Anycast routing not always optimal.
  • Facebook POPs deliberately use near-similar LBs to actual DCs.
  • Maglev LBs (termed by Google)

What Happens When You Type en.wikipedia.org?

Effie Mouzeli and Alexandros Kosiaris, Wikimedia Foundation

https://www.usenix.org/conference/srecon19emea/presentation/mouzeli

  • Wikimedia uses 2 primary DCs and 3 POPs at the moment.
  • Kubernetes
    • Calico
    • Helm
    • 2 clusters + 1 staging
  • Message queues
    • Kafka for everything
  • Fun fact: Ping offload servers for all the people checking their connectivity

IMG_20191003_115301.jpg IMG_20191003_120450.jpg IMG_20191003_120842.jpg

Refining Systems Data without Losing Fidelity

Liz Fong-Jones, honeycomb.io

https://www.usenix.org/conference/srecon19emea/presentation/fong-jones

  • Two types of metrics
    • Host metrics
    • Per user/behavior metrics
  • We need context with data
  • Reducing costs
    1. Store less data
      • Don’t store “read never” data
      • One event per transaction
      • Use tracing for linked events
    2. Sample your data.
      • Sample traces together, not independently
    3. Aggregate data
      • Destroys cardinality
      • Cheap to answer known questions
      • Inflexible/unsuitable for new/unknown questions
  • When sampling:
    • Adjust sampling dynamically
    • Normalize sampling per-key
    • Retain errors & slow queries

IMG_20191003_141517.jpg

One on One SRE

Amy Tobey

https://www.usenix.org/conference/srecon19emea/presentation/tobey

  • Trauma: Extreme stress which overwhelms the ability to cope.
  • Example from GitHub is cited where personal breaks, relief, etc were stressed.
  • One-on-one incident debrief advocate.
    • Stresses informed consent. Let people know they can talk safely.
    • Questions/agenda:
      • Your role in the incident?
      • What surprised you?
      • How long did you work on the incident?
      • Did you get the support you needed?
      • Do you feel it was preventable?
      • What actions do you feel good about?
      • What could have gone better?
      • What did you learn from this?
      • What could we do to prevent re-occurrence?
      • Did our tools and documentation help you?
      • Did you practice self-care?
      • Can you think of anyone else for me to talk to?
  • Talking to people individually can build powerful, under the radar shadow networks.
  • Empathy powerful way to affect organisational change (according to Hardvard Business Review).

Prioritizing Trust While Creating Applications

Jennifer Davis, Microsoft

https://www.usenix.org/conference/srecon19emea/presentation/davis

  • It’s easy to postpone security until the end.
  • Foundations: Defense in Depth.
  • Threat modeling: Cheap and easy to do during early design.
  • Architectural trade-offs.
  • Linting/static code analysis.
  • Secure Code Reviews.
  • Plan for security violations.
  • Don’t forget to talk to your vendors when spotting issues with them.

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager

Jen Wohlner, Livepeer

https://www.usenix.org/conference/srecon19emea/presentation/wohlner

  1. Know your users and talk to them.
    • Do user interviews.
    • Ensure diversity in roles, titles, etc.
  2. Ask non-leading questions.
  3. Hold prototyping sprints.
  4. Add user-centric goals to roadmaps.
    • Idea: Monthly readmap meetings.
    • The roadmap is an internal tool. Don’t just share it “as-is”, use bi-weekly emails/status updates for outside messaging.
  5. Follow-up with users regularly.
    • Uses, painpoints and needs change over time.

IMG_20191003_165253.jpg

Building Resilience: How to Learn More from Incidents

Nick Stenning, Microsoft

https://www.usenix.org/conference/srecon19emea/presentation/stenning

  • Why learn from incidents?
  • Human error is a symptom, not a cause.
  • Avoid counterfactual reasoning during investigation.
  • Avoid normative judgment.
  • Avoid mechanistic reasoning.
  • Instead:
    • Run a facilitated post-incident review.
      • Not just the on-call responder.
      • Have a neutral facilitator.
      • Prepare with 1:1 interviews.
      • Lots of incidents? Don’t do it each time.
    • Language in questions matters.
      • Ask how over why. “How did…”
      • Ask after different viewpoints.
      • Ask about what normally happens in a similar situation when there is no incident.
      • See also Etsy’s Debriefing Facilitation Guide.
    • Ask how things went right.
      • How did we recover?
      • What insights/tools/people were involved?
    • Keep the review and todo planning separate.
      • Keep mitigation talk out of the review. Plan a separate session for that.
      • Helps keep focus on analysing what happened.
      • Allows subconscious to work out mitigations in the background in between the two sessions.

IMG_20191004_093351.jpg

How Stripe Invests in Technical Infrastructure

Will Larson, Stripe

https://www.usenix.org/conference/srecon19emea/presentation/larson

  • Dig out of firefighting trenches.
    • Reduce concurrent work.
    • Finish something useful.
    • Automate.
    • Eliminate entire catagories of problems (with creative solutions).
  • If that doesn’t work? Hire!
  • Also, don’t fall in love with firefighting.
  • Listen to your users, especially when innovating.
  • Benchmark against other similiarly-sized companies.
  • Do surveys of your users.
  • Prioritization:
    • Order by Return on Investment (RoI).
    • Don’t try without users in the room.
    • Have a long-term vision.
  • Avoiding the wrong solution to a problem:
    • Validate with (potential) users first.
    • Try the hard cases early/first.
  • Investment strategy:
    • 40% user asks.
    • 30% platform quality.
    • 30% “key initiatives”.
    • These are somewhat arbitrary, adjust to your own needs/constraints.

Pushing through Friction

Dan Na, Squarespace

https://www.usenix.org/conference/srecon19emea/presentation/na

  • Why does friction occur?
    • Company growth.
  • Example from aviation safety.
    • 5 mandatory checklists – ignored.
    • Alarm – ignored
    • Normalization of deviance. Deviance becomes the norm.
    • Friction inevitable with growth, but made worse by normalization of deviance.
  • Solutions:
    • Document single sources of truth.
    • Update docs with acceptance criteria for work.
    • Adopt processes to vet tech choices.
    • Solicit the “What the fuck!?” of new hires.
  • Long-term cultural fixes:
  • Individuals:
    • Develop your own sense of agency (see Drive by Daniel Pink).
    • Strategies:
      • Have important discussions face to face.
      • Get to know people in other teams/departments.
      • New idea? Try it once.
  • See also https://talks.danielna.com.

Autopsy of a MySQL Automation Disaster

Jean-François Gagné, MessageBird (formerly booking.com)

https://www.usenix.org/conference/srecon19emea/presentation/gagne

  • References MySQL High Availability at GitHub.
  • Messagebird uses Orchestrator + ProxySQL.
  • Solutions for split-brain:
    • Kill on DB, lose data.
    • Replay writes.
      • AUTO_INCREMENT gets in the way.
      • UUIDs are one possible solution.
      • If doing UUIDs, consider monotonically increasing IDs instead, optimized for indexing.

Evolution of Observability Tools at Pinterest

Naoman Abbas, Pinterest

https://www.usenix.org/conference/srecon19emea/presentation/abbas

  • Challenge #1: Usage growth.
  • Avoid tool fragmentation.
  • Observability is expensive, but good Return on Investment.
  • At Pinterest everything going through kafka.

Fault Tree Analysis Applied to Apache Kafka

Andrey Falko, Lyft

https://www.usenix.org/conference/srecon19emea/presentation/falko

Applicable and Achievable Formal Verification

Heidy Khlaaf, Adelard LLP

https://www.usenix.org/conference/srecon19emea/presentation/khlaaf

  • Coq isn’t very usable.
  • IEC 61508 the “Golden boy” safety standard.
  • Safety justification triangle. (?)
  • TLA+.