October 2nd - 4th, 2019 – Dublin, Ireland
Full program: https://www.usenix.org/conference/srecon19emea/program
My personal favorites
In order of appeal.
- A Systems Approach to Safety and Cybersecurity
- Pushing through Friction
- One on One SRE
- Advanced Napkin Math: Estimating System Performance from First Principles
- Other close contenders:
All talks attended
In order of appearance.
The SRE I aspire to be
Yaniv Aknin, Google Cloud
- Engineering –> Using scientific methods. Requires measurements.
- SRE: Measuraby optimize revenue versus cost.
- Techniques in our toolbox:
- Redundant resources – trade cost
- Degraded results – trade quality
- Retry transient failures – trade latency
- The error budget steers reliability versus innovation.
- A really good SRE understands the business and can communicate with business leaders in their language.
A Systems Approach to Safety and Cybersecurity
Nancy Leveson, MIT
Current safety tooling is 50-60 years old. Assumes accidents are caused by individual component failures.
The problem (with safety) is complexity, mostly from interactions between different components.
Human error is inevitable. Human behavior is always affected by the context it’s in.
Nancy describes the old model of modeling reliability, and how it fails in modern (software) systems.
Nancy considers systems theory a possible way forward.
- Focuses on systems as a whole.
- Treat safety as a control problem (rather than a failure problem)
STAMP: Systems Theoretic Accident Modeling Process.
STPA: Systems Theoretic Process Analysis.
STPA is gaining industry adoption.
All of Nancy’s evaluations show STPA is better at identifying more critical requirements or design flaws and orders of magnitude cheaper.
A Tale of Two Rotations: Building a Humane & Effective On-Call
Nick Lee, Uber
- Triage aggressively.
- Constantly refine alerts and thresholds
- Lack of mitigation resulted in a bad on-call experience.
- Quantify on-call.
- Tools need to be trustworthy and feel safe to use.
Latency SLOs Done Right
Heinrich Hartmann, Circonus
- Whole talk boils down to percentile metrics not being suited for SLOs because they cannot be aggregated (or have other math used on them)
- Possible solutions:
- Log data
- Counter metrics
- (HDR) histograms
- Personal note: With Prometheus this is clearly documented, with plenty of resources on how to use histograms for this type of data. Also, the speaker is quite a curious character.
Building a Scalable Monitoring System
Molly Struve, Kenna Security
- Good intro story from Molly about their history/failures and the road to where they are now.
- Monitoring must-haves:
- Centralize alerts in 1 place.
- Make all alerts actionable.
- Make alerts mutable/silenceable.
- Track alert history.
- With a good system, developers began to contribute to and improve the monitoring system.
Being Reasonable about SRE
Vítek Urbanec, Unity Technologies
- Rant against buzzwords and hype adoption.
- You probably already do parts of SRE.
- Shifting from Ops to true SRE takes time and effort.
- More problems happen on the dev side than on the infra side – so join them to learn about their issues.
From nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations
Matthew Huxtable, Sparx
- Complexity often creeps up on us.
- How do you navigate risk?
- “Embedded SRE” model doesn’t affect expected change.
- Dedicated team also doesn’t work. Gets out of touch or becomes a tooling team.
- Instead, make developers have control over and own their own stuff.
- Paper: the compliance budget.
My Life as a Solo SRE
Brian Murphy, G-Research
All of Our ML Ideas Are Bad (and We Should Feel Bad)
Todd Underwood, Google
Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems
Ramin Keene, fuzzbox.io
- Forego correctness, embrace safety.
- Bugs may be incidents too.
Advanced Napkin Math: Estimating System Performance from First Principles
Simon Eskildsen, Shopify
- Effectively a talk about https://github.com/sirupsen/napkin-math, but highly recommended.
- “The goal of this project is to collect software, numbers, and techniques to quickly estimate the expected performance of systems from first-principles. For example, how quickly can you read 1 GB of memory? By composing these resources you should be able to answer interesting questions like: how much storage cost should you expect to pay for a cloud application with 100,000 RPS?”
- Want a monthly napkin challenge? https://sirupsen.com/napkin/
The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It
Narayan Desai, Google
- Bad things happen when you over-simpliy
- SLOs are not static. Review them and update them as situations change.
- Document rationale for changes to SLOs.
- Important operations often happen at low QPS. Math/statistics doesn’t work well there.
- Set per-customer SLOs.
- Clearly communicate effects of service behavior changes.
Load Balancing Building Blocks
Kyle Lexmond, Facebook
- DNS LBs
- Anycast DNS
- Caching problematic with DNS LBs.
- Anycast routing not always optimal.
- Facebook POPs deliberately use near-similar LBs to actual DCs.
- Maglev LBs (termed by Google)
What Happens When You Type en.wikipedia.org?
Effie Mouzeli and Alexandros Kosiaris, Wikimedia Foundation
- Wikimedia uses 2 primary DCs and 3 POPs at the moment.
- 2 clusters + 1 staging
- Message queues
- Kafka for everything
- Fun fact: Ping offload servers for all the people checking their connectivity
Refining Systems Data without Losing Fidelity
Liz Fong-Jones, honeycomb.io
- Two types of metrics
- Host metrics
- Per user/behavior metrics
- We need context with data
- Reducing costs
- Store less data
- Don’t store “read never” data
- One event per transaction
- Use tracing for linked events
- Sample your data.
- Sample traces together, not independently
- Aggregate data
- Destroys cardinality
- Cheap to answer known questions
- Inflexible/unsuitable for new/unknown questions
- Store less data
- When sampling:
- Adjust sampling dynamically
- Normalize sampling per-key
- Retain errors & slow queries
One on One SRE
- Trauma: Extreme stress which overwhelms the ability to cope.
- Example from GitHub is cited where personal breaks, relief, etc were stressed.
- One-on-one incident debrief advocate.
- Stresses informed consent. Let people know they can talk safely.
- Your role in the incident?
- What surprised you?
- How long did you work on the incident?
- Did you get the support you needed?
- Do you feel it was preventable?
- What actions do you feel good about?
- What could have gone better?
- What did you learn from this?
- What could we do to prevent re-occurrence?
- Did our tools and documentation help you?
- Did you practice self-care?
- Can you think of anyone else for me to talk to?
- Talking to people individually can build powerful, under the radar shadow networks.
- Empathy powerful way to affect organisational change (according to Hardvard Business Review).
Prioritizing Trust While Creating Applications
Jennifer Davis, Microsoft
- It’s easy to postpone security until the end.
- Foundations: Defense in Depth.
- Threat modeling: Cheap and easy to do during early design.
- Architectural trade-offs.
- Linting/static code analysis.
- Secure Code Reviews.
- Plan for security violations.
- Don’t forget to talk to your vendors when spotting issues with them.
SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
Jen Wohlner, Livepeer
- Know your users and talk to them.
- Do user interviews.
- Ensure diversity in roles, titles, etc.
- Ask non-leading questions.
- Hold prototyping sprints.
- Add user-centric goals to roadmaps.
- Idea: Monthly readmap meetings.
- The roadmap is an internal tool. Don’t just share it “as-is”, use bi-weekly emails/status updates for outside messaging.
- Follow-up with users regularly.
- Uses, painpoints and needs change over time.
Building Resilience: How to Learn More from Incidents
Nick Stenning, Microsoft
- Why learn from incidents?
- Prevention is the common answer, but..? Read How Complex Systems Fail
- Human error is a symptom, not a cause.
- Avoid counterfactual reasoning during investigation.
- Avoid normative judgment.
- Avoid mechanistic reasoning.
- Run a facilitated post-incident review.
- Not just the on-call responder.
- Have a neutral facilitator.
- Prepare with 1:1 interviews.
- Lots of incidents? Don’t do it each time.
- Language in questions matters.
- Ask how over why. “How did…”
- Ask after different viewpoints.
- Ask about what normally happens in a similar situation when there is no incident.
- See also Etsy’s Debriefing Facilitation Guide.
- Ask how things went right.
- How did we recover?
- What insights/tools/people were involved?
- Keep the review and todo planning separate.
- Keep mitigation talk out of the review. Plan a separate session for that.
- Helps keep focus on analysing what happened.
- Allows subconscious to work out mitigations in the background in between the two sessions.
- Run a facilitated post-incident review.
How Stripe Invests in Technical Infrastructure
Will Larson, Stripe
- Dig out of firefighting trenches.
- Reduce concurrent work.
- Finish something useful.
- Eliminate entire catagories of problems (with creative solutions).
- If that doesn’t work? Hire!
- Also, don’t fall in love with firefighting.
- Listen to your users, especially when innovating.
- Benchmark against other similiarly-sized companies.
- Do surveys of your users.
- Order by Return on Investment (RoI).
- Don’t try without users in the room.
- Have a long-term vision.
- Avoiding the wrong solution to a problem:
- Validate with (potential) users first.
- Try the hard cases early/first.
- Investment strategy:
- 40% user asks.
- 30% platform quality.
- 30% “key initiatives”.
- These are somewhat arbitrary, adjust to your own needs/constraints.
Pushing through Friction
Dan Na, Squarespace
- Why does friction occur?
- Company growth.
- Example from aviation safety.
- 5 mandatory checklists – ignored.
- Alarm – ignored
- Normalization of deviance. Deviance becomes the norm.
- Friction inevitable with growth, but made worse by normalization of deviance.
- Document single sources of truth.
- Update docs with acceptance criteria for work.
- Adopt processes to vet tech choices.
- Solicit the “What the fuck!?” of new hires.
- Long-term cultural fixes:
- Develop your own sense of agency (see Drive by Daniel Pink).
- Have important discussions face to face.
- Get to know people in other teams/departments.
- New idea? Try it once.
- See also https://talks.danielna.com.
Autopsy of a MySQL Automation Disaster
Jean-François Gagné, MessageBird (formerly booking.com)
- References MySQL High Availability at GitHub.
- Messagebird uses Orchestrator + ProxySQL.
- Solutions for split-brain:
- Kill on DB, lose data.
- Replay writes.
AUTO_INCREMENTgets in the way.
- UUIDs are one possible solution.
- If doing UUIDs, consider monotonically increasing IDs instead, optimized for indexing.
Evolution of Observability Tools at Pinterest
Naoman Abbas, Pinterest
- Challenge #1: Usage growth.
- Avoid tool fragmentation.
- Observability is expensive, but good Return on Investment.
- At Pinterest everything going through kafka.
Fault Tree Analysis Applied to Apache Kafka
Andrey Falko, Lyft
- Good intro talk for anyone unfamiliar with FTA.
- Resources at github.com/afalko/fta-kafka.
- SCRAM software used for the modeling.
Applicable and Achievable Formal Verification
Heidy Khlaaf, Adelard LLP
- Coq isn’t very usable.
- IEC 61508 the “Golden boy” safety standard.
- Safety justification triangle. (?)