Top-10 talks of SREcon18 Europe

It’s been a month since I attended SREcon18 Europe and the majority of talks is now available online. In this article I look at the ten talks which stuck with me the most in the days and weeks following the conference.

Summaries are provided to highlight key points and to help you decide whether you should invest time into watching the videos yourself as well.

Ranking method

Before going into the talks themselves, it’s important to talk about how everything was scored and ranked. Put simply, there was nothing scientific about the process:

If I personally find the topic interesting, it gets points.
If the topic is novel or discusses items that aren’t commonly talked about in SRE and/or engineering circles, it gets points.
If the topic is presented exceptionally well, it gets points.
If the topic discusses direct/personal experiences in applying SRE principles, it gets points (bonus points for emphasising failures/lessons learned over “we did X and it worked out great!”).

Given the above, it’s important to realize the top 10 I present is biased towards my personal knowledge, interests and various other non-quantifiable factors. It also reflects the top 10 of talks I personally attended and is not a top 10 of talks from SREcon18 Europe overall.

Finally, it is also important to note there is no order within this ranking. The first presentation on the list is no better or worse than the final presentation at the end.

Now let’s dig in.

The top 10

Your System Has Recovered from an Incident, but Have Your Developers?

Why it’s worth watching

As engineers, we focus a lot on the technical aspects of failure in our systems, but rarely do we spend a lot of time talking about the emotional impact of dealing with these failures. This talk makes you more aware of these aspects and offers ways in which we might lessen the impact.

Key points

We focus a lot on the technology and technical aspects of failure in systems, but not so much on the human side.
Straw poll among 40 DigitalOcean production engineers:
- Almost half said they were stressed or very stressed after an incident.
- (No surprised faces in the audience at this)
- Over half of respondents stated decreased ability to fall asleep after an incident, decreased ability to concentrate, decreased mood and desire to be social.
Looking at medical practitioners, physicians reported increased anxiety about errors, loss of confidence, sleeping difficulties, reduced job satisfaction and harm to their reputation after making errors.
- Majority also said that peer support and counseling could help.
- But many don’t out of fear.
Lessons from stand-up comedy.
- During the incident:
  - Acknowledge something is wrong.
  - Embrace the moment as a learning opportunity.
- After the incident:
  - Put things into perspective.
  - Figure out where things went wrong and how to avoid them next time.
  - Understand how to mentally get back to a good place.
Practice mindfulness. It doesn’t have to cost a lot of time to benefit you.
Looking at Olympian athletes, actively practicing self-compassion had long-term improvement in well-being.

Building a Fellowship Program to Mentor and Grow Your SRE Team

Why it’s worth watching

Good insights here into how to build effective mentoring programs, based on lessons learned from practice at DigitalOcean.

Key points

DigitalOcean created a 2-week fellowship program pairing any developer interested in learning more about infrastructure with a senior engineer.
Program is custom-tailored to individual interests and goals.
Slack has been both a blessing and a curse.
Individuals learned new skills and experiences, some even found their passion and joined the infrastructure team in the end.
At the same time, infrastructure teams (which were offering this fellowship) learned more about other teams and what kind of challenges they face.
The fellowship program strengthened relationships across teams.
Having a mix of different experiences and background improves teams and their performance.

The Math behind Project Scheduling, Bug Tracking, and Triage

Why it’s worth watching

Project management and scheduling is a very complex and complicated topic. There’s interesting lessons to be learned here about why we often miss deadlines and what we might do about it.

Key points

Goal setting doesn’t work.
Project management often suffers from student syndrome.
Humans are bad at estimating total units of work, but generally quite good at calculating ratios or estimations relative to other units.
When estimating, don’t try to relate to time. Use relative estimates instead.
Stories should be large, slow, infrequent tickets. Bugs should be constant and generally fast.
Estimate large user stories, only break them down later if needed.
Do proper bug triage. Label bugs per feature area.
Declaring “bug bankruptcy” and closing all (old) bugs doesn’t solve anything.
- Pisses off your reporters.
- Doesn’t solve the actual issue, resulting in a new bug with potentially less context.
- To really change anything you have to reverse the trend. You do this by spending more time on actually fixing bugs rather than building new features which introduce their own new bugs.

Ethics in Computing

Why it’s worth watching

Computer science, as an industry, has an ethics problem. This talk highlights how and offers some pointers on how to get started changing the situation for the better.

Key points

We have an ethics problem.
- Volkswagen emissions scandal.
- Uber scandals (multiple examples).
- Gave example of immigrations software always giving a negative recommendation due to software bug.
- Gave another example of soap dispensers in public washrooms which were biased to race; sensors only responded to people with white skin.
Ethics isn’t taught in schools nor part of computer science curriculum.
Ethics change over time. What was acceptable 25 years ago isn’t always acceptable today.
Computer science is in it’s infancy.
You can help make a positive impact.
- IEEE already had guidelines for dissenting on ethical grounds back in 1983.
- There’s the IEEE Code of Ethics.
- As well as the recently revised ACM Code of Ethics and Professional Conduct.

I’m SRE and You Can Too!—A Fine Manual for Migrating Your Organization to the New Hotness

Why it’s worth watching

Good points of advice here for organisations which are just getting started introducing the concepts of SRE.

Key points

Reliability should have full company buy-in, not just from the IT team(s) alone.
You need to have SLOs for your services.
SRE builds tools and infrastructure to empower development.
When adopting SRE, perform readiness reviews of all services, not only new ones going forward.
SRE can work without dedicated SRE team(s) but it takes more effort and didication to stay on track.
Sharing of knowledge is important. Do ad-hoc classes.
Wrap black boxes that are hard to modify so you can still add telemetry, load shedding, etc.
Start small and iterate. Let growing popularity drive greater adoption.
Reliable and predictable CI/CD pipelines reduce risk.
Bureaucratic and elaborate business structures don’t improve your service and uptime, most likely the opposite. Rapid iteration and early feedback improves service.
Build trust slowly. Don’t over-promise, be humble and over-deliver instead.

What Medicine Can Teach Us about Being On-Call

Why it’s worth watching

Medicine deals with a number of problems which have similarities to SRE/ops. As an older and well-established profession, it has developed a number of techniques which we could adapt and apply to tech as well.

Key points

Medical practitioners typically consist of various specialized teams all working together towards a common goal (the patient). Often similar with SRE/Ops (at least in larger companies)
Medical improvement comes through scientific/quantifiable validation.
Reducing critical incidents
- Hospitals have rapid response teams. Decreased ICU deaths by 12%.
  - Anyone can page the rapid response team.
  - Goal: identify potential problems before they become a serious issue.
- Reducing self-inflicted incidents
  - Checklists eliminate human error. 5-step checklist for central line procedure dropped infection rate from 11% to 0%.
  - Checklists in runbooks could prevent errors on maintenance procedures we do.
  - I can highly recommend reading The Checklist Manifesto by Atul Gawande myself for a lot of great insights on the power of checklists.
- On-call hand-offs
  - Key points of information communicated in a structured format when handing over patients.
Different doctors/nurses make patients repeat symptoms even when it’s already in the notes. This catches errors and invalid assumptions.
Work/life balance
- Post-call days: Acknowledge rest is needed after being on-call.

Junior Engineers Are Features, Not Bugs

Why it’s worth watching

It has hand-drawn graphics. What’s not to love?

More seriously though, I think mentoring of more junior people is a very important topic which doesn’t receive quite enough attention in the industry at the moment.

Key points

Not hiring juniors now and giving them opportunities to grow will harm our industry in the long run.
Be open-minded about your hiring pool - coding bootcamps, STEM-related degrees not strictly about computer science, QA and support engineers, etc.
Juniors bring a fresh perspective
- Curiosity
- Love of the game, full of fresh enthusiasm
- “Grunt work” may not be grunt work to them.
- Well diverse teams work better together - this includes junior vs senior diversity.
Don’t hire juniors when…
- Extreme development velocity is required.
- There are no opportunities to support growth.
- Role or product is not well-defined.
Be sure to reward mentorship (including public recognition and salary increases).
Hire for potential.
Make sure to properly manage expectations (for all parties involved).

The 7 Deadly Sins of Documentation

Why it’s worth watching

Everybody thinks their documentation isn’t up to snuff. Lots of useful advice here to improve the situation.

Key points

Many people think their documentation is bad. GitHub survey: 93% of open source projects think their docs are insufficient.
Documentation often isn’t given priority, considered a chore/drag.
Make docs a deliverable in your scrum process (or whichever methodology you adhere to).
Create a document setting out standards and style guide so all docs are consistent.
Make sure people know what type of documentation goes where.
Do reviews for docs just like code reviews.
Make sure everybody pitches in.
Different situations call for different types of docs.
- Runbooks to troubleshoot immediate problems.
- Architecture writeups to understand a system as a whole or it’s relationship to other systems.
- Make the goal of a document clear.
Having documents scattered around all over the place means it’s hard to find, search, index, etc.
Have portals with pointers into critical or top-level docs.
Prune old docs aggressively - outdated/conflicting/misleading docs worse than no docs.
Code is not documentation. Even very good code doesn’t show design decisions and trade-offs.
Avoid jargon overuse.

SRE Theory vs. Practice: A Song of Ice and TireFire

Why it’s worth watching

This talk deserves a spot in my top-10 not for content but for form. Put simply, it provided some good laughs for the audience which was a nice way to get the conference started.

Dealing with Dark Debt: Lessons Learnt at Goldman Sachs

Why it’s worth watching

As software engineers we are familiar with the concept of technical debt, however there are different forms of this. We don’t talk a lot of about the more complex form of dark debt or what we can do about it so much. It’s also very well presented by Vanessa.

Key points

Dark debt is a form of technical debt that is invisible until it causes failures.
Goldman Sachs worked on building a sustainable software development ecosystem to combat this.
Focused on linting/beaconing.
Built their own metrics agent/framework which was automatically included in all their projects so they could have metrics on everything with little effort.
Chaos engineering & (automatic) fault injection used to uncover dark debt.
Don’t play whack-a-mole with every bug - be more systematic.
Be transparent about everything.
Use dedicated sprints for larger refactoring.

Notable others

These didn’t make the final cut to be included in the top 10, but they were close contenders: