Some thoughts on how you could evaluate the state of the systems your team owns.
One way to use this:
- Put some of these criteria on the Y axis
- Put the name of the components you own on the X.
- Give everything a score from 1 to 0.
- Either average the scores or sum them to figure out which components need the most love.
If all of these are the same for everything you own, it might make sense to skip that section. For example, if you own 10 services, but they all use a common build pipeline that you don’t maintain, it might make sense to skip that criteria.
Consistent Build from Dev -> Prod
- The same image/code used in each step, not built at each step
Blocking tests at each phase
- Unit is a good start, acceptance is better
Canary and/or staged releases to production
- Having an “alpha” or “canary” production environment can save you a good deal of heartache
Easy, well understood deployment process
- Can you deploy and roll back in 1 step?
- Is it fast? Both the overall process and each individual step?
Code Ownership & Quality
What is the level of comfort your team has with the code?
- Has your team built the codebase?
- Have they maintained it in any meaningful way?
- Do they own it without knowing it?
Well Factored Code
- If an ax weilding maniac who knew where you lived was the next person to maintain the code you’re working on, would you be worried?
Health Quality Score
- Does your company have a way to measure code health?
- If not, could you use something off the shelf?
- How many bugs per component exist? Is that number increasing or decreasing?
Fully Owned Code
- Are you in a codebase where you share dependencies or entire sections of your code?
Well Documented Code
- Not commenting per se, but diagrams/drawings/something to help folks understand and dive in
- Circuit Breakers
- Rate Limiting
- Can the services you rely on fail and would still return a useful response?
On-Call / Triage
Everyone on the team is on-call
Do you have a process for handling bugs / requests / questions?
- Is there someone who reliably triages questions and concerns?
- Is it one person, or a rotation of people?
- Can you actually fix problems from them?
- Do they cover most of the common errors your systems experience?
- Do the alerts identity the issue and point towards resolution, or the tools to resolve?
- When your alerts fire, does that cause an action, or do they frequently get ignored/silenced?
- Are your on-calls dreading their shifts because of pages day & night?
Do you have a formal incident policy?
- When do you work into the night vs work 9-5 until its over?
- Do you have a formal incident review process once the incident is over?
- Do you have a process to make sure incident remediation gets completed?
Low Incident Rate
Service License Agreement (SLO) for Services
- What is the expected response time? What’s your TP50? TP95? TP99?
- 200 rate? is 4 nines enough?
- Do you alert when you service doesn’t perform as expected?
Code is easy to debug
- Easy to plug in debugger?
- Error messages that make sense?
- Can you trace calls from start to finish through your systems?
- Can you time calls from start to finish through your system?
Testing / Tooling
Load Testing Tooling in Place
- Can you determine the maximum # of callers while maintaining your SLO’s?
- Can you test end to end your services?
- Do you have tests that bridge layers of your codebase?
High Test Coverage
- What percentage of your code has unit tests?
- Do you have tests that inconsistently fail? You should fix or delete them
- Do you track the big metrics?
- Response/Run time
- 200’s or successful operations
- 500’s or failed operations
- Can you not only track performance, but the rate at which events happen/don’t happen that are relevant?
- Do you use all the information you’re logging?
Is there a technical roadmap for each component?
Is there a product roadmap for each component?