LeadTime#19 - Recovery After a Failed Launch

How to diagnose system failures while rebuilding team confidence? What questions help balance process improvements with delivery momentum after a high-profile disaster?

May 29, 2025

Hi, this newsletter is a weekly challenge for engineers thinking about management. I'm Péter Szász, writing about Engineering Leadership and training aspiring and first-time managers on this path.

In this newsletter, I pose a weekly EM challenge and leave it as a puzzle for you to think about before the next issue, where I share my thoughts on it.

Last Week’s Challenge

In the previous issue, I described a situation taking place right after a failed product launch, resulting in a production incident, followed by a weekend of struggling with hotfixes. Read all the details here if you missed it. I'll approach this situation with my usual structure:

1. Goals to Achieve

Show leadership: Build back trust by showing both your team and stakeholders that you've got things under control. Avoid decision paralysis; start taking actions while keeping everyone informed of your progress.

Rebuild team morale: Multiple developers working through the weekend is a massive red flag. Your team is demoralized and defensive - they need to see that you're committed to preventing this from happening again, not just promising to "be more careful next time."

Diagnose root causes systematically: This isn't just about process gaps - examine your technical setup, testing coverage, and deployment procedures. Look at both what failed and what systems should have caught these failures before they reached production. Run a blameless postmortem focused on system failures, not human mistakes.

Individual performance assessment: Objectively identify if performance issues contributed to this failure. Support team members who need development, but don't ignore systemic patterns if they exist.

Keep the roadmap moving: You can't halt delivery to fix everything. Find ways to pair small process improvements with feature development so you can test your fixes and get feedback on them in real life, with the next deployment.

2. Risks to Avoid

Going to extremes: There are two bad paths here. Either you halt all delivery for a massive process overhaul, or you chalk this up to human error and make vague promises about "being more careful next time." Both are extremes that rarely work out well in reality. You need to find the right balance between these.

Further demoralizing the team: Weekend workers might resent colleagues who didn't sacrifice their time, who, in return, might be experiencing some level of missing out and guilt. Address this fragmentation directly during your retrospective or individual conversations. Make it clear that weekend work is the (sometimes necessary) exception.

Losing more stakeholder confidence: Executives are asking pointed questions. Give them honest assessments and regular updates on your improvement plan. It’s OK not being able to commit to everything yet, but whatever you do commit to, keep it. Don't promise unrealistic timelines. You can often buy time by sharing your plan for gathering the necessary information, committing to a specific date for follow-up, or providing regular progress updates as needed. Remember, the goal is to gain back confidence that you’ve got this under control — and not that you dump all the random information on your manager’s table for her to make sense of it all.

Overcorrection leading to risk aversion: Teams that get burned can become paralyzed by fear of another failure. Maintain a culture where calculated risks are still acceptable. (See my earlier article on Celebrating Failure.)

Inefficient postmortem: This can be a stressful and costly meeting. You need to balance psychological safety with transparency. Including stakeholders might provide valuable context and can help circulate information — but having leadership in the room could also prevent honest discussion if your culture isn't mature enough for it.

Taking on too much without asking for help: This could be a significant product, engineering, and organizational failure. If you feel overwhelmed, don't try to solve everything yourself. Involve your manager and peers in both the analysis and solution design as needed.

3. Five Questions

What really happened? Understanding the actual business damage helps prioritize fixes and sets realistic expectations with stakeholders. Was this a minor inconvenience, or did it seriously impact customer trust and revenue? Was this a surprise in a usually uneventful list of deployments, or are failures a regular occurrence? Why didn't we just roll back to the previous version and fix the bug during normal working hours? Was it technical limitations or external commitments? Who made the decisions, and how did we communicate?
How should the Postmortem be organized? The key is to build a group-solving and not a culprit-finding mindset. You’re all on the same team, trying to solve a puzzle together. Resist the temptation to be satisfied with an “XY made a mistake” answer; focus on things that made the mistake easier to happen. Why didn't testing catch these bugs? Why was emergency deployment so difficult? You can use the 5 Whys technique to dig into root causes:
Why didn't the user sign-up modal work? Because there was a bug in the signup system.
➡️ Why? Because we didn't alter the table that stores new data.
➡️ Why? Because the person deploying forgot they needed to alter the database.
➡️ Why? Because there was no documented process for deployments with database structure impacts.
➡️ Why? Because we have outdated documentation.
➡️ Great, let’s capture that as an action item to review and update all deployment-related documentation.
This is a great tool to identify action items that can prevent similar issues from happening in the future, by ensuring you impact the root cause and not a symptom. Make sure you document the postmortem’s key findings: timeline of events, observations, action items, and next steps; and ensure these are shared transparently within the organization.
How do we balance process improvements with delivery momentum? You need quick and concrete wins to rebuild confidence, but you can't stop shipping features. Which process fixes can be implemented alongside the next feature release? What's the “minimum viable improvement” that reduces risk while maintaining velocity? Can we alter or descope the next features in the pipeline, to find the best development tasks to match with process improvements?
How can we better separate code deployments from feature releases? The two don’t need to go hand in hand. Feature flagging, staged rollout, and various other techniques can ensure that you have more safety and options to course correct when new code gets to production. Moving closer to a more robust deployment process can help both in decreasing production incidents and in evaluating and iterating on new features as close to their final environment as possible.
Do I need help? This might be bigger than something a single team or EM can solve alone. Are there organizational, resource, or knowledge constraints that require escalation? What expertise or authority do you need that you don't currently have? Are there decisions you’re not sure of, and if so, who could help increase your confidence or challenge your assumptions?

The ultimate task is to build a resilient culture with iterative, small deployments decoupled from feature releases, where failures are rare and quickly recovered from without catastrophic impact on team morale or trust.

Did I miss any important considerations? How have you handled recovery after major production failures? Let me know in the comments!

This Week’s Challenge

Your company is creating an internal platform team mostly consisting of your current staff, including you as the team’s Engineering Manager. Three of your current developers and two more from other teams will make up the new platform team. You will need to take ownership of part of the tech stack and provide shared services and tooling for the entire engineering organization.
Your director wants this transition completed within three months, and expects the platform team to operate with the same discipline as serving external customers.
You need to figure out how to transform this group's approach from shipping user features to building and maintaining internal tools that other engineering teams will adopt and find valuable.
How do you approach this new role?

Think about what your goals would be and what risks you'd like to avoid in this situation. I'll share my thoughts next week. If you don't want to miss it, sign up here to receive those and similar weekly brain-teasers as soon as they are published:

Until then, here's a small piece of inspiration slightly related to this week’s challenge:

"Companies, like individuals, do not become exceptional by believing they are exceptional but by understanding the ways in which they aren’t exceptional. Postmortems are one route into that understanding." - Ed Catmull and Amy Wallace: Creativity, Inc.

See you next week,

Péter

Lead Time - Engineering Management Challenges

Discussion about this post