LeadTime#23 - How Many Engineers Do You Need for This New Platform?
How to approach resource planning for platform migrations when the answer isn't a simple headcount? What questions to ask and how to structure your timeline when replacing critical infrastructure?
Hi, this newsletter is a weekly challenge for engineers thinking about management. I'm Péter Szász, writing about Engineering Leadership and training aspiring and first-time managers on this path.
In this newsletter, I pose a weekly EM challenge and leave it as a puzzle for you to think about before the next issue, where I share my thoughts on it.
Last Week’s Challenge
Last week, I described a simple question from a director to a Platform Engineering Manager: How many people do you need to support a new observability platform? Read the details here if you missed it.
And thank you, Rodrigo, for inspiring this challenge. If you have a case that you’re curious how I’d tackle, reach out here on Substack or via hello@peterszasz.com.
I'll approach this question with my usual Goals - Risks - Questions structure:
Goals to Achieve
Reframe the conversation: The question "How many engineers?" is a tricky one because it misses the time dimension. I can't answer with a static number like "2.5 engineers." I need to help my director understand that this is about project planning and phases, not headcount allocation. The real question is: what's the migration timeline, what are the phases, and how does this fit into our existing platform goals?
Understand the "why" and "why now": I need to be clear about why we're doing this migration and why now, because I'll have to defend this decision credibly in front of my team, other developers, and stakeholders. Is this driven by scalability issues, operational burden, cost optimization, or feature gaps? Without a clear narrative, I won’t be equipped to answer challenging questions or make implementation decisions autonomously.
Plan for organizational impact: 80+ developers will be affected by this change. We should come up with a phased approach to allow teams to switch over gradually in the timeline best matching their own goals.
Risks to Avoid
Cost explosion: These platforms make it really easy to turn on shiny features like detailed Application Performance Monitoring that can shoot costs through the roof. The $240K quote is an estimation based on current usage patterns, but our needs are changing, and it's hard to estimate actual usage without just doing the migration and measuring.
Vendor "self-service" promises: The vendor promises self-service, but the reality might involve significant hand-holding, custom configuration, and ongoing relationship management. This could become a bigger operational burden than our current setup that developers are already familiar with.
Historical data loss: We might lose access to some historical metrics that are important beyond typical log retention periods. Traffic metrics can be crucial for sales or marketing analysis years later. We shouldn't migrate all historical data, but we need to identify what's truly valuable long-term — or make a decision to keep the old system alive in a minimal infrastructure as a read-only reporting tool.
Organizational disruption during migration: Without careful planning, we risk productivity hits across all engineering teams during the transition. The support burden during migration could overwhelm our platform team if we don’t phase carefully.
Migration complexity underestimation: Depending on how much tech debt exists in our current observability integrations across the organization, this could be a simple wrapper switch or a massive refactoring effort touching every service. In the latter case, we should look at this as an opportunity to pair the introduction of this new service with a refactoring of our current observability implementation within the product’s code.
No rollback strategy: What happens if the new platform doesn't work as promised? We need a clear path back to our current setup without losing operational visibility.
5 Questions
What's driving this migration decision, and why now? I need to understand the specific problems we're solving - scalability bottlenecks, operational overhead, missing features, or cost optimization. This narrative will be crucial when communicating with the team, defending resource allocation, or making implementation decisions. Is this reactive (current system is failing) or proactive (positioning for future growth)?
What's our current code health around observability integrations? How many places in our codebase directly call our monitoring stack versus using a centralized library? This determines whether we're looking at a straightforward wrapper migration or a massive refactoring effort across 80+ developers' codebases. Do we have the autonomy to update client code in product teams, or do we need their collaboration?
How will we measure the success or failure of the project? What specific KPIs will prove the expected "reduced operational overhead"? Maybe it's reducing our on-call volume by X%, decreasing mean time to resolution, improving developer satisfaction, freeing up Y hours per week of platform team capacity, or simply overall cost saving. We need concrete metrics to evaluate the investment.
What's our rollback strategy if this doesn't work out? Can we run both systems in parallel indefinitely if needed? How long can we maintain the old stack on minimal infrastructure for historical data access? What's our decision timeline for fully committing to the new platform versus reverting?
Which team members should lead different aspects of this migration? Who has experience with similar observability platforms? How can we distribute the learning opportunities? This could be significant professional development for engineers wanting to expand beyond our current Prometheus/Grafana expertise.
This Week’s Challenge
Moving on to the next one, let’s make it a bit lighter:
You're the recently promoted Engineering Manager of a small product engineering team at a remote-first company. After the promotion, you find yourself with surprisingly little to do day-to-day. You handle emails in the morning, approve everyone's daily plans, answer a few Slack messages during the day, and then... nothing. Apart from Monday's weekly all-team calls around which you're swamped with preparation and small follow ups, you're working maybe 20 hours a week max. Your team is highly independent and rarely reaches out for help, even though you've made it clear they should.
You've identified several process improvements and strategic initiatives that could benefit the team, but all your proposals are sitting on your boss's desk waiting for approval. He's the company owner who wants to be involved in every decision, and when you do get meetings with him, he says he "needs to think about" your suggestions. You're starting to worry — is this normal for a new manager, or are you missing something fundamental about the role? Your team seems to be performing well without much intervention, but you can't shake the feeling that you should be doing more to justify your position and grow your impact.
What do you do?
Think about what your goals would be and what risks you'd like to avoid in this situation. I'll share my thoughts next week. If you don't want to miss it, sign up here to receive those and similar weekly managerial challenges as soon as they are published:
Until then, here's a small piece of inspiration about platform engineering:
See you next week,
Péter