Skip to main content
EdTech Platform

FromMonolithtoMicroservices:AnEdTechMigrationPlaybook

A staged playbook for migrating an EdTech platform from monolith to microservices without downtime, big-bang rewrites, or product velocity loss.

From Monolith to Microservices: An EdTech Migration Playbook
|2026-04-19|EdTechArchitectureScale

Introduction

Most EdTech platforms start as monoliths. That is the right call, early-stage product velocity matters more than architectural purity, and monoliths optimize for the former. But there is a transition point, usually around 100,000 to 200,000 daily active users or when the engineering team passes 15 to 20 people, where the monolith starts costing more than it saves. Deploys become risky. Onboarding new engineers takes months. Features cascade through the codebase. This article is the playbook we use for migrating EdTech monoliths to microservices in production, without big-bang rewrites, without multi-hour maintenance windows, and without the product velocity drop that sinks most migrations halfway through. The pattern is staged: extract one service at a time, prove the pattern works, then accelerate.

When a monolith is actually the problem

Let's be clear about something first. A monolith is not a bug. Plenty of large, healthy products run on one well-organized codebase, and the modular-monolith crowd is right that you can get a long way before you need separate services. So the question is never are we a monolith. The question is whether the monolith has stopped paying its way. We look at behavior, not at the codebase. Three signals tell us the architecture (not the code) is now the bottleneck.
The first is deploy fear. When a team starts batching changes into big releases because nobody trusts a small one, that is fear talking. A healthy system makes the safe thing (a tiny, frequent deploy) the easy thing. When every push risks taking down billing because billing shares a process with the lesson player, people stop pushing. Releases get bigger, rarer, and scarier, which is exactly backwards.
The second is the blast radius of one slow query. This one is brutal on EdTech platforms specifically. A teacher opens a gradebook for a class of 400 and runs a report that was fine when classes were 30. That single query saturates the connection pool, and because the pool is shared, the login page, the video player, and the payment callback all start timing out too. One person's heavy read just took down the whole product. We have watched a platform fall over at 9am on a Monday for exactly this reason, and the fix in the moment is always ugly.
The third is teams blocking each other. You hired a payments team and a content team, and now the payments change cannot merge until someone on content reviews it, because they both edit the same files and migrations. The org chart says two teams. The codebase says one. When that mismatch shows up in your sprint velocity (and it always does, as a quiet tax nobody put on the board), the architecture is fighting the way you have decided to work. If you see all three signals together, it is time. If you see none of them, leave the monolith alone and go ship features.

The strangler-fig pattern adapted for EdTech

The strangler fig is a plant that grows around a host tree, slowly, until one day the host is gone and the fig is holding its own shape. Martin Fowler borrowed the name for software, and it is the pattern we reach for every time. Instead of building a replacement system in the dark and flipping to it on launch day, you put a thin routing layer in front of the monolith and start peeling off one capability at a time. The monolith keeps serving everything you have not extracted yet. The new service takes over the part you have. Traffic decides which one answers.
Mechanically, the host needs a front door. We put a routing proxy (an API gateway, or just nginx with rules to start) in front of everything. On day one it forwards 100% of requests straight to the monolith and changes nothing. That is on purpose. The proxy earns its keep later, because once it exists, moving a route from the monolith to a new service is a config change, not an architectural event. /api/auth/* goes to the auth service. Everything else still goes to the monolith. Nobody using the platform can tell.
EdTech adds constraints that a generic strangler write-up skips, and they are the ones that hurt if you ignore them. The calendar is the big one. You do not run a write-path cutover the week of final exams, full stop. We have built EdTech platforms where a single exam window pushes past 10 million requests a minute, and that is not the moment to be proving a new service in production. The second constraint is the read-heavy shape of the traffic. Students mostly consume (videos, lessons, past papers), so the heavy paths are reads, and reads are far more forgiving to extract than writes because a stale read for a few seconds rarely breaks anything while a lost grade does.
The third EdTech wrinkle is the long-lived session. A student in a two-hour exam is mid-flight the entire time, so any extraction touching exam state has to keep in-progress sessions valid across the cutover. We design these splits so an already-issued session token keeps working against both the old and new path until it expires naturally. You drain the old path, you do not yank it. That single habit (drain, never yank) is what separates a migration students never notice from an outage with a support queue.

Which service to extract first

Pick the first extraction for what it teaches you, not for how much pain it removes. The first one is a rehearsal. You are building the proxy, the cross-service token validation, a new deploy pipeline, and the monitoring to watch two systems at once. You want to learn all of that on the service with the cleanest boundary and the least chance of corrupting data. For almost every EdTech platform we have worked on, that service is authentication.
Auth wins on three counts. Its boundary is obvious, so login, signup, tokens, password reset, and nobody argues about where the line is. Nearly everything else depends on it, which means once it is its own service you have already solved the hard cross-cutting problem of how does service B trust a user that service A logged in. And it is overwhelmingly read traffic. Validating a token is a read. That makes auth forgiving to extract, because the scariest failure mode of a migration (writing to the wrong place and losing data) barely applies. We design token validation to add no more than a few milliseconds per request, so students never feel the seam.
After auth, stop following a template and go where the platform actually bleeds. On a read-heavy EdTech product the next extractions are usually the heavy read paths, in this rough order.
OrderServiceWhy this one
1AuthenticationClean boundary, everyone depends on it, mostly reads. The rehearsal.
2Content / video deliveryThe heaviest read path. Students stream lessons all day. Cache-friendly, and offloading it takes the biggest load off the monolith.
3Grading / gradebookThe query that takes the whole site down lives here. Isolating it means a slow report no longer starves login.
4Notifications / emailAsync, fire-and-forget, no user waiting on the response. Easy win once the pattern is proven.
5Payments / billingLast, deliberately. Money is the one place a dual-write bug is unforgivable, so you extract it once your team is fluent.
Notice billing is last, not first, even though it feels self-contained. That is intentional. The cost of getting a payment migration subtly wrong is a customer charged twice or a subscription that silently lapses, and you do not want to be discovering your dual-write tooling has a bug on the money path. Save the highest-stakes extraction for after you have done four lower-stakes ones and trust the machinery. If you want the deeper version of why the content and gradebook paths fail first under load, we wrote about the exam-day failure modes in the EdTech scaling walls and surviving a login storm.

Dual-write and gradual cutover

Here is the move most teams get backwards. They try to split the database first. Don't. Split the code first and keep one database honest as long as you possibly can. While the new auth service and the old monolith both read and write the same users table, consistency is not a problem you have to solve, because there is exactly one source of truth. You get to extract and ship the service, prove it serves traffic correctly, and defer the genuinely scary part (moving the data) until you have nothing else to worry about.
When you finally do move storage, you do it with dual-write and a verification window, never a single overnight copy. The sequence we use looks like this. First, the service writes to both the old store and the new one, while reads still come only from the old store, so the new store is just shadowing reality with zero risk. Second, a background job runs continuously and compares the two row by row, and it alarms on any mismatch instead of trusting the copy. Third, only once that mismatch count has sat at exactly zero for days (not minutes, days) do you start serving reads from the new store. Fourth, you flip writes. Fifth, after a safe soak, you delete the old write path and the comparison job. Skipping the verification window is how teams discover a foreign-key edge case in production with real student grades on the line.
The tables that hurt are the ones whose rows reference each other across the new boundary. A grade points at a user, and now grades and users are heading for different services. You have a few honest options, and the table below is roughly how we choose.
SituationApproachTrade-off
Reference is read-only and rarely changesDenormalize the needed field onto the new serviceSlight duplication, but no cross-service call on the hot path
Reference must stay liveService exposes an API, the other service calls itA network hop and a new failure mode to handle
Hard transactional link (money, enrollment)Keep both tables in one service until the boundary is provenSlower decomposition, but you never lose a write
Now the cutover itself, which is a dial and not a switch. We route a sliver of traffic to the new service first, usually internal accounts plus around one percent of real users, and watch error rate and P99 latency side by side against the monolith for at least a day. If the new path holds, we raise the percentage in steps (5, 25, 50, 100), pausing at each one. And the rollback has to be a config change that takes effect in seconds, because under real load nobody has ten minutes to ship a hotfix. As long as the new service has not yet started owning writes the monolith cannot read, flipping traffic back to the monolith is completely safe. The one genuinely irreversible moment is the write cutover, so we schedule that for a quiet window and, on EdTech, never the night before an exam.

Operating during the migration

The reason we use the strangler approach at all is that it lets the business keep moving. The monolith stays fully alive the whole time, so the product roadmap does not freeze. That only holds, though, if you run the migration with some discipline, because a half-finished decomposition is more complex than either a clean monolith or a clean set of services. The goal is to spend as little time as possible in that messy middle.
We split the team rather than the calendar. Most engineers keep shipping features on the monolith, and a smaller group runs extractions. The single rule that protects everyone is this: never start a second extraction until the first one is fully cut over and its dual-write scaffolding has been deleted. Migrations die when a team has four services half-extracted at once, each with a comparison job running and a foot in both databases, because now every feature change has to reason about which world it lives in. One at a time is slower on paper and far faster in practice.
Observability stops being optional the moment you have two systems serving one product. You need request tracing that follows a call from the proxy into whichever service answered, and dashboards that put the new service's error rate and latency right next to the monolith's so a regression is obvious in seconds. Before we move any real traffic, the new service ships with its alerts already wired, not bolted on after the first incident. Flying a partial migration without this is how a small auth regression turns into an hour of is it us or is it them.
Two organizational notes, because the hard parts of this are rarely the code. Feature flags and the routing percentage have to be controllable without a deploy, so on-call can dial traffic back at 2am from a config panel instead of a release pipeline. And someone has to own the deadline for ripping out the temporary scaffolding. Dual-write code and comparison jobs are meant to be disposable, but if nobody is accountable for deleting them, they quietly become permanent and you are left maintaining the worst of both worlds forever. We design these splits with a removal date attached from the start.

When to stop extracting services

Microservices are not a finish line you cross. They are a tool, and like any tool they have a cost you keep paying. Every service you carve out is one more deploy pipeline, one more database to back up and patch, one more thing that pages someone at 2am, and one more network hop that can fail. So the right question at every step is not can we extract this but does extracting this remove more pain than it adds. The day the answer turns to no, you stop.
Tie the number of services to your team shape, not to an architecture diagram you saw at a conference. A workable rule is roughly one service per team that can own it end to end, including its on-call. For a platform serving hundreds of thousands of daily active users, that usually lands at a handful of well-chosen services, not the fifty-box microservices poster. If a service does not have a clear owning team, you have not built a service, you have built a shared dependency with extra latency.
You are done when the original signals are gone. Deploys are small and routine and nobody dreads them. One teacher's heavy gradebook query can no longer take down login, because they no longer share a process or a connection pool. And the payments team ships without waiting on the content team. That is the whole reason you started. Chasing further splits past that point trades a code problem your team fully understands for a distributed-systems problem (eventual consistency, partial failure, debugging across five hops) that is genuinely harder, and you do it for no remaining benefit.
If you are weighing this decision on a live EdTech platform, the honest version is that the sequencing and the cutover discipline matter more than the destination. We are a product development partner, so this is the kind of work we build alongside your team rather than hand over a spec and walk away, and we have run these migrations on platforms with 250K+ daily active users. If you want a second set of eyes on where your monolith actually hurts, that is exactly the conversation we would rather have early. You can read how we approach a phased platform rebuild in our EdTech rewrite guide, see the scale story in the 250K-user case study, or look at where this fits in our EdTech engineering work. When you are ready to talk specifics, start here.
YK
Written by

CEO and co-founder of Geminate Solutions, a software and product development partner. He has led teams shipping custom web apps, mobile apps, SaaS platforms, and AI products that serve over 250,000 daily active users.

FAQ

Frequently asked questions

How do we know it is time to break up our EdTech monolith?
Watch behavior, not lines of code. Three signals matter most. People are scared to deploy, so changes pile up into big risky releases instead of small safe ones. A single slow query, often a teacher pulling a giant gradebook, can drag the whole platform down because everything shares one process and one connection pool. And teams block each other, where a payments change cannot ship until a content change is reviewed because they live in the same codebase. If you have all three, the architecture is the bottleneck, not the developers. A clean monolith with none of those problems should be left alone.
Should we do a big-bang rewrite or a gradual migration?
Gradual, almost always. Big-bang rewrites mean freezing the old system, building the new one in parallel for a year or more, and flipping a switch on a date that always slips. The whole time you ship no features, and the cutover is a single terrifying event with no graceful rollback. The strangler-fig approach routes traffic through a thin proxy and extracts one service at a time behind it, so the monolith keeps running and the product keeps shipping. Each extraction is small, reversible, and proven in production before the next one starts. Joel Spolsky famously called the full rewrite the single worst strategic mistake a software company can make, and on a live EdTech platform with exam seasons, we agree.
Which service should we extract from the monolith first?
Authentication, in most cases. Auth has a clean, well-understood boundary, almost every other service depends on it, and it is mostly reads, which makes it forgiving to extract. Pulling it out first forces you to build the shared muscles you need for everything after it: the routing proxy, token validation across services, deploy pipelines, and monitoring. After auth, go where the pain is. On EdTech platforms that is usually the heavy read paths, content and video delivery and the gradebook, because that is what melts during a login storm or an exam window. We design auth token validation to add no more than a few milliseconds to a request so the split is invisible to students.
How do we keep the database consistent while a service is being split out?
Move the code before you move the data. While both the monolith and the new service still read and write the same tables, consistency is free, because there is one source of truth. When you finally split storage, use dual-write with a verification window. The service writes to both the old and new stores, a background job compares them continuously, and you only cut reads over to the new store after the mismatch rate sits at zero for days, not minutes. Tables that cross the new boundary, like a foreign key from grades to users, are the ones that bite. We design verification to alarm on any divergence rather than trusting the migration blindly.
What does the cutover and rollback plan look like?
Cutover is a dial, not a switch. We route a small slice of traffic to the new service first, often internal accounts and one percent of users, watch error rate and latency against the monolith for at least a day, then raise the percentage in steps. Rollback has to be a config change that takes effect in seconds, not a redeploy, because under real load nobody has ten minutes to ship a fix. As long as the new service has not started owning writes that the monolith cannot read, flipping traffic back is safe. The riskiest moment is the write cutover, so we schedule that one for a low-traffic window and never the night before an exam.
Can we keep shipping features while the migration is happening?
Yes, and that is the whole point of doing it gradually. The strangler proxy means the monolith stays fully alive, so product work continues on it the entire time. We typically split the team so that most engineers keep shipping the roadmap and a smaller group runs the extractions. The rule that protects you is to never start a second extraction until the first is fully cut over and the dual-write scaffolding is removed. Migrations that try to extract five services at once stall, because half-finished splits multiply complexity instead of reducing it.
How many microservices should an EdTech platform end up with?
Fewer than most teams expect. Microservices are a means, not a goal, and every service you add is another deploy pipeline, another database, another thing to page someone about at 2am. A good rule is roughly one service per team that can own it end to end, which for a platform serving hundreds of thousands of daily users often means a handful of well-chosen services, not fifty. Stop extracting when the original pain is gone: deploys are safe, one slow query no longer takes down the site, and teams ship without waiting on each other. Splitting past that point trades a code problem you understand for a distributed-systems problem that is harder.
GET STARTED

Ready to build something like this?

Partner with Geminate Solutions to bring your product vision to life with expert engineering and design.

Related Articles