Skip to main content
EdTech Platform

HowanEdTechPlatformScaled25xOver3Years(WithArchitectureDetail)

Architecture-level detail on how our multi-tenant EdTech platform scaled from 20,000 to 250,000 daily users over 3 years, with zero downtime through three major migrations.

How an EdTech Platform Scaled 25x Over 3 Years (With Architecture Detail)
|2026-04-19|EdTechArchitectureScale

Introduction

This is the engineering version of our signature platform case study. Where the case study covers the business arc, this article covers the architecture decisions in technical depth, what was actually shipped at each phase, what broke, what got fixed, and the patterns that other EdTech teams can apply to their own scaling work. Our multi-tenant EdTech platform serves 250,000+ daily active users today, handles 10 million requests per minute at peak, and powers white-label brands including Your CA Buddy and Youth Pathshala. It started at 20,000 daily users on a monolith. The journey from there to here was three distinct architectural phases. None of the phases were rewrites. All of them were staged migrations. This article documents what each phase actually shipped, ordered by business impact rather than chronological order.

Phase 1: Stabilization (months 1-3)

When we plugged in, the product was not slow on average. The averages looked fine. The problem lived entirely in the peaks. Every class-start morning and every exam window, the same handful of endpoints fell over, the team patched through the night, and the next cycle it happened again. So the first job was not to scale anything. It was to stop the bleeding. We came in alongside the in-house team, not instead of them, and took the scaling work off their plate so they could keep the product moving while we went after the failure modes.
Authentication was the first fire. On an EdTech platform the login is not spread evenly through the day. A whole cohort signs in within the same five minutes when class starts, and that login storm hammered a path that did a synchronous database lookup, a password hash, and a session write on every single request. We rewrote it. Sessions moved to a stateless token validated in memory, the user record got cached so the hot path never touched the primary database, and the password hash work moved off the request thread. The endpoint that used to buckle at the 9am rush started absorbing it. (We wrote up that specific battle in more detail in our piece on the login storm architecture, because it is the single most common wall we see EdTech teams hit.)
The second piece was the submission queue. Timed assessments create a brutal pattern. Thousands of students hit submit in the final ninety seconds, and the original design wrote each submission straight to the database in the request, so the table locked and submissions started timing out at the exact moment a student could least afford it. We put a durable queue in front of it. The request now accepts the submission, writes it to the queue, returns instantly, and a pool of workers drains the queue into the database at a rate the database can actually sustain. Nothing is lost, nothing blocks, and a slow database no longer means a failed exam.
By the end of the first three months we had not added a feature. We had moved the P99 latency on the worst endpoints from seconds into the low hundreds of milliseconds, and the recurring overnight incidents stopped. That is what stabilization buys you. Room to think. You cannot rearchitect a system that is on fire every Monday, so the unglamorous work of making the peaks survivable always comes first.

Phase 2: Scale (months 3-18)

With the fires out, the constraint shifted. The monolith was no longer crashing, but it had become the thing slowing everyone down. A change to the video module meant redeploying the whole application, including the exam engine and the auth service, which made every release a high-stakes event nobody wanted to ship near an exam window. So we started pulling the monolith apart, carefully, one seam at a time.
We did not rewrite it. That distinction matters more than anything else in this article. A big-bang rewrite of a system students sit exams on is a way to lose a term and a lot of trust. Instead we used the strangler pattern. We picked the highest-pain module, stood up a new service beside the monolith, had it read the same data and shadow the old path for days while we compared outputs, then shifted a sliver of traffic to it behind a feature flag and watched. If the error rate stayed flat, we ramped. If it twitched, we flipped the flag back in seconds. Video delivery, the exam engine, and reporting each came out this way over the year. We go deeper on that approach in our monolith migration writeup.
Video was its own problem. Serving lecture playback from application servers is a fast way to set money on fire and still buffer during peak. We made it CDN-aware. The video files live in object storage, the CDN caches segments at the edge close to students, and the application only ever issues a signed URL rather than streaming a single byte itself. The design target was a cache-hit rate north of 90 percent at the edge, which is what keeps both the origin load and the bandwidth bill sane when a popular lecture drops and ten thousand students press play inside the same hour.
Two more changes landed in this phase. Reporting moved to a CQRS split, so the heavy analytical queries that teachers and admins run (cohort progress, score distributions, attendance trends) hit a read-optimized store instead of fighting the transactional database that students depend on during a live class. And we went multi-region, because half the user base was sitting a long way from a single-region deployment and eating that latency on every request. Putting compute and read replicas closer to where students actually are took a chunk of latency off the table that no amount of code tuning could have.

Phase 3: Multi-tenant (months 18+)

By month 18 the platform was fast and the team could ship without holding their breath. The business question changed shape. Instead of one brand serving one audience, the goal was to run several brands on one platform, each with its own look, its own domain, its own students, fully isolated from the rest. That is the multi-tenant phase, and it is a different kind of hard. The risk is no longer downtime. It is one tenant ever seeing another tenant's data.
Isolation came first, and we treated it as non-negotiable. Every query in the system carries a tenant identifier, enforced at the data layer rather than trusted from the application, so there is no path where a missing filter quietly leaks one school's students into another's dashboard. We rolled it out tenant by tenant rather than flipping a global switch, validating isolation at each step. If you want the mechanics, we documented the row-level approach and the alternatives in our multi-tenant architecture guide. The short version is that isolation you can audit beats isolation you hope is correct.
Branding became configuration, not code. A new white-label brand is a row in a config table now, the logo, the palette, the domain, the feature toggles, all read at runtime through the same single deployment that serves everyone. No fork per client, no parallel build to keep in sync. That is what made it realistic to launch white-label brands like Your CA Buddy and Youth Pathshala without standing up a separate stack for each one. The same engine, configured differently.
Self-service onboarding closed the loop. Once a brand is a config row and isolation is enforced at the data layer, a new tenant can be provisioned without an engineer in the loop, which is the only way the model scales past a handful of brands. The platform that started as one monolith for 20,000 users now carries 250,000+ daily active users across multiple isolated brands on a single codebase, and the marginal cost of the next brand is close to zero.

Architecture decisions we got right

Staging every migration behind a flag was the decision that saved the whole arc. Not one of the three phases involved a big cutover where the new system replaced the old one overnight. Every change shadowed the live path first, then took traffic in slices we could pull back in seconds. On a platform people sit exams on, that is the difference between a quiet Tuesday and a public incident, and we would make the same call again without hesitation.
Designing for the peak instead of the average was the second one. EdTech traffic is not a smooth curve. It is a flat line with violent spikes at class-start and exam-submit, and any capacity plan built on daily averages is a plan to fail at exactly the moments that matter most. Sizing the auth path, the submission queue, and the video delivery for the worst ninety seconds rather than the mean is what let the platform absorb 10 million requests a minute on an exam day without drama.
Keeping the data layer boring also paid off for years. We resisted the urge to reach for an exotic database every time a new query shape showed up. Caching the hot path, splitting reads from writes with CQRS, and partitioning large tables got us most of the performance with a fraction of the operational weight, and the on-call team only ever had to know one primary store well. A new database is a new thing to wake up for at 3am, and we kept that list short on purpose.

Architecture decisions we would do differently

We would build for multi-tenancy earlier. We bolted tenant isolation on in phase three because the business need arrived then, but retrofitting a tenant identifier through every query and every cache key in a live system is genuinely painful, and a fair bit of it could have been avoided by carrying the tenant concept from the start even while there was only one tenant. It costs almost nothing to design for on day one and a lot to add later.
We underinvested in observability for too long. In the early stabilization months we were diagnosing peak failures partly by feel, reading logs after the fact, because the metrics we needed were not there yet. We eventually built proper per-endpoint latency histograms and queue-depth dashboards, and the moment we had them the debugging got dramatically faster. If we ran it again, that tooling would come in week one, not month four. You cannot fix a peak you cannot see.
And we let the monolith decomposition run a little long. Pulling services out with the strangler pattern is safe, but it is also slow, and there is a real temptation to keep extracting past the point of diminishing returns. A couple of modules that we eventually split could have happily stayed inside the monolith. Not every seam is worth a service, and a service you do not need is just more network calls, more deploys, and more surface area to monitor. We would draw that line tighter next time.

Patterns other EdTech teams can apply

The portable part is the sequence. Stabilize the peaks before you touch architecture, decompose with the strangler pattern rather than rewriting, push video to a CDN so your origin and your bandwidth bill survive a popular lecture, and bake tenant isolation in before you think you need it. None of those are exotic. They are the moves that work specifically because EdTech traffic spikes hard and predictably, and they generalize across almost any platform that lives or dies on class-start and exam windows.
What is not portable is the idea that you can copy our exact stack and inherit our results. Your traffic shape, your peak timing, your data model, and your current bottleneck are yours, and the right first move depends entirely on which wall you have actually hit. We have written about several of those walls in detail, the submission queue under exam load and real-time analytics without crushing the transactional database among them, but reading about a pattern and knowing it is the one your platform needs right now are different things.
This is where a build partner earns its place. Geminate Solutions comes in as a product and engineering partner, not staff you rent by the seat, and the first thing we do is find which wall is actually costing you, then fix it in staged migrations that keep your platform live the whole way through. We are Top Rated Plus on Upwork with a 4.9 rating and have shipped 50+ products, including the EdTech platform behind this article. If your system buckles on the same mornings every term, that is the conversation to have. You can read the fuller business arc in the 250K-user platform case study or how we approach the work on our EdTech software development and custom development pages, then tell us what is breaking at get started and we will tell you honestly what it takes to fix.
YK
Written by

CEO and co-founder of Geminate Solutions, a software and product development partner. He has led teams shipping custom web apps, mobile apps, SaaS platforms, and AI products that serve over 250,000 daily active users.

FAQ

Frequently asked questions

What does a build partner actually do for an EdTech company that an in-house team cannot?
It is rarely about skill the in-house team lacks. It is about load. Your engineers are flat out keeping the product moving, and the scaling work (a new auth path, a submission queue, a CDN strategy for video) is a separate body of work that would stall the roadmap if they took it on. A build partner picks up that work in parallel, alongside your team rather than instead of it, and brings the pattern memory of having shipped this same shape before. We have run platforms past 250,000 daily active users and exam loads past 10 million requests a minute, so the failure modes are familiar before they happen to you.
When in our growth should we bring in a build partner?
The honest signal is when the same incident keeps recurring on the same days. Class-start mornings or the first exam window of a term, where the system buckles, the team patches it overnight, and it buckles again next cycle. That pattern means the architecture has hit a wall the day-to-day team has no slack to move. Before that, you almost never need outside help. After the third repeat of the same fire, you are paying for it in churn and in burned-out engineers, and a partner pays for itself fast.
Is this staff augmentation or outsourcing?
Neither. We do not rent you a seat-warmer or take a spec over the wall and disappear. We build with your team as a product partner, which means we share the architecture decisions, push back when a request will cost you later, and own outcomes rather than tickets. The work integrates with your repo, your on-call, and your roadmap. When the engagement ends, your own engineers can run and extend everything we shipped, because they were in the room while it was built.
How do you scale an EdTech platform without taking it down?
You stage every change behind the live system instead of rewriting it. New service runs in the background reading the same data, you compare its output against the old path for days, then you shift a small slice of traffic to it behind a flag and watch the error rate. If it holds, you ramp. If it wobbles, you flip the flag back in seconds and nobody outside the team notices. We have moved a platform through three major migrations this way with no big-bang cutover, because the cost of a failed rewrite on a system students depend on during exams is not one any team should accept.
What is the hardest part of EdTech scale specifically?
The traffic is not smooth. It is spiky in a way most SaaS never sees. A class-start Monday at 9am or the opening minute of a timed exam concentrates traffic that would be a comfortable daily load into a few minutes, and everyone hits the same endpoints at once. That login storm and that submission rush are the moments that break things, and they are predictable to the minute, which is both the curse and the gift. You cannot average your way out of it. You have to design the auth path, the submission queue, and the video delivery for the peak, not the mean.
Can Geminate Solutions help scale our existing EdTech platform rather than rebuild it?
Yes, and that is usually the right call. A working platform with real users is an asset, not something to throw away. We come in, find the specific walls (the endpoint that falls over at peak, the query that locks the table, the single-region setup adding latency for half your users) and fix them in staged migrations that keep the product live throughout. Geminate Solutions has shipped 50+ products and is Top Rated Plus on Upwork with a 4.9 rating. Start at geminatesolutions.com/get-started for a free, honest assessment of what your platform actually needs.
GET STARTED

Ready to build something like this?

Partner with Geminate Solutions to bring your product vision to life with expert engineering and design.

Related Articles