Skip to main content
EdTech Platform

BuildingaReal-TimeAnalyticsPipelineforLearningPlatforms

How to build a real-time learning analytics pipeline that scales to millions of events per minute without slowing down the rest of your EdTech platform.

Building a Real-Time Analytics Pipeline for Learning Platforms
|2026-04-19|EdTechArchitectureScale

Introduction

Learning analytics is one of those areas where the value is obvious, knowing which students are struggling, which content is working, which interventions actually move the needle, but the architecture is consistently underestimated. Most EdTech teams start by querying analytics from the same database that runs the platform. That works fine until daily users cross some threshold (usually around 30,000 to 50,000) and analytics queries start locking tables that the live platform needs. Then teachers start complaining that dashboards take 14 seconds to load. Then admin reports time out. Then engineering has to choose between killing the analytics feature or rebuilding the data layer. This article walks through the pipeline pattern we use for real-time learning analytics, event sourcing, dedicated analytics database, materialized views, CQRS, and how to migrate to it without downtime when the inevitable wall arrives.

Why analytics queries kill EdTech platforms

Here is the trap. Your platform writes a row every time a student opens a lesson, submits an answer, finishes a video, or earns a badge. Those writes are tiny and fast. The trouble starts the day a teacher opens a class dashboard and you run SELECT ... GROUP BY student_id across the same table the live app is writing to. On Postgres that read takes a snapshot, walks a few million rows, and competes for the same buffer cache and the same I/O that the live writes need. One teacher does this and nobody notices. Forty teachers do it at 9:05am on a Monday and the platform feels like it's wading through mud.
The specific failure modes are boring and predictable. Sequential scans on tables that grew past what the planner expected. Joins across three or four hot tables (events to enrollments to users to courses) where every table is also being written to right now. Aggregations that hold a read snapshot open long enough that autovacuum can't reclaim dead tuples, so the table bloats, so the next scan is slower, so the snapshot is held even longer. It feeds on itself. We've walked into codebases where a single "engagement report" query was the root cause of a weekly latency spike nobody had connected to analytics at all.
And there's a worse version. Because the analytics live in the same database, a heavy report can hold locks or saturate connections right when enrollment traffic peaks. We wrote a whole separate piece on what happens when a school district logs in at the same minute (see the login storm architecture post), and analytics queries are exactly the kind of background load that turns a survivable spike into an outage. The short version: read traffic and write traffic want opposite things from a database, and analytics is the heaviest read traffic you have.
The fix is not a better index. You can index your way out of one slow query, sure. But the structural problem is that a transactional store optimized for "give me this one student's record now" is the wrong shape for "summarize 4 million events by week." Every team eventually learns this, usually somewhere between 30,000 and 50,000 daily active users, which is right about where the EdTech platform we helped scale to 250K+ DAU started feeling it. The rest of this post is how to separate the two concerns properly so neither one starves the other.

Event sourcing as the foundation

The first move is to stop treating learning activity as state you mutate and start treating it as a log of things that happened. A student didn't "reach 80% progress." They emitted a stream of events (lesson_started, video_watched_to_42s, quiz_answered, lesson_completed) and 80% is something you compute from that stream. This is event sourcing, and you don't need a fancy framework to do it. You need a discipline: every meaningful action writes an immutable, append-only event with a type, an actor, a timestamp, and a small JSON payload. That's it.
Why does this matter so much for analytics? Because an append-only event stream is the cheapest possible thing to write (no row updates, no lock contention on existing records, no read-before-write) and the richest possible thing to read from later. You don't know today every metric a school admin will ask for next year. With raw events captured, you can answer questions you hadn't thought of yet by replaying history. With only mutable summary tables, you can only answer the questions you designed for, and re-deriving the past is impossible because you overwrote it.
A practical schema we reach for is a single wide events table with a few indexed columns and a JSONB body. Something like this, deliberately plain:
ColumnTypeWhy it's there
event_iduuidIdempotency. The same client retry doesn't double-count.
event_typetextThe verb. Indexed, low cardinality, used in every filter.
actor_idbigintUsually the student. Indexed for per-learner rollups.
tenant_idbigintSchool or org. Non-negotiable in multi-tenant EdTech.
occurred_attimestamptzWhen it happened on the client, not when we stored it.
payloadjsonbEverything else. Course id, score, video position, device.
Two details people skip and regret. First, capture both occurred_at (client time, when the student actually did the thing) and received_at (server time). Mobile clients go offline, sync hours later, and if you only have server time your "engagement by hour" chart is a lie. Second, make ingestion idempotent on event_id so a flaky network retry doesn't inflate a student's watch time. We learned both of these the hard way on platforms where parents were looking at the numbers, and a parent who catches your dashboard double-counting their kid's homework never fully trusts it again. Get the capture layer honest and everything downstream gets easier.

The CQRS pattern for read-heavy analytics

Once events are the source of truth, the natural next step is to split writes from reads. That's CQRS, Command Query Responsibility Segregation, and the name sounds heavier than the idea. The idea is just: the thing you write to (the event log) and the thing you read dashboards from (pre-shaped summary tables) are different things, kept in sync by a background process. Teachers never query the raw events. They query a table that already has the answer.
Those read-side tables are materialized views, or plain summary tables you refresh on a schedule, whichever fits. The point is that the expensive aggregation, the GROUP BY over millions of rows, runs once on a timer instead of once per dashboard load. A class engagement view might hold one row per student per day with columns for minutes active, lessons completed, average quiz score, and last-seen. When a teacher opens the dashboard, you're reading a few dozen rows by index, not scanning the event firehose. That's how a dashboard read drops from 14 seconds to single-digit milliseconds. Same data, completely different shape.
Pre-aggregation is where you actually win the cost battle, and it's worth being specific about the layers. We typically keep three. Raw events (kept hot for a window, then rolled off to cheap storage). Per-entity daily rollups (student-day, course-day, class-day), refreshed every few minutes. And top-line counters (active learners right now, submissions today) that update on a tight window. A dashboard composes these. The trend line comes from daily rollups, the live tile comes from the counter, and almost nothing touches raw events on the read path. This is the same backbone we describe in the SaaS analytics dashboard case study, where the design target is a query back in under 200ms on millions of events.
The honest trade-off of CQRS is staleness, and you should name it out loud. A materialized view refreshed every five minutes means the dashboard is, at worst, five minutes behind reality. For "how is my class doing this week," nobody cares. For "is my live quiz working right now," five minutes is forever, and that metric needs a different path (more on cadence in the real-time versus batch section). The mistake we see is teams picking one refresh rate for everything. Match the freshness to the question and the system gets dramatically cheaper, because most questions tolerate far more lag than engineers assume.

Tools we use in production

We'll say the unfashionable thing first. Most EdTech platforms do not need a separate analytics database to start. If you already run Postgres, materialized views plus partitioning plus a Redis cache on the hot tiles will carry you a remarkably long way, and you avoid a second datastore with its own backups, its own on-call, and its own way of breaking at 2am. The cheapest analytics stack is the one your team already knows how to operate. We reach for a dedicated warehouse only when the numbers genuinely demand it.
When they do demand it, here's how we actually choose. The decision is driven by write rate, query shape, and how much operational weight your team can carry, not by what's trending.
ToolReach for it whenThe catch
Plain Postgres + matviewsYou're under roughly 10M events per account and the team already runs Postgres. Default choice.Aggregations get heavy past a point. Partition by month before you feel it.
TimescaleDBTime-series queries dominate and you want to stay inside Postgres. Continuous aggregates are the killer feature for rollups.It's still Postgres underneath, so very wide analytical scans eventually hit the same ceiling.
ClickHouseYou're past tens of millions of events and need column-store scan speed. This is the exam-platform tier, the kind of load that runs into 10M+ requests per minute.A real second system to operate. Updates and deletes are awkward by design. Worth it only at genuine scale.
dbtYour rollup logic has grown into a tangle of SQL nobody can read. dbt versions and tests the transformations.It's a discipline, not a database. It organizes the work, it doesn't make queries faster.
MetabaseAdmins and your own team need ad-hoc exploration without an engineer in the loop.Great internally, looks like Metabase. Not the thing you embed in a teacher-facing product.
A pattern that ages well: Postgres as the event store and system of record, ClickHouse (or Timescale continuous aggregates) only for the heavy analytical reads, and dbt sitting on top to keep the rollup SQL sane and tested. You don't build all of that on day one. You add the analytics database the day a specific query stops fitting in Postgres, and not a day sooner. Premature warehouse adoption is one of the most expensive mistakes we see early teams make, because it doubles your operational surface to solve a problem you don't have yet. When in doubt, a sharper index and a partition buy you another year, and that year is usually worth more than the architecture diagram.

Real-time vs near-real-time vs batch

"Real-time analytics" is the phrase everyone uses and almost nobody needs literally. True streaming, where every single event flows through a pipeline and updates a number the instant it lands, is genuinely useful for a narrow set of metrics and wildly overpriced for the rest. The skill is matching cadence to the question, metric by metric, instead of paying streaming costs across the board. We sort EdTech metrics into three buckets.
Actually real-time (sub-second to a few seconds). This is a small list. "How many students are in this live class right now." "Is anyone currently stuck on this question in a timed quiz." "Is the submission queue backing up during an exam." These drive an action a human takes in the moment, so lag defeats the purpose. We serve these with tight counters, often in Redis, updated on a short batching window rather than a literal per-event stream. A 2 to 5 second window is indistinguishable from instant to a human watching a tile, and it cuts the write amplification dramatically.
Near-real-time (one to fifteen minutes). The big middle. A teacher's class engagement view, an admin's daily-active trend, completion rates, average scores. Nobody refreshing these can tell the difference between "now" and "four minutes ago," and pretending they can is how budgets explode. Materialized views or continuous aggregates on a short refresh cycle cover this bucket beautifully. This is where the bulk of an EdTech dashboard actually lives.
Batch (hourly to nightly). The heavy, sprawling stuff. Cohort retention by signup week, term-over-term outcome comparisons, the data exports a school district pulls for compliance, the model-training feature tables. Run these on a schedule when the system is quiet. They're expensive, they're rarely watched live, and forcing them into a faster cadence buys nothing but a bigger bill. The litmus test we use is one question per metric: what does someone do differently if this number is one minute fresher? If the honest answer is "nothing," it does not need real-time, and most metrics on a learning dashboard, when you actually ask, fall into that camp.

Migrating to the pipeline without downtime

Almost nobody gets to build this on a greenfield. You arrive when the platform is already live, already serving students, and the analytics queries are already on fire. So the real engineering challenge isn't the architecture. It's getting from the old shape to the new one without a maintenance window, because a learning platform mid-term has no good time to go dark. The approach we trust is dual-write plus backfill plus a careful cutover, in that order.
Step one, start writing events alongside the old path. The app keeps doing exactly what it does, updating its current tables, but now it also emits the immutable event for every meaningful action. Nothing reads from the new store yet. This is low-risk because you're only adding writes, and if an event emit fails you log it and move on rather than breaking the user's action. Run this in production for a week or two and watch that the event volume matches reality. You want to catch a missing event type now, not after you've cut over.
Step two, backfill history. Your new event store starts empty, but you have years of data in the old tables. Reconstruct events from that history in batches (by tenant, oldest first, throttled so the backfill doesn't become the very load problem you're solving). It won't be perfect. You can usually rebuild lesson_completed from a progress table, but you can't invent the exact second a video was paused two years ago. That's fine. Backfill the metrics that matter, mark the cutover date, and be honest in the UI that fine-grained history starts there. Then build the materialized views off the combined backfilled-plus-live data and let them catch up.
Step three, shadow, then cut over. Point the dashboards at the new read model but keep the old path running in parallel, and compare. When the new numbers match the old ones (within the rounding you'd expect from event reconstruction) for long enough that you trust it, flip the dashboards over for real and stop the old analytics queries. Decommission the old path on a delay, not immediately, so you have a rollback if something surfaces in week two. We ran exactly this sequence on the EdTech platform that grew past 250K+ daily active users, with no downtime and no frozen dashboards during a school term. If your team is staring down this migration and the roadmap can't absorb it, that's a sensible moment to look at how we've done it before or build it with a development partner who has run the play. We go wider on this in the EdTech scaling walls piece, since analytics tends to be the first wall a growing platform hits.
YK
Written by

CEO and co-founder of Geminate Solutions, a software and product development partner. He has led teams shipping custom web apps, mobile apps, SaaS platforms, and AI products that serve over 250,000 daily active users.

FAQ

Frequently asked questions

Do we actually need a separate analytics database, or can Postgres handle it?
For most EdTech platforms, plain Postgres carries you a long way. Materialized views to pre-aggregate the common metrics, monthly table partitioning once an account passes roughly 10 million events, and a Redis cache on the live tiles will hold up well past where teams expect. You only add a dedicated analytics store like ClickHouse the day a specific query stops fitting in Postgres, not before. Adding a second database early just doubles your on-call surface to solve a problem you do not have yet.
What does a teacher actually need to see live versus once a day?
Very little needs to be live. The genuinely real-time list is short: how many students are in a live class right now, whether anyone is stuck in a timed quiz, whether a submission queue is backing up during an exam. Those drive an action a human takes in the moment. Almost everything else (class engagement, daily-active trends, completion rates, average scores) is fine one to fifteen minutes behind, and the heavy reports like cohort retention belong on a nightly batch. The test we use is one question per metric: what does someone do differently if this number is one minute fresher? If the answer is nothing, it does not need real-time.
Why do analytics queries slow down the rest of the EdTech platform?
Because read traffic and write traffic want opposite things from a database, and analytics is the heaviest read traffic you have. A report that groups across millions of rows holds a read snapshot open, competes with live writes for cache and I/O, and can block autovacuum so the table bloats and the next scan is even slower. One teacher running it is invisible. Forty teachers at 9am on a Monday is a latency spike nobody connected to analytics. The fix is structural separation of the read path from the write path, not a sharper index.
What is event sourcing and why does it matter for learning analytics?
Event sourcing means you stop mutating state and instead append an immutable log of things that happened: lesson_started, quiz_answered, lesson_completed, each with an actor, a tenant, a timestamp, and a small JSON payload. Progress is then computed from that stream rather than stored as a number you overwrite. It matters because an append-only log is the cheapest thing to write and the richest thing to read from later. You can answer questions you had not thought of yet by replaying history. With only mutable summary tables, you can answer only the questions you designed for, and the past is gone the moment you overwrote it.
How do you keep a learning dashboard reading in well under a second at scale?
You never let the dashboard touch raw events on the read path. Materialized views or summary tables pre-aggregate the common metrics on a short refresh cycle, so the expensive group-by runs once on a timer instead of once per page load. The live tiles read from tight counters held in Redis. And once an account passes about 10 million events you partition the tables by month so even a 12-month trend query stays fast. That combination is how a dashboard read drops from double-digit seconds to single-digit milliseconds. The design target we hold is a query back in under 200ms on millions of events.
How do you migrate a live platform to this pipeline without downtime?
Dual-write, then backfill, then a careful cutover. First the app keeps updating its current tables but also starts emitting events, with nothing reading the new store yet, so the change is low-risk. Then you reconstruct history into the event store in throttled batches, oldest tenant first, accepting that fine-grained detail only exists from the cutover date forward. Finally you point dashboards at the new read model while the old path runs in parallel, compare the numbers until you trust them, flip over, and decommission the old queries on a delay so a rollback stays possible. Done this way there is no maintenance window and no frozen dashboards during a school term.
Can Geminate Solutions build a real-time analytics pipeline for our EdTech platform?
Yes. The event-sourcing capture layer, the CQRS read model, the pre-aggregation tiers, and the zero-downtime migration are patterns we build with our team and reuse across products. We ran this exact play on an EdTech platform that grew past 250K+ daily active users. Geminate Solutions is a software and product development partner, not a staffing desk, so we build it with you rather than handing over a contractor. Start at geminatesolutions.com/get-started for a free assessment of where your analytics will hit a wall and what it takes to get ahead of it.
GET STARTED

Ready to build something like this?

Partner with Geminate Solutions to bring your product vision to life with expert engineering and design.

Related Articles