Skip to main content
EdTech Platform

TheVideoDeliveryArchitectureEveryLive-ClassEdTechPlatformNeeds

Video delivery architecture for live-class EdTech platforms, CDN strategy, adaptive bitrate, pre-warmed caches, and the patterns that hold at thousands of concurrent students.

The Video Delivery Architecture Every Live-Class EdTech Platform Needs
|2026-04-19|EdTechArchitectureScale

Introduction

Live class video is one of the highest-stakes features on any EdTech platform. When it works, students get to learn synchronously. When it fails, the teacher is on screen, the lesson is mid-flow, and 5,000 students are watching their network indicator turn red. Recovery is impossible, the moment is lost. Most live class platforms get the basics right (some kind of WebRTC, a CDN, adaptive bitrate) and then break the moment they need to scale a single class past about 1,000 viewers. This article covers the architecture patterns we use for live class video at the scale where things actually break, multi-CDN routing, regional SFU clusters, pre-warmed caches before scheduled sessions, and the specific bandwidth optimizations that take egress costs from 'we cannot afford to scale' to 'we can afford to grow'.

Live video architecture choices: WebRTC, SFU, broadcast, hybrid

Before you argue about CDNs and codecs, you have to pick how the video gets from a teacher to the students at all, and that single choice constrains everything downstream. There are really four families to choose from, and the right one depends almost entirely on one question: does anyone need to talk back? Interaction is the dividing line. A lecture where 5,000 students only watch is a fundamentally different system from a seminar where 30 of them unmute and raise hands.
Peer-to-peer WebRTC is the sub-second-latency, everyone-can-talk option, and it is wonderful right up until it is not. In a true mesh every participant sends their stream directly to every other participant, so a call of n people needs roughly n squared connections. That math is fine for a 4-person tutorial and catastrophic by about 8 to 10 people, because each laptop is now uploading the same camera feed half a dozen times and the weakest uplink in the room sets the ceiling for everyone. We do not use raw mesh for anything that calls itself a class. It is a tutoring-pair pattern, not a classroom one.
An SFU (Selective Forwarding Unit) is the workhorse for interactive teaching. Each participant uploads their stream once to a server, and that server forwards the streams out to everyone else. It keeps WebRTC's sub-second latency and its two-way capability, but it moves the fan-out cost off the students' laptops and onto infrastructure you control and can scale. An SFU comfortably handles classroom-sized interactive sessions, the few dozen to low hundreds where genuine back-and-forth matters. The cost is that you now run and pay for SFU servers, and a busy SFU is CPU-hungry and memory-hungry in ways that surprise teams the first time.
Broadcast (HLS or DASH over a CDN) is the opposite trade. You give up real-time interactivity, latency stretches to several seconds, and in exchange you can deliver one teacher to effectively unlimited viewers cheaply, because the CDN caches segments and the cost per extra viewer collapses toward zero. This is how you serve the 5,000-student lecture. The honest pattern for most platforms is hybrid: run a small interactive WebRTC or SFU stage for the teacher and the handful of students actively participating, then transcode that stage into an HLS broadcast for the silent majority who only need to watch. You pay SFU economics for the few who interact and CDN economics for the many who do not, and that split is usually the difference between a model that scales and one that bankrupts you at the first popular class.

The pre-class join storm

EdTech video has the same cruel traffic shape that breaks EdTech logins. The class is at 4:00 PM. Nobody joins at 3:43 and nobody joins at 4:11. They all arrive inside the same 60 to 90 seconds, because a human said "we start at four" and the calendar enforced it. So your video infrastructure does not need to handle 5,000 concurrent viewers as much as it needs to handle 5,000 sessions establishing themselves at once, and those are very different loads. We wrote a whole companion piece on the auth version of this in the login-storm article, and the video version is its noisier cousin.
Spinning up a media session is expensive in a way that steady playback is not. Each joining client does an ICE negotiation to find a network path, a DTLS handshake to set up encryption, and then the SFU allocates buffers and forwarding state for that participant. Do that 5,000 times in a 90-second window and the SFU's CPU spikes hard at exactly the moment the lesson is supposed to begin. The failure mode is ugly and specific: students see "connecting" spinners, some time out and retry, the retries pile more handshakes onto an already-saturated server, and the teacher is live on camera watching the room fail to fill. The lesson started. The students cannot get in. You cannot rewind that moment.
The fix is to never let the join storm find a cold pool. Because EdTech load is scheduled, you know the spike is coming, and that foreknowledge is a gift most systems do not get. We pre-warm the SFU pool ahead of every scheduled session: read the class timetable, and several minutes before a 4:00 PM start, provision the SFU capacity that class will need so the servers are already running, already warm, and not cold-booting under fire. Tie that to predictive autoscaling driven by the schedule rather than by live CPU, because reactive autoscaling is always a step behind a 90-second wall. By the time CPU-triggered scaling notices the storm and boots a new node, the storm is already over and the damage is done.
A few more things hold the storm at bay. Stagger where you honestly can, opening the virtual room two or three minutes early so eager students trickle in before the bell instead of stacking onto it. Make the client back off and retry with jitter rather than hammering instantly, so a brief blip does not turn into a synchronized retry stampede. And put a connection-rate limit and a graceful waiting room in front of the SFU, so when the absolute peak does exceed capacity, students see an orderly "joining in a moment" queue instead of a hard error. A waiting room that admits people over fifteen seconds is invisible to a student and the difference between calm and chaos for the server.

Adaptive bitrate done correctly for EdTech

Adaptive bitrate is table stakes, so we will keep the basics short. You encode each video into a ladder of renditions, say 1080p, 720p, 480p, and 360p, then chop every rendition into segments of a few seconds. The player ships a manifest listing the ladder, measures the student's actual throughput, and switches renditions segment by segment. Bandwidth dips, it steps down to a smaller file. Bandwidth recovers, it climbs back up. That is the whole mechanism, and any competent video stack gives it to you. What separates an EdTech setup from a Netflix clone is what you do with the ladder, and here most teams copy the entertainment playbook and get it subtly wrong.
Your audience is not sitting on a smart TV with fibre. It is a student on a shared mobile hotspot in a town where the tower is congested at 4:00 PM, watching the exact same lecture as a student on campus broadband. So the bottom of your ladder matters far more than the top. We design the lowest rung to stay watchable on a genuinely bad connection, a low-resolution video stream that survives a 300 to 500 kbps link, because a student who can see a small fuzzy version of the slides is learning and a student staring at a buffering wheel has left. Pushing 4K to the few who can take it is a vanity feature next to keeping the bottom rung alive for the many who cannot.
Here is the rule that is specific to education and that we treat as non-negotiable: audio quality never degrades. In an entertainment context a dropped video frame is a minor annoyance and a dropped second of audio is forgivable. In a lecture it is the opposite. A student can follow along through a soft, low-res picture of the teacher and the whiteboard. The moment the audio garbles, the explanation is gone, the thread is lost, and nothing on screen recovers it. So when the connection gets tight, the player should sacrifice video resolution aggressively and protect the audio bitrate to the last. We keep the audio at a clean, intelligible rate independent of the video rung, and we would rather show a 240p teacher with crystal-clear words than a sharp picture with a stuttering voice.
Two more EdTech-specific tweaks. First, prioritize the screen-share or slide track over the talking-head track. Text on a shared slide has to stay sharp to be readable, while the small camera feed of the presenter can tolerate a much lower rate, so when you have separate tracks you give the slides the bandwidth. Second, tune segment length for join speed, not just efficiency. Shorter segments mean a student who joins late or whose connection recovers starts playing sooner, which matters a lot during the join storm from the previous section. The default settings most encoders ship are tuned for movie-length binge watching, and a class is the opposite of that.

Recording and async replay pipeline

On an EdTech platform the recording is often worth more than the live session. A student who missed the 4:00 PM class, or who wants to rewatch the tricky part before an exam, drives a huge share of total watch time, and that watch happens on the cheap on-demand path instead of the expensive live one. So recording is not a nice-to-have you bolt on later. It is core, and the pipeline behind it is where the real engineering sits. This is also the cleanest place to lean on patterns from on-demand video systems, and it is closely related to the work in our 250,000-user EdTech platform case study.
Record on the server, never in the browser. Client-side recording is a trap we see new teams fall into because it looks easy: the browser already has the media, just capture it there. But a browser recording dies if the teacher's laptop sleeps, drops Wi-Fi, or closes the tab, and you only discover the lecture is gone after the class is over and irreplaceable. The SFU already has every participant's stream flowing through it, so that is where you record. A dedicated recording worker subscribes to the session like a silent participant and writes the composited stream straight to durable object storage. The teacher's flaky home connection cannot lose the recording, because the recording never depended on it.
Then the work goes asynchronous, and this is the part teams underestimate. The raw recording that lands in storage is a single high-bitrate file, and it is not yet something you can stream adaptively. It has to be transcoded into the full ABR ladder of renditions and segments before it is replay-ready, and transcoding a one-hour lecture is minutes of compute, not seconds. You never make a student wait on that, and you never block the live system on it. The recording lands, a job goes onto a queue, a pool of transcoding workers (managed MediaConvert, or your own FFmpeg fleet at volume) picks it up, produces the renditions, writes them back to storage, and flips the replay to "ready." Decoupling it this way means a backlog of transcoding jobs slows down how fast replays appear, never the live classes happening right now.
Once the renditions exist, replay is the easy, cheap part, and it is where the economics finally tilt in your favor. The transcoded segments sit in storage behind a CDN exactly like any other on-demand video. The first viewer of a given segment pulls it from origin, the CDN caches it at the edge, and every subsequent viewer in that cohort is served from cache at a fraction of the cost. Because a class's whole cohort tends to rewatch within the same few days, the cache-hit ratio on fresh recordings is excellent, and a popular recording is essentially free to serve after its first few plays. One automatic rule pays for itself many times over: every live session records by default and feeds the on-demand library without anyone clicking a button, so your most valuable, cheapest-to-serve asset accumulates as a side effect of teaching.

Bandwidth optimization for sustainable economics

Here is the uncomfortable truth most EdTech teams meet too late: video is the line item that decides whether the business model works. Storage is cheap, transcoding is a one-time cost per file, and your engineers' salaries are fixed. Egress is none of those things. You pay for every gigabyte the CDN pushes to every student, every time, and that number grows with watch time without any natural ceiling. A one-hour lecture is roughly 0.7 to 1 GB per view at a sensible bitrate. Ten thousand views of one popular lecture is seven to ten terabytes of egress, and you have hundreds of lectures. If your unit economics ignore egress, they are fiction.
The first lever, and the biggest, is the CDN cache-hit ratio. Every cache miss is a double cost: the edge has to fetch the segment from your origin storage (origin egress you also pay for) and the student waits longer for it. Education traffic is naturally cache-friendly, because a cohort watches the same content inside the same window, so getting the hit ratio above 90 percent is achievable and below that is usually a misconfiguration. The two mistakes we fix most often are short cache TTLs on segments that are actually immutable, and signed-URL query strings leaking into the cache key so every student's unique URL fragments the cache into thousands of one-view copies. Put the signature in a cookie or strip it from the cache key, set long TTLs on immutable segments, and the ratio jumps.
Then there is the egress rate itself, and this is where build-versus-buy gets real. The default public-internet egress rate on a big cloud is a few cents per gigabyte, and at scale that is the most expensive way to ship bytes you can choose. Multi-CDN routing lets you negotiate committed-volume rates across more than one provider and route each request to whichever is cheapest and healthiest for that student's region, which both cuts the per-gigabyte cost and gives you a failover when one CDN has a bad day. Regional SFU clusters do the same job for live: put the media server near the students so streams traverse less long-haul, expensive network, and keep cross-region traffic off the most costly paths. None of this matters at small scale, and all of it matters the moment your egress bill has a comma in it.
The cheapest byte is the one you never send. A few structural choices quietly shrink egress across the whole platform.
LeverWhat it doesWhere it bites
Modern codec (H.265 / AV1)20 to 50 percent fewer bytes for the same qualityCosts more transcode compute, needs player support
Per-title encodingA talking-head lecture needs far less bitrate than fast motionMost teams encode everything at one fixed ladder and overpay
High cache-hit ratioServes repeat views from the edge, not originKilled by bad cache keys and short TTLs
Hybrid live (SFU stage plus CDN broadcast)Pays cheap CDN economics for the watch-only majorityPure-SFU broadcast to thousands is the classic budget killer
Lifecycle storage tiersArchive old raw masters to cold storageForgotten raw files quietly inflate storage quarter over quarter
The single most common own-goal we see is broadcasting a popular lecture to thousands of viewers through a pure SFU, paying live media-server economics for an audience that is only watching. That one mistake can multiply a class's cost by an order of magnitude. Route the watch-only majority onto the cached CDN path and keep the SFU for the handful who actually interact, and the bill comes back to earth.

Operating live video at scale

Shipping the architecture is half the job. Keeping it healthy through a live 4:00 PM class, every weekday, is the other half, and it lives or dies on whether you are measuring the right things. Most teams watch server CPU and call it monitoring. CPU tells you the box is busy. It does not tell you whether the student in a rural classroom can actually follow the lesson, and that gap is where bad experiences hide from the dashboard.
The telemetry that matters is the telemetry the student feels. Jitter, the variation in packet arrival timing, is what turns a stream choppy even when average bandwidth looks fine, and it is invisible on a throughput graph. Packet loss above a couple of percent is where audio starts to garble, which we already established is the thing that actually loses a class. And glass-to-glass latency, the real delay from the teacher's camera to the student's screen, is the number that decides whether interaction feels natural or like a bad satellite call. We instrument all three per session and watch them as distributions, not averages, because the average looks healthy while a tail of students has an unwatchable experience. The honest health metric for live video is not "is the server up," it is "what fraction of students are getting clean audio and acceptable latency right now," and that has to be a real-time number, not a weekly report.
Capacity planning here follows the same hard rule as the rest of EdTech: plan for the spike, never the average. Your average concurrent viewer count is a comforting, useless number. The number that breaks you is the synchronized peak when the popular classes all start at the top of the hour. So you size the SFU pool and the egress headroom against that scheduled peak, you pre-warm against the join storm from section two, and you keep a margin above your worst observed minute, because the day a guest lecturer draws double the usual crowd is exactly the day you cannot afford to fall over.
And you find the wall in a test, not in production. Load test the live path before every predictable big event, and shape the test like the real thing. A flat synthetic load of N viewers held steady proves nothing, because the storm is the opposite of flat. Script the actual curve: idle, then a hard ramp that drives your full peak concurrency into a 90-second join window, then a long steady plateau of watching, against a staging environment provisioned like production. Watch jitter, packet loss, glass-to-glass latency, SFU CPU, and egress through the whole shape, and push the test past expected peak on purpose so you meet the failure on a quiet afternoon instead of in front of 5,000 students. Wire a smaller version of that scenario into your regular checks too, because a well-meaning change six months from now can quietly route watch-only traffic back through the SFU and reintroduce the exact cost-and-capacity bomb you defused. If you want a second set of eyes on your video path before the next big cohort, that load test is where we start, and you can tell us where it hurts or read how we approach builds like this on the custom development page.
YK
Written by

CEO and co-founder of Geminate Solutions, a software and product development partner. He has led teams shipping custom web apps, mobile apps, SaaS platforms, and AI products that serve over 250,000 daily active users.

FAQ

Frequently asked questions

What does video actually cost on an EdTech platform?
Almost always more than the team budgeted, because the bill hides in egress. The storage of the raw and transcoded files is cheap, a few cents per gigabyte a month. The transcoding is a one-time compute cost per video. The line that grows without limit is CDN egress, the bytes you pay to push out every single time a student presses play. A one-hour lecture at a sensible bitrate is roughly 0.7 to 1 gigabyte per view, and at public-internet egress rates of a few cents a gigabyte, ten thousand views of that one lecture is real money before you have added a second course. Egress scales with watch time, not with the size of your library, and that is the number most pricing models forget to project.
Should we build our own video pipeline or use a managed service?
For most EdTech teams, start with a managed service and revisit later. Something like AWS MediaConvert plus CloudFront, or Mux, or Cloudflare Stream, gives you transcoding, adaptive bitrate packaging, signed playback, and a global CDN without anyone on your team owning an encoding farm. You pay a premium per minute and per gigabyte for that, but you ship in weeks instead of quarters. The case for building your own pipeline only appears at serious volume, when the managed per-minute and per-gigabyte fees are large enough that an in-house FFmpeg fleet on raw compute, plus a multi-CDN egress deal you negotiate yourself, actually pays back the engineering and on-call cost. That crossover is a spreadsheet question, not a pride question.
How do you stop paying students from sharing or pirating course video?
In layers, because no single control is enough. The baseline is short-lived signed URLs, a playback link that the CDN cryptographically validates and that expires in minutes, so a copied link is dead before it spreads. Above that, token authentication ties playback to a logged-in session and you can bind it to a device or rotate keys per manifest segment. The strong tier is DRM, Widevine, FairPlay, and PlayReady, which encrypts the content and only hands decryption keys to an approved player, and that is what stops the casual screen-capture-and-download path. Be honest with yourself though. DRM raises the cost and the support burden, and a determined person can still point a camera at a screen, so the goal is to make casual sharing not worth the effort, not to achieve the impossible.
What is adaptive bitrate and why does EdTech need it?
Adaptive bitrate streaming, ABR, means you encode each video into several quality renditions and slice every one into a few-second segments. The player measures the student's real bandwidth and switches between renditions segment by segment, dropping to a lower quality when the connection dips and climbing back when it recovers. EdTech needs this more than entertainment streaming does, because a school audience spans a fibre connection in one city and a shared mobile hotspot in a rural classroom, all watching the same lecture. Without ABR the student on the weak connection just gets a spinning buffer and gives up. With it they get a slightly softer picture and keep learning, which is the whole point. The one rule we hold is that audio quality never drops, because a fuzzy slide is survivable and a dropped sentence is not.
Live streaming versus pre-recorded on-demand, which should we build first?
They are different systems with different failure modes, so do not assume one stack covers both. On-demand is forgiving. The file already exists, the CDN caches it, and if a request is slow the student waits a second and plays. Live is unforgiving. The lecture is happening once, latency and stalls are visible in real time, and there is no cache to fall back on for content that does not exist yet. On-demand is also far cheaper to operate because cached segments are reused across thousands of viewers. Our usual advice is to ship on-demand first since it carries most of the educational value at a fraction of the operational risk, then add live only when synchronous teaching is genuinely core to the product. Recording every live session into the on-demand library should be automatic from day one either way.
How much CDN cache-hit ratio should an EdTech video platform target?
Aim for a cache-hit ratio above 90 percent on your on-demand catalog, and treat anything below that as a cost leak to investigate. Every cache miss means the CDN edge has to fetch the segment from your origin storage, which adds origin egress on top of CDN egress and slows the first viewer. Education traffic is naturally cache-friendly because a popular lecture is watched by a whole cohort within the same few days, so the segments stay hot at the edge. You protect that ratio with long cache TTLs on immutable video segments, sensible cache-key design so signed-URL query strings do not fragment the cache, and for predictable surges like an exam-prep video before a deadline, pre-warming the edge before the rush. A poor cache-hit ratio is usually a configuration mistake, not a fact of life.
Can Geminate Solutions build the video delivery layer for our EdTech platform?
Yes. Designing the storage and transcoding pipeline, wiring up adaptive bitrate and a CDN, protecting paid content with signed URLs and DRM, and keeping egress costs from running away is exactly the kind of work we do as a product engineering partner. We have built and run an EdTech platform serving 250,000+ daily active users and an exam system handling 10,000,000+ requests a minute, so delivering video at the scale where the cost and the failure modes actually bite is familiar ground. We start by modelling your real watch-time and egress numbers before recommending build-versus-buy, because the right answer there is the whole budget. Start at geminatesolutions.com/get-started for a free architecture review.
GET STARTED

Ready to build something like this?

Partner with Geminate Solutions to bring your product vision to life with expert engineering and design.

Related Articles