Adaptive bitrate done correctly for EdTech
Adaptive bitrate is table stakes, so we will keep the basics short. You encode each video into a ladder of renditions, say 1080p, 720p, 480p, and 360p, then chop every rendition into segments of a few seconds. The player ships a manifest listing the ladder, measures the student's actual throughput, and switches renditions segment by segment. Bandwidth dips, it steps down to a smaller file. Bandwidth recovers, it climbs back up. That is the whole mechanism, and any competent video stack gives it to you. What separates an EdTech setup from a Netflix clone is what you do with the ladder, and here most teams copy the entertainment playbook and get it subtly wrong.
Your audience is not sitting on a smart TV with fibre. It is a student on a shared mobile hotspot in a town where the tower is congested at 4:00 PM, watching the exact same lecture as a student on campus broadband. So the bottom of your ladder matters far more than the top. We design the lowest rung to stay watchable on a genuinely bad connection, a low-resolution video stream that survives a 300 to 500 kbps link, because a student who can see a small fuzzy version of the slides is learning and a student staring at a buffering wheel has left. Pushing 4K to the few who can take it is a vanity feature next to keeping the bottom rung alive for the many who cannot.
Here is the rule that is specific to education and that we treat as non-negotiable: audio quality never degrades. In an entertainment context a dropped video frame is a minor annoyance and a dropped second of audio is forgivable. In a lecture it is the opposite. A student can follow along through a soft, low-res picture of the teacher and the whiteboard. The moment the audio garbles, the explanation is gone, the thread is lost, and nothing on screen recovers it. So when the connection gets tight, the player should sacrifice video resolution aggressively and protect the audio bitrate to the last. We keep the audio at a clean, intelligible rate independent of the video rung, and we would rather show a 240p teacher with crystal-clear words than a sharp picture with a stuttering voice.
Two more EdTech-specific tweaks. First, prioritize the screen-share or slide track over the talking-head track. Text on a shared slide has to stay sharp to be readable, while the small camera feed of the presenter can tolerate a much lower rate, so when you have separate tracks you give the slides the bandwidth. Second, tune segment length for join speed, not just efficiency. Shorter segments mean a student who joins late or whose connection recovers starts playing sooner, which matters a lot during the join storm from the previous section. The default settings most encoders ship are tuned for movie-length binge watching, and a class is the opposite of that.
Recording and async replay pipeline
On an EdTech platform the recording is often worth more than the live session. A student who missed the 4:00 PM class, or who wants to rewatch the tricky part before an exam, drives a huge share of total watch time, and that watch happens on the cheap on-demand path instead of the expensive live one. So recording is not a nice-to-have you bolt on later. It is core, and the pipeline behind it is where the real engineering sits. This is also the cleanest place to lean on patterns from on-demand video systems, and it is closely related to the work in our
250,000-user EdTech platform case study.
Record on the server, never in the browser. Client-side recording is a trap we see new teams fall into because it looks easy: the browser already has the media, just capture it there. But a browser recording dies if the teacher's laptop sleeps, drops Wi-Fi, or closes the tab, and you only discover the lecture is gone after the class is over and irreplaceable. The SFU already has every participant's stream flowing through it, so that is where you record. A dedicated recording worker subscribes to the session like a silent participant and writes the composited stream straight to durable object storage. The teacher's flaky home connection cannot lose the recording, because the recording never depended on it.
Then the work goes asynchronous, and this is the part teams underestimate. The raw recording that lands in storage is a single high-bitrate file, and it is not yet something you can stream adaptively. It has to be transcoded into the full ABR ladder of renditions and segments before it is replay-ready, and transcoding a one-hour lecture is minutes of compute, not seconds. You never make a student wait on that, and you never block the live system on it. The recording lands, a job goes onto a queue, a pool of transcoding workers (managed MediaConvert, or your own FFmpeg fleet at volume) picks it up, produces the renditions, writes them back to storage, and flips the replay to "ready." Decoupling it this way means a backlog of transcoding jobs slows down how fast replays appear, never the live classes happening right now.
Once the renditions exist, replay is the easy, cheap part, and it is where the economics finally tilt in your favor. The transcoded segments sit in storage behind a CDN exactly like any other on-demand video. The first viewer of a given segment pulls it from origin, the CDN caches it at the edge, and every subsequent viewer in that cohort is served from cache at a fraction of the cost. Because a class's whole cohort tends to rewatch within the same few days, the cache-hit ratio on fresh recordings is excellent, and a popular recording is essentially free to serve after its first few plays. One automatic rule pays for itself many times over: every live session records by default and feeds the on-demand library without anyone clicking a button, so your most valuable, cheapest-to-serve asset accumulates as a side effect of teaching.
Operating live video at scale
Shipping the architecture is half the job. Keeping it healthy through a live 4:00 PM class, every weekday, is the other half, and it lives or dies on whether you are measuring the right things. Most teams watch server CPU and call it monitoring. CPU tells you the box is busy. It does not tell you whether the student in a rural classroom can actually follow the lesson, and that gap is where bad experiences hide from the dashboard.
The telemetry that matters is the telemetry the student feels. Jitter, the variation in packet arrival timing, is what turns a stream choppy even when average bandwidth looks fine, and it is invisible on a throughput graph. Packet loss above a couple of percent is where audio starts to garble, which we already established is the thing that actually loses a class. And glass-to-glass latency, the real delay from the teacher's camera to the student's screen, is the number that decides whether interaction feels natural or like a bad satellite call. We instrument all three per session and watch them as distributions, not averages, because the average looks healthy while a tail of students has an unwatchable experience. The honest health metric for live video is not "is the server up," it is "what fraction of students are getting clean audio and acceptable latency right now," and that has to be a real-time number, not a weekly report.
Capacity planning here follows the same hard rule as the rest of EdTech: plan for the spike, never the average. Your average concurrent viewer count is a comforting, useless number. The number that breaks you is the synchronized peak when the popular classes all start at the top of the hour. So you size the SFU pool and the egress headroom against that scheduled peak, you pre-warm against the join storm from section two, and you keep a margin above your worst observed minute, because the day a guest lecturer draws double the usual crowd is exactly the day you cannot afford to fall over.
And you find the wall in a test, not in production.
Load test the live path before every predictable big event, and shape the test like the real thing. A flat synthetic load of N viewers held steady proves nothing, because the storm is the opposite of flat. Script the actual curve: idle, then a hard ramp that drives your full peak concurrency into a 90-second join window, then a long steady plateau of watching, against a staging environment provisioned like production. Watch jitter, packet loss, glass-to-glass latency, SFU CPU, and egress through the whole shape, and push the test past expected peak on purpose so you meet the failure on a quiet afternoon instead of in front of 5,000 students. Wire a smaller version of that scenario into your regular checks too, because a well-meaning change six months from now can quietly route watch-only traffic back through the SFU and reintroduce the exact cost-and-capacity bomb you defused. If you want a second set of eyes on your video path before the next big cohort, that load test is where we start, and you can
tell us where it hurts or read how we approach builds like this on the
custom development page.