What actually breaks during the storm
Let's trace one login through a typical setup, then multiply it by a few hundred a second. A student submits credentials. The app server checks the password, then it writes a new session row to the database, then it reads the user's roles and course enrollments, then it updates a last-seen timestamp on the user record. That is one write, a couple of reads, and a second write, all on the primary, all inside one request. On a quiet afternoon nobody notices. During the storm, this is the chain that detonates.
The first thing to go is the session-state write. Every login inserts a row, and many setups also update a session table or a users table column on the same record path. When hundreds of these land per second, the database starts taking row locks and, on some schemas, brief table-level contention on hot indexes. Writes that took two milliseconds now wait behind a queue of other writes. The connection pool, which is a small fixed number of slots, fills with requests all blocked on the same locks, and once the pool is exhausted every new request waits for a connection before it can even start.
That is primary-DB lock contention, and it is the heart of the storm. The primary is a single writable node by design, because writes have to be consistent. You cannot just add more primaries the way you add app servers. So as the lock queue grows, latency on the auth path climbs from milliseconds into seconds, requests start hitting their timeouts, and the app servers begin returning 500s for logins that would have succeeded if they had only waited a little longer. Students retry, which doubles the load, which deepens the queue. The retry storm on top of the login storm is what turns a slow minute into a full outage.
Then comes the downstream cascade, and this is the part that surprises people. Auth is not isolated. The same primary database serving session writes is also serving the dashboard query that loads a student's classes, the API that fetches today's assignments, the gradebook read. When auth saturates the connection pool and the primary's CPU, every one of those unrelated features slows down or fails too, because they are all queuing behind the same exhausted resource. A login problem becomes a whole-platform problem. The student who logged in at 7:58 and is sitting on the dashboard suddenly cannot load their first lesson, even though their own request had nothing to do with logging in. One saturated primary takes the building down with it.
Implementation pattern: 5-step rollout
You do not rebuild the auth path of a live platform with a flag day and a held breath. Auth is the front door, and if the cutover goes wrong every single user is locked out at once. So the whole rollout is built around one rule: every step is reversible, and the old path keeps working until the new one has earned the traffic. Here is the sequence we run.
Step 1, build the token path beside the old one. Add token issuance and in-memory validation as a parallel code path. Do not delete the session logic. At this point nobody is using the new path in anger, you are just shipping the machinery and unit-testing the signing, the expiry, the refresh flow, and the denylist. Behind a feature flag that is off for everyone.
Step 2, dual-validate. Teach the app servers to accept either a valid session or a valid token on incoming requests. This is the keystone of a zero-downtime migration. With dual-validation live, a user holding an old session and a user holding a new token are both authenticated, so you can move people across the boundary without ever forcing a logout. Set up your read replicas in this step too, and point permission lookups at them.
Step 3, ramp issuance behind a flag. Start issuing tokens instead of sessions to a small slice of logins, 1 percent, then 5, then 25. Pick the slice by user ID so it is sticky, and pick your timing deliberately. Roll out into the quiet windows the school calendar hands you, evenings and weekends, never the minute before first period. Watch P99 auth latency and error rate on the new cohort against the old one at every increase. If anything looks wrong, the flag goes back to zero and the old session path catches everyone instantly.
Step 4, full cutover with a safety window. Once 100 percent of new logins issue tokens, leave dual-validation running. Old sessions are still out there in browsers, and you let them expire naturally rather than invalidating them and logging a building full of students out mid-lesson. Keep both paths alive until the longest old session has aged out, typically a day or two depending on your session lifetime.
Step 5, decommission and harden. When the last old session has expired and the metrics have held through a real Monday morning, retire the session-validation code and the session table writes. Now you tighten the new path: confirm token lifetimes are short, the refresh flow is solid, the revocation denylist works, and the read replicas are carrying the permission load. The migration is done when the primary database barely notices the 8:00 AM spike, and you proved that with traffic, not with a diagram.