What direct database writes actually break
Picture the naive path, because almost every platform ships it first and there is nothing shameful about that. A student clicks submit. The request handler validates the payload, opens a transaction, writes a row to the submissions table, maybe writes a few more rows for attachments and an audit trail, commits, and only then returns a 200. The student's browser sat blocking on that whole round trip. At ten requests a second this is invisible. At ten thousand requests a second arriving inside the same ninety seconds, every piece of it turns into a failure mode.
Lock contention is usually the first to go. A submission write is not a lonely insert. It touches the assignment row to bump a counter, it touches the student's enrollment record, it may update a per-assignment submission count. Under a flood, thousands of transactions all reach for the same hot rows, queue behind each other on row and index locks, and transaction time climbs from single-digit milliseconds into seconds. Connections do not free up fast enough, so the connection pool drains, and then requests start failing not because the database is full but because they cannot get a connection to it in the first place.
The next thing to saturate is IO, and it shows up in two places at once. Every insert into a hot table also writes to its indexes, and a write-heavy table under a flood produces a lot of index churn and page splits that the storage layer has to flush. Meanwhile the file uploads, the actual PDFs and code archives and images, are hammering object storage or a disk volume on a parallel path. We have seen the database look healthy on CPU while the underlying volume's IOPS were pinned and write latency had quietly gone through the roof. The query planner is fine. The disk is the bottleneck, and disks do not scale with a config flag.
Here is the failure that does the real damage, though, and it is a subtle one. When the synchronous path slows down, students see spinners. Spinners make people click submit again. Now you have retries piling onto an already saturated primary, which slows it further, which produces more spinners, which produces more retries. That feedback loop is how a database that was at eighty percent at the start of the minute is fully wedged by the end of it. The flood does not just overwhelm the system. The system's own error handling helps the flood overwhelm it faster.
The queue-based pattern
The fix is one idea, applied ruthlessly: accept fast, process slow. Split the submission into two jobs that used to be one. The first job, accepting it, has to be cheap and durable and has to happen during the flood. The second job, processing it, can be expensive and can happen whenever the consumers get to it, because the student does not need to wait for it. A durable queue sits between the two and absorbs the entire shape of the spike.
At the API, the request handler becomes a producer and does almost nothing. It validates that the request is well-formed and the student is allowed to submit, it makes sure the uploaded file is safely in object storage (a direct-to-storage presigned upload keeps the bytes off your app servers entirely), and then it writes one small message to the queue. That message is just a reference: who submitted, for which assignment, where the file lives, and an idempotency key. The broker acknowledges, the API returns a receipt, and the whole interaction is over in tens of milliseconds. No grading, no plagiarism check, no fat transaction on the primary. The accept path stays flat no matter how steep the flood gets, because it is doing a fixed, tiny amount of work per request.
The durable queue in the middle is the part that actually saves you. Durable is the word that matters: once the broker acknowledges a message, it is written to disk and replicated, so a power cut, a deploy, or every consumer crashing at once cannot lose it. Which broker depends on what you already run. Amazon SQS is the path of least operational drag if you are on AWS and want a managed queue with a built-in dead-letter queue. Kafka or NATS JetStream make sense when you want a durable, replayable log and you have the appetite to run it. We tend to start teams on the managed option, because the deadline is stressful enough without also learning to operate a new piece of stateful infrastructure under fire.
On the other side a pool of consumers reads from the queue and does the heavy lifting at a rate the database can actually sustain. This is the quiet genius of the pattern, sometimes called load leveling: the queue might fill with forty thousand messages in two minutes, but the consumers drain it at a steady, safe rate, say a few hundred a second, and the database never sees the spike at all. It sees a smooth, predictable write rate it was sized for. If the backlog grows you add consumers, and because grading is independent per submission, that scales out linearly. The flood becomes a backlog, and a backlog is a thing you can wait on calmly instead of an outage you firefight.
| Property | Synchronous write | Queue-based pipeline |
|---|
| Accept latency at peak | Climbs into seconds, then errors | Flat, under 200ms at P99 |
| What the DB sees | The full spike, all at once | A smooth, leveled write rate |
| A lost submission | Possible whenever the path fails | Impossible once the broker acks |
| Recovery from a crash | In-flight writes are gone | Messages replay from the queue |
| Scaling lever | Bigger primary (a ceiling) | More consumers (horizontal) |
Guaranteed delivery and exactly-once semantics
A queue is only a real safety net if you wire the acknowledgements correctly, and this is where the subtle bugs live. The guarantee runs in two hops. First, the producer must not tell the student success until the broker has confirmed the message is durably stored. Return the receipt a beat too early and you have promised something you cannot keep. Second, a consumer must not acknowledge a message until it has finished the work and committed it. If you ack on receipt and then crash mid-grade, the broker thinks the job is done and the submission silently vanishes. So the rule is blunt: ack only after the work is durably committed, never before.
That rule has a price, and the price is duplicates. If a consumer does the work but dies in the half-second before it can ack, the broker times out and redelivers the message to another consumer, which does the work again. This is at-least-once delivery, and it is the honest default of every durable queue. People chase true exactly-once delivery and mostly find it is either a myth or a performance tax not worth paying. The move that actually works is to accept at-least-once delivery and make your processing idempotent, so doing the work twice produces the same result as doing it once.
Idempotency keys are how you get there, and they have to start at the client. The browser generates one unique id per submission attempt and sends it with the request and every retry of that request. The server stores that key the first time it commits the work, inside the same transaction as the write, so the two succeed or fail together. When a duplicate arrives, whether from a student mashing submit on a bad connection or from the broker redelivering after a crash, the server sees the key already exists and returns the original result instead of creating a second copy. One submission, one row, no matter how many times the message is delivered. Skip this and a flood quietly fills your database with triplicate exams that someone has to untangle by hand later.
The last piece is the dead-letter queue, and it is the difference between a resilient system and one that loops forever. Some messages genuinely cannot be processed: a corrupt file, a student who was un-enrolled mid-flight, a bug that throws on one specific payload. You retry these a few times with backoff, because most failures are transient and clear on their own. But a message that fails after, say, five tries does not deserve a sixth that blocks the pipeline behind it. It goes to a dead-letter queue, a holding pen where it is preserved, not dropped, and an alert fires for a human to look. The crucial property is that nothing is ever silently lost. Every submission ends in exactly one of three states: processed, retrying, or parked in the dead-letter queue with someone notified. There is no fourth state called gone.