Dual-write and gradual cutover
Here is the move most teams get backwards. They try to split the database first. Don't. Split the code first and keep one database honest as long as you possibly can. While the new auth service and the old monolith both read and write the same users table, consistency is not a problem you have to solve, because there is exactly one source of truth. You get to extract and ship the service, prove it serves traffic correctly, and defer the genuinely scary part (moving the data) until you have nothing else to worry about.
When you finally do move storage, you do it with dual-write and a verification window, never a single overnight copy. The sequence we use looks like this. First, the service writes to both the old store and the new one, while reads still come only from the old store, so the new store is just shadowing reality with zero risk. Second, a background job runs continuously and compares the two row by row, and it alarms on any mismatch instead of trusting the copy. Third, only once that mismatch count has sat at exactly zero for days (not minutes, days) do you start serving reads from the new store. Fourth, you flip writes. Fifth, after a safe soak, you delete the old write path and the comparison job. Skipping the verification window is how teams discover a foreign-key edge case in production with real student grades on the line.
The tables that hurt are the ones whose rows reference each other across the new boundary. A grade points at a user, and now grades and users are heading for different services. You have a few honest options, and the table below is roughly how we choose.
| Situation | Approach | Trade-off |
|---|
| Reference is read-only and rarely changes | Denormalize the needed field onto the new service | Slight duplication, but no cross-service call on the hot path |
| Reference must stay live | Service exposes an API, the other service calls it | A network hop and a new failure mode to handle |
| Hard transactional link (money, enrollment) | Keep both tables in one service until the boundary is proven | Slower decomposition, but you never lose a write |
Now the cutover itself, which is a dial and not a switch. We route a sliver of traffic to the new service first, usually internal accounts plus around one percent of real users, and watch error rate and P99 latency side by side against the monolith for at least a day. If the new path holds, we raise the percentage in steps (5, 25, 50, 100), pausing at each one. And the rollback has to be a config change that takes effect in seconds, because under real load nobody has ten minutes to ship a hotfix. As long as the new service has not yet started owning writes the monolith cannot read, flipping traffic back to the monolith is completely safe. The one genuinely irreversible moment is the write cutover, so we schedule that for a quiet window and, on EdTech, never the night before an exam.