Zero-Downtime Data Migration: How It Actually Works
Every migration vendor markets "zero downtime." Most mean "your users won't notice the cutover happened on Saturday at 3 AM," which is not the same thing. This is what zero-downtime migration actually means architecturally, the patterns that make it work, and the failure modes that derail it.
What "Zero Downtime" Actually Means
Three distinct definitions get conflated:
- Off-hours cutover. Migration happens at 3 AM on a holiday. Users sleep through it. Technically there is downtime, but no one experiences it. This is the bargain-basement version.
- Read-only window. Source goes read-only for an hour while final delta is loaded. Users can't write but can still log in and read. Acceptable for B2B; unacceptable for consumer products with active sessions.
- True zero downtime. Both source and destination accept writes throughout the migration. Cutover is a single atomic switch of which system is authoritative. Users experience nothing.
This article is about the third definition. It's harder, more expensive, and worth it for systems where downtime is unacceptable.
The Core Architecture: Dual-Write + CDC
Zero-downtime migration relies on two patterns running together: dual-write at the application layer, and Change Data Capture (CDC) at the data layer.
Dual-write
The application is modified to write to both source and destination simultaneously. Reads still come from the source (it's authoritative). Writes go to both. The application is responsible for maintaining consistency across the two systems during the migration window.
This pattern works when you control the application code. It does not work when the source system is the application (Salesforce, NetSuite, HubSpot). Those systems' UIs write directly to their own databases and you can't intercept.
Change Data Capture (CDC)
CDC reads the source system's transaction log (Postgres WAL, MySQL binlog, Salesforce Change Data Capture API, NetSuite Saved Search webhooks) and replicates every change to the destination. The replication runs continuously, with sub-second latency in well-tuned setups.
This is the only viable pattern when the source is a SaaS system, you can't modify its application code. CDC happens at the data layer using the source system's published change feed.
The combined approach
For most enterprise migrations: dual-write where you control the application, CDC where you don't. For pure SaaS-to-SaaS migrations (Salesforce → HubSpot): CDC on both sides, with a "source of truth" flag determining which system writes win during conflicts.
Rehearsal Runs
Before cutover, you rehearse. The rehearsal is the migration, on a different day, with the same data, validated against the same acceptance criteria. You run it 2-3 times.
Why rehearse?
Three reasons:
- Discovers ordering bugs. Records loaded out of dependency order fail. Rehearsal catches these before production.
- Calibrates timing. How long does the load actually take at full data volume? Estimates are wrong. Measure twice, cut over once.
- Builds team muscle. The team running cutover should have done it before. Rehearsal is muscle-building under no pressure.
What a rehearsal looks like
- Day 1: Provision a destination clone. Load full historical data. Run reconciliation. Identify deltas.
- Day 2: Triage every delta. Some are real bugs (mapping errors, missed transformations). Some are timing artifacts (records modified between extract and load).
- Day 3: Re-run the migration with bug fixes. Validate clean reconciliation.
- Repeat until two consecutive rehearsals run clean.
Most migrations need at least 2 rehearsals. Complex migrations need 4-5.
Traffic Cutover
The cutover itself is anticlimactic if you've done the prep right. Roughly:
- T-30 min: Final delta load. CDC catches up to within seconds of source.
- T-10 min: Stop dual-write at the application layer. All writes now go to source only (briefly).
- T-5 min: Final CDC catch-up. Destination is now byte-equivalent to source.
- T-0: Atomic switch. Reads and writes flip to destination. Source becomes read-only.
- T+5 min: First production traffic confirmed against destination. Smoke tests pass.
- T+30 min: Reconciliation report runs. Variance is zero (or known and accepted).
- T+1 hour: Cutover declared complete. Source remains read-only for the rollback window.
The "atomic switch" is implementation-specific. For applications behind a load balancer: change the upstream backend. For DNS-based routing: lower TTLs in advance, then switch records. For SaaS migrations with API integrations: update the integration's API endpoint config.
Reconciliation
Reconciliation is the proof that migration succeeded. Without it, you have a feeling. With it, you have evidence.
Three layers of reconciliation
- Row counts. Source COUNT(*) per table = destination COUNT(*) per table. Variance must be zero.
- Column aggregates. SUM, MIN, MAX, COUNT DISTINCT per critical column. For numeric columns this is straightforward; for text columns we use COUNT(*) WHERE column IS NOT NULL.
- Row-level hashes. Sample 1% of rows; compute a hash of every column; compare source hash to destination hash. Catches subtle field-level corruption that aggregates miss.
Reconciliation thresholds
Define acceptance criteria before cutover. Some examples:
- Row count variance: must be 0.
- Numeric column SUM variance: <0.01% (rounding tolerance).
- Date column MIN/MAX variance: 0 (timestamps must match exactly).
- Row-level hash sample: 99.99%+ match rate.
If reconciliation fails the criteria, cutover doesn't sign off. Either fix-forward (find and fix the variance) or roll back.
Rollback Windows
Even with everything done right, you keep a rollback window. Two reasons:
- Latent bugs. Some bugs only appear under production load patterns. A rollback window gives you time to discover and respond.
- User confidence. Knowing rollback is possible reduces the political risk of the project. Stakeholders sign off more readily on a migration with a documented rollback plan than one without.
How long?
- SaaS migrations: 30 days. Enough to catch any month-end report or invoice cycle issues.
- OLTP database migrations: 7 days. Long enough to validate steady-state, short enough that source data doesn't drift unrecoverably.
- ERP migrations: 90 days. Multiple month-ends, full quarter close cycle.
What stays alive during rollback?
The source system stays in read-only mode. CDC continues replicating destination → source for any post-cutover writes (in case rollback is needed). All historical access (audit, reporting, drill-down) works against either system.
After the rollback window expires, source is decommissioned. Pre-decommission checklist: confirm no dependent systems still query the source; archive source data to cold storage for compliance retention; revoke access; delete tenant.
Common Failure Modes
The patterns that cause zero-downtime migrations to fail:
- Skipping rehearsal. Discovering load-order bugs at 3 AM on cutover day. The most common failure.
- CDC lag. CDC pipeline can't keep up with source write rate. Cutover gets postponed because destination isn't caught up. Fix: provision more CDC capacity or rate-limit source writes during cutover.
- Sequence/identity column drift. Source and destination both auto-increment. After cutover, IDs collide. Fix: pre-allocate ID ranges or seed destination's sequence higher than source's max.
- Trigger fan-out. Loading a record on the destination fires triggers that try to update related records that haven't been loaded yet. Fix: disable triggers during load; re-enable post-load with a one-time backfill.
- Connection pool exhaustion. Bulk load saturates the destination's connection pool. Application traffic gets queued. Fix: load against a separate connection pool or load during low-traffic windows.
- Reconciliation thresholds defined after the fact. Variance shows up; team debates whether it's acceptable. Define thresholds before, not after.
Most of these are avoided with disciplined rehearsal and explicit pre-cutover acceptance criteria. The complete migration checklist covers the prep work.