Backup 3-2-1 Rule
The 3-2-1 backup rule is one of the oldest and most reliable strategies for protecting data. Originally popularized by photographer Peter Krogh in his book The DAM Book: Digital Asset Management for Photographers, it remains the foundation of nearly every modern backup policy. The rule is simple: keep 3 copies of your data, on 2 different types of media, with 1 copy stored offsite.
The strategy is endorsed by the US-CERT (Cybersecurity and Infrastructure Security Agency) and referenced in the NIST SP 800-34 Rev. 1 - Contingency Planning Guide as a foundational element of data protection.
The Rule Explained
| Copy 1 | Copy 2 | Copy 3 | |
|---|---|---|---|
| Role | Production | Local Backup | Offsite Backup |
| Media | SSD / NVMe | HDD / NAS / Tape | Cloud / Remote DC |
| Location | Local Site | Local Site | Remote Site |
3 Copies
Maintain at least three copies of your data: the original (production) and two backups. A single backup is not enough because both the original and the backup can fail simultaneously. Disk failures, ransomware, accidental deletion, or software bugs can corrupt data without warning. With three copies, the probability of losing all of them at once drops dramatically.
2 Different Media Types
Store backups on at least two different types of storage media. If all copies live on the same type of hardware, a single class of failure (firmware bug, batch defect, same-model vulnerability) can destroy everything at once.
Common media type combinations:
| Primary | Secondary |
|---|---|
| SSD/NVMe | HDD |
| HDD | Tape (LTO) |
| Local NAS | Cloud Object Storage |
| SAN | USB drives |
The key is that different media types have different failure modes. An SSD and an HDD will not fail for the same physical reason at the same time.
1 Offsite Copy
At least one copy must be stored in a geographically separate location. Local disasters (fire, flood, theft, power surge) can destroy all equipment at a single site. An offsite copy survives even if the entire primary site is lost.
Offsite options include:
- Cloud storage (Amazon S3, Backblaze B2, GCS, Azure Blob)
- Remote data center or colocation facility
- Physically transported tapes or drives stored at another location
RPO and RTO
Any backup strategy needs to answer two fundamental questions: how much data can you afford to lose, and how quickly do you need to recover.
Recovery Point Objective (RPO) defines the maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. This directly determines backup frequency: if your RPO is 1 hour, you need backups at least every hour.
Recovery Time Objective (RTO) defines the maximum acceptable downtime. An RTO of 4 hours means the system must be back online within 4 hours after a failure. This determines what kind of backup infrastructure you need and how fast your restore process must be.
| RPO | Backup Frequency | Example |
|---|---|---|
| 24 hours | Daily | Non-critical archives, documentation |
| 1 hour | Hourly snapshots | Business applications, internal tools |
| Minutes | Continuous replication | E-commerce, financial transactions |
| Near zero | Synchronous replication | Payment processing, trading systems |
The 3-2-1 rule does not prescribe specific RPO/RTO values, but the choice of tools and media directly affects both. Restoring from a local NAS is fast (low RTO) but a single daily backup means up to 24 hours of potential data loss (high RPO). Cloud-based continuous replication gives near-zero RPO but may have higher RTO depending on bandwidth and data volume.
Cost Considerations
Backup strategy involves a tradeoff between cost and protection level. Understanding storage costs helps in choosing the right combination of media.
Storage Cost Comparison
| Media Type | Approximate Cost per TB/month | Durability | Access Speed |
|---|---|---|---|
| Local HDD (enterprise) | $1-2 (amortized over 5 years) | Moderate | Fast |
| Local NAS (RAID) | $3-5 (amortized) | High | Fast |
| LTO-9 Tape | $0.50-1 (amortized) | Very high | Slow (sequential) |
| Amazon S3 Standard | $23 | Very high (99.999999999%) | Fast |
| Amazon S3 Glacier Deep Archive | $0.99 | Very high | Hours |
| Backblaze B2 | $6 | Very high | Fast |
Hidden Costs
Raw storage price is only part of the equation:
- Egress fees: most cloud providers charge for downloading data. During a disaster recovery, you may need to download everything at once. S3 charges $0.09/GB for egress, meaning restoring 10 TB costs $900. Backblaze B2 and Wasabi offer free or reduced egress
- API request costs: frequent incremental backups generate many PUT/GET requests. S3 charges per 1,000 requests, which adds up with small-file-heavy workloads. Backblaze B2 offers free egress up to 3x stored data
- Bandwidth: uploading backups to remote locations requires sufficient upload bandwidth. A 10 TB initial backup over a 100 Mbps link takes roughly 9 days
- Management overhead: more complex backup infrastructure requires more time to maintain, monitor, and test
Cost-Effective 3-2-1 Strategy
For most small to medium deployments:
- Copy 1 (production): already paid for as part of infrastructure
- Copy 2 (local NAS with HDD): one-time hardware cost, low ongoing cost
- Copy 3 (B2 or S3 Glacier Deep Archive): $1-6/TB/month depending on access needs
A 5 TB dataset following this strategy costs roughly $5-30/month for offsite storage, plus the one-time cost of NAS hardware.
Backup Verification
A backup that has never been tested is a hope, not a strategy. Verification should cover both integrity checks and full restore tests.
Integrity Checks
Run automated integrity checks on every backup:
- Checksum verification: tools like restic and borgbackup store checksums for every block of data.
Run
restic checkorborg checkregularly to detect silent corruption - Snapshot listing: verify that expected snapshots exist and cover the right time range. A backup job that silently stopped running last month is worse than no backup at all
- Size validation: compare backup sizes over time. A sudden drop may indicate missing data. A sudden spike may indicate unwanted files being backed up
Automated Restore Testing
Schedule periodic restore tests to verify end-to-end recovery. At minimum, a restore test should restore the latest snapshot to a temporary directory, verify that critical files exist, and confirm that database dumps can be loaded without errors.
Verification Schedule
| Check | Frequency | Automated |
|---|---|---|
| Backup job completion | Every run | Yes (monitoring alerts) |
| Checksum integrity | Weekly | Yes (cron/systemd timer) |
| Snapshot count and age | Daily | Yes (monitoring alerts) |
| Partial restore test | Monthly | Yes (script above) |
| Full bare-metal restore | Quarterly | Manual |
The quarterly full restore is the most important test. It verifies not just data integrity but the entire recovery procedure: documentation, access credentials, network configuration, and the time it actually takes. Document each test result, including how long the restore took, and update the RTO estimate accordingly.
Common Mistakes
Testing restores: many organizations back up religiously but never verify that restores actually work. A backup that cannot be restored is not a backup. Schedule periodic restore tests.
Same failure domain: storing the “offsite” copy on another server in the same rack or building does not satisfy the offsite requirement. A single event can destroy both.
No encryption: backups stored offsite or in the cloud must be encrypted. Losing control of unencrypted backup data is a data breach.
No retention policy: keeping only the latest backup means that if corruption goes undetected for days, all backups may contain corrupted data. Use versioned or incremental backups with a retention window.
Ignoring backup integrity: backups should include checksums or verification steps. Silent data corruption (bit rot) can make backups unreadable over time.
Asymmetric encryption without private key management: encrypting backups with a public key (PGP, RSA) is convenient because anyone can create backups without access to the decryption key. However, if the private key is lost, every backup encrypted with it becomes permanently unreadable. Unlike a symmetric passphrase that a human can memorize or write down, a private key is a large binary blob that must be carefully stored and replicated. If you use asymmetric encryption for backups, store the private key in multiple secure locations independent of the backups themselves: a hardware security module, a password manager, and a printed paper copy in a safe. Test decryption regularly to confirm the private key is still accessible. A backup encrypted with a lost private key is identical to no backup at all.
Backing up only databases, not configuration: restoring a database is useless if the application configuration, TLS certificates, cron jobs, firewall rules, and service definitions are lost. A full recovery requires the entire environment, not just the data. Include system configuration, secrets, and infrastructure-as-code definitions in your backup scope.
No monitoring or alerting: a backup job that fails silently for weeks is a common cause of data loss. Without monitoring, the failure is only discovered when a restore is needed and nothing is there. Every backup job should report success or failure to a monitoring system, and missing reports should trigger alerts just like explicit failures.
Relying on RAID as a backup: RAID protects against individual disk failure,
but it does not protect against ransomware, accidental deletion, software bugs, or controller failure.
RAID is a high-availability mechanism, not a backup strategy. All data on a RAID array
can be destroyed by a single rm -rf, an encrypted filesystem attack, or a failed controller
that corrupts the entire array.
Relying solely on cloud provider snapshots: provider-managed snapshots (EBS snapshots, VM snapshots) are convenient but exist within the same provider and often the same account. An account compromise, billing issue, or provider-side bug can delete snapshots along with the primary data. Snapshots are a useful first layer but must be supplemented with an independent offsite copy.
Backing up running databases without consistency: copying database files from a running instance
produces a corrupted backup. Databases must be backed up using their native dump tools
(pg_dump, mysqldump, mongodump) or filesystem snapshots taken while writes are frozen.
A backup that cannot be restored due to corruption is worse than no backup
because it creates a false sense of security.
Not documenting the restore procedure: even with perfect backups, a restore can fail if no one knows the exact steps, credentials, and order of operations. Document the restore procedure, store it outside the systems being backed up, and verify that someone other than the original author can follow it successfully.
Extended Rules
The original 3-2-1 rule has evolved as threats have changed. Two notable extensions are now widely recommended.
3-2-1-1-0
This extension adds two requirements:
- 1 copy must be air-gapped or immutable (protected from ransomware that can encrypt network-accessible backups)
- 0 errors verified through regular restore testing
Immutability can be achieved through:
- Object lock on cloud storage (S3 Object Lock, B2 immutable buckets)
- WORM (Write Once Read Many) tape
- Offline/air-gapped drives that are disconnected after backup
4-3-2
For critical data, some organizations use the 4-3-2 rule:
- 4 copies of the data
- 3 different storage media types
- 2 offsite copies in different geographic locations
This provides additional redundancy against regional disasters and cloud provider outages.
Practical Implementation
A minimal 3-2-1 setup for a Linux server:
| Copy | Media | Location | Tool |
|---|---|---|---|
| Original | SSD (production server) | Primary DC | - |
| Backup 1 | NAS with HDD (ZFS/RAID) | Primary DC | restic, borgbackup |
| Backup 2 | Cloud object storage | Remote (S3/B2) | restic, rclone |
Key implementation details:
- Use different tools for each backup copy. If a bug in one tool corrupts or silently skips files, the other tool will still produce a valid backup. For example, use borgbackup for local NAS backups and restic for offsite cloud copies. Tool diversity eliminates single points of failure in backup software itself
- Automate backups with cron or systemd timers, never rely on manual execution
- Encrypt all offsite backups at rest (restic and borgbackup do this by default)
- Monitor backup jobs and alert on failures
- Retain multiple versions (daily for 7 days, weekly for 4 weeks, monthly for 12 months is a common starting point)
- Test restores regularly, at minimum once per quarter
Summary
The 3-2-1 rule provides a simple framework that scales from personal workstations to enterprise infrastructure. The exact tools and storage targets will vary, but the underlying principle stays the same: diversify copies, diversify media, and keep data offsite. When in doubt, add more redundancy rather than less. Data that exists in only one place effectively does not exist.
Real-World Data Loss Incidents
These incidents demonstrate what happens when backup strategies fail or are absent entirely.
Toy Story 2 near-deletion (1998)
What happened: Someone ran rm -rf * on the Pixar file server hosting the Toy Story 2 production files.
The backup system had been failing silently for months without anyone noticing.
The film was saved only because a technical director had a copy on her home workstation
that she used while working remotely during maternity leave.
Lessons: Backups that are never tested are not backups. Silent failures can persist for months if monitoring and restore tests are absent. An accidental offsite copy saved an entire film production.
Gmail account reset (2011)
What happened: A storage software update caused approximately 150,000 Gmail accounts to be wiped clean: all emails, contacts, labels, and chat history disappeared. Google restored the affected accounts from tape backups within a few days.
Lessons: Even at Google’s scale with massive online redundancy, tape-based offsite backups remain the last line of defense. Software bugs can bypass all online replication layers simultaneously. A different media type (tape) on a separate system saved the data.
GitLab database deletion (2017)
What happened: An engineer accidentally deleted a production PostgreSQL database during maintenance. Of five backup mechanisms in place, none worked correctly: LVM snapshots were never configured, regular database dumps had errors, S3 backups had never been tested, and Azure disk snapshots were not enabled. The recovery relied on a 6-hour-old copy that an engineer happened to have made. GitLab lost approximately 6 hours of production data.
Lessons: Multiple backup systems mean nothing if none of them are verified. Every backup mechanism should be tested regularly with actual restore operations. Redundancy on paper is not redundancy in practice.
VFEmail destruction (2019)
What happened: Attackers formatted every disk on every server of the U.S.-based email provider, destroying almost two decades of data including all backups. No ransom was demanded. The primary and backup systems shared the same access layer, so compromising one gave access to everything.
Lessons: Backups must be isolated from production systems. Shared authentication between primary and backup infrastructure means a single compromise destroys both. Air-gapped or immutable backups would have survived this attack.
Myspace data loss (2019)
What happened: Myspace announced that all photos, videos, and audio files uploaded before 2016 were lost during a server migration. Over 50 million songs from 14 million artists were permanently destroyed.
Lessons: Migrations are high-risk operations that require verified backup copies before execution. Irreplaceable user-generated content demands the same protection level as any critical data.
Garmin WastedLocker ransomware (2020)
What happened: The Evil Corp group encrypted Garmin’s internal systems, taking down Garmin Connect, flyGarmin, and all customer-facing services for nearly five days. Call centers, email, and online chat were all unavailable. Garmin reportedly paid a $10 million ransom to obtain the decryption key.
Lessons: Network-accessible backups are vulnerable to the same ransomware that hits production systems. Immutable or air-gapped backups would have eliminated the need to pay a ransom. The $10 million payment far exceeds the cost of proper backup infrastructure.
OVHcloud Strasbourg fire (2021)
What happened: A fire destroyed the SBG2 data center and damaged SBG1 in Strasbourg. Customers who stored both primary data and backups in the same facility lost everything permanently. Those with offsite backups or multi-region replication recovered.
Lessons: “Offsite” must mean genuinely separate geography, not just another rack or building on the same campus. A cloud provider’s local backup option does not satisfy the offsite requirement of the 3-2-1 rule.
Kyoto University supercomputer data loss (2021)
What happened: A faulty HPE software update meant to clean up old log files instead deleted 77 TB of research data (34 million files) from the supercomputer’s backup storage between December 14 and 16, 2021. The bug affected 14 research groups, and data belonging to 4 of them was irrecoverable.
Lessons: A backup stored on the same system it protects is not a separate copy. Automated maintenance scripts must be tested in isolation before running against production backup storage. Critical research data needs copies on independent systems.
Atlassian cloud data deletion (2022)
What happened: A deployment script intended to decommission one app used incorrect identifiers and deleted data for 883 sites belonging to 775 customers. Jira, Confluence, Opsgenie, and Statuspage were unavailable for affected customers for up to 14 days.
Lessons: Restore procedures must be tested at scale, not just for individual accounts. Atlassian had no automated system to restore a large subset of customers simultaneously, turning a minutes-long deletion into a two-week recovery. Disaster recovery plans must account for mass-restore scenarios.
CloudNordic ransomware (2023)
What happened: Ransomware encrypted all servers of Danish cloud provider CloudNordic, including primary and secondary backup systems. The attack happened during a data center migration when already-infected machines were connected to the internal network. The majority of customers - hundreds of Danish companies - lost all their data permanently.
Lessons: Migrations create windows of vulnerability when systems from different security zones are connected. Both backup tiers were reachable from the same network, so one compromise encrypted everything. At least one backup copy must be offline or on an isolated network segment.
UniSuper Google Cloud deletion (2024)
What happened: A misconfiguration during provisioning caused Google Cloud to delete the entire private cloud subscription of UniSuper, a $125 billion Australian pension fund. The deletion propagated across both geographic regions meant to protect against exactly this scenario.
Lessons: Multi-region replication within a single provider does not protect against provider-level errors. UniSuper recovered because they maintained backups with an independent third-party provider outside of Google Cloud. True 3-2-1 compliance requires at least one copy outside your primary provider.
South Korea national data center fire (2025)
What happened: A fire caused by thermal runaway of an expired UPS battery module at the National Information Resources Service data center in Daejeon burned for nearly 22 hours, taking down 647 government IT systems - roughly one-third of the nation’s online public services. 858 TB of government data was lost, including G-Drive, a cloud system used by 125,000 officials, which had no backups because the volume was deemed too large.
Lessons: No dataset is “too large” to back up - the cost of losing it is always higher. Physical infrastructure failures can destroy an entire site regardless of digital redundancy. Critical government services must have offsite copies on separate media, with no exceptions.
References
- NIST SP 800-34 Rev. 1 - Contingency Planning Guide for Federal Information Systems
- NIST SP 800-209 - Security Guidelines for Storage Infrastructure
- CISA - Data Backup Options
- CISA - Protecting Against Ransomware
- Backblaze - The 3-2-1 Backup Strategy
- restic - Fast, secure, efficient backup program
- BorgBackup - Deduplicating archiver with compression and encryption
- rclone - Rsync for cloud storage
- LTO Program - Linear Tape-Open technology