Accidental DBA
Posts
Frequent SQL Server Backups are Great!

Frequent SQL Server Backups are Great!

Until they aren’t.

Kevin Hill
November 22, 2024

Frequent Backups Aren't Enough – Here's the Missing Piece for SQL Reliability

Many IT shops think frequent backups are the best safety net for their SQL Servers. I understand this. I’ve been that DBA. But, database backups are not a “set it and forget it” item in your maintenance responsibilities.

The Problem with "Backup = Reliability" Mindset

When you install SQL Server, backups are not happening by default. The next step for shops that are backup-aware is to sort out how to use the internal maintenance plan process, or find a 3^rd party tool (Ola, Veeam, Commvault, Datto) to do it for them. This is great, but it doesn’t go far enough.

Backups don’t prevent hardware failures, corruption, or performance issues. A backup is useless if it’s incomplete, corrupt, or can’t be restored quickly. Also, a local backup is lost if the server dies, the VM is corrupt or worse – ransomware encrypts the data files and the backups because they are on the same server.

What Real Reliability Looks Like

Almost every conversation around backups and disaster recovery (DR) also leads to high availability (HA) discussions. This is great, but it’s important to understand the difference:

We start with two things:

RPO - Recovery Point Objective: How much data can you afford to lose?
RTO - Recovery Time Objective: How long can we afford to be down?

When I ask CxO level folks these questions they always come up with something like “None and none“ or “zero and zero” for data-loss and downtime. My auto-response – “Add some zeros to the project price tag.”

The point (for today) is that DR and HA are NOT the same thing. If Joe deletes some data in prod, a multimillion-dollar HA solution will not save that data – but it will replicate that delete VERY quickly.

In a DR setup, you are looking to be able to recover missing data, databases or even rebuild an entire server. Human errors are the single biggest cause of database restores in SQL Server. The ability to recover from these errors with minimal heartburn is the most important thing to consider.

Real reliability comes from knowing your databases have no corruption in them and can be recovered from backup as needed. You also need to KNOW (not guess) how long this will take. Management expects that.

Practical Steps to Enhance Reliability

Test Your Backups Regularly:
- None of the backup processes mentioned above take the next step, which is to do a test restore. Most of them can do an in-place restore, but you don’t do that to the same server except to recover in a failure.
- Do test restores to a non-prod server and time it. Run corruption checks (DBCC CHECKDB) and make sure the database(s) are accessible.
Monitor and Audit Your SQL Server:
- Keep an eye on backup job failures and make sure the right person/team is aware of them.
- Run regular corruption checks against ALL databases, including system dbs.
Invest in Redundancy:
- Log Shipping is old, reliable technology. Backup, copy, restore.
  - Set the restore job to run LESS often than the 15-minute default so you have time to catch Joe’s “oops” before that mistake is restored to your secondary.
- Replication (the SQL Server feature) is NOT a DR option, as not everything in the database can be replicated. Even some tables cannot.
- Multiple options exist for other DR options, some require Enterprise edition

Story Time (x3):

1 - I (not Joe) overwrote the Users table in prod with a copy of the Dev table when I was moving new code to prod and forgot to uncheck the “move dependent objects” box. 14-hour outage way back in the dot-com days. We had home-grown Log Shipping in place, with restores on a 4-hour delay. The site was up in 30 minutes, but we did manage to lose about 30 minutes of data in the process. Year: 2000.

2 - I recently helped a client recover the entire company from a 3-week old backup when ransomware got in and encrypted the SQL data and log files as well as the local backups. I had warned them about this very possibility. Luckily, they were able to replay all of the missing data from the logs that it originated in.

3 - As I was writing this newsletter, I was multi-tasking and messed up our company timesheet with some bad sorting (yes, we use Excel for our timesheet). Thanks to version history, nothing was lost.

The Bottom Line:

If you don’t test your disaster recovery plan, you only have a disaster recovery hope. – paraphrased from Geoff Hiten.

If you need help sorting out what you have and what you are missing, please reach out:

[email protected]

Get a FREE SQL Server Health Check from Dallas DBAs!

New Pocket DBA® clients get the first month FREE!

Join the Accidental SQL DBA group on LinkedIn – no fee, no spam

My Recent LinkedIn Posts:

Not Working is More Important Than Working

JOE DROPPED THE USERS TABLE!!!

Interesting Stuff I Read This Week:

Spirit Airlines files for bankruptcy protection after mounting losses – I’ve never flown Spirit…have you?

New York’s Transit Agency Approves $9 Congestion Toll - Bloomberg – hoping this helps NYC get greener! I’ve never been there.

SQL tidBITs:

CTRL-A multi-line text editing in SSMS 21 is enhanced! Cool!

Reply

or to participate.