Accidental DBA
Posts
Near-Perfect Uptime

Near-Perfect Uptime

How We Took a Disaster to a Dream

Kevin Hill
October 18, 2024

It started out like so many others do.

Our first call with a new client (emergency prepaid 2 hours) started with “Our log file is full and we can’t do anything now”

The Log file full is a classic SQL Server problem that people have been asking about for decades (blog / video).

We did the obvious – flip the database from Full to Simple and back, since there was no space to back up a 2TB log file. Backing up to NUL was going to take far too long. We started a Full backup to reset the chain. While that was running, we started poking around at what else might be needed that could be fixed easily.

We found these issues (among others):

· Way behind on patching (SQL 2016 Enterprise)

· Bare metal server

· All of the default SQL Server settings

· Tempdb on the same drive as the log files on a very busy server

· Missing Index report – one of the top 3 of the worst I’ve ever seen. Not enough and too many at the same time.

· Data going back many years that was completely unnecessary with no purging or archiving process.

· Stand-alone server with manual backups. No HA/DR at all.

All of this meant that the server was one hardware failure away from losing ALL the data, forever.

It gets better!

We signed the client to our Fractional DBA service and got to work. They had budget but no direction.

First things first, we stabilized the server with a proper backup process (files stored off the box) and current patching.

I spent most of my time taking care of the existing box while a teammate built up a brand new 3 node Availability Group (Primary, reporting secondary, DR secondary).

It took 6 months to get all of this in place (5 hour per week plan).

Since “go live” 3 years ago there has only been one outage and that was performance related, not server downtime. I don’t actively measure in “9s”, but I’m pretty sure zero unplanned downtime = 100% uptime.

Can I promise you that? No. That would be a lie.

I can set you up for success and minimize the risks. And that is what you really want from your DBA team, whatever they look like.

The Bottom Line:

If you are not doing the basic reliability tasks, all the cool code and nice features in the world won’t save you from a system failure. Or ransomware.

If you don’t know where you stand and you want some help, please reach out.

Fun-Sized Consulting, 4-10 hours, October only!

My Recent LinkedIn Posts

Dear Junior DBA – a classic blog post I wrote while getting new tires many years ago.

Know your tools! – all the monitoring tools won’t help if you don’t know what to look for

Near-Perfect Uptime – asking you about your reliability goals

Interesting Stuff I Read This Week

Elon Musk's Starship rocket achieves record-breaking feat (bbc.com ) – I’ve been watching the “chopsticks” since their construction

Tech giants bet on nuclear power | LinkedIn – Microsoft starting up 3-mile Island, Amazon and Google getting nuclear as well. Thanks AI?

Most of Earth’s meteorites come from a few asteroid break ups – Apparently there is an asteroid belt between Mars and Jupiter. How about that?

SQL tidBITs:

If your application login is in the sysadmin SQL Server server role and it gets compromised, ALL of your data and log files are at risk when (not if) the ransomware hits. If your backup files are on the same server, they will get encrypted as well, more than likely.

Near-Perfect Uptime

How We Took a Disaster to a Dream

It started out like so many others do.

It gets better!

If you don’t know where you stand and you want some help, please reach out.

My Recent LinkedIn Posts

Interesting Stuff I Read This Week

SQL tidBITs:

If you enjoyed this, please send to 1 person. Just one.

Maybe 2 or 3.

4 is getting silly and 5 is right out.

Reply