Options for warm/hot disaster recovery

freddy12345 · December 30, 2016, 4:13am

Hi,

We are implementing a new HA/DR solution, and I am asking for guidance on the best (not too expensive) options for having a warm or even hot standby of the entire server.

Setup: We will have two new physical servers at the same colocation facility, each running a windows 2012R2 VM with SQL server 2012 (or maybe 2014) Standard edition on it--and under OpenStack, in case that matters. There are only 2 databases to be synched with the target VM, and the largest is under 10 GB. Not a heavy load. I should also mention that this sql server instance is a publisher in replication to 3 other servers.

If VM1 fails in some way, we'd like to get VM2 up and running in less than 30 minutes--much faster would be much better.

What are my 'best' options? I'm leaning towards DB mirroring or even log shipping. Googling around I have found the following:
• VM-level replication (but our network guy feels this is not that reliable). Seems conceptually simpler.
• Mirroring the 2 databases, with auto failover feature (we are using Standard edition)
• Log shipping
• Replication
• Always on AGs? I don't think we can use these with Standard Ed.
• Other?

I realize the mirroring/log shipping solutions require many other external objects to be synched up as well (logins, credentials, sql agent jobs, you name it…)

Thanks for any guidance here!

Fred

StephenChapman · March 22, 2017, 4:20pm

Hi Fred,

The most important question to ask is what is your RTO and RPO. Essentially, how long can you afford your system to be down for and how much time's worth of data can you afford to lose. You have answered the RTO question by saying that you would hope to be back up and running within 30 minutes and it's good to know that both DB Mirroring and Log Shipping can easily achieve this for you.

The second question I think still needs answering but to summarise the two options, mirroring can potentially provide you with typically less that one minutes worth of data loss whilst log shipping could be up to 1 hour (although that depends on how you configure this).

There are two methods to set up mirroring, you can have async or sync commit. Synchronous commit is the way to go if you want to minimise the risk of data loss but this comes at a potential performance impact as the transaction is written to the secondary before it is committed to the primary and reported back to the app as a success. Asynchronous commit on the other hand just sends the transactions off to the secondary but commits them straight away to the primary regardless. This does have a lower performance impact but at the expense of a potentially higher level of data loss.

Both methods are pretty simple to set up and are both available in Standard Edition (and as you are sticking with SQL 2012 you may find you do not need to purchase extra licences for your secondary, non readable, replica). You are correct in thinking that AlwaysOn AG's are an enterprise feature.

As for replicating your logins, credentials etc, well you would normally do these manually when you set up your system as you don't replicate your system databases but this is not a big issue (you can script them).

A little tip as well, which I have used before, is set up a DNS "A" record alias pointing, initially, to your primary replica. Point your application to the Alias SQL Server name, then when you do get into a situation where you need to reconfigure your applications to point to the secondary server you can just change the DNS alias and not have to reconfigure all your connection strings.

I hope this helps

Stephen

freddy12345 · March 22, 2017, 7:33pm

Stephen,
Thank you so much for your response.
It seems that mirroring or log shipping are the way for me to go.

Do you have any experience, or knowledge about replicating an entire VM? Seems like that would be simpler—at least conceptually.

Thanks again!
Fred

StephenChapman · March 23, 2017, 8:24am

Hi Fred,

I have to say that all that side of thing gets dealt with by our platform team so I get very little exposure to the way that is set up.

What I can tell you however is the a little horror story that I discovered in a previous company that may be something to be aware of.

We were looking into performance issues on a database system that comprised of a pair of VM's running SQL Server and mirroring between each other. The app was taking several seconds to complete a simple transaction. Looking at the stats on the server all the time being taken was by SQL Server. We changed the mirroring from Sync to Async commit and we still did not get any improvement. When we looked deeper we found that the two VM's had been built from the same SAN pool of disks and as a consequence of this the mirroring was forcing reads and writes from the same disks. We disabled the mirroring and lo and behold, a massive improvement.

From my point of view, while I do not get really involved in the setting up of the underlying systems and methods I do try to keep a hand in on having an idea what has been set up...purely to ensure the model we intend to implement will actually work to the benefit.

Kristen · March 23, 2017, 8:24am

Don't remember seeing your original question back in December, sorry you didn't get any answers sooner ...

I don't know much about it (have done replication and log shipping, but not really for this type of DR application), and we do have clients that use VM replication for DR, but their hardware people take care of that (so I have no useful knowledge) and the first we hear about it is when it cocks-up and we are asked to rescue the whole lot from conventional backups (that's a drama of course, but not the norm ...)

What I do know is that we have had to reconfigure the Backups to be to-network, instead of to-local-disks, because the replication of the VM was including all that, and it was killing bandwidth ... I regard backup-to-network as a more fragile process than to-local-disk, although the performance tests we have done have only added a few percent to the elapsed backup time (YMMV of course ...)

I remain sceptical that these clever failover systems are going to save the bacon on the day that they are needed, particularly in Shops that don't have gazillions of servers and heaps of skills - but I don't have an alternative suggestion.

Either way, I would want all mission-critical DBs running in FULL recovery model and Log backups running at no more than 1 minute interval, and LOG backups immediately copied to somewhat-more-remote storage in case the server breaks / building catches fire / etc. so that I had a conventional restore-from-backup Plan-B in my back pocket.

freddy12345 · March 23, 2017, 3:13pm

Stephen,
Wow, thank you for this good pointer. Good to know and be aware of.

i think in our case, the two VMs will be on two separate physical servers and the storage will be local to each. But...i will double check.

thanks again,
Fred

freddy12345 · March 23, 2017, 3:16pm

Kristen,
Great suggestions, thank you. We do have local storage backups. But, i think we are going to veer away from the VM replication. mainly because our network guy says it will get complex trying to run Hyper V under OpenStack.
thanks again,
Fred

yosiasz · March 24, 2017, 12:32am

whatever methodology you use TEST IT once in a while. the whole stack in a segregated environment.
whole stack meaning all the moving pieces in it: web application, api server, ssis jobs etc etc

You can do a fire drill to see how long it would take to bring everything back up. Then document everything to death. and you have to do this yesterday

freddy12345 · March 24, 2017, 9:06pm

we are talking about a production server, so i think doing a test like this on a weekend, when usage is slower, is the way to go.
Thank you for this!

jason_clark · March 29, 2017, 8:54am

Please have a look on here, the whole D/R summary list: http://www.sqlmvp.org/disaster-recovery-with-high-availability-options/