As a DBA, it is advisable to collaborate with your colleagues from the operations team. Especially when you are planning to upgrade to SQL Server 2014 and they are planning to buy new hardware for your new database servers. That is exactly what I recently did, and in this blog series I am going to explain and document the entire process of upgrading our databases from SQL Server 2005 to SQL Server 2014. In this first post I am going to talk about the Recovery Point Objective and Recovery Time Objective, because that’s where a project like this all starts.
A new SAN
Last summer the selection process for the new SAN storage began. My colleague from the operations team asked me how much storage we needed for our databases, so I had to do a little capacity planning. That involved a little more than just summing up the file sizes of every database involved. I had to decide on a High Availability and Disaster Recovery (HA/DR) strategy for our new SQL Server 2014 environment, because the storage capacity needs strongly depend on it. And before choosing a HA/DR strategy, I had to know what our Recovery Point Objective (RPO) and what our Recovery Time Objective (RTO) was.
RPO/RTO and HA/DR
Before I continue, I’d like to briefly explain short what RPO/RTO and HA/DR mean, in case you don’t know. If you do, then you can skip this section.
Recovery Point Objective, or RPO, determines the maximum amount of data the business can afford to lose in case of a data loss event. This is measured in time. For example, you might decide that the business can stand to lose 1 minute or 1 hour of data. In other words: to which point in time can we recover from disaster: 1 minute before disaster struck, or 1 hour?
Recovery Time Objective, or RTO, determines the maximum amount of time it takes to get the systems online after they go down. This includes the time it takes for the on-call guy to get out of bed after being woken up by his pager, logging into his computer, analyzing the problem(s), etcetera…. until the system is online again.
High Availability, or HA, determines how the SQL Server application is configured to handle isolated failures, i.e. to failover to a secondary system in case the primary system fails because of a server failure. Failover Clustering and Database Mirroring are both HA technologies.
Disaster Recovery, or DR, determines how the SQL Server application is configured to handle widespread outages, i.e. failover to a secondary site if the primary site is struck by a disaster like flood, quake, SAN failure, etc.
Getting my manager and my CTO involved
Determining RPO/RTO is not purely technical: I had to talk to my manager and to our CTO to understand what business needs had to be met. I actually had to explain what RPO/RTO meant, and presented them with 2 tables from Brent Ozar Unlimited to help them select the appropriate RPO and RTO for HA as well as DR. I also showed them this diagram, also from Brent Ozar Unlimited, to show them the associated costs that go with their choices. By doing all this, I got them involved in the process, created awareness and set expectations. Their response was:
RPO (How much data are we willing to lose in the event of an outage?):
|None to 1 second|
RTO (How long can a system be down?)
|None to 1 second|
To be honest, I’m a lucky DBA, because I actually advised them to make this choice and they trusted my judgement. Anyways, their choices meant:
- I had to choose a HA strategy where, in the event of a failure within the primary data center (e.g. a server dies), a failover has to occur within 1 minute (RTO) with a maximum data loss of 1 minute (RPO).
- I had to choose a DR strategy where, in the event of a complete primary data center failure (e.g. a flood or earthquake), we have to get our databases online in our secondary datacenter within 1 hour (RTO) with a maximum data loss of 1 hour (RPO).
What I chose, and how these numbers translate to the HA/DR technologies I’ve chosen, will be part of the next post in this series.
In this blogpost I showed that, before doing anything else towards upgrading our SQL Server databases, I sat down with my manager and our CTO to determine the RPO and RTO for our databases. When I knew these numbers, I was better able to choose a HA/DR strategy, which I am going to talk about next. Because I involved my manager and our CTO, this exercise also had a nice little side effect, namely setting expectations and creating awareness for our business users.