Backup and Recovery
During a recent engagement, I had the privilege of getting back hands on with some backup technology, namely CommVault. I’d worked with CommVault extensively in the past as part of a multinational organisations centralised backup and recovery strategy for the UK. It was great fun getting back hands on with the technology and it got me thinking about the role of backup and recovery solutions within public and private clouds, more on that later.
The deployment in question focused around protecting vSphere hosted virtual machines using the Virtual Server Agent, with some specific client installations for SQL availability groups, file systems and active directory.
Designed around a NetApp storage stack, with SnapMirror and IntelliSnap. An additional NetApp eSeries SAN at the DR site is configured as a large magnetic tape library and a small magnetic library is configured on the all flash at the primary site.
To over simplify; within this design CommVault interacts with the NetApp Storage Arrays, NetApp Storage Virtual Machines and OnCommand System Manager, to manage cross site replication called via a Backup Copy job from the CommServe. When the replicated Snapshots are successfully synchronised to the DR site, CommVault initiates auxiliary copies to the DR tape library.
For those not familiar with CommVault, Storage Policies configured via the CommServe manage the lifecycle of data within the system. To backup any data, you’ll need to link backup clients and schedules to a Storage Policy. Utilising the NetApp to manage replication the Storage Policy the backup workflow for this environment is visualised below;
- Backup jobs start from the CommCell Console.
- The VMA acquiesces via VSS calls for windows VMs, as this is a VMware configuration, vStorage APIs are called to create software snapshots and enable delta file creation for each of the guests targeted as contents of the snapshot.
- The NetApp Array APIs are then called to:
- Verify the backup job contents (validating the underlying disk structure for file systems, databases, etc).
- Create a SAN snapshot or clone and retain based on retention rules for the ‘Primary Snap’.
With this design, granular backups are not being maintained on the primary site, so there is no need to mount these snapshots to a proxy at this stage.
- Backup copy operations then kick in to replicate the Snapshots to the remote DR site which are then retained in accordance with the ‘Secondary Snap’ policy.
- Upon completion of the backup copy operation the replicated snapshots are mounted on the selected proxy computer at the DR site, for post-snapshot operations.
- In this case an auxiliary copy operation, that takes a granular copy of the snapshot contents to the CommVault tape library.
- Optional Cloud storage copies and retention could also be defined and configured for further offsite copies of the data held on the CommVault tape library.
With this implementation, there is the potential to restore complete point in time VMs from the primary site and secondary site. From those complete VM restorations it is possible to gain access to the granular files, or there is the additional option to conduct point in time granular restorations of individual files from data held at the secondary DR site.
I’ve over simplified things a little here, as the configuration options for IntelliSnap and SnapMirror can be quite complex and worthy of several blog posts. If you are interested in learning more about CommVault they maintain an excellent documentation set.
I would consider this very much a traditional backup and recovery installation. Albeit a very solid implementation with clever use of virtualisation and SAN technologies to reduce the client load and manage storage.
Backup and restoration configuration in a cloud world?
Where new solutions are taking advantage in SaaS or PaaS, with infrastructure as code and data secured, replicated and retained at the application layer, is there the need for backup and recovery solutions?
From a configuration perspective, most virtualisation and cloud providers allow the infrastructure configuration to be captured and retained as code. Therefore, perhaps the future solution for retaining, replicating and restoring these configurations sits with more traditional software version control repositories?
Already significant and growing communities exist developing, branching and replicating infrastructure code for the Azure platform . With private repositories and code management tools available within GitHub, Team Foundation Services or even BitBucket, in the future could infrastructure redeployment replace restoration?
Designing highly available solutions for tier 1 applications should be on the agenda during any design phase. Achieved via application based replication or cloud configurations, the goal is the same to make a system always available. Within cloud environments this can be achieved without having to layout expense for traffic managers, load balancers or top of the line SAN technology. This can be achieved through configuration of the cloud environment and paid as OpEx.
With a highly available, application driven replication configuration, what will be faster? To conduct restoration or a rebuild and re-addition of a node? Perhaps launched from a blueprint or JSON configuration, maybe even launched automatically when certain conditions are encountered?
With tier 2 application data, it is of course possible to make use of some of the features of the public/private cloud. Perhaps tier 2 solutions are configured for storage replication only and expectation and agreement for recovery times are set accordingly.
This is all great, but we still must consider the potential for data corruption within any data set, individual errors and managing retention.
For data corruption, individual errors or virus activities, there is going to remain the need for point in time backup and recovery solutions. Whilst many SAN technologies are configurable to retain copies of replicated data for x number of days, weeks, months – even with deduplication and the potential for tiering storage, within a large environment this space requirement will grow quickly. There remains the issue that the SAN will replicate what it is presented with, by that I mean if you corrupt the data on one side it will replicate corrupted to the other!
Data management and retention is also a big issue, many industries here in the UK have specific audit point requirements around data retention. For example, within the one organisation I’m aware of they had a requirement to retain copies of data until seven years after the end of the business relationship. Does anybody really want to be managing that requirement within SAN or cloud configurations?
Whilst operating and creating environments with a cloud provider, or with advanced private cloud features, allows the potential to minimise the footprint for backup and recovery solutions. The requirement to protect against corruption, individual errors and maintain data regulatory compliance. Means that for the time being at least, backup and recovery solutions are still very much vital within any cloud.