IT Today Catalog Auerbach Publications ITKnowledgebase IT Today Archives infosectoday.com Book Proposal Guidelines IT Today Catalog Auerbach Publications ITKnowledgebase IT Today Archives infosectoday.com Book Proposal Guidelines
IT Today is brought to you by Auerbach Publications

IT Performance Improvement

Management

Security

Networking and Telecommunications

Software Engineering

Project Management

Database


Share This Article


Search the Site



Free Subscription to IT Today





Powered by VerticalResponse

 
The Green and Virtual Data Center
Best Practices in Business Technology Management
The Effective CIO: How to Achieve Outstanding Success through Strategic Alignment, Financial Management, and IT Governance
Enterprise Architecture A to Z: Frameworks, Business Process Modeling, SOA, and Infrastructure Technology
The Business Value of IT: Managing Risks, Optimizing Performance and Measuring Results

Designing Backup for Recovery

By Preston de Guise

Our goal in this section is to discuss how a backup system needs to be designed to facilitate recoveries. As mentioned in our introduction, the purpose of a backup is to provide a mechanism to recover, and therefore it follows that the backup system must be designed to allow those recoveries to take place with as little effort or cost as possible.

1. Recovery Performance

In the earlier topic regarding backup performance, we briefly discussed the idea that performance considerations for recoveries must be considered when planning a backup environment, and we'll now examine this principle in more detail.

many organizations, backup environments are scoped on a backup window, regardless of whether that backup window is real or imaginary. Remember, though, our original definition of backup - a copy of any data that can be used to restore the data as/when required to its original form.

This creates what should be a self-evident rule - the recoverability requirements of the system must play a more significant role in the planning of the backup environment than the backup requirements.

Example: Consider a simple case where an organization implements a new backup system, and configures all backups with four-way multiplexing across each of the three tape drives to ensure that backups complete within a six-hour management-stated backup window. Two large fileservers consume almost the entire backup window, starting first and finishing last, therefore taking up one unit of multiplexing on a tape drive each for the entire duration of the backup.

Following the installation of this configuration, simple random file recovery tests are performed and succeed. Backup duplication is trialed and found to take too long, but because the previous backup system could not duplicate anyway, this is not seen as a loss of functionality, so the system is declared ready and put into production use.

Five months after installation, one of the key fileservers suffers a catastrophic failure at 9 a.m. on the last business day of the month when a destructive virus causes wide-scale data corruption and deletion. The most recent backups are called for, and the recovery commences.

At this point the impact of multiplexed backups come into play. Although the backup of the system completes just within the six-hour window, a complete file system read from the tapes generated for the host takes approximately twelve hours to complete. On the final tape, a failure occurs in the tape drive, damaging the tape and resulting in the loss of several directories. These directories need to be restored from the older backups. As a result, (1) the system has been unavailable for the entire last day of the month, resulting in loss of invoicing revenue for the company; and (2) some records of invoicing were lost due to the tape failure, and during the following month revenue has to be recalculated following complaints from customers due to double-billing.

Clearly, this is not a desirable scenario for a company to find itself in. Although it is not necessarily easy to redesign a backup system after it has been put in place (particularly when it results in a request for additional budget), the recovery requirements must be clearly understood and catered for in the backup environment. At bare minimum, management must understand the risks the company is exposed to in relation to data protection.

As seen in chapter 6, "Performance Options, Analysis, and Tuning," a multitude of ways exist in which performance can be improved given the type of challenge to the backup environment. Even when a reliable mechanism to achieve appropriate backup performance has been found, it may be that the performance comes at the cost of rendering the recovery untenable.

Backup to Recover. When implementing an upgrade to a new backup system, a company wanted to resolve performance problems in backing up its two main file systems, which were both around 400 GB in size.

Backups for these file systems had reached the point where they crawled through at speeds not exceeding 2 MB/s to DLT-7000. Unfortunately, LTO Ultrium 2 (the chosen tape format to replace) was not going to perform any better given the file system was dense.

As various backup products now support block-level agents on hosts, the block-level agent for the backup software deployed was trialed on the Windows 2000 fileserver. The backup window immediately dropped from approximately fifteen hours to four. (Due to the need to allocate only one tape drive for block-level backups and the agent's inability to multiplex at the time, this meant that the backup window in fact dropped from fifteen hours for two file systems to eight hours.) This was still seen as a considerably better backup improvement.

Complete file system recoveries also went through at full tape speed, resulting in a 400-GB recovery performed in three to four hours.

As the file systems were rigorously protected by virus scanning, volume mirroring and array replication, the customer had never needed to perform a complete file system restore. Therefore the key recovery test was actually an individual directory recovery. The directory initially chosen for recovery testing was approximately 40 GB, and that recovery took approximately eight hours.

Investigations showed that the file systems were both in the order of 50 percent fragmented, with the directory chosen exhibiting the worst symptoms, with most files suffering serious fragmentation problems. This highlighted the performance problems that can arise from file-level recovery from block-level backups.

As individual file/directory recoveries were seen to be more important, the business decision was made instead to optimize the file system backup via generating multiple simultaneous backups from each file system, which provided somewhat better performance. This, of course, resulted in a situation whereby a complete disaster recovery of the file system would take considerably longer to complete due to the multiplexing involved, but this was seen as acceptable, given the reduced need for such a recovery in comparison to the day-to-day file and directory recoveries required.

In this case, after trying to design the system for backup performance, the company elected instead to optimize recovery performance.

Clearly, this is not a desirable scenario for a company to find itself in. Although it is not necessarily easy to redesign a backup system after it has been put in place (particularly when it results in a request for additional budget), the recovery requirements must be clearly understood and catered for in the backup environment. At bare minimum, management must understand the risks the company is exposed to in relation to data protection.

As seen in chapter 6, "Performance Options, Analysis, and Tuning," a multitude of ways exist in which performance can be improved given the type of challenge to the backup environment. Even when a reliable mechanism to achieve appropriate backup performance has been found, it may be that the performance comes at the cost of rendering the recovery untenable.

Backup to Recover. When implementing an upgrade to a new backup system, a company wanted to resolve performance problems in backing up its two main file systems, which were both around 400 GB in size.

Backups for these file systems had reached the point where they crawled through at speeds not exceeding 2 MB/s to DLT-7000. Unfortunately, LTO Ultrium 2 (the chosen tape format to replace) was not going to perform any better given the file system was dense.

As various backup products now support block-level agents on hosts, the block-level agent for the backup software deployed was trialed on the Windows 2000 fileserver. The backup window immediately dropped from approximately fifteen hours to four. (Due to the need to allocate only one tape drive for block-level backups and the agent's inability to multiplex at the time, this meant that the backup window in fact dropped from fifteen hours for two file systems to eight hours.) This was still seen as a considerably better backup improvement.

Complete file system recoveries also went through at full tape speed, resulting in a 400-GB recovery performed in three to four hours.

As the file systems were rigorously protected by virus scanning, volume mirroring and array replication, the customer had never needed to perform a complete file system restore. Therefore the key recovery test was actually an individual directory recovery. The directory initially chosen for recovery testing was approximately 40 GB, and that recovery took approximately eight hours.

Investigations showed that the file systems were both in the order of 50 percent fragmented, with the directory chosen exhibiting the worst symptoms, with most files suffering serious fragmentation problems. This highlighted the performance problems that can arise from file-level recovery from block-level backups.

As individual file/directory recoveries were seen to be more important, the business decision was made instead to optimize the file system backup via generating multiple simultaneous backups from each file system, which provided somewhat better performance. This, of course, resulted in a situation whereby a complete disaster recovery of the file system would take considerably longer to complete due to the multiplexing involved, but this was seen as acceptable, given the reduced need for such a recovery in comparison to the day-to-day file and directory recoveries required.

In this case, after trying to design the system for backup performance, the company elected instead to optimize recovery performance.

Clearly, this is not a desirable scenario for a company to find itself in. Although it is not necessarily easy to redesign a backup system after it has been put in place (particularly when it results in a request for additional budget), the recovery requirements must be clearly understood and catered for in the backup environment. At bare minimum, management must understand the risks the company is exposed to in relation to data protection.

As seen in chapter 6, "Performance Options, Analysis, and Tuning," a multitude of ways exist in which performance can be improved given the type of challenge to the backup environment. Even when a reliable mechanism to achieve appropriate backup performance has been found, it may be that the performance comes at the cost of rendering the recovery untenable.

Backup to Recover. When implementing an upgrade to a new backup system, a company wanted to resolve performance problems in backing up its two main file systems, which were both around 400 GB in size.

Backups for these file systems had reached the point where they crawled through at speeds not exceeding 2 MB/s to DLT-7000. Unfortunately, LTO Ultrium 2 (the chosen tape format to replace) was not going to perform any better given the file system was dense.

As various backup products now support block-level agents on hosts, the block-level agent for the backup software deployed was trialed on the Windows 2000 fileserver. The backup window immediately dropped from approximately fifteen hours to four. (Due to the need to allocate only one tape drive for block-level backups and the agent's inability to multiplex at the time, this meant that the backup window in fact dropped from fifteen hours for two file systems to eight hours.) This was still seen as a considerably better backup improvement.

Complete file system recoveries also went through at full tape speed, resulting in a 400-GB recovery performed in three to four hours.

As the file systems were rigorously protected by virus scanning, volume mirroring and array replication, the customer had never needed to perform a complete file system restore. Therefore the key recovery test was actually an individual directory recovery. The directory initially chosen for recovery testing was approximately 40 GB, and that recovery took approximately eight hours.

Investigations showed that the file systems were both in the order of 50 percent fragmented, with the directory chosen exhibiting the worst symptoms, with most files suffering serious fragmentation problems. This highlighted the performance problems that can arise from file-level recovery from block-level backups.

As individual file/directory recoveries were seen to be more important, the business decision was made instead to optimize the file system backup via generating multiple simultaneous backups from each file system, which provided somewhat better performance. This, of course, resulted in a situation whereby a complete disaster recovery of the file system would take considerably longer to complete due to the multiplexing involved, but this was seen as acceptable, given the reduced need for such a recovery in comparison to the day-to-day file and directory recoveries required.

In this case, after trying to design the system for backup performance, the company elected instead to optimize recovery performance.

A more-significant problem can occur in WAN-based backup environments. For some companies, WAN-based backups are designed to centralize the location of backups, with an understanding that the "trickling" of a backup across the WAN is an acceptable occurrence. However, is the same "trickling" acceptable to a recovery? Unsurprisingly, the answer is frequently no. In this case, the recoverability of a remote system may depend on such options as:

  • Recovery of the remote machine's hard drive to an identical machine in the backup server's computer room, then shipping the hard drive out to the remote site
  • Shipping a tape and a stand-alone tape drive out to the remote host for recovery
  • Recovery to a spare machine in the computer room local to the backup server, and then sending the entire replacement server out to the remote site
  • Contingency plans to allow users to work with remote fileservers while a trickle-back recovery occurs

It should be noted that when such "trickle" remote backup systems are deployed, it is frequently done to eliminate the need for dedicated IT staff at the satellite offices. The lack of dedicated staff at the remote offices in turn can impact how the recovery can be performed.

All of these examples and situations should serve to highlight one very important rule - designing a system to meet a backup window is one thing, but ensuring that it can be used to recover systems within the required window may require some compromises to the backup speed, and very careful consideration in advance of just how recoveries will be performed.

2. Facilitation of Recovery

When designing a backup system, it is necessary to consider various key questions that deal with how recoveries will be facilitated.

2.1 How Frequently Are Recoveries Requested?

How the Frequency of Recovery Requests Impact Backup System Design

Frequency Implications
Frequent recovery requests If recoveries are performed frequently, the goal of the backup system should be to ensure that the hardware and configuration is such that recoveries can start with minimum intervention and delay. In a tape-only environment, this will require
  • Sufficient slots in the tape libraries to ensure that a copy of all backups covering the most-common recovery time span can be held in the libraries for a fast restore start
  • A sufficient number of tape drives that there will always be a drive available to perform a recovery from (or at least one tape drive set aside as read-only)
  • An environment featuring backup-to-disk, and tape that also has frequent recoveries, should be designed such that the backups most likely to be required for recovery reside on disk backup units rather than tape, wherever possible
  • When recoveries are performed frequently, procedures for the recovery will need to be documented, but there should also be sufficient "in-memory" understanding from the users involved in the recovery as to what steps will be involved.
Infrequent recovery requests If recoveries are only infrequently performed, then design considerations may be somewhat different. For instance, potentially spending tens of thousands of dollars on a dedicated recovery tape drive used maybe once a month may be unpalatable to a business with limited financial resources; i.e., taking a risk-versus-cost decision, the business may elect to wear the risk so as to avoid a higher capital cost.
Therefore, when recoveries are few and far between, the key design consideration will become the procedures and documentation that describe the recovery process. This will be because users and administrators will be less experienced with recoveries, and will need more guidance and prompting than those in environments with frequent recoveries. (Recovery procedures should always be detailed and readily understandable by any administrator of the system they are referring to, but it becomes especially the case when the recoveries are performed only infrequently.)

2.2 Backup Recency Versus Recovery Frequency

Typically, a backup environment design should aim to meet an 80-percent immediate start rule for recoveries - that is, 80 percent of recoveries should be done without a human physically needing to change or load media. To achieve this, it is necessary to understand the frequency of the recovery requests compared to how recently the backup requested for recovery was performed.

For example, a company might find that its recovery requests are balanced against the backup recency as follows:

  • 40 percent of recovery requests are for data backed up within the last seven days
  • 20 percent of recovery requests are for data backed up within the last seven to fourteen days
  • 30 percent of recovery requests are for data backed up within the last two to six weeks
  • The remaining 10 percent of recovery requests come from monthly and yearly backups (i.e., are more akin to retrieval from archive) are kept for five weeks, this would mean having a tape library capable of holding five weeks worth of backup tapes.

With this in mind, it's clear that the backup system for this environment should be designed in such a way that no less than the first 14 days worth of backups are readily recoverable at minimum notice. (It would be preferable for the first six weeks to be recoverable with minimum notice as well.

In a purely tape-based environment, this implies a tape library that is large enough to hold one copy of all the tapes generated for this critical-recovery period. For many companies, this turns out to be their "smallest" backup retention period; e.g., if daily incrementals (with weekly fulls) are kept for five weeks, this would mean having a tape library capable of holding five weeks worth of backup tapes.

In a combined disk-and-tape backup solution, backups should be migrated from disk to tape only when the age of the backup has entered the "less frequently recovered" period.

Obviously, many companies will have different recovery frequency-to-age ratios (though the 80 percent rule is a good starting point if no data or trends are available on the subject), but the end result should be the same: ensure the highest frequency recoveries are served fastest.

3. Who May Want to Perform Recoveries?

This is often a controversial topic within an organization, and the answer is rarely the same in two companies, even in the same line of business. Therefore, here are a few answers to this question that are commonly heard, some of which overlap, and some of which are contradictory:

  • Only system administrators should perform recoveries.
  • Only backup administrators should perform recoveries.
  • Application administrators should perform the recoveries for their own application.
  • Help desk staff and operators should be able to perform user-file recoveries.
  • Help desk staff and operators should be able to perform all recoveries, including complete system recoveries.
  • Help desk staff should not have access to the backup system.
  • End users should be able to recover their own files.
  • End users should not be able to recover their own files.

Based on this common subset of answers received to this question, it is obvious that there are no industry-standard responses.

When it comes to designing the system with recovery in mind, always remember that the further removed someone is from actually running the backup server, the less they'll understand about backup and recovery. This can be presented as shown in Figure 7.1.

Figure 1 Role relationship with backup systems

If there is a difference between the backup administrator and the system administrator, the backup administrator should usually understand more of the backup environment than the system administrator does. In turn, a system administrator will typically understand more of the backup environment than an application administrator does, and so on.

The point of this "onion skin" approach is to highlight that the further away users are from administering or interacting with a backup system, the less they'll understand about it. We introduced the idea of the amount of information a user might receive about the backup system in the training section (see chapter 5). As it is not practical to train all people who might interact with the backup system to the same level, it stands to reason that the way in which these people interact with the backup system will be different.

Therefore, for each "tier" away from the backup environment a user is, it is necessary to provide a progressively simpler mechanism for the user to interact with it, and the more necessary it is to deal with contingencies automatically on behalf of the user, perhaps without the user even being aware this is occurring. Examples of impacts this makes include, but are not limited to:

  • Ensuring that recoveries don't need to wait for media to be retrieved from off site (for end users, this sort of "unexplained wait" may result in multiple cancellations and restarts of the restore process before logging a fault with the administrator of the system).
  • Ensuring that each role has appropriate documentation for the type(s) of recoveries they may be reasonably expected to perform. This is directly related to actually deciding the extent to which the different roles can perform recoveries. For instance, many system administration teams are happy with the notion of providing recovery capabilities to operators and help desk staff for file and directory recoveries, but will not hand over responsibility for application and system recoveries. The level of documentation provided will primarily be based on the types of recoveries expected.
  • Backup software should be configured in such a way that the permissions of the user performing a recovery are reflected in what is recovered. For instance, does the backup product disallow the recovery of employee payroll data (which legally can be accessed only by human resources staff) by a help desk operator? (Additionally, watch out for anti-virus software running as a different user account than the actual user performing the recovery. When not properly configured, this can block the recovery due to not having sufficient permissions to scan the data being retrieved.)
  • Does the backup software provide some form of auditing to let administrators know that (1) a particular user is trying to recover files or (2) a particular user has recovered a particular set of data?
  • Does the backup software provide a simple monitoring mechanism that doesn't compromise system security, and is straightforward enough that users can see their recovery request in progress, and what might be impacting its ability to complete?

About the Author
Enterprise Systems Backup and Recovery From Enterprise Systems Backup and Recovery: A Corporate Insurance by Preston de Guise


© Copyright 2010 Auerbach Publications