InfoSecurity India's First Magazine on Comprehensive IT Security
Menu Bar
InfoStore May 2009
Cover Story
Business Continuity:
Effective Strategy is a Must

Protecting high-value data and delivering uninterrupted service, business continuity irespective of the distance between two information exchanging entitites - is by all counts the top objective of any IT organization, today.

To Meet the goal of providing uninterrupted service and thereby accelerating continuous business growth, represents a number of challenges as the traditional data protection strategies are complex, expensive and hard to manage, and require extensive additional infrastructure. Distances between primary and secondary sites are also limited with many data protection technologies, since the primary application performance can get impacted. Protection against regional disasters and power outrages and numerous regulations are driving the need for extended distance disaster protection.

In September 2002, the US based Federal Reserve, the Securities and Exchange Commission (SEC), and the Office of the Comptroller of the Currency (OCC) jointly published the "Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System," in direct response to the terrorist attacks of September 11, 2001. This outlined "preliminary conclusions with respect to the factors affecting the resilience of critical markets and activities in the U.S. financial system; sound practices to strengthen financial system resilience; and an appropriate timetable for implementing these sound practices." The agencies solicited comments on the draft white paper, and received many letters from leaders of financial firms, industry associations, technology companies, and other agencies.

Undoubtedly the most controversial aspect of this interagency white paper was the suggestion that those financial institutions that "play significant roles in critical financial markets" must have fully operational recovery sites located at least 200-300 miles away from the primary data center site. In addition to protection against regional disasters, there could be other reasons for deploying systems across multiple data centers in different geographic locations such as providing "local access" to users spread across a wide geographic area or to take advantage of existing IT resource skills and infrastructure in the companies geographically dispersed data centers.

Is Only the Mainframe Data Critical for Enterprise?

Windows server operating systems have become accepted in high-end, mission-critical applications and as a result requirements for disaster tolerance and business continuance for these systems is becoming more and more important. Microsoft offers a robust clustering technology as part of the windows operating system (MSCS); however this is generally deployed for failure protection within a campus or data center environment.

The goal is to ensure that there is no single point of failure. In other words, the loss of a single component or complete site failure cannot cause applications to become unavailable. In extreme cases, a complete site can fail, either due to a total loss of power or through a natural or artificial disaster. More and more businesses are recognizing the value of deploying mission critical solutions across multiple geographically dispersed sites.

A new data protection approach utilizing an intelligent replication appliance coupled with windows clustering technology can be used to create a highly resilient infrastructure across data centers that are thousands of miles apart, protecting applications automatically against all types of failures as well as local or regional disasters. In addition, this new network-based data protection architecture can provide unique features such as bandwidth optimization and support for heterogeneous, storage and server environments.

Replication Requirements Unique to Clusters

Clusters are defined as a minimum of two or more computer systems that together provide a highly available and highly scalable platform for hosting applications. MSCS clusters host applications that use failover to achieve high availability. The failover mechanism is automatic and the configuration ensures that loss of one site does not cause a loss of the application.

The challenge with making a multi-site MSCS configuration to work the replication infrastructure has to solve several specific issues:

* Making sure that multiple sites have independent copies of the same data

* Making sure that each site has its own copy of the data so that if one site is lost, the applications can continue

* Ensuring that changes to the data at one site are replicated in a consistent manner to the other sites so that in the event that the first site fails, the changes are available in the second site so that the applications will run uninterrupted

  • Ensuring that the data between two sites stays consistent at all times

  • Replicating data across sites in both directions to ensure failover-failback

Geographically dispersed cluster configurations should be implemented, especially around storage and data replication components of the solution. The system should ensure various failures will not result in data corruption and ensure that the cluster integrity is always maintained. The most difficult challenge, specifically with geographically dispersed clusters, is to be able to distinguish between a communication failure between sites where the other site is still alive and a site failure where it is no longer available to run applications.

The MSCS architecture handles this issue using a single quorum resource in the cluster that is used as the tie-breaker to avoid split-brain scenarios. A split-brain scenario can happen in the above case when all of the network communication links between two or more cluster nodes fail. In these cases, the cluster may be split into two or more partitions that cannot communicate with each other.

Using an Intelligent Replication Appliance to Set up a Geographically Dispersed Cluster

A geographically dispersed cluster generally deploys multiple storage arrays, with a minimum of one at each site. The replication system is configured to replicate the application data in both directions so that, in the event of site failure, the application data is preserved so that the failover servers can continue to provide the services and applications. In addition, the consistency of the quorum volume should be maintained in a synchronous manner to guarantee operations of the MSCS cluster independent of any type of failures. Synchronous replication is used for the quorum volume, which means that any data written by MSCS on one node at one site will not complete until the change has been made on the other site.

Asynchronous replication is used for the data volumes, which means that if a change is made to the data on one site, that change will be replicated to the second site. It is important, however, that the consistency of the data volumes is maintained. This means that the write order fidelity is guaranteed by the replication system, and that the remote data is always consistent. This is important since most applications can recover from crash consistent states but very few (if any) can recover from out of order I/O sequences, whereby the application may be totally unusable.

A Cost-Effective, Less Complex Alternative

Intelligent replication appliances, coupled with clustering configurations, effectively address the cost, complexity and management issues that have limited traditional data protection solutions, simplifying infrastructures while extending application protection efficiently over long distances. The ability to use a network-based appliance to enable low-cost connectivity, bandwidth optimization and long distance replication services allows IT organizations to deliver 24X7X365 availability of business information (with dramatic savings in operational costs) and ensures that information will be immediately available in the event of a complete or partial site failure.

Bandwidth Optimization: Using intelligent bandwidth reduction technologies, appliances can deliver unprecedented reduction in bandwidth requirements. This enables the system to dramatically reduce WAN costs, particularly over long distances.
Bi-Directional Replication Over Existing Infrastructures: The appliance can enable bi-directional use across heterogeneous server and storage platforms, with guaranteed data consistency across multiple servers and storage platforms in the event of any possible failure or disaster.

Protection for the Entire Data Center: Using an intelligent replication appliance, users can protect the transactional data, multi-tiered applications and other business-important information in a data center (including operating systems, working files and e-mail) to bring point-in-time protection of all applications for end-to-end immediate recovery in case of a failure. We have entered a time where cost-effective, intelligent replication solutions can help protect data centers located thousands of miles apart, automatically and cost effectively.

Need for Data Protection

Ideally enterprise data protection strategies should be centered towards enabling the rapid recovery of business operations in the event of a wide range of failures. The continuum of failure types spans the range from a simple hardware component failure to a widespread disaster that may take out one or more data centers.

Companies recognize that data loss represents a business risk. Three types of damage may occur because of data loss. First, data may be unrecoverable. Next, data may be recoverable but may require considerable time to restore. This scenario—the most likely—assumes that data is backed up in some other place, separate from the primary source. In some cases, not all the data may be recovered. This is a common problem with data restored from nightly backups. Finally, while data is unavailable, either permanently or temporarily, applications not directly related to lost data may fail. This is especially true of relational databases that reference other databases.

A corporation might lose important data due to disasters---both natural and man-made, security breaches, accidents or unintended user action and system failure. Business continuity is the ability of a business to continue to operate in the face of disaster and is of utmost importance to all enterprises across sectors. Protecting data and the access to it is a primary component of business continuity strategies. Restoring systems whose data has been destroyed is useless. IT needs to ensure that the data entrusted to it survives.

In addition to regulatory challenges, data protection challenges extend to meeting recovery time objectives (RTOs) and recovery point objectives (RPOs); meeting backup window and application impact requirements and managing increasingly complex data protection environments built upon multiple disparate tools. Although many customers focus on their backup window, it’s actually the RTO and RPO that should really drive the design. The next challenge is the mismatch between the speed of tape drives and the other components in the environment. Believe it or not, tape drives are actually too fast for the job. This means that you can’t just solve an RTO challenge by buying more tape drives.

It’s also difficult to meet RPO requirements when backups cannot be completed within specified backup windows. In addition, running backups should not significantly impact the application they are backing up. If they are impacting it, business units may require that the backup window be shortened even farther. Ideally, backups should finish quickly and have as little impact on the application as possible. Accomplishing that with typical data protection systems, however, can be quite a challenge. This coupled with regulatory compulsions make data protection extremely complicated.

A typical reaction to the challenges mentioned above is to purchase multiple point solutions that each address a single challenge; one product for traditional backup and recovery, another for point-in-time recovery, and yet another for archiving. This results in a more complex data protection environment that requires additional resources to manage and maintain. The complex data protection environment can also increase the risk of a human error when recovering data, as individuals attempt to understand which solution should be used to recover the data and maintain the operational support of the various point solutions.

Meeting each of the challenges above while still maintaining a financial competitive advantage is yet another challenge. An organization’s ability to recover data quickly and accurately can set it apart from the competition, but only if it can do so at a lower cost. The value of the data being protected must be appropriately matched to the cost of the product(s) protecting the data. This cost is not just product cost, but also the cost of managing the data protection solution(s). The more complex the solution, the higher the management and operational cost is. All information in a company does not have the same value and the value of any given piece of information changes over time. Therefore, an ideal data protection solution should be able to provide different levels of data protection to different pieces of information over time.

Data protection was earlier equated with only backup and recovery. Now data protection encompasses not only traditional backup and recovery, but also technologies such as archiving, snapshots, replication, and security. The perfect solution to ensure optimum results given the complexities of the enterprise is to combine effective use of tiered disk and tape with global storage policies. Backup and recovery can also be enhanced significantly by using full-text indexing. Local and persistent snapshots ensure even faster recoveries. A unified platform comprising modules for data backup and recovery, migration, archiving, snapshots, and replication---all through a single management console and database can be the perfect solution to the data protection needs of the enterprise to ensure business continuity.

Present Long Distance Disaster Recovery Solutions

In a tradional world, there have been two approaches to this problem. Tape-based backup solutions store a point-in-time copy of operational data on inexpensive media that can be moved to a remote location and safely stored. If recovery is required, data can be restored from tape to disk at a recovery site and then used to restart critical applications.

The other approach has been to use asynchronous replication to continuously maintain a relatively up-to-date copy of operational data on disk at a remote site. Any changes to operational data at the local site are sent across a network of some type to also be applied at the remote site. If recovery is required, servers and applications at the remote site can be brought up using this copy. Each of these solutions introduces major operational trade offs, however. Tape-based backup solutions suffer from three basic problems: backup windows, data loss and data integrity/recoverability.

The backup window is the amount of time an application must be off line for a backup to be performed. In today's globalized environment, 7X24 availability is a strong requirement, leaving literally no time when critical applications can be down--even for just a minute or two. Even the use of snapshot technology still requires the application to be brought down, causing operational impact.

Tape-based backups are also point-in-time copies, current to the point when the backup was taken. Given the backup window problem, backups are not taken very frequently. Most environments are backed up only once a day at the most, and many are only backed up every several days or on a weekly basis. If recovery is required, any changes to the operational data made since the last backup are lost.

Finally, tape media integrity can introduce non-deterministic recovery problems. There is some risk in the data conversion that takes place when data is backed up from disk to tape, and then restored from tape back to disk. Tapes must be managed with some manual labor at some point throughout the off-site storage process, and this introduces the possibility of human error (tapes could be misplaced or lost).

Tapes also wear out more quickly than disks, particularly if they are being used over and over for backup purposes. Research done by Gartner Group in 2H03 indicates that one in four tapes have unrecoverable files. The unfortunate aspect of this is that it is not clear if a file is unrecoverable until recovery is attempted, and at that point it's too late to do anything about it if it can't be recovered.

Box Item
Building a Better Business Continuity Plan:

  1. Consider potential site loss and area-wide disasters that impact an entire region and the resulting inaccessibility of staff.

  2. Address interdependencies, both market-based and geographic, among your business eco-system participants as well as infrastructure service providers.

  3. Establish recovery time objectives and redundancy acceptance levels for each application and required system.

  4. Accommodate mixed environments, where a variety of databases are present. Disaster recovery and high availability solutions must be heterogeneous to be complete and effective.

  5. Evaluate OLTP systems versus batch systems protection and recovery requirements for each.

Addressing The Issue

Replication did a good job of addressing these issues. Once installed, replication can run continuously in the background, effectively removing the backup window issue. Data loss is minimized relative to tape since writes are continuously being sent across the network to the remote site. In almost all cases, replication will allow recovery from much more current data than tape. And because data at the remote site is stored on disk in native disk format (not converted to a tape format for storage on tape) there is a much higher chance of data being recoverable when it is required.

Historically, replication has been significantly more expensive than tape due to the price differential between disk and tape hardware. But newer disk technologies such as ATA are closing the cost gap while at the same time providing more reliable solutions with faster recovery. The disk-to-disk data protection trend is definitely taking hold. In 2004, an Enterprise Storage Group survey showed that 83% of enterprise users and 59% of mid-tier users have either already deployed or state that they will purchase some form of disk-based data protection technology within the next 24 months. Replication is one form of disk-to-disk data protection that will benefit from this trend.

The first form of replication available was synchronous replication. In synchronous replication, widely deployed for sites within 30-50 miles of each other, writes at both sites must complete before a write acknowledgement is sent back to the critical application. The write latency between sites then becomes a constraint limiting the viable distance of synchronous replication configurations. To address the concerns over widespread disasters in this post-9/11 era, synchronous replication is not sufficient, in most cases, because of distance limitations.

By its very nature, asynchronous replication is designed for long-distance configurations. In asynchronous replication, the write to the remote site is decoupled from the write to the local site using a local queue of some sort. This means that critical application performance is unfettered by the distance between sites, opening up the ability to support configurations spanning literally thousands of miles.

One key point to understand about asynchronous replication is that the asynchronous nature of the write does mean that there can be some lag time between writes being applied at the local and remote sites, depending on network bandwidth and distance. This lag time is generally measured on the order of seconds or minutes, however, as opposed to tape where the "lag" time (the last backup point) is often measured in days.

Asynchronous replication is available in a variety of forms. Storage-based solutions enable asynchronous replication between enterprise arrays from the same vendor and can support heterogeneous servers, but generally force vendor lock-in at the array level, driving high cost. Server-based solutions support mirroring between servers running the same operating system and can support heterogeneous arrays, but can also impose vendor lock-in at the volume manager or file system level. Depending on the vendor, architectural limitations may also impose performance and scalability issues.

Appliance-based solutions support both heterogeneous servers and storage, but force new hardware purchases (appliances) and can introduce performance bottlenecks that limit scalability at the appliance level. Generally, all of these types of solutions support the ability to replicate data over IP-based networks. If long distance replication is required, it is important to understand the features and limitations of each approach to determine if it is appropriate for your environment.

There are vendors like Hitachi, HP or even IBM - currently developing network device-based solutions that are slated for availability in 2006 and 07. These solutions are expected to offer an operating system-agnostic solution with no requirement for any host-based components. The availability of switch-based options will offer users increased choice in how they deploy replication.

Conclusion

Business resilience is the ability to recover from adverse events such as natural disasters, technological failures, human error, or intentional harm. Enterprises must restore operations as quickly as possible, regain momentum, and recover as much information as possible. In short, disaster recovery is about reducing the return to operations and mitigating data loss.
These goals require a viable business continuity plan. But no plan is static. New business practices, changes in technology, and escalating concerns about vulnerability have focused even greater attention on the need for effective business continuity planning and have altered the benchmarks of an effective plan.

-By: 'InfoStore' Bureau.


Home   |   Current Issue   |   Archives   |   Subscription   |   Advertisement   |   Contacts

© 2006-07 'InfoStore' magazine. All rights reserved.
Website designed, developed and maintained by Fanatic Media