Uptime 101

Maximizing uptime of key systems in your IT infrastructure, whether it’s a server, email, or Internet access, is of utmost concern for IT departments. Most companies strive for and most vendors tout “five nines of availability” (99.999% uptime). A few vendors are now touting 100% uptime for some services. But, as with most IT buzzwords, there’s a lot to consider where uptime is concerned.

Planned vs. Unplanned Downtime

Maximizing uptime can also be viewed as minimizing downtime; so let’s begin by looking at different types of downtime. Unplanned downtime is what most companies focus their energy on reducing. Unplanned downtime is typically thought of as being caused by hurricanes, earthquakes, or extended power outages at the macro level. At the micro level, server or operating system failure are much more common occurrences, but human error is a regular cause of downtime as well.

Planned downtime is incurred every time an OS patch is installed, a hardware component is upgraded, or a system is migrated. Typically, these types of planned downtime occur during an approved ‘Maintenance Window’ (after regular business hours) in order to minimize any potential impact on business operations.

Disaster Recovery vs. Business Continuity

Next, let’s look at protecting uptime from a higher level: DR vs. BCP. Many times, these two terms are used, incorrectly, interchangeably. Disaster recovery refers to the ability to revert back to normal operations after a disaster has been declared. Normal operations are typically achieved when all IT services are back up and running at the same level of performance as they existed prior to a disaster. Business continuity, on the other hand, is the ability to maintain business operations (usually at a reduced level of service) throughout and in the direct aftermath of a disaster.

Typically, business continuity involves the failover of services to a contingency location, while disaster recovery is failing back to the primary datacenter. In the ‘real world’, these two concepts can be as simple as using cell phones and a notebook after a disaster and restoring data from tapes once power is restored to multiple collocation datacenters using SAN based replication and automated failover/failback capabilities.

High Availability

High availability (HA) is any means that ensures a certain component or service remains operational for an extended period of time.

The means for HA are typically built into most systems or applications, and include:

  • Redundant components, such as dual power supplies, RAID configuration for disks, multiple UPS systems for unplanned downtime and hot swappable components for planned downtime.  Other examples include maintaining multiple points of Internet access.
  • Clustering is the use of multiple computing nodes to ensure HA for a particular application. For example, traditional Microsoft Clustering utilizes shared storage and multiple servers to protect critical applications like SQL, Exchange, and even File and Print Servers.
  • Network Load Balancing, either using built-in Windows functionality or more robust hardware appliances from F5 or Cisco, provides HA for applications that are typically static in nature, such as websites. As an example of the difference between clustering and load balancing, web portals are load balanced while the back end databases they connect to are clustered.
  • Replication is the means of copying data from one location to another. The location can be a secondary device in the same datacenter or a remote site. Software, or file-level, replication of files, applications, and even virtual machines is provided by vendors such as DoubleTake, Neverfail, and Vizioncore, while hardware, or block-level, replication is usually built into SAN and NAS devices. Replication technology has been with us for a while now, although efficient failback capabilities are a more recent development. After all, what good is the ability to failover to a secondary location (business continuity) if you’re not able to failback (disaster recovery)?
  • Virtualization technology has been a great enabler of HA over the past decade. Building a multiple server VMware environment with shared storage, for example, greatly reduces both planned and unplanned downtime through technologies such as VMotion (where a VM is moved from one physical host to another ‘on the fly’), VMware HA (where if there is a physical server failure, all VMs are automatically restarted on any remaining physical servers), and now Fault Tolerance (where a second ‘copy’ of a VM is constantly updated and ready to go should the original VM ever go offline).

Finally, keep in mind that uptime can be architected into a server or datacenter, but also into your site by leveraging virtualization and replication to maintain a standby location should your primary site go offline. There are many different ways to achieve uptime and many of them can be achieved within a reasonable budget; it all boils down to understanding the cost of downtime to your business (for example, $5,000 per hour) and determining the likelihood of a downtime event (for example, the probability of a hurricane or other disaster in addition to server failure and human error). Once this is estimated, the justification for an investment in uptime becomes clear.

Posted on February 9th, 2010. Filed under None, Technical Education.

Leave a Reply

Spam Protection by WP-SpamFree Plugin