Systems Availability: The View from 30,000 Feet
Do you know. . .?
- The difference between resilience and reliability?
- How to calculate the increase in reliability by adding a parallel component?
- How to break up a server/network combination into physical domains for availability calculations?
- What percentage of (recorded) outages are due to hardware failure?
- What nonhardware factors cause outages? Can you name a dozen?
- How a system can be down when all components are working?
- The full breadth of what needs to be considered when designing and operating high availability services?
- Enough about high availability to mentor someone in it? To tutor your boss?
- What Lusser says about series components and reliability?
- What Amdahl's/Gunther's/Gustafson's Laws all about?
If your answer to all the questions is yes, read no further and go out and play golf, go fishing, or drink beer (or all three). If any answers are no, please read on (see Figure 1).
This is our starting point in the discussion of availability, its theory, design, practice, and management, and I hope you and your organization will benefit from it. The management disciplines are the ones I found missing from most literature on high availability (HA) I've seen. Unmanaged technology can be like a loose cannon on a rolling shipdangerous.
As well as learning from it, I hope you enjoy the bookI enjoyed writing it, unlike Hilaire Belloc did writing one of his booksI am writing a book about the Crusades so dull that I can scarcely write it. By translating erudite texts so that I could understand the topic enough to write about it has taught me a lot, for which I am grateful. I hope it helps you.
Figure 1. The curse of downtime! (From IBM Redbook: SG24-2085-00.)
Availability in Perspective
Availability seems an obvious entity to comprehend. In information technology (IT) terms, it is the presence of a working component or system, which is performing its job as specified. It has three connotations:
- Is it working or not?
- What percentage of time is it working according to specification?
- What is this specification that explains what working means?
We will see later that the last property above is the subject of an agreement between interested parties and is absolutely a key to the topic of HA.
Murphy's Law of Availability
This apparently immutable law, often called The 4th Law of Thermodynamics, states If anything can go wrong, it will. Hopefully, because the law is probabilistic, this book will help to minimize the impact of this law, certainly in the case of IT services and supporting systems.
Availability Drivers in Flux: What Percentage of Business Is Critical?
Until recently, the focus on HA was concerned with hardware and software and, unfortunately, still is in many organizations. There is a change in the need for HA, and the reasons for lack of it, as perceived by businesses and reported in a study by Forrester Consulting. It is dated February 2013 and titled "How Organizations Are Improving Business Resiliency with Continuous IT Availability."1
Table 1. Top Risks to Business Services Availability
The report indicates a shift in the types of risks, which businesses see as affecting the availability of their applications and services. These are outlined in Table 1, the result of surveying 246 global business continuity (BC) decision makers.
The outcome of these concerns is as follows:
- Upgrading BC/disaster recovery (DR) is seen as a top IT priority (61% say high/critical).
- Improving BC/DR drives the adoption of x86 virtualization (55% say very important).
- Many organizations have already adopted active–active configurations.
- Continuous availability achieves both operational and financial benefits".
- More organizations are ready for continuous availability.
- About 82% lack confidence in their (current) DR solutions.
- They believe that off-the-shelf continuous availability technology is mature.
The survey concludes by saying:
Organizational demands for higher levels of availability will only increase. It's not a question of if but how IT operations will achieve these demands cost effectively. By combining HA/DR in a single approach organizations can achieve higher levels of availability, even continuous availability, without the huge capital expenditures and costly overhead of separate solutions and idle recovery data centers.
Another Forrester survey (2010) yielded the classifications of services supported by IT as approximately one-third each mission critical, business critical, and noncritical. This is a simple figure to bear in mind throughout this book when thinking so what as it tells us that two-thirds of a business activity is extremely important. Remember that two-thirds.
Historical View of Availability:2, 3 The First 7 × 24 Requirements?
System reliability has an interesting history with its genesis in the military. It is also notable that much of the theory of reliability was developed by and for the military and later, by the space programs. In fact, Lusser of Lusser's Law worked with Werner Von Braun on the development of rocketry after the latter's sojourn with the German V1s and V2s in World War II. If you look at the MIL handbooks produced by the US military, you will find the logic in the drive for component reliability. It is essentially the increasing reliance of military operations on electronics, and relying heavily on unreliable equipment in combat situations does not make sense. This focus on reliability of components was taken up by commercial manufacturers as a survival mechanism in a competitive world of selling goods. Service is also a key factor in winning business in this competitive world.
In the IT arena, reliability and availability go beyond simply using quality components because IT provides a service and the service needs to be reliable and hence available for use when needed. A service is composed of components that comprise working units, like disks that make up servers that combine to make systems and so on. Hence, we have a millpond effect where this need for reliability spreads beyond the base components.
The following diagram shows, schematically not for the last time in this book, the viewpoints of server, system, and service (Figure 2). If you take this on board now, we are half way to our goal in this book.
Figure 2. The service universeServer, system, service.
As you move outward through the onion rings in the diagram, the theory of reliability and availability becomes more tenuous and difficult to predict exactly. However, one thing can be predicted and that is Murphy's Lawif a thing can go wrong, it will. The task of the availability person in IT is to predict how it might go wrong, what is the possibility of it going wrong and how do we design to avoid and when they happen, mitigate these failures.
In 1952, the US military was developing the SAGE system (semiautomatic ground environment) in the Cold War environment that pervaded East–West relations after World War II. It was essentially an early warning system (EWS) to monitor potential airborne attacks on the US mainland (Probably the precursor to the proactive AWACS project). IBM, under Th mas J. Watson Jr., was bidding for the computer part of the SAGE business against Radio Corporation of America (RCA), Raytheon, Remington Rand, and Sylvania (where are they now?).
In his book Father, Son and Co., Watson says:
…the Air Force wanted the system to be absolutely reliable. In those days it was considered an accomplishment if someone could build a computer that would work a full eight-hour day without failing. But SAGE was supposed to operate flawlessly round the clock, year in and year out …the storage circuitry we were using worked faster than the UNIVAC [a competitor], but it also 'forgot' bits of data more often." …"The system even had the reliability that the Air Force wanted …solved the problem by having the Q7s [the new IBM computer] work in tandem, taking turns. One machine would juggle the radar [data] while its twin was being serviced or standing by. By that method, the average SAGE center was able to stay on alert over 97% of the time."4
We cognoscenti recognize here (or will do shortly) the need for reliability in memory and for hardware redundancy via this rudimentary cold standby node cluster. Watson doesn't say anything about how the data were shared in those pre-SAN, pre-NAS, and pre-switch days, or how fast the switchover time was, but they blazed the system availability trail nevertheless.
Jim Gray had to say about the old days:
Computers built in the late 1950s offered twelve-hour mean time to failure. A maintenance staff of a dozen full-time customer engineers could repair the machine in about eight hours. This failure-repair cycle provided 60% availability. The vacuum tube and relay components of these computers were the major source of failures: they had lifetimes of a few months. Therefore, the machines rarely operated for more than a day without interruption.
Many fault detection and fault masking techniques used today were first used on these early computers. Diagnostics tested the machine. Self-checking computational techniques detected faults while the computation progressed. The program occasionally saved (checkpointed) its state on stable media.
After a failure, the program read the most recent checkpoint, and continued the computation from that point. This checkpoint/restart technique allowed long-running computations to be performed by machines that failed every few hours.
Things have certainly changed since then but my main thesis in this document is that man cannot live by hardware alone, a view supported by the Forrester survey outlined earlier. Logical errors and general finger trouble (by liveware) will sometimes outsmart any NeverFallsOver and Semper Availabilis vendors' hardware and software products.
The ham-fisted or underskilled operator can knock almost any system over. A new up-and-coming contender for tripping up systems is malware which is now increasingly recognized as a potential menace, not only to service availability, but to an organization's data, confidentiality, finances, and reputation.
Historical Availability Scenarios
In the late 1950s, Fairchild and Texas Instruments (TI) went head to head in the race for smaller, more reliable electronic circuits. The requirements were driven partly by US military requirements, mainly the electronics in the Minuteman missile system. At that time, the move from glass tube circuits to transistors was underway but the problems of needing one transistor per function and the interconnection of many of them remained.
Fairchild was a new boy in this arena but was making progress with planar technology which enabled the connection of multiple circuits on a single substrate possible and avoided the problem of multiple interconnections via wires and solders in favor of metal strips on the insulating substrate. This was a massive leap in the technologyICs or integrated circuitsand got round the reliability issue of earlier transistors. This was easily demonstrated by what was dubbed the pencil tap test, whereby a transistor could be made to malfunction in various ways simply by tapping it a few times with a pencil, not a desirable trait in military hardware delivered via a rocket or other robust means.
The names Robert Noyce and Jean Hoerni (Gordon Moore, of Moore's Law, was also part of the Fairchild setup in those days) are forever associated with this leap forward in transistor technology which was soon given further impetus by the avid interest of National Aeronautics and Space Administration (NASA) in fast, small, and reliable integrated circuitry. Interestingly, the intrinsic reliability of these new technologies was not the only issue. Their volume production was also of prime importance in the development and spread of computers and planar technology helped this in this goal.
It was IBM who, bidding to the military, asked Fairchild to produce suitable circuits for their purposes in this exercise. To the military, the initial cost of these chips (c. $100 each) was not an issue but for the volume sale of computers, it was. However, history shows us that such costs are now minimal3
This series of diagnostic tests is run automatically by a device when the power is turned on. Today, it can apply to basic input/output systems (BIOS), storage area networks (SANs), mainframes, and many other devices and systems. Power-on self-test normally creates a log of errors for analysis and, in most cases, will not start any software until any problems are cleared. This is sometimes called built-in self-test (BIST).
In my time at IBM, I saw many advances in the tools and techniques used by engineers for maintenance and diagnostic exercises. Originally, the engineers (called customer engineers or CEs then) used masses of A2 (42 cm×60 cm) diagrams in oversized books, several of them kept in a hangar-sized metal bookcase in the machine room. They pored over these tomes while examining the innards of a machine with electrical probes, multimeters, and screwdrivers plus a few raised digits and swear words.
This could be a time-consuming exercise, as well as a strain on the engineer's back in lifting these books and make his eyes myopic trying to read incredibly complex diagrams. Taking a whole day to diagnose and fix a simple failure was not the norm but it was not unusual either. I can see those books now in my mind's eye and remain thankful that I did not have to use them in my work as a systems engineer.
These techniques did little to expose soft errors that might eventually become hard errors, possibly causing outages later. I remember meetings with customers where the IBM CE would summarize the latest hardware diagnostics and agree on a date and time for maintenance, or perhaps repair/replace activity, for components exhibiting higher than expected soft error rates, sometimes called transient errors.
In later years, these cumbersome diagnostic methods and books were replaced by maintenance devices (MDs), the equivalent of the clever diagnostic tools used in modern cars, but pre-Java. They shortened the diagnostic time and hence were a considerable boon to system availability and to the health of the engineer. Just to complete the picture, I was told by one engineer that there was an MD to diagnose a failing MD and so on! I should have guessed.
Repair could also be a time-consuming process that was eventually superseded by field replaceable units (FRUs), where failing items were replaced in situ (where possible) and the offending part taken away for repair, or to be scrapped. The part, if repairable and contracts and consumer law permitting, could then be used again on the same system or elsewhere.
FRUs installed by the customer are called CRUs (customer-replaceable units), a fairly recent innovation. It is current modular system and component designs that make a replaceable units philosophy possible.
To be cost-effective, FRUs needed to be of a size that could be easily replaced and, if necessary, discarded if they could not be repaired after removal. This necessitates a granular approach to the system design but then more components making up a system means more things to go wrong.
Later versions of the hardware and operating systems offered diagnostic recording and warning features that could be used either retrospectively (e.g., for identifying soft errors) or as an operational warning of potentially failing parts or components as work was in progress. A sophisticated level of self-test and diagnostics is implemented in hardware systems that offer fault tolerance (ft). These include Stratus and HP Nonstop, a system initially marketed by Tandem before their acquisition by HP Modern systems have these and other reliability, availability, and serviceability (RAS) features that considerably enhance availability figures and are rarely unique to any one vendor. One characteristic of in-flight diagnostics is that the errors they detect can be either logged, flagged in real time to IT operations, or bypassed using fault-tolerant recovery techniques.
We have seen the early attempts to specify what causes our computers to fall over and to address the issues in various ways. The tackling of this problem is evolutionary and made very necessary by the consequences of failure to business and other systems. I can't think of any business today that isn't totally dependent on IT to run the whole operation or dependent on someone who is, for example, a third party.
Some enterprises only address the HA and DR aspects of running their IT when they get their fingers burned and for some, those burns are fatal. Funding HA IT is like paying for an insurance policy on your houseyou hope you won't need it but when your house burns down you're glad you took the policy out. Your CFO may say we've spent all this money on high availability and disaster recovery and I can't see any benefits.
This issue reminds me of an appropriate story you might respond with:
A man was walking round my home town of Warrington, UK, scattering a green powder. A second man saw this and asked the first man "Why are you scattering that powder?" to which the first man replied "To keep the elephants away." The second man looked puzzled and said "But there are no elephants in Warrington." "No" said the first man "this powder is very effective isn't it?"
Even if an enterprise decides to spend on these aspects of IT, they may either get it wrong or overspend with overkill just to be on the safe side. That's where knowing what you are doing comes in useful!
Many years ago, UK electricity boards built their networks using components, such as transformers, with large amounts of redundancy and capacity in them just in case and to save continually upgrading them as the load grew. Today, costs are such that networks are designed and provisioned using power systems analysis tools to design and install equipment with the ratings to do the job and have enough spare capacity to handle projected growth. These power systems planning tools and the effort involved in using them are more than covered by cost savings from quality network design and component usage. I know this because I spent several years of my IBM life working with and in public utilities gas, water, and electricity.
Planning, quality design, implementation, and operations reap their own rewards and this applies in IT as well as in other service areas like utilities. The effects of such labors may not be obvious due to a form of IT hysteresis but they are there nonetheless. Remember: The three components of servicespeople, products, and processesare like the three musketeers. All for one and one for all, in true holistic fashion. Together, they unite to provide a service to users and clients.
1. Forrester Research. How Organizations Are Improving Business Resiliency with Continuous IT Availability. EMC Corporation: 2013.
2. Jim Gray and Daniel P. Siewiorek. " High-Availability Computer Systems." IEEE Computer. September 1991.
3. Jim Gray. "Why Do Computers Stop and What Can Be Done About It?" Tandem Technical Report 85.7 PN87614. June 1985.
4. Thomas J. Watson and Peter Petre. Father, Son & Co.: My Life at IBM and Beyond. Bantam. 2000.
5. David Law. "Invention of the Planar Integrated Circuit & Other Stories from the Fairchild Notebooks." www.computerhistory.org
Boring, Alan. Computer System Reliability and Nuclear War.
Read more IT Performance Improvement
Certain names and logos on this page and others may constitute trademarks, servicemarks, or tradenames of
Taylor & Francis LLC. Copyright © 20082015 Taylor & Francis LLC. All rights reserved.