For more than 50 years, Auerbach Publications has been printing cutting-edge books on all topics IT.

Read archived articles or become a new subscriber to IT Today, a free newsletter.

This free newsetter offers strategies and insight to managers and hackers alike. Become a new subscriber today.


Contact

Interested in submitting an article? Want to comment about an article?

Contact John Wyzalek editor of IT Performance Improvement.

 

Responsibility for Defect Support

Gay Gordon-Byrne

Understanding which parts of the original equipment manufacturer (OEM) organization (see Figure 1) are responsible for which types of defects allows end users to craft more equitable and productive service agreements. The sales force from the OEM is unlikely to be prepared to delve into this issue and may easily provide misleading or incomplete details. Nor are all OEMs straightforward with buyers as to which types of defects are covered regardless of warranty status, such as "recalls" in the auto industry, or those that are only available within the post-warranty maintenance agreement as a separate contract.

Figure 1. Manufacturer Organizational Chart

Software Defect Support

Experience tells us that the vast majority of calls for help from users to help desks are not hardware failures or software failures. Actual hardware problems are a tiny fraction, certainly less than 10%, of all calls for help.¹ Of the remaining 90% of calls, at least 40% are for user problems such as settings or passwords. That leaves 50% for software problems that are tricky to diagnose and more difficult to fix.

One of the first, and most important, jobs of the help desk (or other support team) is to determine which type of problem is involved and then direct the call to the most appropriate entity. Many times the user needs help with how to use an application or has a setting issue, in which case the problem is not a failure issue and is handled without involving the vendor. A large number of companies subcontract for these types of how-to problems, including investments in education, Web-based learning, and so forth.

The next level of work involves distinguishing between hardware problems and software problems. Hardware problems are usually simple to identify because the part in question does not work and cannot be made to work even intermittently. Machines with broken connections do not fix themselves. There are no "fault tolerant" circuits, only functional or nonfunctional circuits. In this respect hardware failure is totally binary, meaning "on" or "off."

Software failures are typically intermittent. A wide variety of conditions may have to come together at the same time to cause a software failure, which is why the reboot function is so frequently effective. The reboot returns the machine to the original settings, which clears the error. It is axiomatic that the more applications in use at the same time and the more interfaces that are active, the greater the chances of problems of interoperability and outright conflict.

The most common way to start diagnosing a hardware failure compared to a software problem is to restart the machine. If the machine restarts correctly, the problem is software. The word "failure" is more appropriate at this stage because in the world of hardware, failures reflect something broken, which must be the case, whereas a software problem can have a workaround or be ignored if the consequences are not significant. Really horrible software failures, such as those that repeatedly crash the machine or the active partition, are simpler to diagnose than others and usually get the highest level of attention (for severity) from vendors.

That being said, there are a few nuances of software use that impact hardware failures. First, if there is a failure of machine code and patching resolves the problem, this is still technically a hardware failure as the responsibility for such patching lies with the hardware OEM. These are rare events in practice. Most patching is done preemptively by the OEM in the earliest days of initial warranty and otherwise only reloaded if code is lost during a repair. Engineering change (EC) level options also fall into this category.

Second, there is some validity to overclocking a processor contributing to heat-related failures. Overclocking is usually a hobbyist and gaming technique for faster performance. In this respect, the use of the software does cause hardware failures, but the repair remains a physical one. One cannot reset the clock speed of a fried processor and return it to service.

Similarly, there are situations where high demands on read-write activity on disk drives contribute to drive failure as the excessive activity is presumed to accelerate the failure rate of actuator arms through overuse. Again, this is ultimately a hardware failure even when the causation is clearly a problem that is controllable by software. The actuator arms will not repair themselves, so the technician must still place "warm hands" on the device to make the repair.

Once hardware failures are ruled out, the team must then attempt to differentiate between operating system and application system failures. Without belaboring the obvious, in a situation where multiple applications are run, if the machine operates correctly with some applications, but not with one specific application, the source of the problem can be easily narrowed and the vendor contacted. Operating system bugs are more likely to be systemic and impact the system regardless of which applications are running.

Depending on if the application or the operating system (OS) is causing a problem, the responsibility for defect support for those problems lies with the software vendor. Determining which vendor needs to fix which elements of code is a contentious area where there is a high potential for finger-pointing. There is some validity to the idea of reducing the level of finger-pointing by dealing with a single vendor, but there is still the matter of problem resolution.

Users are often attracted to the idea of "one throat to choke" in the context of problems in general. It is assumed that by having one point of contact for one vendor, that problems are resolved more quickly. Under the convenience of the single point of contact lies the unpleasant truth that many vendors have grown by acquisition and not by developing a truly integrated set of products. One has only to look at the acquisition history of software giant Computer Associates (CA) to see that completely disparate products are marketed together but were never built to be installed together. ² Similarly, many hardware vendors have purchased both hardware and software vendors that were never built as integrated products and packaged the products together.

Types of Patches

Patches to hardware and software problems are often distributed as "urgent," "security," or "essential" updates in addition to those patches that add support for new features or drivers. There are often a lot of patches, and patch management is an important task for system administrators to track.

Since downloading patches occasionally introduces new problems, many organizations test all patches before applying the patches more widely. The axiom "If it ain't broke, don't fix it" is widely applicable in the world of software maintenance. Many serious outages have been linked to patching without adequate testing.

The wording of "Security Patches" is particularly suspect for all application software as it is very much to the marketing advantage of vendors to create a sense of urgent need for continued maintenance contracts for their products. So long as "Cybersecurity" remains a fear factor, vendors will use wording that implies that dropping software maintenance is going to make systems more vulnerable to attack. This needs to be carefully scrutinized before being accepted. As with all types of marketing hype, the buyer needs to consider the motives of the source.

Hardware Defect Support

In the technology equipment industry, hardware break—fix and defect support for underlying machine code is often blended into an extra-charge service agreement typically sold as "maintenance" or "extended warranty." These agreements cover both patches to machine code as well as parts failure, they are actually two different types of problems resolved by two different groups within the OEM organization.

Diagnosing flaws in machine code errors as distinct from manufacturing or component issues is the result of analysis of recurring problems that resulted in service calls. The call for repair under warranty is the first indication to the OEM that there might be a hardware problem. It is not possible to determine if the cause is a logic flaw or a manufacturing flaw until the part (or machine) is returned for examination.

Forensic analysis (autopsy) on returned parts is done to ascertain the root cause of the failure by reliability and quality assurance engineers in the back office.³ In addition to the OEM, each part manufacturer has its own staff specializing in root cause analysis. If defects in design can be addressed with a downloadable patch, an update is issued. If the defects are more profound, it is likely that the item will be replaced by a new model. The new model is commonly identified by an "engineering change," or EC level.

Defect support of logic flaws (machine code) is one of the major tipping points within service contracts determining if the owner of the equipment has any opportunities for self-service or independent service. If defect support of machine code is provided by the hardware OEM as part of the hardware, it is common for the end user to be able to select from multiple support options. If on the other hand, defect support for machine code is bundled into software maintenance agreements, then it will be unlikely that the owner will have any hardware support flexibility.

Logic Flaws (Hardware or Machine Code)

At the most basic level, computers are combinations of logic statements expressed in wires. It does not matter if the wires are large and visible, or extremely small. Logic is always part of the design of the computer, and logic errors cannot be corrected externally, unless manufactured into the chip. Defect support of logic must therefore be provided by the manufacturer. Such errors are almost always corrected without cost to the buyer because the OEM would not likely be able to sell flawed products once the flaws were identified.

Logic problems exist in both physical form, as manufactured into the chip, and also in the embedded software that comes with the machine. Embedded software has many names, machine code or embedded code being the most dominant but also including microcode, firmware, BIOS (basic internal operating system), PLC (programmable logic code), and IOS (internal operating system). The function of the machine code software is twofold. First, machine code provides common routines needed to perform strictly hardware functions such as moving data from storage and back. Second, machine code is also a more flexible delivery system for corrections to complex logic that can be more easily corrected in the event of problems. As shown in Figure 2, corrections delivered dynamically, as is often described as a "Firmware Update," are only one of many media options. In many cases, patches to machine code are collected and included in subsequent versions of the hardware itself, thus completing the cycle of machine code back into a manufactured form.

Figure 2. Hardware to Software to Hardware.

Within the OEM organization, fixes to hardware errors and machine code errors are supported by the hardware designers, uniquely capable of finding and patching errors in logic. This is an entirely different set of employees from developers of operating systems or application systems that design their products based on the specifications of the hardware. Corrections from the hardware engineers may be distributed in physical form in a replacement part manufactured to incorporate the changes or in a less costly form as a distribution of a patch to machine code. The net result is identical. The part or machine patched using media is identical to the part or machine later manufactured inclusive of the microcode corrections.

OEMs strongly prefer to avoid physical recalls and if at all possible distribute "patches, fixes, updates, and so on" to be applied in the field. The method of distribution involves either some form of media or more often a download to media from the library of patches using the Internet. In the early stages of a product release, there are always more patches than later as field use reveals patchable problems. This is why a series of patches may be recommended at the time of equipment installation even though the product is "new." This is the same concept familiar to users of the Windows line of operating systems, known as "service packs."

There is truth to the phrase "bleeding edge" instead of "leading edge." Significantly new products will expose new problems in the initial release. Buyers of new products should expect to invest more time in patching. Some products are more widely distributed in the marketplace than others, so models with less common distribution are less likely to have a thorough "shakeout" in a short period of time.

Eventually, the majority of problems are identified and patched. At this point, the machine code stabilizes and few new patches are needed. Most patching in later years is done to allow native mode attachments of new peripherals or to recognize new features or major model changes. There is always the potential that a functional patch might be created years into product distribution, but this is rare since the development team for the hardware is almost certainly working on newer generations of equipment. Support for older models is eventually dropped officially at the "End of Service Life (EOSL)" or "End of Life (EOL)." This should not be confused with end of functional life. Many products remain in productive use decades past their last EOL or EOSL announcement.

Keeping equipment in service for the long term requires that equipment owners have access to the full library of known patches for several reasons:

  1. Not all users want to apply patches that are not immediately needed.
  2. Machines or parts in storage will not have had all patches applied until they are deployed.
  3. Machines do not work correctly without all updated code.

Without access to the library of patches (or patch media), the owner is unable to assure him or herself that their equipment is operating correctly at any point in time, including years beyond the initial warranty.

Updates versus Upgrades

Manufacturers have been taking the position that they are entitled to limit access to updates to machine code because they are constantly providing new enhancements and should be paid for their work. The flaw in this argument is that the enhancements they reference are unlikely to be the types of improvements users expect when they pay for upgrades.

Upgrades are defined in the computer science realm as:

  1. A software program that provides added enhancements over an earlier version. .
  2. A hardware device that provides greater performance than an earlier model.⁴

Updates to code intended to repair flaws (patches, fixes, security patches) are not improvements. It is only the confusion of end users that allows such word play to be tolerated. Updates to machine code are not upgrades. Updates are synonymous with patches and are the recalls of the technology industry.

Upgrades to functionality come with component purchases. No machine code update will change the physics of the product. If one wants a faster processor, they must buy a faster processor. If there were a way to double the speed or density of a product through a machine code upgrade, the wise OEM would charge for such value and not hide it in a patch.

Defect Support of Component Failures

Parts suffer physical breakdown in addition to logic flaws. Different batches of parts may perform differently. Different assembly teams may do a better job than others. Just as automobile buyers used to joke about never buying a vehicle built on a Monday morning (presumably the workers were recovering from hangovers), every step in the assembly process introduces variables. Shipping and handling exposes parts to extremes of heat, cold, vibration, and static charges. Given the variety of ways that delicate electronics can be abused prior to installation, it is quite an achievement that we have such low levels of immediate failure (known as "infant mortality").

Almost all infant mortality is observed in the first attempt to power on the machine. If the machine powers up and executes all self-diagnostics, the machine is no more likely to fail than any other. Loose connections may reveal themselves as machines are transferred from the staging location to the end user, but there is no need for a "Burn-In" period, which is an artifact of mechanical production or of warming up the device, which is an artifact of vacuum tube technology. A 90-day warranty is more than enough time to identify and exchange any shipping or packaging defects.

Once a machine is installed, the likelihood of specific part failure will track that of the overall mean time between failure of that part. In a configured machine, the overall failure rate of the machine will be no better than the worst part. For example, a CISCO switch with some components rated at 300,000 hours (clock oscillator) and others at 40,000 hours (fan) will still need service at the rate of the fan regardless of the rest of the parts.

OEM support of parts failure is a key element of all hardware maintenance contracts. Issues of infant mortality are usually resolved with an apology and a quick shipment of additional units. All other parts issues are resolvable with parts replacement. The choice of labor for parts replacement properly belongs with the equipment owner, just as owners of automobiles and refrigerators are able to choose if they prefer to pull and replace parts themselves or hire a technician.

The supply chain for parts is large and complex. It is possible for defective parts to be knowingly sold and integrated into machines. Some parts defects may have a low potential impact on users and be integrated into machines where the impact is unlikely. Others can clearly be unscrupulously sold and enter machines where the flaws will cause problems. It is up to the quality assurance team at the OEM to carefully test and differentiate between offerings by competitive parts manufacturers.

Last, defects in electronic hardware may be problems with machine code and not just physically broken parts. Defect diagnosis and repair is an obligation of the manufacturer, and, although presented as a "benefit" to the end user, it is incorrectly positioned. The end user would strongly prefer to purchase equipment that had no defects. In most markets outside of information technology (IT), defect support follows the machine through the chain of owners and does not disappear by location. This is the case for automobiles, major appliances, hot water heaters, and televisions. It should also apply to technology products of all kinds.

Engineering Change Levels

There are occasionally hardware product problems so large that a part needs a physical adaptation to operate correctly. These types of repairs are usually identified early in the product cycle and managed by the OEM under warranty with an engineering change (EC). ECs are physically different parts. (Wherever possible, lower level parts can be updated through microcode updates to machine code that can be applied dynamically on installed equipment.) Whether updated in the field through a code update or physically replaced, the EC level is important to the correct functioning of the machine.

Keeping track of the EC level of parts is burdensome but must be done. EC changes fall short of the types of problems that would result in a wholesale product recall and may not impact all customers. Unfortunately, owners of equipment without the latest ECs on their equipment are also losing value since the secondary market buyer will almost certainly demand the highest level EC as part of the purchase. End users who have equipment with available ECs should make sure their OEMs update all their equipment as soon as possible to retain value.

ECs can only be provided by the OEM and are product specific. Because such changes are costly for the OEM to provide the labor for the repair, the trend in the industry is to move away from a physical EC and design products to be updated through remote delivery of patches to machine code. Remote delivery, be it through the Internet or using physical media, is still far less costly to the OEM than manufacturing and replacing parts. Manufacturers have successfully avoided being chastised for delivering buggy and unstable products largely by describing patches and fixes as updates or, worse, upgrades. True enhancements or upgrades to equipment are almost always offered as chargeable offerings and not included for free in any EC or machine code updates.

Remote Diagnostics

Many devices in large enterprise settings are equipped with remote sensors and communications links to report failures to the OEM without the user having to make a trouble call. Many end users believe that these systems are constantly monitoring the health of the device, when in fact they can only sense failure after the fact and not prevent failure. For example, large disk arrays are assembled using massively redundant devices with sophisticated mirroring software so that individual and common failures of disk drives do not interfere with the work being done. These systems cannot detect future failures, they can only report on actual failures. The call home feature merely reports the failure so that a replacement part and technician can be efficiently dispatched.

Why do such machines not know about impending failure? Because there is no way to know until a circuit fails. OEMs would know more than anyone about the mean time between failure (MTBF) and the mean time to failure (MTTF) of the device, yet it is entirely clear that they do not have such information or they would build and deliver much less troublesome devices. It is vastly more expensive for an OEM to dispatch a technician to replace a $40.00 disk drive than to have avoided the problem with a $0.10 difference in unit cost. This, more than any marketing claims, tells us that the OEM does not yet know enough to predict failure. Remote hardware diagnostic routines are provided by the OEM to reduce the volume of physical service calls needed for diagnostic purposes. It is clearly efficient for the service provider to have the ability to triage a problem remotely in order to either restore the unit to service remotely (as with a reboot), or to manage the logistics of providing both spare parts and a suitably trained technician to the location. This is the reasoning behind the common disclaimer in Service Level Agreements (SLAs) that the response time to the SLA begins only once diagnostics have confirmed the hardware problem, not the instance that the service call was placed.

There are several problems with reliance on the remote call-home function. First, the remote service may not be linked to the owner's trouble tracking system, making the management analysis of repairs made and calls for service extremely difficult. Second, the remote call-home feature is not always the trigger to the acceptance of the service request as criteria for response within the SLA. Many times the vendor will treat the inbound call as preliminary and not start the SLA "clock" until some other validating step is performed. This allows the OEM a significant time jump on arranging for technicians and parts ahead of the user's knowledge with the result that the SLA can appear in compliance. The user can thus be made more exposed to concurrent failure far more extensively than is known.

Remote fixes are not the same thing as remote diagnostics and mostly refer to the option for a technician, usually a software specialist, to connect through a network into the computer and remotely manipulate settings. There is no such thing as an actual remote repair, since anything physically broken must be physically repaired.

There are new forms of remote diagnostics being used for service delivery in many industries, such as a telematics function in the auto industry. The telematics connections between the onboard computers of the vehicle and the service department of the auto dealership are already the subject of concern by privacy advocates. It is not yet clear if the data being exported to the dealer belongs to the user or the dealer. Privacy advocates prefer that the owner of the equipment also be the owner of whatever data is transmitted, for the obvious reason that it is truly no concern of the dealer how the vehicle is being used (unless the dealer is also the owner as in a rental agreement). Users expecting to buy equipment with telematics functions should consider for themselves if issues of data privacy are meaningful and negotiate accordingly.

There is already the potential for the OEM to block access to the telematics function by owners with expired warranty agreements. It may come to pass that telematics services are sold separately from traditional hardware break—fix, in which case the owner of the vehicle (or other device) may find themselves required to purchase a telematics license in order to have dealer service. Although this type of tying agreement sounds illegal, it is exactly the same as many current requirements in the IT industry with different names. Buyers wishing to avoid contracts that bind them in the future should reject such options, if necessary, in order to protect their future rights to service.

Summary

Obfuscation by OEMs is common when it comes to the details of defect support. Buyers should be digging deeply into all proposals to make sure their contracts correctly represent their understanding of how defect support is to be delivered and at what cost. Many OEM policies are surprisingly negotiable on the matter of defect support, and every opportunity should be taken before any purchases are made to demand the most appropriate agreements possible.

Notes

1. Analysis done by TekTrakker for clients revealed this general pattern of problem calls to the user help desk/service desk. Readers can easily cull their own records for patterns that are unique to their organizations.

2. "CA Technologies," Wikipedia, http://en.wikipedia.org/wiki/CA_Technologies. Neither the corporate Web site nor the Wikipedia entry for CA provides a comprehensive early history. Both lists are missing key early products, such as Dynam-T (Tape Library Management), CA Sort (Sorting), and Dynam-D (Disk Space Management), but the text confirms that CA grew by acquisition and not by development. The result was a series of unrelated products linked only by marketing brochures.

3. For additional information on the issues surrounding component reliability and patching, see the Web site for the IEEE Reliability Society at: http://rs.ieee.org/.

4. See definitions at "Patch (Computing)," Wikipedia, http://en.wikipedia.org/ wiki/Software_update; and "Upgrade," Wikipedia, http://en.wikipedia.org/wiki/Software_upgrade.


Read more IT Performance Improvement

This article is an excerpt from:

Describing how to avoid common vendor traps, Buying, Supporting, Maintaining Software and Equipment: An IT Manager's Guide to Controlling the Product Lifecycle will help readers better control the negotiation of their IT products and services and, ultimately, better manage the lifecycle of those purchases.

The book supplies an inside look at the methods and goals of vendors and their contracts—which are almost always in conflict with end-user goals. The text is set up to follow the way most people experience technology products and contracting decisions. It begins by explaining the significance of the decisions made at the time of product selection. It details what you need to focus on when negotiating service and support agreements and describes how to use purchase orders to negotiate more favorable agreements.

Illustrating the types of problems typically experienced during product use, the book describes how to better control the useful life of your equipment. It supplies tips on how to avoid excessive charges from predatory vendors and concludes by delving into issues of product end of life.

Explaining how to manage support and maintenance issues for the long term, this book provides the understanding you need to make sure you are more knowledgeable about the products and services your organization needs than the vendor teams with whom you are negotiating.