Information about Graceful Degradation
- This article contains specific implementations of fault tolerant systems. For general theory, see fault-tolerant design.
Fault-tolerance or graceful degradation is the property that enables a system (often computer-based) to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively-designed system in which even a small failure can cause total breakdown. Fault-tolerance is particularly sought-after in high-availability or life-critical systems.
Fault-tolerance is not just a property of individual machines; it may also characterise the rules by which they interact. For example, the Transmission Control Protocol (TCP) is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communications links which are imperfect or overloaded. It does this by requiring the endpoints of the communication to expect packet loss, duplication, reordering and corruption, so that these conditions do not damage data integrity, and only reduce throughput by a proportional amount.
Data formats may also be designed to degrade gracefully. HTML for example, is designed to be forward compatible, allowing new HTML entities to be ignored by Web browsers which do not understand them without causing the document to be unusable.
Recovery from errors in fault-tolerant systems can be characterised as either roll-forward or roll-back. When the system detects that it has made an error, roll-forward recovery takes the system state at that time and corrects it, to be able to move forward. Roll-back recovery reverts the system state back to some earlier, correct version, for example using checkpointing, and moves forward from there. Roll-back recovery requires that the operations between the checkpoint and the detected erroneous state can be made idempotent. Some systems make use of both roll-forward and roll-back recovery for different errors or different parts of one error.
Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.
Fault Tolerance Requirements
The basic characteristics of fault tolerance require:- No single point of failure
- No single point of repair
- Fault isolation to the failing component
- Fault containment to prevent propagation of the failure
- Availability of reversion modes
In addition, fault tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. A five nines system would therefore statistically provide 99.999% availability.
Fault-tolerance by replication
Spare components addresses the first fundamental characteristic of fault-tolerance in three ways:- Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;
- Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover);
- Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed Dual Modular Redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed Triple Modular Redundancy (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
Lockstep fault tolerant machines are most easily made fully synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.
Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.
One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.
No Single Point of Repair
If a system experiences a failure, it must continue to operate without interruption during the repair process.Fault Isolation to the Failing Component
When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation.Fault Containment
Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "Rogue transmitter" which can swamp legitimate communication in a system and cause overall system failure. Mechanisms that isolate a rogue transmitter or failing component to protect the system are required.Reversion modes
Some failure mechanisms can endangered the survivability of the system per se, the operator or the end result. To prevent these to happen, mission critical system (e.g. weapon system, hydraulic tools) provide for a safe mode. This can be implemented through interlock mechanism, software.See also
- Damage tolerant design
- Byzantine fault tolerance
- Intrusion Tolerance
- Proactive Resilience
- Cluster
- Defence in depth
- Data redundancy
- Object group
- Process group
- Progressive Enhancement
- Transaction processing
- Elegant degradation
- Fail-safe
- Capillary routing
- error detection and correction
- Separation of protection and security
Bibliography
- Brian Randell, P.A. Lee, P. C. Treleaven (june 1978). "Reliability Issues in Computing System Design". ACM Computing Surveys (CSUR) 10 (2): 123–165. ISSN 0360-0300.
- P. J. Denning (December 1976). "Fault tolerant operating systems". ACM Computing Surveys (CSUR) 8 (4): 359–389. ISSN 0360-0300.
- Theodore A. Linden (December 1976). "Operating System Structures to Support Security and Reliable Software". ACM Computing Surveys (CSUR) 8 (4): 409–445. ISSN 0360-0300.
External links
- Fault Handling and Fault Tolerance — Articles about software and hardware fault tolerance techniques.
- Article "Practical Considerations in Making CORBA Services Fault-Tolerant" by Priya Narasimhan
- Article "Experiences, Strategies and Challenges in Building Fault-Tolerant CORBA Systems" by Pascal Felber and Priya Narasimhan
- Article (an excellent starting point in the subject, read it first and then read the tutorial below) "Dependability And Its Threats: A Taxonomy" by Algirdas Avizienis, Jean-Claude Laprie, B. Randell
- Tutorial (a very good one, read it after you have read the article above) "Software Fault Tolerance: A Tutorial" by Wilfredo Torres-Pomales
- EU funded research project HPC4U addressing development of fault tolerant technologies for Grid computing environments
Fault-tolerant design refers to a method for designing a system so it will continue to operate, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails.
..... Click the link for more information.
..... Click the link for more information.
System (from Latin systēma, in turn from Greek σύστημα systēma) is a set of entities, real or abstract, where each entity interacts with, or is related to, at least one other
..... Click the link for more information.
..... Click the link for more information.
computer is a machine which manipulates data according to a list of instructions.
Computers take numerous physical forms. The first devices that resemble modern computers date to the mid-20th century (around 1940 - 1941), although the computer concept and various machines
..... Click the link for more information.
Computers take numerous physical forms. The first devices that resemble modern computers date to the mid-20th century (around 1940 - 1941), although the computer concept and various machines
..... Click the link for more information.
High availability is a system design protocol and associated implementation that ensures a certain absolute degree of operational continuity during a given measurement period.
..... Click the link for more information.
..... Click the link for more information.
A life-critical system or safety-critical system is a system whose failure or malfunction may result in:
..... Click the link for more information.
- death or serious injury to people, or
- loss or severe damage to equipment or
- environmental harm.
..... Click the link for more information.
The Transmission Control Protocol (TCP) is one of the core protocols of the Internet protocol suite. TCP provides reliable, in-order delivery of a stream of bytes, making it suitable for applications like file transfer and e-mail.
..... Click the link for more information.
..... Click the link for more information.
Packet switching is a communications paradigm in which packets (discrete blocks of data) are routed between nodes over data links shared with other traffic. In each network node, packets are queued or buffered, resulting in variable delay.
..... Click the link for more information.
..... Click the link for more information.
HTML (Hypertext Markup Language)
File extension:
MIME type:
Type code: TEXT
..... Click the link for more information.
File extension:
.html, .htmMIME type:
text/htmlType code: TEXT
..... Click the link for more information.
Forward compatibility (sometimes confused with extensibility) is the ability of a system to accept input intended for later versions of itself.
Forward compatibility is harder to achieve than backward compatibility because it needs to cope gracefully with an unknown future
..... Click the link for more information.
Forward compatibility is harder to achieve than backward compatibility because it needs to cope gracefully with an unknown future
..... Click the link for more information.
A web browser is a software application that enables a user to display and interact with text, images, videos, music and other information typically located on a Web page at a website on the World Wide Web or a local area network.
..... Click the link for more information.
..... Click the link for more information.
Idempotence IPA: /ˌaɪdɨmˈpoʊtənts/ describes the property of operations in mathematics and computer science that yield the same result after the operation is applied multiple times.
..... Click the link for more information.
..... Click the link for more information.
Self-stabilization is a concept of fault-tolerance in distributed computing. A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter.
..... Click the link for more information.
..... Click the link for more information.
availability has the following meanings:
1. The degree to which a system, subsystem, or equipment is operable and in a committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time.
..... Click the link for more information.
1. The degree to which a system, subsystem, or equipment is operable and in a committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time.
..... Click the link for more information.
Uptime is a measure of the time a computer system has been "up" and running. It came into use to describe the opposite of downtime, times when a system was not operational. The uptime and reliability of computer and communications facilities is sometimes measured in nines.
..... Click the link for more information.
..... Click the link for more information.
Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or hardware components, to improve reliability, fault-tolerance, or accessibility.
..... Click the link for more information.
..... Click the link for more information.
Parallel computing is the simultaneous execution of some combination of multiple instances of programmed instructions and data on multiple processors in order to obtain results faster.
..... Click the link for more information.
..... Click the link for more information.
In law, a quorum is the minimum number of members of a deliberative body necessary to conduct the business of that group. Ordinarily, this is a majority of the people expected to be there, although many bodies may have a lower or higher quorum.
..... Click the link for more information.
..... Click the link for more information.
Redundancy in engineering is the duplication of critical s of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe.
..... Click the link for more information.
..... Click the link for more information.
Failover is the capability to switch over automatically to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active server, system, or network.
..... Click the link for more information.
..... Click the link for more information.
Raid or RAID may refer to:
..... Click the link for more information.
- Redundant Array of Independent/Inexpensive Disks, or RAID, a system of multiple hard drives for sharing or replicating data.
..... Click the link for more information.
Raid or RAID may refer to:
..... Click the link for more information.
- Redundant Array of Independent/Inexpensive Disks, or RAID, a system of multiple hard drives for sharing or replicating data.
..... Click the link for more information.
Edison cylinder phonograph ca. 1899. The Phonograph cylinder is a storage medium. The phonograph may or may not be considered a storage device.]] A data storage device is a device for recording (storing) information (data).
..... Click the link for more information.
..... Click the link for more information.
In computer storage, data redundancy (sometimes incorrectly referred to as data reliability) is a property of some disk arrays (most commonly in RAID arrays) which provides fault tolerance such that if some disks fail, all or part of the data stored on the array can
..... Click the link for more information.
..... Click the link for more information.
Lockstep systems are redundant computing systems that run the same set of operations at the same time in parallel. The output from lockstep operations can be compared to determine if there has been a fault.
..... Click the link for more information.
..... Click the link for more information.
Parallel may refer to:
..... Click the link for more information.
Mathematics and science
- Parallel (geometry)
- Parallel (latitude), an imaginary east-west line circling a globe
Proper name
- Parallel (manga), a shōnen manga by Toshihiko Kobayashi
- Parallel
..... Click the link for more information.
Replication may refer to:
..... Click the link for more information.
- Science
- Self-replication, an organism making a copy of itself or replicating oneself
- DNA replication or DNA Synthesis, the process of copying a double-stranded DNA molecule.
..... Click the link for more information.
A machine which is Dual Modular Redundant has duplicated elements which work in parallel to provide one form of redundancy. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work.
..... Click the link for more information.
..... Click the link for more information.
In computing, triple modular redundancy (TMR) is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output.
..... Click the link for more information.
..... Click the link for more information.
Lockstep systems are redundant computing systems that run the same set of operations at the same time in parallel. The output from lockstep operations can be compared to determine if there has been a fault.
..... Click the link for more information.
..... Click the link for more information.
Synchronicity or synchronous can refer to the following meanings:
..... Click the link for more information.
- synchronization, the coordination of events to operate a system in unison.
- Synchronization (computer science)
- Synchronization (alternating current)
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus