Revised Edition: 2016
ISBN 978-1-280-29870-7
© All rights reserved.
Published by:
Library Press
48 West 48 Street, Suite 1116,
New York, NY 10036, United States
Email: Table of Contents
Chapter 1 - Reliability Engineering
Chapter 2 - Failure Rate
Chapter 3 - Safety Engineering
Chapter 4 - Failure Mode & Effects Analysis
Chapter 5 - Root Cause Analysis and Fault Tree Analysis
Chapter 6 - Fault-tolerant Design
Chapter 7 - Fault-tolerant System
Chapter 8 - RAID
Chapter 9 - System Engineering
Chapter- 1
Reliability Engineering
Reliability engineering is an engineering field, that deals with the study of reliability:
the ability of a system or component to perform its required functions under stated
conditions for a specified period of time. It is often reported as a probability.
A Reliability Block Diagram
Reliability may be defined in several ways:
The idea that something is fit for a purpose with respect to time;
The capacity of a device or system to perform as designed;
The resistance to failure of a device or system;
The ability of a device or system to perform a required function under stated
conditions for a specified period of time;
The probability that a functional unit will perform its required function for a
specified interval under stated conditions.
The ability of something to "fail well" (fail without catastrophic consequences)
Reliability engineers rely heavily on statistics, probability theory, and reliability theory.
Many engineering techniques are used in reliability engineering, such as reliability
prediction, Weibull analysis, thermal management, reliability testing and accelerated life
testing. Because of the large number of reliability techniques, their expense, and the
varying degrees of reliability required for different situations, most projects develop a
reliability program plan to specify the reliability tasks that will be performed for that
specific system.
The function of reliability engineering is to develop the reliability requirements for the
product, establish an adequate reliability program, and perform appropriate analyses and
tasks to ensure the product will meet its requirements. These tasks are managed by a
reliability engineer, who usually holds an accredited engineering degree and has
additional reliability-specific education and training. Reliability engineering is closely
associated with maintainability engineering and logistics engineering, e.g. Integrated
Logistics Support (ILS). Many problems from other fields, such as security engineering,
can also be approached using reliability engineering techniques. We provides an
overview of some of the most common reliability engineering tasks.
Many types of engineering employ reliability engineers and use the tools and methodology of reliability engineering. For example:
System engineers design complex systems having a specified reliability
Mechanical engineers may have to design a machine or system with a specified
Automotive engineers have reliability requirements for the automobiles (and
components) which they design
Electronics engineers must design and test their products for reliability requirements.
In software engineering and systems engineering the reliability engineering is
the subdiscipline of ensuring that a system (or a device in general) will perform its
intended function(s) when operated in a specified manner for a specified length of
time. Reliability engineering is performed throughout the entire life cycle of a
system, including development, test, production and operation.
Reliability theory
Reliability theory is the foundation of reliability engineering. For engineering purposes,
reliability is defined as:
the probability that a device will perform its intended function during a
specified period of time under stated conditions.
Mathematically, this may be expressed as,
is the failure probability density function and t is the length of the
period of time (which is assumed to start from time zero).
Reliability engineering is concerned with four key elements of this definition:
First, reliability is a probability. This means that failure is regarded as a
random phenomenon: it is a recurring event, and we do not express any
information on individual failures, the causes of failures, or relationships
between failures, except that the likelihood for failures to occur varies
over time according to the given probability function. Reliability engineering is concerned with meeting the specified probability of success, at a
specified statistical confidence level.
Second, reliability is predicated on "intended function:" Generally, this is
taken to mean operation without failure. However, even if no individual
part of the system fails, but the system as a whole does not do what was
intended, then it is still charged against the system reliability. The system
requirements specification is the criterion against which reliability is
Third, reliability applies to a specified period of time. In practical terms,
this means that a system has a specified chance that it will operate without
failure before time . Reliability engineering ensures that components and
materials will meet the requirements during the specified time. Units other
than time may sometimes be used. The automotive industry might specify
reliability in terms of miles, the military might specify reliability of a gun
for a certain number of rounds fired. A piece of mechanical equipment
may have a reliability rating value in terms of cycles of use.
Fourth, reliability is restricted to operation under stated conditions. This
constraint is necessary because it is impossible to design a system for
unlimited conditions. A Mars Rover will have different specified conditions than the family car. The operating environment must be addressed
during design and testing. Also, that same rover, may be required to
operate in varying conditions requiring additional scrutiny.
Reliability program plan
Many tasks, methods, and tools can be used to achieve reliability. Every system requires
a different level of reliability. A commercial airliner must operate under a wide range of
conditions. The consequences of failure are grave, but there is a correspondingly higher
budget. A pencil sharpener may be more reliable than an airliner, but has a much
different set of operational conditions, insignificant consequences of failure, and a much
lower budget.
A reliability program plan is used to document exactly what tasks, methods, tools,
analyses, and tests are required for a particular system. For complex systems, the
reliability program plan is a separate document. For simple systems, it may be combined
with the systems engineering management plan or integrated Logistics Support
management plan. The reliability program plan is essential for a successful reliability
program and is developed early during system development. It specifies not only what the
reliability engineer does, but also the tasks performed by others. The reliability program
plan is approved by top program management.
Reliability requirements
For any system, one of the first tasks of reliability engineering is to adequately specify
the reliability requirements. Reliability requirements address the system itself, test and
assessment requirements, and associated tasks and documentation. Reliability requirements are included in the appropriate system/subsystem requirements specifications, test
plans, and contract statements.
System reliability parameters
Requirements are specified using reliability parameters. The most common reliability
parameter is the mean time between failures (MTBF), which can also be specified as the
failure rate or the number of failures during a given period. These parameters are very
useful for systems that are operated frequently, such as most vehicles, machinery, and
electronic equipment. Reliability increases as the MTBF increases. The MTBF is usually
specified in hours, but can also be used with other units of measurement such as miles or
In other cases, reliability is specified as the probability of mission success. For example,
reliability of a scheduled aircraft flight can be specified as a dimensionless probability or
a percentage refer to system safety engineering.
A special case of mission success is the single-shot device or system. These are devices
or systems that remain relatively dormant and only operate once. Examples include
automobile airbags, thermal batteries and missiles. Single-shot reliability is specified as a
probability of success, or is subsumed into a related parameter. Single-shot missile
reliability may be incorporated into a requirement for the probability of hit.
For such systems, the probability of failure on demand (PFD) is the reliability measure.
This PFD is derived from failure rate and mission time for non-repairable systems. For
repairable systems, it is obtained from failure rate and mean-time-to-repair (MTTR) and
test interval. This measure may not be unique for a given system as this measure depends
on the kind of demand. In addition to system level requirements, reliability requirements
may be specified for critical subsystems. In all cases, reliability parameters are specified
with appropriate statistical confidence intervals.
Reliability modelling
Reliability modelling is the process of predicting or understanding the reliability of a
component or system. Two separate fields of investigation are common: The physics of
failure approach uses an understanding of the failure mechanisms involved, such as crack
propagation or chemical corrosion; The parts stress modelling approach is an empirical
method for prediction based on counting the number and type of components of the
system, and the stress they undergo during operation.
For systems with a clearly defined failure time (which is sometimes not given for systems
with a drifting parameter), the empirical distribution function of these failure times can be
determined. This is done in general in an accelerated experiment with increased stress.
These experiments can be divided into two main categories:
Early failure rate studies determine the distribution with a decreasing failure rate over the
first part of the bathtub curve. Here in general only moderate stress is necessary. The
stress is applied for a limited period of time in what is called a censored test. Therefore,
only the part of the distribution with early failures can be determined.
In so-called zero defect experiments, only limited information about the failure
distribution is acquired. Here the stress, stress time, or the sample size is so low that not a
single failure occurs. Due to the insufficient sample size, only an upper limit of the early
failure rate can be determined. At any rate, it looks good for the customer if there are no
In a study of the intrinsic failure distribution, which is often a material property, higher
stresses are necessary to get failure in a reasonable period of time. Several degrees of
stress have to be applied to determine an acceleration model. The empirical failure
distribution is often parametrised with a Weibull or a log-normal model.
It is a general praxis to model the early failure rate with an exponential distribution. This
less complex model for the failure distribution has only one parameter: the constant
failure rate. In such cases, the Chi-square distribution can be used to find the goodness of
fit for the estimated failure rate. Compared to a model with a decreasing failure rate, this
is quite pessimistic (important remark: this is not the case if less hours / load cycles are
tested than service life in a wear-out type of test, in this case the opposite is true and
assuming a more constant failure rate than it is in reality can be dangerous). Sensitivity
analysis should be conducted in this case.
Reliability test requirements
Because reliability is a probability, even highly reliable systems have some chance of
failure. However, testing reliability requirements is problematic for several reasons. A
single test is insufficient to generate enough statistical data. Multiple tests or longduration tests are usually very expensive. Some tests are simply impractical. Reliability
engineering is used to design a realistic and affordable test program that provides enough
evidence that the system meets its requirement. Statistical confidence levels are used to
address some of these concerns. A certain parameter is expressed along with a
correspond-ding confidence level: for example, an MTBF of 1000 hours at 90%
confidence level. From this specification, the reliability engineer can design a test with
explicit criteria for the number of hours and number of failures until the requirement is
met or failed.
The combination of reliability parameter value and confidence level greatly affects the
development cost and the risk to both the customer and producer. Care is needed to select
the best combination of requirements. Reliability testing may be performed at various
levels, such as component, subsystem, and system. Also, many factors must be addressed
during testing, such as extreme temperature and humidity, shock, vibration, and heat.
Reliability engineering determines an effective test strategy so that all parts are exercised
in relevant environments. For systems that must last many years, reliability engineering
may be used to design an accelerated life test as well.
Reliability prediction
A prediction of reliability is an important element in the process of selecting equipment
for use by telecommunications service providers and other buyers of electronic equipment. Reliability is a measure of the frequency of equipment failures as a function of
time. Reliability has a major impact on maintenance and repair costs and on the
continuity of service. Reliability predictions:
Help assess the effect of product reliability on the maintenance activity and on the
quantity of spare units required for acceptable field performance of any particular
system. For example, predictions of the frequency of unit level maintenance
actions can be obtained. Reliability prediction can be used to size spare populations.
Provide necessary input to system-level reliability models. System-level reliability models can subsequently be used to predict, for example, frequency of system
outages in steady-state, frequency of system outages during early life, expected
downtime per year, and system availability.
Provide necessary input to unit and system-level Life Cycle Cost Analyses. Life
cycle cost studies determine the cost of a product over its entire life. Therefore,
how often a unit will have to be replaced needs to be known. Inputs to this
process include unit and system failure rates. This includes how often units and
systems fail during the first year of operation as well as in later years.
Assist in deciding which product to purchase from a list of competing products.
As a result, it is essential that reliability predictions be based on a common
Can be used to set factory test standards for products requiring a reliability test.
Reliability predictions help determine how often the system should fail.
Are needed as input to the analysis of complex systems such as switching systems
and digital cross-connect systems. It is necessary to know how often different
parts of the system are going to fail even for redundant components.
Can be used in design trade-off studies. For example, a supplier could look at a
design with many simple devices and compare it to a design with fewer devices
that are newer but more complex. The unit with fewer devices is usually more
Can be used to set achievable in-service performance standards against which to
judge actual performance and stimulate action.
The telecommunications industry has devoted much time over the years to concentrate on
developing reliability models for electronic equipment. One such tool is the Automated
Reliability Prediction Procedure (ARPP), which is an Excel-spreadsheet software tool
that automates the reliability prediction procedures in SR-332, Reliability Prediction
Procedure for Electronic Equipment. FD-ARPP-01 provides suppliers and manufacturers
with a tool for making Reliability Prediction Procedure (RPP) calculations. It also
provides a means for understanding RPP calculations through the capability of interactive
examples provided by the user.
The RPP views electronic systems as hierarchical assemblies. Systems are constructed
from units that, in turn, are constructed from devices. The methods presented predict
reliability at these three hierarchical levels:
1. Device: A basic component (or part)
2. Unit: Any assembly of devices. This may include, but is not limited to, circuit
packs, modules, plug-in units, racks, power supplies, and ancillary equipment.
Unless otherwise dictated by maintenance considerations, a unit will usually be
the lowest level of replaceable assemblies/devices. The RPP is aimed primarily at
reliability prediction of units.
3. Serial System: Any assembly of units for which the failure of any single unit will
cause a failure of the system.
Requirements for reliability tasks
Reliability engineering must also address requirements for various reliability tasks and
documentation during system development, test, production, and operation. These
requirements are generally specified in the contract statement of work and depend on
how much leeway the customer wishes to provide to the contractor. Reliability tasks
include various analyses, planning, and failure reporting. Task selection depends on the
criticality of the system as well as cost. A critical system may require a formal failure
reporting and review process throughout development, whereas a non-critical system may
rely on final test reports. The most common reliability program tasks are documented in
reliability program standards, such as MIL-STD-785 and IEEE 1332. Failure reporting
analysis and corrective action systems are a common approach for product/process
reliability monitoring.
Design for reliability
Design For Reliability (DFR), is an emerging discipline that refers to the process of
designing reliability into products. This process encompasses several tools and practices
and describes the order of their deployment that an organization needs to have in place to
drive reliability into their products. Typically, the first step in the DFR process is to set
the system’s reliability requirements. Reliability must be "designed in" to the system.
During system design, the top-level reliability requirements are then allocated to
subsystems by design engineers and reliability engineers working together.
Reliability design begins with the development of a model. Reliability models use block
diagrams and fault trees to provide a graphical means of evaluating the relationships
between different parts of the system. These models incorporate predictions based on
parts-count failure rates taken from historical data. While the predictions are often not
accurate in an absolute sense, they are valuable to assess relative differences in design
A Fault Tree Diagram
One of the most important design techniques is redundancy. This means that if one part
of the system fails, there is an alternate success path, such as a backup system. An
automobile brake light might use two light bulbs. If one bulb fails, the brake light still
operates using the other bulb. Redundancy significantly increases system reliability, and
is often the only viable means of doing so. However, redundancy is difficult and
expensive, and is therefore limited to critical parts of the system. Another design
technique, physics of failure, relies on understanding the physical processes of stress,
strength and failure at a very detailed level. Then the material or component can be redesigned to reduce the probability of failure. Another common design technique is
component derating: Selecting components whose tolerance significantly exceeds the
expected stress, as using a heavier gauge wire that exceeds the normal specification for
the expected electrical current.
Many tasks, techniques and analyses are specific to particular industries and applications.
Commonly these include:
Built-in test (BIT)
Failure mode and effects analysis (FMEA)
Reliability simulation modeling
Thermal analysis
Reliability Block Diagram analysis
Fault tree analysis
Root cause analysis
Sneak circuit analysis
Accelerated Testing
Reliability Growth analysis
Weibull analysis
Electromagnetic analysis
Statistical interference
Avoid Single Point of Failure
Results are presented during the system design reviews and logistics reviews. Reliability
is just one requirement among many system requirements. Engineering trade studies are
used to determine the optimum balance between reliability and other requirements and
Reliability testing
A Reliability Sequential Test Plan
The purpose of reliability testing is to discover potential problems with the design as
early as possible and, ultimately, provide confidence that the system meets its reliability
Reliability testing may be performed at several levels. Complex systems may be tested at
component, circuit board, unit, assembly, subsystem and system levels. (The test level
nomenclature varies among applications.) For example, performing environmental stress
screening tests at lower levels, such as piece parts or small assemblies, catches problems
before they cause failures at higher levels. Testing proceeds during each level of
integration through full-up system testing, developmental testing, and operational testing,
thereby reducing program risk. System reliability is calculated at each test level.
Reliability growth techniques and failure reporting, analysis and corrective active
systems (FRACAS) are often employed to improve reliability as testing progresses. The
drawbacks to such extensive testing are time and expense. Customers may choose to
accept more risk by eliminating some or all lower levels of testing.
Another type of tests are called Sequental Probability Ratio type of tests. These tests use
both a statistical type 1 and type 2 error, combined with a discrimination ratio as main
input (together with the R requirement). This test sets - Independently - before the start of
the test both the risk of incorrectly accepting a bad design (Type 2 error) and the risk of
incorrectly rejecting a good design (type 1 error) together with the discrimination ratio
and the required minimum reliability parameter. The test is therefore more controllable
and provides more information for a quality and business point of view. The number of
test samples is not fixed, but it is said that this test is in general more efficient (requires
less samples) and provides more information than for example zero failure testing.
It is not always feasible to test all system requirements. Some systems are prohibitively
expensive to test; some failure modes may take years to observe; some complex
interactions result in a huge number of possible test cases; and some tests require the use
of limited test ranges or other resources. In such cases, different approaches to testing can
be used, such as accelerated life testing, design of experiments, and simulations.
The desired level of statistical confidence also plays an important role in reliability
testing. Statistical confidence is increased by increasing either the test time or the number
of items tested. Reliability test plans are designed to achieve the specified reliability at
the specified confidence level with the minimum number of test units and test time.
Different test plans result in different levels of risk to the producer and consumer. The
desired reliability, statistical confidence, and risk levels for each side influence the
ultimate test plan. Good test requirements ensure that the customer and developer agree
in advance on how reliability requirements will be tested.
A key aspect of reliability testing is to define "failure". Although this may seem obvious,
there are many situations where it is not clear whether a failure is really the fault of the
system. Variations in test conditions, operator differences, weather, and unexpected
situations create differences between the customer and the system developer. One
strategy to address this issue is to use a scoring conference process. A scoring
conference includes representatives from the customer, the developer, the test
organization, the reliability organization, and sometimes independent observers. The
scoring conference process is defined in the statement of work. Each test case is
considered by the group and "scored" as a success or failure. This scoring is the official
result used by the reliability engineer.
As part of the requirements phase, the reliability engineer develops a test strategy with
the customer. The test strategy makes trade-offs between the needs of the reliability
organization, which wants as much data as possible, and constraints such as cost,
schedule, and available resources. Test plans and procedures are developed for each
reliability test, and results are documented in official reports.
Accelerated testing
The purpose of accelerated life testing is to induce field failure in the laboratory at a
much faster rate by providing a harsher, but nonetheless representative, environment. In
such a test the product is expected to fail in the lab just as it would have failed in the
field—but in much less time. The main objective of an accelerated test is either of the
To discover failure modes
To predict the normal field life from the high stress lab life
An Accelerated testing program can be broken down into the following steps:
Define objective and scope of the test
Collect required information about the product
Identify the stress(es)
Determine level of stress(es)
Conduct the Accelerated test and analyse the accelerated data.
Common way to determine a life stress relationship are
Arrhenius Model
Eyring Model
Inverse Power Law Model
Temperature-Humidity Model
Temperature Non-thermal Model
Software reliability
Software reliability is a special aspect of reliability engineering. System reliability, by
definition, includes all parts of the system, including hardware, software, operators and
procedures. Traditionally, reliability engineering focuses on critical hardware parts of the
system. Since the widespread use of digital integrated circuit technology, software has
become an increasingly critical part of most electronics and, hence, nearly all present day
systems. There are significant differences, however, in how software and hardware
behave. Most hardware unreliability is the result of a component or material failure that
results in the system not performing its intended function. Repairing or replacing the
hardware component restores the system to its original unfailed state. However, software
does not fail in the same sense that hardware fails. Instead, software unreliability is the
result of unanticipated results of software operations. Even relatively small software
programs can have astronomically large combinations of inputs and states that are
infeasible to exhaustively test. Restoring software to its original state only works until the
same combination of inputs and states results in the same unintended result. Software
reliability engineering must take this into account.
Despite this difference in the source of failure between software and hardware —
software does not wear out — some in the software reliability engineering community
believe statistical models used in hardware reliability are nevertheless useful as a measure
of software reliability, describing what we experience with software: the longer you run
software, the higher the probability you will eventually use it in an untested manner and
find a latent defect that results in a failure (Shooman 1987), (Musa 2005), (Denney
As with hardware, software reliability depends on good requirements, design and
implementation. Software reliability engineering relies heavily on a disciplined software
engineering process to anticipate and design against unintended consequences. There is
more overlap between software quality engineering and software reliability engineering
than between hardware quality and reliability. A good software development plan is a key
aspect of the software reliability program. The software development plan describes the
design and coding standards, peer reviews, unit tests, configuration management,
software metrics and software models to be used during software development.
A common reliability metric is the number of software faults, usually expressed as faults
per thousand lines of code. This metric, along with software execution time, is key to
most software reliability models and estimates. The theory is that the software reliability
increases as the number of faults (or fault density) goes down. Establishing a direct
connection between fault density and mean-time-between-failure is difficult, however,
because of the way software faults are distributed in the code, their severity, and the
probability of the combination of inputs necessary to encounter the fault. Nevertheless,
fault density serves as a useful indicator for the reliability engineer. Other software
metrics, such as complexity, are also used.
Testing is even more important for software than hardware. Even the best software
development process results in some software faults that are nearly undetectable until
tested. As with hardware, software is tested at several levels, starting with individual
units, through integration and full-up system testing. Unlike hardware, it is inadvisable to
skip levels of software testing. During all phases of testing, software faults are
discovered, corrected, and re-tested. Reliability estimates are updated based on the fault
density and other metrics. At system level, mean-time-between-failure data are collected
and used to estimate reliability. Unlike hardware, performing exactly the same test on
exactly the same software configuration does not provide increased statistical confidence.
Instead, software reliability uses different metrics such as test coverage.
Eventually, the software is integrated with the hardware in the top-level system, and
software reliability is subsumed by system reliability. The Software Engineering
Institute's Capability Maturity Model is a common means of assessing the overall
software development process for reliability and quality purposes. However, actual
software reliability is served through SAE standards JA1002 and JA1003.
Reliability Operational Assessment
After a system is produced, reliability engineering monitors, assesses, and corrects
deficiencies. Monitoring includes electronic and visual surveillance of critical parameters
identified during the fault tree analysis design stage. The data are constantly analyzed
using statistical techniques, such as Weibull analysis and linear regression, to ensure the
system reliability meets requirements. Reliability data and estimates are also key inputs
for system logistics. Data collection is highly dependent on the nature of the system.
Most large organizations have quality control groups that collect failure data on vehicles,
equipment, and machinery. Consumer product failures are often tracked by the number of
returns. For systems in dormant storage or on standby, it is necessary to establish a
formal surveillance program to inspect and test random samples. Any changes to the
system, such as field upgrades or recall repairs, require additional reliability testing to
ensure the reliability of the modification. Since it is not possible to anticipate all the
failure modes of a given system, especially ones with a human element, failures will
occur. The reliability program also includes a systematic root cause analysis that
identifies the causal relationships involved in the failure such that effective corrective
actions may be implemented. When possible, system failures and corrective actions are
reported to the reliability engineering organization.
One of the most common methods to apply a Reliability Operational Assessment are
Failure Reporting, Analysis and Corrective Action Systems (FRACAS). This systematic
approach develops a reliability, safety and logistics assessment based on Failure /
Incident reporting, management, analysis and corrective/preventive actions. Organizations today are adopting this method and utilize commercial systems such as a Web based
FRACAS application enabling and organization to create a failure/incident data
repository from which statistics can be derived to view accurate and genuine reliability,
safety and quality performances.
Some of the common outputs from a FRACAS system includes: Field MTBF, MTTR,
Spares Consumption, Reliability Growth, Failure/Incidents distribution by type, location,
part no., serial no, symptom etc.
Reliability organizations
Systems of any significant complexity are developed by organizations of people, such as
a commercial company or a government agency. The reliability engineering organization
must be consistent with the company's organizational structure. For small, non-critical
systems, reliability engineering may be informal. As complexity grows, the need arises
for a formal reliability function. Because reliability is important to the customer, the
customer may even specify certain aspects of the reliability organization.
There are several common types of reliability organizations. The project manager or chief
engineer may employ one or more reliability engineers directly. In larger organizations,
there is usually a product assurance or specialty engineering organization, which may
include reliability, maintainability, quality, safety, human factors, logistics, etc. In such
case, the reliability engineer reports to the product assurance manager or specialty
engineering manager.
In some cases, a company may wish to establish an independent reliability organization.
This is desirable to ensure that the system reliability, which is often expensive and time
consuming, is not unduly slighted due to budget and schedule pressures. In such cases,
the reliability engineer works for the project day-to-day, but is actually employed and
paid by a separate organization within the company.
Because reliability engineering is critical to early system design, it has become common
for reliability engineers, however the organization is structured, to work as part of an
integrated product team.
The American Society for Quality has a program to become a Certified Reliability
Engineer, CRE. Certification is based on education, experience, and a certification test:
periodic recertification is required. The body of knowledge for the test includes:
reliability management, design evaluation, product safety, statistical tools, design and
development, modeling, reliability testing, collecting and using data, etc.
Another highly respected certification program is the CRP (Certified Reliability
Professional). To achieve certification, candidates must complete a series of courses
focused on important Reliability Engineering topics, successfully apply the learned body
of knowledge in the workplace and publicly present this expertise in an industry
conference or journal.
Reliability engineering education
Some Universities offer graduate degrees in Reliability Engineering. Other reliability
engineers typically have an engineering degree, which can be in any field of engineering,
from an accredited university or college program. Many engineering programs offer
reliability courses, and some universities have entire reliability engineering programs. A
reliability engineer may be registered as a Professional Engineer by the state, but this is
not required by most employers. There are many professional conferences and industry
training programs available for reliability engineers. Several professional organizations
exist for reliability engineers, including the IEEE Reliability Society, the American
Society for Quality (ASQ), and the Society of Reliability Engineers (SRE).
Chapter- 2
Failure Rate
Failure rate is the frequency with which an engineered system or component fails,
expressed for example in failures per hour. It is often denoted by the Greek letter λ
(lambda) and is important in reliability engineering.
The failure rate of a system usually depends on time, with the rate varying over the life
cycle of the system. For example, an automobile's failure rate in its fifth year of service
may be many times greater than its failure rate during its first year of service. One does
not expect to replace an exhaust pipe, overhaul the brakes, or have major transmission
problems in a new vehicle.
In practice, the mean time between failures (MTBF, 1/λ) is often used instead of the
failure rate. The MTBF is an important system parameter in systems where failure rate
needs to be managed, in particular for safety systems. The MTBF appears frequently in
the engineering design requirements, and governs frequency of required system
maintenance and inspections. In special processes called renewal processes, where the
time to recover from failure can be neglected and the likelihood of failure remains
constant with respect to time, the failure rate is simply the multiplicative inverse of the
MTBF (1/λ).
A similar ratio used in the transport industries, especially in railways and trucking is
'mean distance between failures', a variation which attempts to correlate actual loaded
distances to similar reliability needs and practices.
Failure rates are important factors in the insurance, finance, commerce and regulatory
industries and fundamental to the design of safe systems in a wide variety of applications.
Failure rate in the discrete sense
The failure rate can be defined as the following:
The total number of failures within an item population, divided by the total time
expended by that population, during a particular measurement interval under
stated conditions. (MacDiarmid, et al.)
Although the failure rate, λ(t), is often thought of as the probability that a failure occurs
in a specified interval given no failure before time t, it is not actually a probability
because it can exceed 1. It can be defined with the aid of the reliability function or
survival function R(t), the probability of no failure before time t, as:
over a time interval (t2 − t1) from t1 (or t) to t2 and Δt is defined as (t2 − t1). Note that this
is a conditional probability, hence the R(t) in the denominator.
Failure rate in the continuous sense
Exponential failure density functions
Calculating the failure rate for ever smaller intervals of time, results in the hazard
function (or hazard rate), h(t). This becomes the instantaneous failure rate as tends to
A continuous failure rate depends on the existence of a failure distribution,
, which
is a cumulative distribution function that describes the probability of failure (at least) up
to and including time t,
where T is the failure time. The failure distribution function is the integral of the failure
density function, f(t),
The hazard function can be defined now as
Many probability distributions can be used to model the failure distribution. A common
model is the exponential failure distribution,
which is based on the exponential density function. The hazard rate function for this is:
Thus, for an exponential failure distribution, the hazard rate is a constant with respect to
time (that is, the distribution is "memoryless"). For other distributions, such as a Weibull
distribution or a log-normal distribution, the hazard function may not be constant with
respect to time. For some such as the deterministic distribution it is monotonic increasing
(analogous to "wearing out"), for others such as the Pareto distribution it is monotonic
decreasing (analogous to "burning in"), while for many it is not monotonic.
Failure rate data
Failure rate data can be obtained in several ways. The most common means are:
Historical data about the device or system under consideration.
Many organizations maintain internal databases of failure information on the
devices or systems that they produce, which can be used to calculate failure rates
for those devices or systems. For new devices or systems, the historical data for
similar devices or systems can serve as a useful estimate.
Government and commercial failure rate data.
Handbooks of failure rate data for various components are available from
government and commercial sources. MIL-HDBK-217F, Reliability Prediction of
Electronic Equipment, is a military standard that provides failure rate data for
many military electronic components. Several failure rate data sources are
available commercially that focus on commercial components, including some
non-electronic components.
The most accurate source of data is to test samples of the actual devices or
systems in order to generate failure data. This is often prohibitively expensive or
impractical, so that the previous data sources are often used instead.
Failure rates can be expressed using any measure of time, but hours is the most common
unit in practice. Other units, such as miles, revolutions, etc., can also be used in place of
"time" units.
Failure rates are often expressed in engineering notation as failures per million, or 10−6,
especially for individual components, since their failure rates are often very low.
The Failures In Time (FIT) rate of a device is the number of failures that can be
expected in one billion (109) device-hours of operation. (E.g. 1000 devices for 1 million
hours, or 1 million devices for 1000 hours each, or some other combination.) This term is
used particularly by the semiconductor industry.
Under certain engineering assumptions (e.g. besides the above assumptions for a constant
failure rate, the assumption that the considered system has no relevant redundancies), the
failure rate for a complex system is simply the sum of the individual failure rates of its
components, as long as the units are consistent, e.g. failures per million hours. This
permits testing of individual components or subsystems, whose failure rates are then
added to obtain the total system failure rate.
Suppose it is desired to estimate the failure rate of a certain component. A test can be
performed to estimate its failure rate. Ten identical components are each tested until they
either fail or reach 1000 hours, at which time the test is terminated for that component.
(The level of statistical confidence is not considered in this example.) The results are as
Estimated failure rate is
or 799.8 failures for every million hours of operation.
The Nelson–Aalen estimator can be used to estimate the cumulative hazard rate function.
Chapter- 3
Safety Engineering
Safety engineering is an applied science strongly related to systems engineering and the
subset System Safety Engineering. Safety engineering assures that a life-critical system
behaves as needed even when pieces fail.
Ideally, safety-engineers take an early design of a system, analyze it to find what faults
can occur, and then propose safety requirements in design specifications up front and
changes to existing systems to make the system safer. In an early design stage, often a
fail-safe system can be made acceptably safe with a few sensors and some software to
read them. Probabilistic fault-tolerant systems can often be made by using more, but
smaller and less-expensive pieces of equipment.
Far too often, rather than actually influencing the design, safety engineers are assigned to
prove that an existing, completed design is safe. If a safety engineer then discovers
significant safety problems late in the design process, correcting them can be very expensive. This type of error has the potential to waste large sums of money.
The exception to this conventional approach is the way some large government agencies
approach safety engineering from a more proactive and proven process perspective,
known as "system safety". The system safety philosophy is to be applied to complex and
critical systems, such as commercial airliners, complex weapon systems, spacecraft, rail
and transportation systems, air traffic control system and other complex and safetycritical industrial systems. The proven system safety methods and techniques are to
prevent, eliminate and control hazards and risks through designed influences by a
collaboration of key engineering disciplines and product teams. Software safety is a fast
growing field since modern systems functionality are increasingly being put under
control of software. The whole concept of system safety and software safety, as a subset
of systems engineering, is to influence safety-critical systems designs by conducting
several types of hazard analyses to identify risks and to specify design safety features and
procedures to strategically mitigate risk to acceptable levels before the system is certified.
Additionally, failure mitigation can go beyond design recommendations, particularly in
the area of maintenance. There is an entire realm of safety and reliability engineering
known as Reliability Centered Maintenance (RCM), which is a discipline that is a direct
result of analyzing potential failures within a system and determining maintenance
actions that can mitigate the risk of failure. This methodology is used extensively on
aircraft and involves understanding the failure modes of the serviceable replaceable
assemblies in addition to the means to detect or predict an impending failure. Every
automobile owner is familiar with this concept when they take in their car to have the oil
changed or brakes checked. Even filling up one's car with fuel is a simple example of a
failure mode (failure due to fuel exhaustion), a means of detection (fuel gauge), and a
maintenance action (filling the car's fuel tank).
For large scale complex systems, hundreds if not thousands of maintenance actions can
result from the failure analysis. These maintenance actions are based on conditions (e.g.,
gauge reading or leaky valve), hard conditions (e.g., a component is known to fail after
100 hrs of operation with 95% certainty), or require inspection to determine the
maintenance action (e.g., metal fatigue). The RCM concept then analyzes each individual
maintenance item for its risk contribution to safety, mission, operational readiness, or
cost to repair if a failure does occur. Then the sum total of all the maintenance actions are
bundled into maintenance intervals so that maintenance is not occurring around the clock,
but rather, at regular intervals. This bundling process introduces further complexity, as it
might stretch some maintenance cycles, thereby increasing risk, but reduce others,
thereby potentially reducing risk, with the end result being a comprehensive maintenance
schedule, purpose built to reduce operational risk and ensure acceptable levels of
operational readiness and availability.
Analysis techniques
The two most common fault modeling techniques are called failure mode and effects
analysis and fault tree analysis. These techniques are just ways of finding problems and
of making plans to cope with failures, as in probabilistic risk assessment. One of the
earliest complete studies using this technique on a commercial nuclear plant was the
WASH-1400 study, also known as the Reactor Safety Study or the Rasmussen Report.
Failure modes and effects analysis
Failure Mode and Effects Analysis (FMEA) is a bottom-up, inductive analytical method
which may be performed at either the functional or piece-part level. For functional
FMEA, failure modes are identified for each function in a system or equipment item,
usually with the help of a functional block diagram. For piece-part FMEA, failure modes
are identified for each piece-part component (such as a valve, connector, resistor, or
diode). The effects of the failure mode are described, and assigned a probability based on
the failure rate and failure mode ratio of the function or component.
Failure modes with identical effects can be combined and summarized in a Failure Mode
Effects Summary. When combined with criticality analysis, FMEA is known as Failure
Mode, Effects, and Criticality Analysis or FMECA, pronounced "fuh-MEE-kuh".
Fault tree analysis
Fault tree analysis (FTA) is a top-down, deductive analytical method. In FTA, initiating
primary events such as component failures, human errors, and external events are traced
through Boolean logic gates to an undesired top event such as an aircraft crash or nuclear
reactor core melt. The intent is to identify ways to make top events less probable, and
verify that safety goals have been achieved.
A fault tree diagram
Fault trees are a logical inverse of success trees, and may be obtained by applying de
Morgan's theorem to success trees (which are directly related to reliability block
FTA may be qualitative or quantative. When failure and event probabilites are unknown,
qualitative fault trees may be analyzed for minimal cut sets. For example, if any minimal
cut set contains a single base event, then the top event may be caused by a single failure.
Quantitative FTA is used to compute top event probability, and usually requires computer
software such as CAFTA from the Electric Power Research Institute or SAPHIRE from
the Idaho National Laboratory.
Some industries use both fault trees and event trees. An event tree starts from an
undesired initiator (loss of critical supply, component failure etc.) and follows possible
further system events through to a series of final consequences. As each new event is
considered, a new node on the tree is added with a split of probabilities of taking either
branch. The probabilities of a range of "top events" arising from the initial event can then
be seen.
Safety certification
Usually a failure in safety-certified systems is acceptable if, on average, less than one life
per 109 hours of continuous operation is lost to failure. Most Western nuclear reactors,
medical equipment, and commercial aircraft are certified to this level. The cost versus
loss of lives has been considered appropriate at this level (by FAA for aircraft under
Federal Aviation Regulations).
Preventing failure
A NASA graph shows the relationship between the survival of a crew of astronauts and
the amount of redundant equipment in their spacecraft (the "MM", Mission Module).
Probabilistic fault tolerance: adding redundancy to equipment and
Once a failure mode is identified, it can usually be prevented entirely by adding extra
equipment to the system. For example, nuclear reactors contain dangerous radiation, and
nuclear reactions can cause so much heat that no substance might contain them.
Therefore reactors have emergency core cooling systems to keep the temperature down,
shielding to contain the radiation, and engineered barriers (usually several, nested,
surmounted by a containment building) to prevent accidental leakage.
Most biological organisms have a certain amount of redundancy: multiple organs,
multiple limbs, etc.
For any given failure, a fail-over or redundancy can almost always be designed and
incorporated into a system.
When does safety stop, where does reliability begin?
Inherent fail-safe design
When adding equipment is impractical (usually because of expense), then the least
expensive form of design is often "inherently fail-safe". The typical approach is to
arrange the system so that ordinary single failures cause the mechanism to shut down in a
safe way (for nuclear power plants, this is termed a passively safe design, although more
than ordinary failures are covered).
One of the most common fail-safe systems is the overflow tube in baths and kitchen
sinks. If the valve sticks open, rather than causing an overflow and damage, the tank
spills into an overflow.
Another common example is that in an elevator the cable supporting the car keeps springloaded brakes open. If the cable breaks, the brakes grab rails, and the elevator cabin does
not fall.
Inherent fail-safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment.
Containing failure
It is also common practice to plan for the failure of safety systems through containment
and isolation methods. The use of isolating valves, also known as the block and bleed
manifold, is very common in isolating pumps, tanks, and control valves that may fail or
need routine maintenance. In addition, nearly all tanks containing oil or other hazardous
chemicals are required to have containment barriers set up around them to contain 100%
of the volume of the tank in the event of a catastrophic tank failure. Similarly, long
pipelines have remote-closing valves periodically installed in the line so that in the event
of failure, the entire pipeline is not lost. The goal of all such containment systems is to
provide means of limiting the damage done by a failure to a small localized area.
Chapter- 4
Failure Mode & Effects Analysis
A failure modes and effects analysis (FMEA) is a procedure in product development
and operations management for analysis of potential failure modes within a system for
classification by the severity and likelihood of the failures. A successful FMEA activity
helps a team to identify potential failure modes based on past experience with similar
products or processes, enabling the team to design those failures out of the system with
the minimum of effort and resource expenditure, thereby reducing development time and
costs. It is widely used in manufacturing industries in various phases of the product life
cycle and is now increasingly finding use in the service industry. Failure modes are any
errors or defects in a process, design, or item, especially those that affect the customer,
and can be potential or actual. Effects analysis refers to studying the consequences of
those failures.
Basic terms
FMEA cycle
" The LOSS of an intended function of a device under stated conditions."
Failure mode
"The manner by which a failure is observed; it generally describes the way the
failure occurs."
Failure effect
Immediate consequences of a failure on operation, function or functionality, or
status of some item
Indenture levels
An identifier for item complexity. Complexity increases as levels are closer to
Local effect
The Failure effect as it applies to the item under analysis.
Next higher level effect
The Failure effect as it applies at the next higher indenture level.
End effect
The failure effect at the highest indenture level or total system.
Failure cause
Defects in design, process, quality, or part application, which are the underlying
cause of the failure or which initiate a process which leads to failure.
"The consequences of a failure mode. Severity considers the worst potential
consequence of a failure, determined by the degree of injury, property damage, or
system damage that could ultimately occur."
Learning from each failure is both costly and time consuming, and FMEA is a more
systematic method of studying failure. As such, it is considered better to first conduct
some thought experiments.
FMEA was formally introduced in the late 1940s for military usage by the US Armed
Forces. Later it was used for aerospace/rocket development to avoid errors in small
sample sizes of costly rocket technology. An example of this is the Apollo Space
program. It was also used as application for HACCP for the Apollo Space Program, and
later the food industry in general. The primary push came during the 1960s, while
developing the means to put a man on the moon and return him safely to earth. In the late
1970s the Ford Motor Company introduced FMEA to the automotive industry for safety
and regulatory consideration after the Pinto affair. They applied the same approach to
processes (PFMEA) to consider potential process induced failures prior to launching
Although initially developed by the military, FMEA methodology is now extensively
used in a variety of industries including semiconductor processing, food service, plastics,
software, and healthcare. It is integrated into the Automotive Industry Action Group's
(AIAG) Advanced Product Quality Planning (APQP) process to provide risk mitigation,
in both product and process development phases. Each potential cause must be
considered for its effect on the product or process and, based on the risk, actions are
determined and risks revisited after actions are complete. Toyota has taken this one step
further with its Design Review Based on Failure Mode (DRBFM) approach. The method
is now supported by the American Society for Quality which provides detailed guides on
applying the method.
In FMEA, failures are prioritized according to how serious their consequences are, how
frequently they occur and how easily they can be detected. An FMEA also documents
current knowledge and actions about the risks of failures for use in continuous
improvement. FMEA is used during the design stage with an aim to avoid future failures
(sometimes called DFMEA in that case). Later it is used for process control, before and
during ongoing operation of the process. Ideally, FMEA begins during the earliest
conceptual stages of design and continues throughout the life of the product or service.
The outcome of an FMEA development is actions to prevent or reduce the severity or
likelihood of failures, starting with the highest-priority ones. It may be used to evaluate
risk management priorities for mitigating known threat vulnerabilities. FMEA helps
select remedial actions that reduce cumulative impacts of life-cycle consequences (risks)
from a systems failure (fault).
It is used in many formal quality systems such as QS-9000 or ISO/TS 16949.
Using FMEA when designing
FMEA can provide an analytical approach, when dealing with potential failure modes
and their associated causes. When considering possible failures in a design – like safety,
cost, performance, quality and reliability – an engineer can get a lot of information about
how to alter the development/manufacturing process, in order to avoid these failures.
FMEA provides an easy tool to determine which risk has the greatest concern, and
therefore an action is needed to prevent a problem before it arises. The development of
these specifications will ensure the product will meet the defined requirements.
The pre-work
The process for conducting an FMEA is straightforward. It is developed in three main
phases, in which appropriate actions need to be defined. But before starting with an
FMEA, it is important to complete some pre-work to confirm that robustness and past
history are included in the analysis.
A robustness analysis can be obtained from interface matrices, boundary diagrams, and
parameter diagrams. A lot of failures are due to noise factors and shared interfaces with
other parts and/or systems, because engineers tend to focus on what they control directly.
To start it is necessary to describe the system and its function. A good understanding
simplifies further analysis. This way an engineer can see which uses of the system are
desirable and which are not. It is important to consider both intentional and unintentional
uses. Unintentional uses are a form of hostile environment.
Then, a block diagram of the system needs to be created. This diagram gives an overview
of the major components or process steps and how they are related. These are called
logical relations around which the FMEA can be developed. It is useful to create a coding
system to identify the different system elements. The block diagram should always be
included with the FMEA.
Before starting the actual FMEA, a worksheet needs to be created, which contains the
important information about the system, such as the revision date or the names of the
components. On this worksheet all the items or functions of the subject should be listed in
a logical manner, based on the block diagram.
Example FMEA Worksheet
Func- Failure
(occurre Current (detectio (critical
Effects (severity Cause(s)
priority mended
number actions
ion date
based on
time to
spills on
al sensor
fill to
low and
Step 1: Severity
Determine all failure modes based on the functional requirements and their effects.
Examples of failure modes are: Electrical short-circuiting, corrosion or deformation. A
failure mode in one component can lead to a failure mode in another component,
therefore each failure mode should be listed in technical terms and for function. Hereafter
the ultimate effect of each failure mode needs to be considered. A failure effect is defined
as the result of a failure mode on the function of the system as perceived by the user. In
this way it is convenient to write these effects down in terms of what the user might see
or experience. Examples of failure effects are: degraded performance, noise or even
injury to a user. Each effect is given a severity number (S) from 1 (no danger) to 10
(critical). These numbers help an engineer to prioritize the failure modes and their effects.
If the severity of an effect has a number 9 or 10, actions are considered to change the
design by eliminating the failure mode, if possible, or protecting the user from the effect.
A severity rating of 9 or 10 is generally reserved for those effects which would cause
injury to a user or otherwise result in litigation.
Step 2: Occurrence
In this step it is necessary to look at the cause of a failure mode and how many times it
occurs. This can be done by looking at similar products or processes and the failure
modes that have been documented for them. A failure cause is looked upon as a design
weakness. All the potential causes for a failure mode should be identified and
documented. Again this should be in technical terms. Examples of causes are: erroneous
algorithms, excessive voltage or improper operating conditions. A failure mode is given
an occurrence ranking (O), again 1–10. Actions need to be determined if the occurrence
is high (meaning > 4 for non-safety failure modes and > 1 when the severity-number
from step 1 is 9 or 10). This step is called the detailed development section of the FMEA
process. Occurrence also can be defined as %. If a non-safety issue happened less
than 1%, we can give 1 to it. It is based on your product and customer specification
Step 3: Detection
When appropriate actions are determined, it is necessary to test their efficiency. In
addition, design verification is needed. The proper inspection methods need to be chosen.
First, an engineer should look at the current controls of the system, that prevent failure
modes from occurring or which detect the failure before it reaches the customer.
Hereafter one should identify testing, analysis, monitoring and other techniques that can
be or have been used on similar systems to detect failures. From these controls an
engineer can learn how likely it is for a failure to be identified or detected. Each
combination from the previous 2 steps receives a detection number (D). This ranks the
ability of planned tests and inspections to remove defects or detect failure modes in time.
The assigned detection number measures the risk that the failure will escape detection. A
high detection number indicates that the chances are high that the failure will escape
detection, or in other words, that the chances of detection are low.
After these three basic steps, risk priority numbers (RPN) are calculated
Risk priority numbers
RPN do not play an important part in the choice of an action against failure modes. They
are more threshold values in the evaluation of these actions.
After ranking the severity, occurrence and detectability the RPN can be easily calculated
by multiplying these three numbers: RPN = S × O × D
This has to be done for the entire process and/or design. Once this is done it is easy to
determine the areas of greatest concern. The failure modes that have the highest RPN
should be given the highest priority for corrective action. This means it is not always the
failure modes with the highest severity numbers that should be treated first. There could
be less severe failures, but which occur more often and are less detectable.
After these values are allocated, recommended actions with targets, responsibility and
dates of implementation are noted. These actions can include specific inspection, testing
or quality procedures, redesign (such as selection of new components), adding more
redundancy and limiting environmental stresses or operating range. Once the actions have
been implemented in the design/process, the new RPN should be checked, to confirm the
improvements. These tests are often put in graphs, for easy visualization. Whenever a
design or a process changes, an FMEA should be updated.
A few logical but important thoughts come in mind:
Try to eliminate the failure mode (some failures are more preventable than others)
Minimize the severity of the failure
Reduce the occurrence of the failure mode
Improve the detection
Timing of FMEA
The FMEA should be updated whenever:
At the beginning of a cycle (new product/process)
Changes are made to the operating conditions
A change is made in the design
New regulations are instituted
Customer feedback indicates a problem
Uses of FMEA
Development of system requirements that minimize the likelihood of failures.
Development of methods to design and test systems to ensure that the failures
have been eliminated.
Evaluation of the requirements of the customer to ensure that those do not give
rise to potential failures.
Identification of certain design characteristics that contribute to failures, and
minimize or eliminate those effects.
Tracking and managing potential risks in the design. This helps avoid the same
failures in future projects.
Ensuring that any failure that could occur will not injure the customer or seriously
impact a system.
To produce world class quality products
Improve the quality, reliability and safety of a product/process
Improve company image and competitiveness
Increase user satisfaction
Reduce system development timing and cost
Collect information to reduce future failures, capture engineering knowledge
Reduce the potential for warranty concerns
Early identification and elimination of potential failure modes
Emphasize problem prevention
Minimize late changes and associated cost
Catalyst for teamwork and idea exchange between functions
Reduce the possibility of same kind of failure in future
Reduce impact of profit margin company
Reduce possible scrap in production
Since FMEA is effectively dependent on the members of the committee which examines
product failures, it is limited by their experience of previous failures. If a failure mode
cannot be identified, then external help is needed from consultants who are aware of the
many different types of product failure. FMEA is thus part of a larger system of quality
control, where documentation is vital to implementation. General texts and detailed
publications are available in forensic engineering and failure analysis. It is a general
requirement of many specific national and international standards that FMEA is used in
evaluating product integrity. If used as a top-down tool, FMEA may only identify major
failure modes in a system. Fault tree analysis (FTA) is better suited for "top-down"
analysis. When used as a "bottom-up" tool FMEA can augment or complement FTA and
identify many more causes and failure modes resulting in top-level symptoms. It is not
able to discover complex failure modes involving multiple failures within a subsystem, or
to report expected failure intervals of particular failure modes up to the upper level
subsystem or system.
Additionally, the multiplication of the severity, occurrence and detection rankings may
result in rank reversals, where a less serious failure mode receives a higher RPN than a
more serious failure mode. The reason for this is that the rankings are ordinal scale
numbers, and multiplication is not defined for ordinal numbers. The ordinal rankings only
say that one ranking is better or worse than another, but not by how much. For instance, a
ranking of "2" may not be twice as bad as a ranking of "1," or an "8" may not be twice as
bad as a "4," but multiplication treats them as though they are.
Most FMEAs are created as a spreadsheet. Specialized FMEA software packages exist
that offer some advantages over spreadsheets.
Types of FMEA
Process: analysis of manufacturing and assembly processes
Design: analysis of products prior to production
Concept: analysis of systems or subsystems in the early design concept stages
Equipment: analysis of machinery and equipment design before purchase
Service: analysis of service industry processes before they are released to impact
the customer
System: analysis of the global system functions
Software: analysis of the software functions
Chapter- 5
Root Cause Analysis and Fault Tree Analysis
Root Cause Analysis
Root cause analysis (RCA) is a class of problem solving methods aimed at identifying
the root causes of problems or events. The practice of RCA is predicated on the belief
that problems are best solved by attempting to address, correct or eliminate root causes,
as opposed to merely addressing the immediately obvious symptoms. By directing
corrective measures at root causes, it is more probable that problem recurrence will be
prevented. However, it is recognized that complete prevention of recurrence by one
corrective action is not always possible. Conversely, there may be several effective
measures (methods) that address the root cause of a problem. Thus, RCA is often
considered to be an iterative process, and is frequently viewed as a tool of continuous
RCA, is typically used as a reactive method of identifying event(s) causes, revealing
problems and solving them. Analysis is done after an event has occurred. Insights in
RCA may make it useful as a pro-active method. In that event, RCA can be used to
forecast or predict probable events even before they occur. While one follows the other,
RCA is a completely separate process to Incident Management.
Root cause analysis is not a single, sharply defined methodology; there are many
different tools, processes, and philosophies for performing RCA analysis. However,
several very-broadly defined approaches or "schools" can be identified by their basic
approach or field of origin: safety-based, production-based, process-based, failure-based,
and systems-based.
Safety-based RCA descends from the fields of accident analysis and occupational
safety and health.
Production-based RCA has its origins in the field of quality control for industrial
Process-based RCA is basically a follow-on to production-based RCA, but with a
scope that has been expanded to include business processes.
Failure-based RCA is rooted in the practice of failure analysis as employed in
engineering and maintenance.
Systems-based RCA has emerged as an amalgamation of the preceding schools,
along with ideas taken from fields such as change management, risk management,
and systems analysis.
Despite the different approaches among the various schools of root cause analysis, there
are some common principles. It is also possible to define several general processes for
performing RCA.
General principles of root cause analysis
1. The primary aim of RCA is to identify the root cause(s) of a problem in order to
create effective corrective actions that will prevent that problem from ever reoccurring, otherwise addressing the problem with virtual certainty of success.
("Success" is defined as the near-certain prevention of recurrence.)
2. To be effective, RCA must be performed systematically, usually as part of an
investigation, with conclusions and root causes identified backed up by
documented evidence. Usually a team effort is required.
3. There may be more than one root cause for an event or a problem, the difficult
part is demonstrating the persistence and sustaining the effort required to develop
4. The purpose of identifying all solutions to a problem is to prevent recurrence at
lowest cost in the simplest way. If there are alternatives that are equally effective,
then the simplest or lowest cost approach is preferred.
5. Root causes identified depend on the way in which the problem or event is
defined. Effective problem statements and event descriptions (as failures, for
example) are helpful, or even required.
6. To be effective the analysis should establish a sequence of events or timeline to
understand the relationships between contributory (causal) factors, root cause(s)
and the defined problem or event to prevent in the future.
7. Root cause analysis can help to transform an reactive culture (that reacts to
problems) into a forward-looking culture that solves problems before they occur
or escalate. More importantly, it reduces the frequency of problems occurring
over time within the environment where the RCA process is used.
8. RCA is a threat to many cultures and environments. Threats to cultures often meet
with resistance. There may be other forms of management support required to
achieve RCA effectiveness and success. For example, and "non-punitory" policy
towards problem identifiers may be required.
General process for performing and documenting an RCA-based
Corrective Action
Notice that RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective
action, because it directs the corrective action at the true root cause of the problem. The
root cause is secondary to the goal of prevention, but without knowing the root cause, we
cannot determine what an effective corrective action for the defined problem will be.
1. Define the problem or describe the event factually
2. Gather data and evidence, classifying that along a timeline of events to the final
failure or crisis.
3. Ask "why" and identify the causes associated with each step in the sequence
towards the defined problem or event.
4. Classify causes into causal factors that relate to an event in the sequence, and root
causes, that if applied can be agreed to have interrputed that step of the sequence
5. If there are multiple root causes, which is often the case, reveal those clearly for
later optimum selection.
6. Identify corrective action(s) that will prevent absolutely with certainty prevent
recurrence of the problem or event. These can be used to select the best correction
action, later
7. Identify solutions that effective, prevent recurrence with reasonable certainty with
consensus agreement of the group, are within your control, meet your goals and
objectives and do not cause introduce other new, unforeseen problems.
8. Implement the recommended root cause correction(s).
9. Ensure effectiveness by observing the implemented recommendation solutions.
10. Other methodologies for problem solving and problem avoidance may be useful.
Root cause analysis techniques
Barrier analysis - a technique often used in process industries. It is based on
tracing energy flows, with a focus on barriers to those flows, to identify how and
why the barriers did not prevent the energy flows from causing harm.
Bayesian inference
Causal factor tree analysis - a technique based on displaying causal factors in a
tree-structure such that cause-effect dependencies are clearly identified.
Change analysis - an investigation technique often used for problems or accidents.
It is based on comparing a situation that does not exhibit the problem to one that
does, in order to identify the changes or differences that might explain why the
problem occurred.
Current Reality Tree - A method developed by Eliahu M. Goldratt in his theory of
constraints that guides an investigator to identify and relate all root causes using a
cause-effect tree whose elements are bound by rules of logic (Categories of
Legitimate Reservation). The CRT begins with a brief list of the undesirables
things we see around us, and then guides us towards one or more root causes. This
method is particularly powerful when the system is complex, there is no obvious
link between the observed undesirable things, and a deep understanding of the
root cause(s) is desired.
Failure modes and effects analysis
Fault tree analysis
5 Whys ask why why why why over until exhausted
Ishikawa diagram, also known as the fishbone diagram or cause-and-effect
diagram. The Ishikawa diagram is the one method for project managers for
conducting RCA. It's effective due to its simplicity, ability to resolve inexact
information, ability to allow group participation and the complexity of the rest of
the methods.
Pareto analysis "80/20 rule"
RPR Problem Diagnosis - An ITIL-aligned method for diagnosing IT problems.
Kepner-Tregoe Approach
Common cause analysis (CCA) common modes analysis (CMA) are evolving engineering techniques for complex technical systems to determine if common root causes in
hardware, software or highly integrated systems interaction may contribute to human
error or improper operation of a system. Systems are analyzed for root causes and causal
factors to determine probability of failure modes, fault modes, or common mode software
faults due to escaped requirements. Also ensuring complete testing and verification are
methods used for ensuring complex systems are designed with no common causes that
cause severe hazards. Common cause analysis are sometimes required as part of the
safety engineering tasks for theme parks, commercial/military aircraft, spacecraft,
complex control systems, large electrical utility grids, nuclear power plants, automated
industrial controls, medical devices or other safety safety-critical systems with complex
Basic elements of root cause using Management Oversight Risk
Tree (MORT) Approach Classification
o Defective raw material
o Wrong type for job
o Lack of raw material
Man Power
o Inadequate capability
o Lack of Knowledge
o Lack of skill
o Stress
o Improper motivation
Machine / Equipment
o Incorrect tool selection
o Poor maintenance or design
o Poor equipment or tool placement
o Defective equipment or tool
o Orderly workplace
o Job design or layout of work
o Surfaces poorly maintained
o Physical demands of the task
o Forces of nature
o No or poor management involvement
o Inattention to task
Task hazards not guarded properly
Other (horseplay, inattention....)
Stress demands
Lack of Process
Lack of Communication
o No or poor procedures
o Practices are not the same as written procedures
o Poor communication
Management system
o Training or education lacking
o Poor employee involvement
o Poor recognition of hazard
o Previously identified hazards were not eliminated
Fault Tree Analysis
Fault tree analysis (FTA) is a failure analysis in which an undesired state of a system is
analyzed using boolean logic to combine a series of lower-level events. This analysis
method is mainly used in the field of safety engineering to quantitatively determine the
probability of a safety hazard.
Fault Tree Analysis (FTA) was originally developed in 1962 at Bell Laboratories by H.A.
Watson, under a U.S. Air Force Ballistics Systems Division contract to evaluate the
Minuteman I Intercontinental Ballistic Missile (ICBM) Launch Control System.
Following the first published use of FTA in the 1962 Minuteman I Launch Control Safety
Study, Boeing and AVCO expanded use of FTA to the entire Minuteman II system in
1963-1964. FTA received extensive coverage at a 1965 System Safety Symposium in
Seattle sponsored by Boeing and the University of Washington. Boeing began using FTA
for civil aircraft design around 1966. In 1970, the U.S. Federal Aviation Administration
(FAA) published a change to 14 CFR 25.1309 airworthiness regulations for transport
aircraft in the Federal Register at 35 FR 5665 (1970-04-08). This change adopted failure
probability criteria for aircraft systems and equipment and led to widespread use of FTA
in civil aviation.
Within the nuclear power industry, the U.S. Nuclear Regulatory Commission began using
probabilistic risk assessment (PRA) methods including FTA in 1975, and significantly
expanded PRA research following the 1979 incident at Three Mile Island. This
eventually led to the 1981 publication of the NRC Fault Tree Handbook NUREG–0492,
and mandatory use of PRA under the NRC's regulatory authority.
Fault Tree Analysis (FTA) attempts to model and analyze failure processes of engineering and biological systems. FTA is basically composed of logic diagrams that display the
state of the system and is constructed using graphical design techniques. Originally,
engineers were responsible for the development of Fault Tree Analysis, as a deep
knowledge of the system under analysis is required.
Often, FTA is defined as another part, or technique, of reliability engineering. Although
both model the same major aspect, they have arisen from two different perspectives.
Reliability engineering was, for the most part, developed by mathematicians, while FTA,
as stated above, was developed by engineers.
Fault Tree Analysis usually involves events from hardware wear out, material failure or
malfunctions or combinations of deterministic contributions to the event stemming from
assigning a hardware/system failure rate to branches or cut sets. Typically failure rates
are carefully derived from substantiated historical data such as mean time between failure
of the components, unit, subsystem or function. Predictor data may be assigned.
Assigning a software failure rate is elusive and not possible. Since software is a vital
contributor and inclusive of the system operation it is assumed the software will function
normally as intended. There is no such thing as a software fault tree unless considered in
the system context. Software is an instruction set to the hardware or overall system for
correct operation. Since basic software events do not fail in the physical sense, attempting
to predict manifestation of software faults or coding errors with any reliability or
accuracy is impossible, unless assumptions are made. Predicting and assigning human
error rates is not the primary intent of a fault tree analysis, but may be attempted to gain
some knowledge of what happens with improper human input or intervention at the
wrong time.
FTA can be used as a valuable design tool, can identify potential accidents, and can
eliminate costly design changes. It can also be used as a diagnostic tool, predicting the
most likely system failure in a system breakdown. FTA is used in safety engineering and
in all major fields of engineering.
FTA methodology is described in several industry and government standards, including
NRC NUREG–0492 for the nuclear power industry, an aerospace-oriented revision to
NUREG–0492 for use by NASA, SAE ARP4761 for civil aerospace, MIL–HDBK–338
for military systems for military systems. IEC standard IEC 61025 is intended for crossindustry use and has been adopted as European Norme EN 61025.
Since no system is perfect, dealing with a subsystem fault is a necessity, and any working
system eventually will have a fault in some place. However, the probability for a
complete or partial success is greater than the probability of a complete failure or partial
failure. Assembling a FTA is thus not as tedious as assembling a success tree which can
turn out to be very time consuming.
Because assembling a FTA can be a costly and cumbersome experience, the perfect
method is to consider subsystems. In this way dealing with smaller systems can assure
less error work probability, less system analysis. Afterward, the subsystems integrate to
form the well analyzed big system.
An undesired effect is taken as the root ('top event') of a tree of logic. There should be
only one Top Event and all concerns must tree down from it. Then, each situation that
could cause that effect is added to the tree as a series of logic expressions. When fault
trees are labeled with actual numbers about failure probabilities (which are often in
practice unavailable because of the expense of testing), computer programs can calculate
failure probabilities from fault trees.
A fault tree diagram
The Tree is usually written out using conventional logic gate symbols. The route through
a tree between an event and an initiator in the tree is called a Cut Set. The shortest
credible way through the tree from fault to initiating event is called a Minimal Cut Set.
Some industries use both Fault Trees and Event Trees. An Event Tree starts from an
undesired initiator (loss of critical supply, component failure etc.) and follows possible
further system events through to a series of final consequences. As each new event is
considered, a new node on the tree is added with a split of probabilities of taking either
branch. The probabilities of a range of 'top events' arising from the initial event can then
be seen.
Classic programs include the Electric Power Research Institute's (EPRI) CAFTA
software, which is used by many of the US nuclear power plants and by a majority of US
and international aerospace manufacturers, and the Idaho National Laboratory's
SAPHIRE, which is used by the U.S. Government to evaluate the safety and reliability of
nuclear reactors, the Space Shuttle, and the International Space Station. Outside the US,
the software RiskSpectrum is a popular tool for Fault Tree and Event Tree analysis and is
licensed for use at almost half of the worlds nuclear power plants for Probabilistic Safety
Many different approaches can be used to model a FTA, but the most common and
popular way can be summarized in a few steps. Remember that a fault tree is used to
analyze a single fault event, and that one and only one event can be analyzed during a
single fault tree. Even though the “fault” may vary dramatically, a FTA follows the same
procedure for an event, be it a delay of 0.25 msec for the generation of electrical power,
or the random, unintended launch of an ICBM.
FTA analysis involves five steps:
1. Define the undesired event to study
o Definition of the undesired event can be very hard to catch, although some
of the events are very easy and obvious to observe. An engineer with a
wide knowledge of the design of the system or a system analyst with an
engineering background is the best person who can help define and
number the undesired events. Undesired events are used then to make the
FTA, one event for one FTA; no two events will be used to make one
2. Obtain an understanding of the system
o Once the undesired event is selected, all causes with probabilities of
affecting the undesired event of 0 or more are studied and analyzed.
Getting exact numbers for the probabilities leading to the event is usually
impossible for the reason that it may be very costly and time consuming to
do so. Computer software is used to study probabilities; this may lead to
less costly system analysis.
System analysts can help with understanding the overall system. System
designers have full knowledge of the system and this knowledge is very
important for not missing any cause affecting the undesired event. For the
selected event all causes are then numbered and sequenced in the order of
occurrence and then are used for the next step which is drawing or constructing the fault tree.
3. Construct the fault tree
After selecting the undesired event and having analyzed the system so that
we know all the causing effects (and if possible their probabilities) we can
now construct the fault tree. Fault tree is based on AND and OR gates
which define the major characteristics of the fault tree.
4. Evaluate the fault tree
o After the fault tree has been assembled for a specific undesired event, it is
evaluated and analyzed for any possible improvement or in other words
study the risk management and find ways for system improvement. This
step is as an introduction for the final step which will be to control the
hazards identified. In short, in this step we identify all possible hazards
affecting in a direct or indirect way the system.
5. Control the hazards identified
o This step is very specific and differs largely from one system to another,
but the main point will always be that after identifying the hazards all
possible methods are pursued to decrease the probability of occurrence.
Comparison with other analytical methods
FTA is a deductive, top-down method aimed at analyzing the effects of initiating faults
and events on a complex system. This contrasts with failure mode and effects analysis
(FMEA), which is an inductive, bottom-up analysis method aimed at analyzing the
effects of single component or function failures on equipment or subysystems. FTA is
very good at showing how resistant a system is to single or multiple initiating faults. It is
not good at finding all possible initiating faults. FMEA is good at exhaustively cataloging
initiating faults, and identifying their local effects. It is not good at examining multiple
failures or their effects at a system level. FTA considers external events, FMEA does not.
In civil aerospace the usual practice is to perform both FTA and FMEA, with a failure
mode effects summary (FMES) as the interface between FMEA and FTA.
Alternatives to FTA include dependence diagram (DD), also known as reliability block
diagram (RBD) and Markov analysis. A dependence diagram is equivalent to a success
tree analysis (STA), the logical inverse of an FTA, and depicts the system using paths
instead of gates. DD and STA produce probability of success (i.e., avoiding a top event)
rather than probability of a top event.
Chapter- 6
Fault-tolerant Design
In engineering, fault-tolerant design, also known as fail-safe design, is a design that
enables a system to continue operation, possibly at a reduced level (also known as
graceful degradation), rather than failing completely, when some part of the system fails.
The term is most commonly used to describe computer-based systems designed to
continue more or less fully operational with, perhaps, a reduction in throughput or an
increase in response time in the event of some partial failure. That is, the system as a
whole is not stopped due to problems either in the hardware or the software. An example
in another field is a motor vehicle designed so it will continue to be drivable if one of the
tires is punctured. A structure is able to retain its integrity in the presence of damage due
to causes such as fatigue, corrosion, manufacturing flaws, or impact.
If each component, in turn, can continue to function when one of its subcomponents fails,
this will allow the total system to continue to operate, as well. Using a passenger vehicle
as an example, a car can have "run-flat" tires, which each contain a solid rubber core,
allowing them to be used even if a tire is punctured. The punctured "run-flat" tire may be
used for a limited time at a reduced speed.
This means having backup components which automatically "kick in" should one
component fail. For example, large cargo trucks can lose a tire without any major
consequences. They have many tires, and no one tire is critical (with the exception of the
front tires, which are used to steer).
Redundant power supply
Redundant subsystem "B"
In engineering, redundancy is the duplication of critical components of a system with
the intention of increasing reliability of the system, usually in the case of a backup or failsafe.
In many safety-critical systems, such as fly-by-wire and hydraulic systems in aircraft,
some parts of the control system may be triplicated, which is formally termed triple
modular redundancy (TMR). An error in one component may then be out-voted by the
other two. In a triply redundant system, the system has three sub components, all three of
which must fail before the system fails. Since each one rarely fails, and the sub
components are expected to fail independently, the probability of all three failing is
calculated to be extremely small. Redundancy may also be known by the terms
"majority voting systems" or "voting logic".
A Suspension Bridge's numerous cables are a form of redundancy
Forms of redundancy
There are four major forms of redundancy, these are:
Hardware redundancy, such as DMR and TMR
Information redundancy, such as Error detection and correction methods
Time redundancy, including transient fault detection methods such as Alternate
Software redundancy such as N-version programming
Function of redundancy
The two functions of redundancy are passive redundancy and active redundancy. Both
functions prevent performance decline from exceeding specification limits without
human intervention using extra capacity.
Passive redundancy uses excess capacity to reduce the impact of component failures. One
common form of passive redundancy is the extra strength of cabling and struts used in
bridges. This extra strength allows some structural components to fail without bridge
collapse. The extra strength used in the design is called the margin of safety.
Eyes and ears provide working examples of passive redundancy. Vision loss in one eye
does not cause blindness but depth depth perception is impaired. Hearing loss in one ear
does not cause deafness but directionality is impaired. Performance decline is commonly
associated with passive redundancy when a limited number of failures occur.
Active redundancy eliminates performance decline by monitoring performance of individual device, and this monitoring is used in voting logic. The voting logic is linked to
switching that automatically reconfigures components. Error detection and correction and
the Global Positioning System (GPS) are two examples of active redundancy.
Electrical power distribution provides an example of active redundancy. Several power
lines connect each generation facility with customers. Each power line include monitors
that detect overload. Each power line also includes circuit breakers. The combination of
power lines provides excess capacity. Circuit breakers disconnect a power line when
monitors detect an overload. Power is redistributed across the remaining lines.
Voting Logic
Voting logic uses performance monitoring to determine how to reconfigure individual
components so that operation continues without violating specification limitations of the
overall system. Voting logic often involve computers, but systems composed of items
other than computers may be reconfigured using voting logic. Circuit breakers are an
example of a form of non-computer voting logic.
Electrical power systems use power scheduling to reconfigure active redundancy.
Computing systems adjust the production output of each generating facility when other
generating facilities are suddenly lost. This prevents blackout conditions during major
events like earthquake.
The simplest voting logic in computing systems involves two components: primary and
alternate. They both run similar software, but the output from the alternate remains
inactive during normal operation. The primary monitors itself and periodically sends an
activity message to the alternate as long as everything is OK. All outputs from the
primary stop, including the activity message, when the primary detects a fault. The
alternate activates its output and takes over from the primary after a brief delay when the
activity message ceases. Errors in voting logic can cause both to have all outputs active at
the same time, can cause both to have all outputs inactive at the same time, or outputs can
flutter on and off.
A more reliable form of voting logic involves an odd number of 3 devices or more. All
perform identical functions and the outputs are compared by the voting logic. The voting
logic establishes a majority when there is a disagreement, and the majority will act to
deactivate the output from other device(s) that disagree. A single fault will not interrupt
normal operation. This technique is used with avionics systems, such as those responsible
for operation of the space shuttle.
Calculating the probability of system failure
Each duplicate component added to the system decreases the probability of system failure
according to the formula:
n - number of components
pi - probability of component i failing
p - the probability of all components failing (system failure)
This formula assumes independence of failure events. That means that the probability of
a component B failing given that a component A has already failed is the same as that of
B failing when A has not failed. There are situations where this is unreasonable, such as
using two power supplies connected to the same socket, whereby if one socket failed, the
other would too.
It also assumes that at only one component is needed to keep the system running. If m
components are needed for the system to survive, out of n, the probability of failure is
of failure
, Assuming all components have equal probability, p,
This model is probably unrealistic in that it assumes that components are not replaced in
time when they fail.
When to use
Providing fault-tolerant design for every component is normally not an option. In such
cases the following criteria may be used to determine which components should be faulttolerant:
How critical is the component? In a car, the radio is not critical, so this
component has less need for fault-tolerance.
How likely is the component to fail? Some components, like the drive shaft in a
car, are not likely to fail, so no fault-tolerance is needed.
How expensive is it to make the component fault-tolerant? Requiring a
redundant car engine, for example, would likely be too expensive both
economically and in terms of weight and space, to be considered.
An example of a component that passes all the tests is a car's occupant restraint system.
While we do not normally think of the primary occupant restraint system, it is gravity. If
the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant
restraint may fail. Restraining the occupants during such an accident is absolutely critical
to safety, so we pass the first test. Accidents causing occupant ejection were quite
common before seat belts, so we pass the second test. The cost of a redundant restraint
method like seat belts is quite low, both economically and in terms or weight and space,
so we pass the third test. Therefore, adding seat belts to all vehicles is an excellent idea.
Other "supplemental restraint systems", such as airbags, are more expensive and so pass
that test by a smaller margin.
Hardware fault-tolerance sometimes requires that broken parts can be swapped out with
new ones while the system is still operational (in computing known as hot swapping).
Such a system implemented with a single backup is known as single point tolerant, and
represents the vast majority of fault-tolerant systems. In such systems the mean time
between failures should be long enough for the operators to have time to fix the broken
devices (mean time to repair) before the backup also fails. It helps if the time between
failures is as long as possible, but this is not specifically required in a fault-tolerant
Fault-tolerance is notably successful in computer applications. Tandem Computers built
their entire business on such machines, which used single point tolerance to create their
NonStop systems with uptimes measured in years.
Fail-safe architectures may encompass also the computer software, for example by
process replication (computer science).
Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:
Interference with fault detection in the same component. To continue the
above passenger vehicle example, it may not be obvious to the driver when a tire
has been punctured, with either of the fault-tolerant systems. This is usually
handled with a separate "automated fault detection system". In the case of the tire,
an air pressure monitor detects the loss of pressure and notifies the driver. The
alternative is a "manual fault detection system", such as manually inspecting all
tires at each stop.
Interference with fault detection in another component. Another variation of
this problem is when fault-tolerance in one component prevents fault detection in
a different component. For example, if component B performs some operation
based on the output from component A, then fault-tolerance in B can hide a
problem with A. If component B is later changed (to a less fault-tolerant design)
the system may fail suddenly, making it appear that the new component B is the
problem. Only after the system has been carefully scrutinized will it become clear
that the root problem is actually with component A.
Reduction of priority of fault correction. Even if the operator is aware of the
fault, having a fault-tolerant system is likely to reduce the importance of repairing
the fault. If the faults are not corrected, this will eventually lead to system failure,
when the fault-tolerant component fails completely or when all redundant
components have also failed.
Test difficulty. For certain critical fault-tolerant systems, such as a nuclear
reactor, there is no easy way to verify that the backup components are functional.
The most infamous example of this is Chernobyl, where operators tested the
emergency backup cooling by disabling primary and secondary cooling. The
backup failed, resulting in a core meltdown and massive release of radiation.
Cost. Both fault-tolerant components and redundant components tend to increase
cost. This can be a purely economic cost or can include other measures, such as
weight. Manned spaceships, for example, have so many redundant and faulttolerant components that their weight is increased dramatically over unmanned
systems, which don't require the same level of safety.
Inferior components. A fault-tolerant design may allow for the use of inferior
components, which would have otherwise made the system inoperable. While this
practice has the potential to mitigate the cost increase, use of multiple inferior
components may lower the reliability of the system to a level equal to, or even
worse than, a comparable non-fault-tolerant system.
Related terms
There is a difference between fault-tolerance and systems that rarely have problems. For
instance, the Western Electric crossbar systems had failure rates of two hours per forty
years, and therefore were highly fault resistant. But when a fault did occur they still
stopped operating completely, and therefore were not fault-tolerant.
Chapter- 7
Fault-tolerant System
Fault-tolerance or graceful degradation is the property that enables a system (often
computer-based) to continue operating properly in the event of the failure of (or one or
more faults within) some of its components. If its operating quality decreases at all, the
decrease is proportional to the severity of the failure, as compared to a naïvely-designed
system in which even a small failure can cause total breakdown. Fault-tolerance is
particularly sought-after in high-availability or life-critical systems.
Fault-tolerance is not just a property of individual machines; it may also characterise the
rules by which they interact. For example, the Transmission Control Protocol (TCP) is
designed to allow reliable two-way communication in a packet-switched network, even in
the presence of communications links which are imperfect or overloaded. It does this by
requiring the endpoints of the communication to expect packet loss, duplication,
reordering and corruption, so that these conditions do not damage data integrity, and only
reduce throughput by a proportional amount.
An example of graceful degradation by design in an image with transparency. The top
two images are each the result of viewing the composite image in a viewer that
recognises transparency. The bottom two images are the result in a viewer with no
support for transparency. Because the transparency mask (centre bottom) is discarded,
only the overlay (centre top) remains; the image on the left has been designed to degrade
gracefully, hence is still meaningful without its transparency information.
Data formats may also be designed to degrade gracefully. HTML for example, is
designed to be forward compatible, allowing new HTML entities to be ignored by Web
browsers which do not understand them without causing the document to be unusable.
Recovery from errors in fault-tolerant systems can be characterised as either rollforward or roll-back. When the system detects that it has made an error, roll-forward
recovery takes the system state at that time and corrects it, to be able to move forward.
Roll-back recovery reverts the system state back to some earlier, correct version, for
example using checkpointing, and moves forward from there. Roll-back recovery
requires that the operations between the checkpoint and the detected erroneous state can
be made idempotent. Some systems make use of both roll-forward and roll-back recovery
for different errors or different parts of one error.
Within the scope of an individual system, fault-tolerance can be achieved by anticipating
exceptional conditions and building the system to cope with them, and, in general, aiming
for self-stabilization so that the system converges towards an error-free state. However, if
the consequences of a system failure are catastrophic, or the cost of making it sufficiently
reliable is very high, a better solution may be to use some form of duplication. In any
case, if the consequence of a system failure is catastrophic, the system must be able to use
reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a
human action if humans are present in the loop.
Fault tolerance requirements
The basic characteristics of fault tolerance require:
No single point of repair
Fault isolation to the failing component
Fault containment to prevent propagation of the failure
Availability of reversion modes
In addition, fault tolerant systems are characterized in terms of both planned service
outages and unplanned service outages. These are usually measured at the application
level and not just at a hardware level. The figure of merit is called availability and is
expressed as a percentage. For example, a five nines system would statistically provide
99.999% availability.
Fault-tolerant systems are typically based on the concept of redundancy.
Fault-tolerance by replication
Spare components addresses the first fundamental characteristic of fault-tolerance in
three ways:
Replication: Providing multiple identical instances of the same system or
subsystem, directing tasks or requests to all of them in parallel, and choosing the
correct result on the basis of a quorum;
Redundancy: Providing multiple identical instances of the same system and
switching to one of the remaining instances in case of a failure (failover);
Diversity: Providing multiple different implementations of the same specification,
and using them like replicated systems to cope with errors in a specific implementation.
All implementations of RAID, redundant array of independent disks, except RAID 0 are
examples of a fault-tolerant storage device that uses data redundancy.
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any
time, all the replications of each element should be in the same state. The same inputs are
provided to each replication, and the same outputs are expected. The outputs of the
replications are compared using a voting circuit. A machine with two replications of each
element is termed Dual Modular Redundant (DMR). The voting circuit can then only
detect a mismatch and recovery relies on other methods. A machine with three
replications of each element is termed Triple Modular Redundancy (TMR). The voting
circuit can determine which replication is in error when a two-to-one vote is observed. In
this case, the voting circuit can output the correct result, and discard the erroneous
version. After this, the internal state of the erroneous replication is assumed to be
different from that of the other two, and the voting circuit can switch to a DMR mode.
This model can be applied to any larger number of replications.
Lockstep fault tolerant machines are most easily made fully synchronous, with each gate
of each replication making the same state transition on the same edge of the clock, and
the clocks to the replications being exactly in phase. However, it is possible to build
lockstep systems without this requirement.
Bringing the replications into synchrony requires making their internal stored states the
same. They can be started from a fixed initial state, such as the reset state. Alternatively,
the internal state of one replica can be copied to another replica.
One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a
pair, with a voting circuit that detects any mismatch between their operations and outputs
a signal indicating that there is an error. Another pair operates exactly the same way. A
final circuit selects the output of the pair that does not proclaim that it is in error. Pairand-spare requires four replicas rather than the three of TMR, but has been used commercially.
No single point of repair
If a system experiences a failure, it must continue to operate without interruption during
the repair process.
Fault isolation to the failing component
When a failure occurs, the system must be able to isolate the failure to the offending
component. This requires the addition of dedicated failure detection mechanisms that
exist only for the purpose of fault isolation.
Recovery from a fault condition requires classifying the fault or failing component. The
National Institute of Standards and Technology (NIST) categorizes faults based on
Locality, Cause, Duration and Effect.
Fault containment
Some failure mechanisms can cause a system to fail by propagating the failure to the rest
of the system. An example of this kind of failure is the "Rogue transmitter" which can
swamp legitimate communication in a system and cause overall system failure.
Mechanisms that isolate a rogue transmitter or failing component to protect the system
are required.
Specimen of Fault –tolerant system
Fault-tolerant computer system
A conceptual design of a segregated-component fault-tolerant computer design
Fault-tolerant computer systems are systems designed around the concepts of fault
tolerance. In essence, they have to be able to keep working to a level of satisfaction in the
presence of faults.
Types of fault tolerance
Most fault-tolerant computer systems are designed to be able to handle several possible
failures, including hardware-related faults such as hard disk failures, input or output
device failures, or other temporary or permanent failures; software bugs and errors;
interface errors between the hardware and software, including driver failures; operator
errors, such as erroneous keystrokes, bad command sequences, or installing unexpected
software; and physical damage or other flaws introduced to the system from an outside
Hardware fault-tolerance is the most common application of these systems, designed to
prevent failures due to hardware components. Typically, components have multiple
backups and are separated into smaller "segments" that act to contain a fault, and extra
redundancy is built into all physical connectors, power supplies, fans, etc. There are
special software and instrumentation packages designed to detect failures, such as fault
masking, which is a way to ignore faults by seamlessly preparing a backup component to
execute something as soon as the instruction is sent, using a sort of voting protocol where
if the main and backups don't give the same results, the flawed output is ignored.
Software fault-tolerance is based more around nullifying programming errors using realtime redundancy, or static "emergency" subprograms to fill in for programs that crash.
There are many ways to conduct such fault-regulation, depending on the application and
the available hardware.
The first known fault-tolerant computer was SAPO, built in 1951 in Czechoslovakia by
Antonin Svoboda. Its basic design was magnetic drums connected via relays, with a
voting method of memory error detection. Several other machines were developed along
this line, mostly for military use. Eventually, they separated into three distinct categories:
machines that would last a long time without any maintenance, such as the ones used on
NASA space probes and satellites; computers that were very dependable but required
constant monitoring, such as those used to monitor and control nuclear power plants or
supercollider experiments; and finally, computers with a high amount of runtime which
would be under heavy use, such as many of the supercomputers used by insurance
companies for their probability monitoring.
Most of the development in the so called LLNM (Long Life, No Maintenance) computing
was done by NASA during the 1960s, in preparation for Project Apollo and other
research aspects. NASA's first machine went into a space observatory, and their second
attempt, the JSTAR computer, was used in Voyager. This computer had a backup of
memory arrays to use memory recovery methods and thus it was called the JPL Self-
Testing-And-Repairing computer. It could detect its own errors and fix them or bring up
redundant modules as needed. The computer is still working today.
Hyper-dependable computers were pioneered mostly by aircraft manufacturers, nuclear
power companies, and the railroad industry in the USA. These needed computers with
massive amounts of uptime that would fail gracefully enough with a fault to allow
continued operation, while relying on the fact that the computer output would be
constantly monitored by humans to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of Saturn V rockets, but later on BNSF, Unisys,
and General Electric built their own.
In general, the early efforts at fault-tolerant designs were focused mainly on internal
diagnosis, where a fault would indicate something was failing and a worker could replace
it. SAPO, for instance, had a method by which faulty memory drums would emit a noise
before failure. Later efforts showed that, to be fully effective, the system had to be selfrepairing and diagnosing – isolating a fault and then implementing a redundant backup
while alerting a need for repair. This is known as N-model redundancy, where faults
cause automatic fail safes and a warning to the operator, and it is still the most common
form of level one fault-tolerant design in use today.
Voting was another initial method, as discussed above, with multiple redundant backups
operating constantly and checking each other's results, with the outcome that if, for
example, four components reported an answer of 5 and one component reported an
answer of 6, the other four would "vote" that the fifth component was faulty and have it
taken out of service. This is called M out of N majority voting.
Historically, motion has always been to move further from N-model and more to M out
of N due to the fact that the complexity of systems and the difficulty of ensuring the
transitive state from fault-negative to fault-positive did not disrupt operations.
Fault tolerance verification and validation
The most important requirement of design in a fault tolerant computer system is making
sure it actually meets its requirements for reliability. This is done by using various failure
models to simulate various failures, and analyzing how well the system reacts. These
statistical models are very complex, involving probability curves and specific fault rates,
latency curves, error rates, and the like. The most commonly used models are HARP,
SAVE, and SHARPE in the USA, and SURF or LASS in Europe.
Fault tolerance research
Research into the kinds of tolerances needed for critical systems involves a large amount
of interdisciplinary work. The more complex the system, the more carefully all possible
interactions have to be considered and prepared for. Considering the importance of highvalue systems in transport, utilities and the military, the field of topics that touch on
research is very wide: it can include such obvious subjects as software modeling and
reliability, or hardware design, to arcane elements such as stochastic models, graph
theory, formal or exclusionary logic, parallel processing, remote data transmission, and
Chapter- 8
RAID!, an acronym for Redundant Array of Inexpensive Disks (formerly Redundant
Array of Independent Disks), is a technology that provides increased storage functions
and reliability through redundancy. This is achieved by combining multiple disk drive
components into a logical unit, where data is distributed across the drives in one of
several ways called "RAID levels". This concept was first defined by David A. Patterson,
Garth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987 as
Redundant Arrays of Inexpensive Disks. Marketers representing industry RAID
manufacturers later attempted to reinvent the term to describe a redundant array of
independent disks as a means of dissociating a low-cost expectation from RAID technology.
RAID is now used as an umbrella term for computer data storage schemes that can divide
and replicate data among multiple disk drives. The schemes or architectures are named by
the word RAID followed by a number (e.g., RAID 0, RAID 1). The various designs of
RAID systems involve two key goals: increase data reliability and increase input/output
performance. When multiple physical disks are set up to use RAID technology, they are
said to be in a RAID array. This array distributes data across multiple disks, but the array
is addressed by the operating system as one single disk. RAID can be set up to serve
several different purposes.
Standard levels
A number of standard schemes have evolved which are referred to as levels. There were
five RAID levels originally conceived, but many more variations have evolved, notably
several nested levels and many non-standard levels (mostly proprietary).
Following is a brief textual summary of the most commonly used RAID levels.
RAID 0(block-level striping without parity or mirroring) has no (or zero)
redundancy. It provides improved performance and additional storage but no fault
tolerance. Hence simple stripe sets are normally referred to as RAID 0. Any disk
failure destroys the array, and the likelihood of failure increases with more disks
in the array (at a minimum, catastrophic data loss is twice as likely compared to
single drives without RAID). A single disk failure destroys the entire array
because when data is written to a RAID 0 volume, the data is broken into
fragments called blocks. The number of blocks is dictated by the stripe size,
which is a configuration parameter of the array. The blocks are written to their
respective disks simultaneously on the same sector. This allows smaller sections
of the entire chunk of data to be read off the drive in parallel, increasing bandwidth. RAID 0 does not implement error checking, so any error is uncorrectable.
More disks in the array means higher bandwidth, but greater risk of data loss.
In RAID 1 (mirroring without parity or striping), data is written identically to
multiple disks (a "mirrored set"). Although many implementations create sets of 2
disks, sets may contain 3 or more disks. Array provides fault tolerance from disk
errors or failures and continues to operate as long as at least one drive in the
mirrored set is functioning. With appropriate operating system support, there can
be increased read performance, and only a minimal write performance reduction.
Using RAID 1 with a separate controller for each disk is sometimes called
In RAID 2 (bit-level striping with dedicated Hamming-code parity), all disk
spindle rotation is synchronized, and data is striped such that each sequential bit is
on a different disk. Hamming-code parity is calculated across corresponding bits
on disks and stored on one or more parity disks. Extremely high data transfer rates
are possible.
In RAID 3 (byte-level striping with dedicated parity), all disk spindle rotation is
synchronized, and data is striped such that each sequential byte is on a different
disk. Parity is calculated across corresponding bytes on disks and stored on a
dedicated parity disk. Very high data transfer rates are possible.
RAID 4 (block-level striping with dedicated parity) is identical to RAID 5 (see
below), but confines all parity data to a single disk, which can create a
performance bottleneck. In this setup, files can be distributed between multiple
disks. Each disk operates independently which allows I/O requests to be
performed in parallel, though data transfer speeds can suffer due to the type of
parity. The error detection is achieved through dedicated parity and is stored in a
separate, single disk unit.
RAID 5 (block-level striping with distributed parity) distributes parity along with
the data and requires all drives but one to be present to operate; drive failure
requires replacement, but the array is not destroyed by a single drive failure. Upon
drive failure, any subsequent reads can be calculated from the distributed parity
such that the drive failure is masked from the end user. The array will have data
loss in the event of a second drive failure and is vulnerable until the data that was
on the failed drive is rebuilt onto a replacement drive. A single drive failure in the
set will result in reduced performance of the entire set until the failed drive has
been replaced and rebuilt.
RAID 6 (block-level striping with double distributed parity) provides fault
tolerance from two drive failures; array continues to operate with up to two failed
drives. This makes larger RAID groups more practical, especially for highavailability systems. This becomes increasingly important as large-capacity drives
lengthen the time needed to recover from the failure of a single drive. Singleparity RAID levels are as vulnerable to data loss as a RAID 0 array until the failed
drive is replaced and its data rebuilt; the larger the drive, the longer the rebuild
will take. Double parity gives time to rebuild the array without the data being at
risk if a single additional drive fails before the rebuild is complete.
The following table provides an overview of the most important parameters of standard
RAID levels. Space efficiency is given as an equation in terms of the number of drives, n,
which results in a value between 0 and 1, representing the fraction of the sum of the
drives' capacities that is available for use. For example, if three drives are arranged in
RAID 3, this gives a space efficiency of 1− (1/3) = 0.66. If their individual capacities are
250 GB each, for a total of 750 GB over the three, the usable capacity under RAID 3 for
data storage is 500 GB.
# of disks Efficiency Tolerance Benefit Benefit
without parity
or mirroring.
0 (none)
RAID 1 without parity
or striping.
n−1 disks
striping with
Hammingcode parity.
1 − 1/n ⋅
1 disk
when the
disk is
found by
striping with
1 − 1/n
1 disk
striping with
1 − 1/n
1 disk
striping with
1 − 1/n
1 disk
striping with
1 − 2/n
2 disks
(n−1)X variable
Nested (hybrid) RAID
In what was originally termed hybrid RAID, many storage controllers allow RAID
levels to be nested. The elements of a RAID may be either individual disks or RAIDs
themselves. Nesting more than two deep is unusual.
As there is no basic RAID level numbered larger than 9, nested RAIDs are usually
unambiguously described by attaching the numbers indicating the RAID levels,
sometimes with a "+" in between. The order of the digits in a nested RAID designation is
the order in which the nested array is built: for RAID 1+0 first pairs of drives are
combined into two or more RAID 1 arrays (mirrors), and then the resulting RAID 1
arrays are combined into a RAID 0 array (stripes). It is also possible to combine stripes
into mirrors (RAID 0+1). The final step is known as the top array. When the top array is a
RAID 0 (such as in RAID 10 and RAID 50) most vendors omit the "+", though RAID
5+0 is clearer.
RAID 0+1: striped sets in a mirrored set (minimum four disks; even number of
disks) provides fault tolerance and improved performance but increases complexity.
The key difference from RAID 1+0 is that RAID 0+1 creates a second striped set
to mirror a primary striped set. The array continues to operate with one or more
drives failed in the same mirror set, but if drives fail on both sides of the mirror
the data on the RAID system is lost.
RAID 1+0: mirrored sets in a striped set (minimum two disks but more commonly
four disks to take advantage of speed benefits; even number of disks) provides
fault tolerance and improved performance but increases complexity.
The key difference from RAID 0+1 is that RAID 1+0 creates a striped set from a
series of mirrored drives. In a failed disk situation, RAID 1+0 performs better
because all the remaining disks continue to be used. The array can sustain
multiple drive losses so long as no mirror loses all its drives.
RAID 5+1: mirrored striped set with distributed parity (some manufacturers label
this as RAID 53).
Whether an array runs as RAID 0+1 or RAID 1+0 in practice is often determined by the
evolution of the storage system. A RAID controller might support upgrading a RAID 1
array to a RAID 1+0 array on the fly, but require a lengthy offline rebuild to upgrade
from RAID 1 to RAID 0+1. With nested arrays, sometimes the path of least disruption
prevails over achieving the preferred configuration.
RAID Parity
Many RAID levels employ an error protection scheme called "parity". Parity calculation,
in and of itself, is a widely used method in information technology to provide fault
tolerance in a given set of data.
In Boolean logic, there is a principle called exclusive or, or shorthand, "XOR", meaning
"one or the other, but not neither nor both." For example:
The XOR operator is central to how parity data is created and used within an array; It is
used both for the protection of data, as well as for the recovery of missing data.
As an example, for a simple RAID made up of 6 hard disks (4 for data, 1 for parity, and 1
for use as hot spare), where each drive is capable of holding just a single byte worth of
storage, an initial RAID configuration with random values written to each of our four
data drives would look like:
(Hot Spare)
Every time data is written to the data drives, a parity value is calculated in order to be
able to recover from a disk failure. To calculate the parity for this RAID, the XOR of
each drive's data is calculated. The resulting value is the parity data.
00101010 XOR 10001110 XOR 11110111 XOR 10110101 = 11100110
The parity data "11100110" is then written to the dedicated parity drive:
Drive #1: 00101010 (Data)
Drive #2: 10001110 (Data)
Drive #3: 11110111 (Data)
Drive #4: 10110101 (Data)
Drive #5: -------- (Hot Spare)
Drive #6: 11100110 (Parity)
In order to restore the contents of a failed drive, e.g. Drive #3, the same XOR calculation
is performed against all the remaining drives, substituting the parity value (11100110) in
place of the missing/dead drive:
00101010 XOR 10001110 XOR 11100110 XOR 10110101 = 11110111
With the complete contents of Drive #3 recovered, the data is written to the hot spare, and
the RAID can continue operating.
(Hot Spare)
At this point the dead drive has to be replaced with a working one of the same size. When
this happens, the hot spare's contents are then automatically copied to it by the array
controller, allowing the hot spare to return to its original purpose as an emergency
standby drive. The resulting array is identical to its pre-failure state:
(Hot Spare)
This same basic XOR principle applies to parity within RAID groups regardless of
capacity or number of drives. As long as there are enough drives present to allow for an
XOR calculation to take place, parity can be used to recover data from any single drive
failure. (A minimum of three drives must be present in order for parity to be used for
fault tolerance, since the XOR operator requires two operands, and a place to store the
RAID 10 versus RAID 5 in Relational Databases
A common myth (and one which serves to illustrate the mechanics of proper RAID
implementation) is that in all deployments, RAID 10 is inherently better for relational
databases than RAID 5, due to RAID 5's need to recalculate and redistribute parity data
on a per-write basis.
While this may have been a hurdle in past RAID 5 implementations, the task of parity
recalculation and redistribution within modern Storage Area Network (SAN) appliances
is performed as a back-end process transparent to the host, not as an in-line process which
competes with existing I/O. (i.e. the RAID controller handles this as a housekeeping task
to be performed during a particular spindle's idle timeslices, so as not to disrupt any
pending I/O from the host.) The "write penalty" inherent to RAID 5 has been effectively
masked over the past ten years by a combination of improved controller design, larger
amounts of cache, and faster hard disks.
In the vast majority of enterprise-level SAN hardware, any writes which are generated by
the host are simply acknowledged immediately, and destaged to disk on the back end
when the controller sees fit to do so. From the host's perspective, an individual write to a
RAID 10 volume is no faster than an individual write to a RAID 5 volume; A difference
between the two only becomes apparent when write cache at the SAN controller level is
overwhelmed, and the SAN appliance must reject or gate further write requests in order
to allow write buffers on the controller to destage to disk. While rare, this generally
indicates poor performance management on behalf of the SAN administrator, not a
shortcoming of RAID 5 or RAID 10. SAN appliances generally service multiple hosts
which compete both for controller cache and spindle time with one another. This
contention is largely masked, in that the controller is generally intelligent and adaptive
enough to maximize read cache hit ratios while also maximizing the process of destaging
data from write cache.
The choice of RAID 10 versus RAID 5 for the purposes of housing a relational database
will depend upon a number of factors (spindle availability, cost, business risk, etc.) but,
from a performance standpoint, it depends mostly on the type of I/O that database can
expect to see. For databases that are expected to be exclusively or strongly read-biased,
RAID 10 is often chosen in that it offers a slight speed improvement over RAID 5 on
sustained reads. If a database is expected to be strongly write-biased, RAID 5 becomes
the more attractive option, since RAID 5 doesn't suffer from the same write handicap
inherent in RAID 10; All spindles in a RAID 5 can be utilized to write simultaneously,
whereas only half the members of a RAID 10 can be used. However, for reasons similar
to what has eliminated the "read penalty" in RAID 5, the reduced ability of a RAID 10 to
handle sustained writes has been largely masked by improvements in controller cache
efficiency and disk throughput.
What causes RAID 5 to be slightly slower than RAID 10 on sustained reads is the fact
that RAID 5 has parity data interleaved within normal data. For every read pass in RAID
5, there is a probability that a read head may need to traverse a region of parity data. The
cumulative effect of this is a slight performance drop compared to RAID 10, which does
not use parity, and therefore will never encounter a circumstance where data underneath a
head is of no use. For the vast majority of situations, however, most relational databases
housed on RAID 10 perform equally well in RAID 5. The strengths and weaknesses of
each type only become an issue in atypical deployments, or deployments on
overcommitted or outdated hardware.
There are, however, other considerations which must be taken into account other than
simply those regarding performance. RAID 5 and other non-mirror-based arrays offer a
lower degree of resiliency than RAID 10 by virtue of RAID 10's mirroring strategy. In a
RAID 10, I/O can continue even in spite of multiple drive failures. By comparison, in a
RAID 5 array, any simultaneous failure involving greater than one drive will render the
array itself unusable by virtue of parity recalculation being impossible to perform. For
many, particularly in mission-critical environments with enough capital to spend, RAID
10 becomes the favorite as it provides the lowest level of risk.
Additionally, the time required to rebuild data on a hot spare in a RAID 10 is
significantly less than RAID 5, in that all the remaining spindles in a RAID 5 rebuild
must participate in the process, whereas only half of all spindles need to participate in
RAID 10. In modern RAID 10 implementations, all drives generally participate in the
rebuilding process as well, but only half are required, allowing greater degraded-state
throughput over RAID 5 and overall faster rebuild times.
Again, modern SAN design largely masks any performance hit while the RAID array is
in a degraded state, by virtue of selectively being able to perform rebuild operations both
in-band or out-of-band with respect to existing I/O traffic. Given the rare nature of drive
failures in general, and the exceedingly low probability of multiple concurrent drive
failures occurring within the same RAID array, the choice of RAID 5 over RAID 10
often comes down to the preference of the storage administrator, particularly when
weighed against other factors such as cost, throughput requirements, and physical spindle
In short, the choice of RAID 5 versus RAID 10 involves a complicated mixture of
factors. There is no one-size-fits-all solution, as the choice of one over the other must be
dictated by everything from the I/O characteristics of the database, to business risk, to
worst case degraded-state throughput, to the number and type of disks present in the array
itself. Over the course of the life of a database, you may even see situations where RAID
5 is initially favored, but RAID 10 slowly becomes the better choice, and vice versa.
New RAID classification
In 1996, the RAID Advisory Board introduced an improved classification of RAID
systems. It divides RAID into three types: Failure-resistant disk systems (that protect
against data loss due to disk failure), failure-tolerant disk systems (that protect against
loss of data access due to failure of any single component), and disaster-tolerant disk
systems (that consist of two or more independent zones, either of which provides access
to stored data).
The original "Berkeley" RAID classifications are still kept as an important historical
reference point and also to recognize that RAID Levels 0-6 successfully define all known
data mapping and protection schemes for disk. Unfortunately, the original classification
caused some confusion due to assumption that higher RAID levels imply higher
redundancy and performance. This confusion was exploited by RAID system
manufacturers, and gave birth to the products with such names as RAID-7, RAID-10,
RAID-30, RAID-S, etc. The new system describes the data availability characteristics of
the RAID system rather than the details of its implementation.
The next list provides criteria for all three classes of RAID:
- Failure-resistant disk systems (FRDS) (meets a minimum of criteria 1–6):
1. Protection against data loss and loss of access to data due to disk drive failure
2. Reconstruction of failed drive content to a replacement drive
3. Protection against data loss due to a "write hole"
4. Protection against data loss due to host and host I/O bus failure
5. Protection against data loss due to replaceable unit failure
6. Replaceable unit monitoring and failure indication
- Failure-tolerant disk systems (FTDS) (meets a minimum of criteria 7–15):
7. Disk automatic swap and hot swap
8. Protection against data loss due to cache failure
9. Protection against data loss due to external power failure
10. Protection against data loss due to a temperature out of operating range
11. Replaceable unit and environmental failure warning
12. Protection against loss of access to data due to device channel failure
13. Protection against loss of access to data due to controller module failure
14. Protection against loss of access to data due to cache failure
15. Protection against loss of access to data due to power supply failure
- Disaster-tolerant disk systems (DTDS) (meets a minimum of criteria 16–21):
16. Protection against loss of access to data due to host and host I/O bus failure
17. Protection against loss of access to data due to external power failure
18. Protection against loss of access to data due to component replacement
19. Protection against loss of data and loss of access to data due to multiple disk failure
20. Protection against loss of access to data due to zone failure
21. Long-distance protection against loss of data due to zone failure
Non-standard levels
Many configurations other than the basic numbered RAID levels are possible, and many
companies, organizations, and groups have created their own non-standard configuretions, in many cases designed to meet the specialised needs of a small niche group. Most
of these non-standard RAID levels are proprietary.
Storage Computer Corporation used to call a cached version of RAID 3 and 4,
RAID 7. Storage Computer Corporation is now defunct.
EMC Corporation used to offer RAID S as an alternative to RAID 5 on their
Symmetrix systems. Their latest generations of Symmetrix, the DMX and the VMax series, do not support RAID S (instead they support RAID 1, RAID 5 and
RAID 6.)
The ZFS filesystem, available in Solaris, OpenSolaris and FreeBSD, offers RAIDZ, which solves RAID 5's write hole problem.
Hewlett-Packard's Advanced Data Guarding (ADG) is a form of RAID 6.
NetApp's Data ONTAP uses RAID-DP (also referred to as "double", "dual", or
"diagonal" parity), is a form of RAID 6, but unlike many RAID 6 implementations, does not use distributed parity as in RAID 5. Instead, two unique parity
disks with separate parity calculations are used. This is a modification of RAID 4
with an extra parity disk.
Accusys Triple Parity (RAID TP) implements three independent parities by
extending RAID 6 algorithms on its FC-SATA and SCSI-SATA RAID controllers
to tolerate three-disk failure.
Linux MD RAID10 (RAID 10) implements a general RAID driver that defaults to
a standard RAID 1 with 2 drives, and a standard RAID 1+0 with four drives, but
can have any number of drives, including odd numbers. MD RAID 10 can run
striped and mirrored, even with only two drives with the f2 layout (mirroring with
striped reads, giving the read performance of RAID 0; normal Linux software
RAID 1 does not stripe reads, but can read in parallel).
Infrant (now part of Netgear) X-RAID offers dynamic expansion of a RAID 5
volume without having to back up or restore the existing content. Just add larger
drives one at a time, let it resync, then add the next drive until all drives are
installed. The resulting volume capacity is increased without user downtime. (It
should be noted that this is also possible in Linux, when utilizing Mdadm utility.
It has also been possible in the EMC Clariion and HP MSA arrays for several
years.) The new X-RAID2 found on x86 ReadyNas, that is ReadyNas with Intel
CPUs, offers dynamic expansion of a RAID 5 or RAID 6 volume (note X-RAID2
Dual Redundancy not available on all X86 ReadyNas) without having to back up
or restore the existing content etc. A major advantage over X-RAID, is that using
X-RAID2 you do not need to replace all the disks to get extra space, you only need
to replace two disks using single redundancy or four disks using dual redundancy
to get more redundant space.
BeyondRAID, created by Data Robotics and used in the Drobo series of products,
implements both mirroring and striping simultaneously or individually dependent
on disk and data context. It offers expandability without reconfiguration, the
ability to mix and match drive sizes and the ability to reorder disks. It supports
NTFS, HFS+, FAT32, and EXT3 file systems. It also uses thin provisioning to
allow for single volumes up to 16 TB depending on the host operating system
Hewlett-Packard's EVA series arrays implement vRAID - vRAID-0, vRAID-1,
vRAID-5, and vRAID-6. The EVA allows drives to be placed in groups (called
Disk Groups) that form a pool of data blocks on top of which the RAID level is
implemented. Any Disk Group may have "virtual disks" or LUNs of any vRAID
type, including mixing vRAID types in the same Disk Group - a unique feature.
vRAID levels are more closely aligned to Nested RAID levels - vRAID-1 is
actually a RAID 1+0 (or RAID 10), vRAID-5 is actually a RAID 5+0 (or RAID
50), etc. Also, drives may be added on-the-fly to an existing Disk Group, and the
existing virtual disks data is redistributed evenly over all the drives, thereby
allowing dynamic performance and capacity growth.
IBM (Among others) has implemented a RAID 1E (Level 1 Enhanced). With an
even number of disks it is similar to a RAID 10 array, but, unlike a RAID 10
array, it can also be implemented with an odd number of drives. In either case, the
total available disk space is n/2. It requires a minimum of three drives.
Hadoop has a RAID system that generates a parity file by xor-ing a stripe of
blocks in a single HDFS file. More details can be found here
Data backup
A RAID system used as a main system disk is not intended as a replacement for backing
up data. In parity configurations it will provide a backup-like feature to protect from
catastrophic data loss caused by physical damage or errors on a single drive. Many other
features of backup systems cannot be provided by RAID arrays alone. The most notable
is the ability to restore an earlier version of data, which is needed to protect against
software errors causing unwanted data to be written to the disk, and to recover from user
error or malicious deletion. RAID can also be overwhelmed by catastrophic failure that
exceeds its recovery capacity and, of course, the entire array is at risk of physical damage
by fire, natural disaster, or human forces. RAID is also vulnerable to controller failure
since it is not always possible to migrate a RAID to a new controller without data loss.
RAID drives can serve as excellent backup drives when employed as removable backup
devices to main storage, and particularly when located offsite from the main systems.
However, the use of RAID as the only storage solution does not replace backups.
The distribution of data across multiple drives can be managed either by dedicated
hardware or by software. When done in software the software may be part of the operating system or it may be part of the firmware and drivers supplied with the card.
Software-based RAID
Software implementations are now provided by many operating systems. A software
layer sits above the (generally block-based) disk device drivers and provides an
abstraction layer between the logical drives (RAIDs) and physical drives. Most common
levels are RAID 0 (striping across multiple drives for increased space and performance)
and RAID 1 (mirroring two drives), followed by RAID 1+0, RAID 0+1, and RAID 5
(data striping with parity) are supported. New filesystems like btrfs may replace the
traditional software RAID by providing striping and redundancy at the filesystem object
Apple's Mac OS X Server and Mac OS X support RAID 0, RAID 1 and RAID
FreeBSD supports RAID 0, RAID 1, RAID 3, and RAID 5 and all layerings of the
above via GEOM modules and ccd., as well as supporting RAID 0, RAID 1,
RAID-Z, and RAID-Z2 (similar to RAID 5 and RAID 6 respectively), plus nested
combinations of those via ZFS.
Linux supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6 and all layerings of
the above, as well as "RAID10" (see above). Certain reshaping/resizing/expanding operations are also supported.
Microsoft's server operating systems support RAID 0, RAID 1, and RAID 5.
Some of the Microsoft desktop operating systems support RAID such as
Windows XP Professional which supports RAID level 0 in addition to spanning
multiple disks but only if using dynamic disks and volumes. Windows XP
supports RAID 0, 1, and 5 with a simple file patch. RAID functionality in
Windows is slower than hardware RAID, but allows a RAID array to be moved to
another machine with no compatibility issues.
NetBSD supports RAID 0, RAID 1, RAID 4 and RAID 5 (and any nested
combination of those like 1+0) via its software implementation, named
OpenBSD aims to support RAID 0, RAID 1, RAID 4 and RAID 5 via its software
implementation softraid.
Solaris ZFS supports ZFS equivalents of RAID 0, RAID 1, RAID 5 (RAID Z),
RAID 6 (RAID Z2), and a triple parity version RAID Z3, and any nested
combination of those like 1+0. Note that RAID Z/Z2/Z3 solve the RAID 5/6 write
hole problem and are therefore particularly suited to software implementation
without the need for battery backed cache (or similar) support. The boot
filesystem is limited to RAID 1.
Solaris SVM supports RAID 1 for the boot filesystem, and adds RAID 0 and
RAID 5 support (and various nested combinations) for data drives.
Linux and Windows FlexRAID is a snapshot RAID implementation.
HP's OpenVMS provides a form of RAID 1 called "Volume shadowing", giving
the possibility to mirror data locally and at remote cluster systems.
Software RAID has advantages and disadvantages compared to hardware RAID. The
software must run on a host server attached to storage, and server's processor must
dedicate processing time to run the RAID software. The additional processing capacity
required for RAID 0 and RAID 1 is low, but parity-based arrays require more complex
data processing during write or integrity-checking operations. As the rate of data
processing increases with the number of disks in the array, so does the processing
requirement. Furthermore all the buses between the processor and the disk controller
must carry the extra data required by RAID which may cause congestion.
Over the history of hard disk drives, the increase in speed of commodity CPUs has been
consistently greater than the increase in speed of hard disk drive throughput. Thus, overtime for a given number of hard disk drives, the percentage of host CPU time required to
saturate a given number of hard disk drives has been dropping. e.g. The Linux software
md RAID subsystem is capable of calculating parity information at 6 GB/s (100% usage
of a single core on a 2.1 GHz Intel "Core2" CPU as of Linux v2.6.26). A three-drive
RAID 5 array using hard disks capable of sustaining a write of 100 MB/s will require
parity to be calculated at the rate of 200 MB/s. This will require the resources of just over
3% of a single CPU core during write operations (parity does not need to be calculated
for read operations on a RAID 5 array, unless a drive has failed).
Software RAID implementations may employ more sophisticated algorithms than
hardware RAID implementations (for instance with respect to disk scheduling and
command queueing), and thus may be capable of increased performance.
Another concern with operating system-based RAID is the boot process. It can be
difficult or impossible to set up the boot process such that it can fall back to another drive
if the usual boot drive fails. Such systems can require manual intervention to make the
machine bootable again after a failure. There are exceptions to this, such as the LILO
bootloader for Linux, loader for FreeBSD, and some configurations of the GRUB
bootloader natively understand RAID 1 and can load a kernel. If the BIOS recognizes a
broken first disk and refers bootstrapping to the next disk, such a system will come up
without intervention, but the BIOS might or might not do that as intended. A hardware
RAID controller typically has explicit programming to decide that a disk is broken and
fall through to the next disk.
Hardware RAID controllers can also carry battery-powered cache memory. For data
safety in modern systems the user of software RAID might need to turn the write-back
cache on the disk off (but some drives have their own battery/capacitors on the writeback cache, a UPS, and/or implement atomicity in various ways, etc.). Turning off the
write cache has a performance penalty that can, depending on workload and how well
supported command queuing in the disk system is, be significant. The battery backed
cache on a RAID controller is one solution to have a safe write-back cache.
Finally operating system-based RAID usually uses formats specific to the operating
system in question so it cannot generally be used for partitions that are shared between
operating systems as part of a multi-boot setup. However, this allows RAID disks to be
moved from one computer to a computer with an operating system or file system of the
same type, which can be more difficult when using hardware RAID (e.g. #1: When one
computer uses a hardware RAID controller from one manufacturer and another computer
uses a controller from a different manufacturer, drives typically cannot be interchanged.
e.g. #2: If the hardware controller 'dies' before the disks do, data may become
unrecoverable unless a hardware controller of the same type is obtained, unlike with
firmware-based or software-based RAID).
Most operating system-based implementations allow RAIDs to be created from partitions
rather than entire physical drives. For instance, an administrator could divide an odd
number of disks into two partitions per disk, mirror partitions across disks and stripe a
volume across the mirrored partitions to emulate IBM's RAID 1E configuration. Using
partitions in this way also allows mixing reliability levels on the same set of disks. For
example, one could have a very robust RAID 1 partition for important files, and a less
robust RAID 5 or RAID 0 partition for less important data. (Some BIOS-based
controllers offer similar features, e.g. Intel Matrix RAID.) Using two partitions on the
same drive in the same RAID is, however, dangerous. (e.g. #1: Having all partitions of a
RAID 1 on the same drive will, obviously, make all the data inaccessible if the single
drive fails. e.g. #2: In a RAID 5 array composed of four drives 250 + 250 + 250 + 500
GB, with the 500 GB drive split into two 250 GB partitions, a failure of this drive will
remove two partitions from the array, causing all of the data held on it to be lost).
Hardware-based RAID
Hardware RAID controllers use different, proprietary disk layouts, so it is not usually
possible to span controllers from different manufacturers. They do not require processor
resources, the BIOS can boot from them, and tighter integration with the device driver
may offer better error handling.
A hardware implementation of RAID requires at least a special-purpose RAID controller.
On a desktop system this may be a PCI expansion card, PCI-e expansion card or built
into the motherboard. Controllers supporting most types of drive may be used –
IDE/ATA, SATA, SCSI, SSA, Fibre Channel, sometimes even a combination. The
controller and disks may be in a stand-alone disk enclosure, rather than inside a
computer. The enclosure may be directly attached to a computer, or connected via SAN.
The controller hardware handles the management of the drives, and performs any parity
calculations required by the chosen RAID level.
Most hardware implementations provide a read/write cache, which, depending on the I/O
workload, will improve performance. In most systems the write cache is non-volatile (i.e.
battery-protected), so pending writes are not lost on a power failure.
Hardware implementations provide guaranteed performance, add no overhead to the local
CPU complex and can support many operating systems, as the controller simply presents
a logical disk to the operating system.
Hardware implementations also typically support hot swapping, allowing failed drives to
be replaced while the system is running.
However, inexpensive hardware RAID controllers can be slower than software RAID due
to the dedicated CPU on the controller card not being as fast as the CPU in the
computer/server. More expensive RAID controllers have faster CPUs, capable of higher
throughput speeds and do not present this slowness.
Firmware/driver-based RAID
Operating system-based RAID doesn't always protect the boot process and is generally
impractical on desktop versions of Windows (as described above). Hardware RAID
controllers are expensive and proprietary. To fill this gap, cheap "RAID controllers" were
introduced that do not contain a RAID controller chip, but simply a standard disk
controller chip with special firmware and drivers. During early stage bootup the RAID is
implemented by the firmware; when a protected-mode operating system kernel such as
Linux or a modern version of Microsoft Windows is loaded the drivers take over.
These controllers are described by their manufacturers as RAID controllers, and it is
rarely made clear to purchasers that the burden of RAID processing is borne by the host
computer's central processing unit, not the RAID controller itself, thus introducing the
aforementioned CPU overhead from which hardware controllers don't suffer. Firmware
controllers often can only use certain types of hard drives in their RAID arrays (e.g.
SATA for Intel Matrix RAID), as there is neither SCSI nor PATA support in modern
Intel ICH southbridges; however, motherboard makers implement RAID controllers
outside of the southbridge on some motherboards. Before their introduction, a "RAID
controller" implied that the controller did the processing, and the new type has become
known by some as "fake RAID" even though the RAID itself is implemented correctly.
Adaptec calls them "HostRAID". Various Linux distributions will refuse to work with
"fake RAID".
Network-attached storage
While not directly associated with RAID, Network-attached storage (NAS) is an
enclosure containing disk drives and the equipment necessary to make them available
over a computer network, usually Ethernet. The enclosure is basically a dedicated
computer in its own right, designed to operate over the network without screen or
keyboard. It contains one or more disk drives; multiple drives may be configured as a
Hot spares
Both hardware and software RAIDs with redundancy may support the use of hot spare
drives, a drive physically installed in the array which is inactive until an active drive fails,
when the system automatically replaces the failed drive with the spare, rebuilding the
array with the spare drive included. This reduces the mean time to recovery (MTTR),
though it doesn't eliminate it completely. Subsequent additional failure(s) in the same
RAID redundancy group before the array is fully rebuilt can result in loss of the data;
rebuilding can take several hours, especially on busy systems.
Rapid replacement of failed drives is important as the drives of an array will all have had
the same amount of use, and may tend to fail at about the same time rather than
randomly. RAID 6 without a spare uses the same number of drives as RAID 5 with a hot
spare and protects data against simultaneous failure of up to two drives, but requires a
more advanced RAID controller. Further, a hot spare can be shared by multiple RAID
Reliability terms
Failure rate
Two different kinds of failure rates are applicable to RAID systems. Logical
failure is defined as the loss of a single drive and its rate is equal to the sum of
individual drives' failure rates. System failure is defined as loss of data and its rate
will depend on the type of RAID. For RAID 0 this is equal to the logical failure
rate, as there is no redundancy. For other types of RAID, it will be less than the
logical failure rate, potentially approaching zero, and its exact value will depend
on the type of RAID, the number of drives employed, and the vigilance and
alacrity of its human administrators.
Mean time to data loss (MTTDL)
In this context, the average time before a loss of data in a given array. Mean time
to data loss of a given RAID may be higher or lower than that of its constituent
hard drives, depending upon what type of RAID is employed. The referenced
report assumes times to data loss are exponentially distributed. This means 63.2%
of all data loss will occur between time 0 and the MTTDL.
Mean time to recovery (MTTR)
In arrays that include redundancy for reliability, this is the time following a failure
to restore an array to its normal failure-tolerant mode of operation. This includes
time to replace a failed disk mechanism as well as time to re-build the array (i.e.
to replicate data for redundancy).
Unrecoverable bit error rate (UBE)
This is the rate at which a disk drive will be unable to recover data after
application of cyclic redundancy check (CRC) codes and multiple retries.
Write cache reliability
Some RAID systems use RAM write cache to increase performance. A power
failure can result in data loss unless this sort of disk buffer is supplemented with a
battery to ensure that the buffer has enough time to write from RAM back to disk.
Atomic write failure
Also known by various terms such as torn writes, torn pages, incomplete writes,
interrupted writes, non-transactional, etc.
Problems with RAID
Correlated failures
The theory behind the error correction in RAID assumes that failures of drives are
independent. Given these assumptions it is possible to calculate how often they can fail
and to arrange the array to make data loss arbitrarily improbable.
In practice, the drives are often the same age, with similar wear. Since many drive
failures are due to mechanical issues which are more likely on older drives, this violates
those assumptions and failures are in fact statistically correlated. In practice then, the
chances of a second failure before the first has been recovered is not nearly as unlikely as
might be supposed, and data loss can, in practice, occur at significant rates.
A common misconception is that "server-grade" drives fail less frequently than
consumer-grade drives. Two independent studies, one by Carnegie Mellon University and
the other by Google, have shown that the "grade" of the drive does not relate to failure
This is a little understood and rarely mentioned failure mode for redundant storage
systems that do not utilize transactional features. Database researcher Jim Gray wrote
"Update in Place is a Poison Apple" during the early days of relational database
commercialization. However, this warning largely went unheeded and fell by the wayside
upon the advent of RAID, which many software engineers mistook as solving all data
storage integrity and reliability problems. Many software programs update a storage
object "in-place"; that is, they write a new version of the object on to the same disk
addresses as the old version of the object. While the software may also log some delta
information elsewhere, it expects the storage to present "atomic write semantics,"
meaning that the write of the data either occurred in its entirety or did not occur at all.
However, very few storage systems provide support for atomic writes, and even fewer
specify their rate of failure in providing this semantic. Note that during the act of writing
an object, a RAID storage device will usually be writing all redundant copies of the
object in parallel, although overlapped or staggered writes are more common when a
single RAID processor is responsible for multiple drives. Hence an error that occurs
during the process of writing may leave the redundant copies in different states, and
furthermore may leave the copies in neither the old nor the new state. The little known
failure mode is that delta logging relies on the original data being either in the old or the
new state so as to enable backing out the logical change, yet few storage systems provide
an atomic write semantic on a RAID disk.
While the battery-backed write cache may partially solve the problem, it is applicable
only to a power failure scenario.
Since transactional support is not universally present in hardware RAID, many operating
systems include transactional support to protect against data loss during an interrupted
write. Novell Netware, starting with version 3.x, included a transaction tracking system.
Microsoft introduced transaction tracking via the journaling feature in NTFS. Ext4 has
journaling with checksums; ext3 has journaling without checksums but an "append-only"
option, or ext3COW (Copy on Write). If the journal itself in a filesystem is corrupted
though, this can be problematic. The journaling in NetApp WAFL file system gives
atomicity by never updating the data in place, as does ZFS. An alternative method to
journaling is soft updates, which are used in some BSD-derived system's implementation
of UFS.
This can present as a sector read failure. Some RAID implementations protect against this
failure mode by remapping the bad sector, using the redundant data to retrieve a good
copy of the data, and rewriting that good data to the newly mapped replacement sector.
The UBE (Unrecoverable Bit Error) rate is typically specified at 1 bit in 1015 for
enterprise class disk drives (SCSI, FC, SAS) and 1 bit in 1014 for desktop class disk
drives (IDE/ATA/PATA, SATA). Increasing disk capacities and large RAID 5
redundancy groups have led to an increasing inability to successfully rebuild a RAID
group after a disk failure because an unrecoverable sector is found on the remaining
drives. Double protection schemes such as RAID 6 are attempting to address this issue,
but suffer from a very high write penalty.
Write cache reliability
The disk system can acknowledge the write operation as soon as the data is in the cache,
not waiting for the data to be physically written. This typically occurs in old, nonjournaled systems such as FAT32, or if the Linux/Unix "writeback" option is chosen
without any protections like the "soft updates" option (to promote I/O speed whilst
trading-away data reliability). A power outage or system hang such as a BSOD can mean
a significant loss of any data queued in such a cache.
Often a battery is protecting the write cache, mostly solving the problem. If a write fails
because of power failure, the controller may complete the pending writes as soon as
restarted. This solution still has potential failure cases: the battery may have worn out, the
power may be off for too long, the disks could be moved to another controller, the
controller itself could fail. Some disk systems provide the capability of testing the battery
periodically, however this leaves the system without a fully charged battery for several
An additional concern about write cache reliability exists, specifically regarding devices
equipped with a write-back cache—a caching system which reports the data as written as
soon as it is written to cache, as opposed to the non-volatile medium. The safer cache
technique is write-through, which reports transactions as written when they are written to
the non-volatile medium.
Equipment compatibility
The methods used to store data by various RAID controllers are not necessarily
compatible, so that it may not be possible to read a RAID array on different hardware,
with the exception of RAID 1, which is typically represented as plain identical copies of
the original data on each disk. Consequently a non-disk hardware failure may require the
use of identical hardware to recover the data, and furthermore an identical configuration
has to be reassembled without triggering a rebuild and overwriting the data. Software
RAID however, such as implemented in the Linux kernel, alleviates this concern, as the
setup is not hardware dependent, but runs on ordinary disk controllers, and allows the
reassembly of an array. Additionally, individual RAID1 disks (software, and most
hardware implementations) can be read like normal disks when removed from the array,
so no RAID system is required to retrieve the data. Inexperienced data recovery firms
typically have a difficult time recovering data from RAID drives, with the exception of
RAID1 drives with conventional data structure.
Data recovery in the event of a failed array
With larger disk capacities the odds of a disk failure during rebuild are not negligible. In
that event the difficulty of extracting data from a failed array must be considered. Only
RAID 1 stores all data on each disk. Although it may depend on the controller, some
RAID 1 disks can be read as a single conventional disk. This means a dropped RAID 1
disk, although damaged, can often be reasonably easily recovered using a software
recovery program. If the damage is more severe, data can often be recovered by
professional data recovery specialists. RAID 5 and other striped or distributed arrays
present much more formidable obstacles to data recovery in the event the array fails.
Drive error recovery algorithms
Many modern drives have internal error recovery algorithms that can take upwards of a
minute to recover and re-map data that the drive fails to easily read. Many RAID
controllers will drop a non-responsive drive in 8 seconds or so. This can cause the array
to drop a good drive because it has not been given enough time to complete its internal
error recovery procedure, leaving the rest of the array vulnerable. So-called enterprise
class drives limit the error recovery time and prevent this problem, but desktop drives can
be quite risky for this reason. A fix specific to Western Digital drives used to be known: a
utility called WDTLER.exe could limit the error recovery time of a Western Digital
desktop drive so that it would not be dropped from the array for this reason. The utility
enabled TLER (time limited error recovery) which limits the error recovery time to 7
seconds. As of October 2009 Western Digital has locked out this feature in their desktop
drives such as the Caviar Black. Western Digital enterprise class drives are shipped from
the factory with TLER enabled to prevent being dropped from RAID arrays. Similar
technologies are used by Seagate, Samsung, and Hitachi.
As of late 2010, support for ATA Error Recovery Control configuration has been added
to the Smartmontools program, so it now allows configuring many desktop class hard
drives for use on a RAID controller.
Increasing recovery time
Drive capacity has grown at a much faster rate than transfer speed, and error rates have
only fallen a little in comparison. Therefore, larger capacity drives may take hours, if not
days, to rebuild. The re-build time is also limited if the entire array is still in operation at
reduced capacity. Given a RAID array with only one disk of redundancy (RAIDs 3, 4,
and 5), a second failure would cause complete failure of the array. Even though
individual drives' mean time between failure (MTBF) have increased over time, this
increase has not kept pace with the increased storage capacity of the drives. The time to
rebuild the array after a single disk failure, as well as the chance of a second failure
during a rebuild, have increased over time.
Operator skills, correct operation
In order to provide the desired protection against physical drive failure, a RAID array
must be properly set up and maintained by an operator with sufficient knowledge of the
chosen RAID configuration, array controller (hardware or software), failure detection and
recovery. Unskilled handling of the array at any stage may exacerbate the consequences
of a failure, and result in downtime and full or partial loss of data that might otherwise be
Particularly, the array must be monitored, and any failures detected and dealt with
promptly. Failure to do so will result in the array continuing to run in a degraded state,
vulnerable to further failures. Ultimately more failures may occur, until the entire array
becomes inoperable, resulting in data loss and downtime. In this case, any protection the
array may provide merely delays this.
The operator must know how to detect failures or verify healthy state of the array,
identify which drive failed, have replacement drives available, and know how to replace a
drive and initiate a rebuild of the array.
Other problems
While RAID may protect against physical drive failure, the data is still exposed to
operator, software, hardware and virus destruction. Many studies cite operator fault as the
most common source of malfunction, such as a server operator replacing the incorrect
disk in a faulty RAID array, and disabling the system (even temporarily) in the process.
Most well-designed systems include separate backup systems that hold copies of the data,
but don't allow much interaction with it. Most copy the data and remove the copy from
the computer for safe storage.
Norman Ken Ouchi at IBM was awarded a 1978 U.S. patent 4,092,732 titled "System for
recovering data stored in failed memory unit." The claims for this patent describe what
would later be termed RAID 5 with full stripe writes. This 1978 patent also mentions that
disk mirroring or duplexing (what would later be termed RAID 1) and protection with
dedicated parity (that would later be termed RAID 4) were prior art at that time.
The term RAID was first defined by David A. Patterson, Garth A. Gibson and Randy
Katz at the University of California, Berkeley, in 1987. They studied the possibility of
using two or more drives to appear as a single device to the host system and published a
paper: "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the
SIGMOD conference.
This specification suggested a number of prototype RAID levels, or combinations of
drives. Each had theoretical advantages and disadvantages. Over the years, different
implementations of the RAID concept have appeared. Most differ substantially from the
original idealized RAID levels, but the numbered names have remained. This can be
confusing, since one implementation of RAID 5, for example, can differ substantially
from another. RAID 3 and RAID 4 are often confused and even used interchangeably.
One of the early uses of RAID 0 and 1 was the Crosfield Electronics Studio 9500 page
layout system based on the Python workstation. The Python workstation was a Crosfield
managed international development using PERQ 3B electronics, benchMark Technology's Viper display system and Crosfield's own RAID and fibre-optic network controllers. RAID 0 was particularly important to these workstations as it dramatically sped
up image manipulation for the pre-press markets. Volume production started in
Peterborough, England in early 1987.
Vinum is a logical volume manager, also called Software RAID, allowing implementations of the RAID-0, RAID-1 and RAID-5 models, both individually and in
combination. Vinum is part of the base distribution of the FreeBSD operating system.
Versions exist for NetBSD, OpenBSD and DragonFly BSD. Vinum source code is
currently maintained in the FreeBSD source tree. Vinum supports raid levels 0, 1, 5, and
JBOD. Vinum is invoked as "gvinum" on FreeBSD version 5.4 and up.
Software RAID vs. Hardware RAID
The distribution of data across multiple disks can be managed by either dedicated
hardware or by software. Additionally, there are hybrid RAIDs that are partly softwareand partly hardware-based solutions.
With a software implementation, the operating system manages the disks of the array
through the normal drive controller (ATA, SATA, SCSI, Fibre Channel, etc.). With
present CPU speeds, software RAID can be faster than hardware RAID.
A hardware implementation of RAID requires at a minimum a special-purpose RAID
controller. On a desktop system, this may be a PCI expansion card, or might be a
capability built in to the motherboard. In larger RAIDs, the controller and disks are
usually housed in an external multi-bay enclosure. This controller handles the
management of the disks, and performs parity calculations (needed for many RAID
levels). This option tends to provide better performance, and makes operating system
support easier.
Hardware implementations also typically support hot swapping, allowing failed drives to
be replaced while the system is running. In rare cases hardware controllers have become
faulty, which can result in data loss. Hybrid RAIDs have become very popular with the
introduction of inexpensive hardware RAID controllers. The hardware is a normal disk
controller that has no RAID features, but there is a boot-time application that allows users
to set up RAIDs that are controlled via the BIOS. When any modern operating system is
used, it will need specialized RAID drivers that will make the array look like a single
block device. Since these controllers actually do all calculations in software, not
hardware, they are often called "fakeraids". Unlike software RAID, these "fakeraids"
typically cannot span multiple controllers.
Example configuration. A simple example to mirror drive enterprise to drive excelsior
drive enterprise device /dev/da1s1d
drive excelsior device /dev/da2s1d
volume mirror
plex org concat
sd length 512m drive enterprise
plex org concat
sd length 512m drive excelsior
Non-RAID drive architectures
Non-RAID drive architectures also exist, and are often referred to, similarly to RAID, by
standard acronyms, several tongue-in-cheek. A single drive is referred to as a SLED
(Single Large Expensive Drive), by contrast with RAID, while an array of drives without
any additional control (accessed simply as independent drives) is referred to as a JBOD
(Just a Bunch Of Disks). Simple concatenation is referred to a SPAN, or sometimes as
JBOD, though this latter is proscribed in careful use, due to the alternative meaning just
Chapter- 9
System Engineering
Systems engineering techniques are used in complex projects: spacecraft design,
computer chip design, robotics, software integration, and bridge building. Systems
engineering uses a host of tools that include modeling and simulation, requirements
analysis and scheduling to manage complexity.
Systems engineering is an interdisciplinary field of engineering that focuses on how
complex engineering projects should be designed and managed over the life cycle of the
project. Issues such as logistics, the coordination of different teams, and automatic
control of machinery become more difficult when dealing with large, complex projects.
Systems engineering deals with work-processes and tools to handle such projects, and it
overlaps with both technical and human-centered disciplines such as control engineering,
industrial engineering, organizational studies, and project management.
QFD House of Quality for Enterprise Product Development Processes
The term systems engineering can be traced back to Bell Telephone Laboratories in the
1940s. The need to identify and manipulate the properties of a system as a whole, which
in complex engineering projects may greatly differ from the sum of the parts' properties,
motivated the Department of Defense, NASA, and other industries to apply the
When it was no longer possible to rely on design evolution to improve upon a system and
the existing tools were not sufficient to meet growing demands, new methods began to be
developed that addressed the complexity directly. The evolution of systems engineering,
which continues to this day, comprises the development and identification of new
methods and modeling techniques. These methods aid in better comprehension of
engineering systems as they grow more complex. Popular tools that are often used in the
systems engineering context were developed during these times, including USL, UML,
QFD, and IDEF0.
In 1990, a professional society for systems engineering, the National Council on Systems
Engineering (NCOSE), was founded by representatives from a number of U.S. corporations and organizations. NCOSE was created to address the need for improvements in
systems engineering practices and education. As a result of growing involvement from
systems engineers outside of the U.S., the name of the organization was changed to the
International Council on Systems Engineering (INCOSE) in 1995. Schools in several
countries offer graduate programs in systems engineering, and continuing education
options are also available for practicing engineers.
Systems engineering signifies both an approach and, more recently, as a discipline in
engineering. The aim of education in systems engineering is to simply formalize the
approach and in doing so, identify new methods and research opportunities similar to the
way it occurs in other fields of engineering. As an approach, systems engineering is
holistic and interdisciplinary in flavour.
Origins and traditional scope
The traditional scope of engineering embraces the design, development, production and
operation of physical systems, and systems engineering, as originally conceived, falls
within this scope. "Systems engineering", in this sense of the term, refers to the
distinctive set of concepts, methodologies, organizational structures (and so on) that have
been developed to meet the challenges of engineering functional physical systems of
unprecedented complexity. The Apollo program is a leading example of a systems
engineering project.
The use of the term "system engineer" has evolved over time to embrace a wider, more
holistic concept of "systems" and of engineering processes. This evolution of the
definition has been a subject of ongoing controversy and the term continues to be applied
to both the narrower and broader scope.
Holistic view
Systems engineering focuses on analyzing and eliciting customer needs and required
functionality early in the development cycle, documenting requirements, then proceeding
with design synthesis and system validation while considering the complete problem, the
system lifecycle. Oliver et al. claim that the systems engineering process can be
decomposed into
a Systems Engineering Technical Process, and
a Systems Engineering Management Process.
Within Oliver's model, the goal of the Management Process is to organize the technical
effort in the lifecycle, while the Technical Process includes assessing available information, defining effectiveness measures, to create a behavior model, create a structure
model, perform trade-off analysis, and create sequential build & test plan.
Depending on their application, although there are several models that are used in the
industry, all of them aim to identify the relation between the various stages mentioned
above and incorporate feedback. Examples of such models include the Waterfall model
and the VEE model.
Interdisciplinary field
System development often requires contribution from diverse technical disciplines. By
providing a systems (holistic) view of the development effort, systems engineering helps
mold all the technical contributors into a unified team effort, forming a structured
development process that proceeds from concept to production to operation and, in some
cases, to termination and disposal.
This perspective is often replicated in educational programs in that systems engineering
courses are taught by faculty from other engineering departments which, in effect, helps
create an interdisciplinary environment.
Managing complexity
The need for systems engineering arose with the increase in complexity of systems and
projects. When speaking in this context, complexity incorporates not only engineering
systems, but also the logical human organization of data. At the same time, a system can
become more complex due to an increase in size as well as with an increase in the
amount of data, variables, or the number of fields that are involved in the design. The
International Space Station is an example of such a system.
The development of smarter control algorithms, microprocessor design, and analysis of
environmental systems also come within the purview of systems engineering. Systems
engineering encourages the use of tools and methods to better comprehend and manage
complexity in systems. Some examples of these tools can be seen here:
System model, Modeling, and Simulation,
System architecture,
System dynamics,
Systems analysis,
Statistical analysis,
Reliability analysis, and
Decision making
Taking an interdisciplinary approach to engineering systems is inherently complex since
the behavior of and interaction among system components is not always immediately
well defined or understood. Defining and characterizing such systems and subsystems
and the interactions among them is one of the goals of systems engineering. In doing so,
the gap that exists between informal requirements from users, operators, marketing
organizations, and technical specifications is successfully bridged.
The scope of systems engineering activities
One way to understand the motivation behind systems engineering is to see it as a
method, or practice, to identify and improve common rules that exist within a wide
variety of systems. Keeping this in mind, the principles of systems engineering —
holism, emergent behavior, boundary, et al. — can be applied to any system, complex or
otherwise, provided systems thinking is employed at all levels. Besides defense and
aerospace, many information and technology based companies, software development
firms, and industries in the field of electronics & communications require systems
engineers as part of their team.
An analysis by the INCOSE Systems Engineering center of excellence (SECOE)
indicates that optimal effort spent on systems engineering is about 15-20% of the total
project effort. At the same time, studies have shown that systems engineering essentially
leads to reduction in costs among other benefits. However, no quantitative survey at a
larger scale encompassing a wide variety of industries has been conducted until recently.
Such studies are underway to determine the effectiveness and quantify the benefits of
systems engineering.
Systems engineering encourages the use of modeling and simulation to validate
assumptions or theories on systems and the interactions within them.
Use of methods that allow early detection of possible failures, in safety engineering, are
integrated into the design process. At the same time, decisions made at the beginning of a
project whose consequences are not clearly understood can have enormous implications
later in the life of a system, and it is the task of the modern systems engineer to explore
these issues and make critical decisions. There is no method which guarantees that
decisions made today will still be valid when a system goes into service years or decades
after it is first conceived but there are techniques to support the process of systems
engineering. Examples include the use of soft systems methodology, Jay Wright
Forrester's System dynamics method and the Unified Modeling Language (UML), each
of which are currently being explored, evaluated and developed to support the
engineering decision making process.
Education in systems engineering is often seen as an extension to the regular engineering
courses, reflecting the industry attitude that engineering students need a foundational
background in one of the traditional engineering disciplines (e.g. mechanical engineering,
industrial engineering, computer engineering, electrical engineering) plus practical, realworld experience in order to be effective as systems engineers. Undergraduate university
programs in systems engineering are rare.
INCOSE maintains a continuously updated Directory of Systems Engineering Academic
Programs worldwide. As of 2006, there are about 75 institutions in United States that
offer 130 undergraduate and graduate programs in systems engineering. Education in
systems engineering can be taken as SE-centric or Domain-centric.
SE-centric programs treat systems engineering as a separate discipline and all the
courses are taught focusing on systems engineering practice and techniques.
Domain-centric programs offer systems engineering as an option that can be
exercised with another major field in engineering.
Both these patterns cater to educate the systems engineer who is able to oversee interdisciplinary projects with the depth required of a core-engineer.
Systems engineering topics
Systems engineering tools are strategies, procedures, and techniques that aid in
performing systems engineering on a project or product. The purpose of these tools vary
from database management, graphical browsing, simulation, and reasoning, to document
production, neutral import/export and more.
There are many definitions of what a system is in the field of systems engineering. Below
are a few authoritative definitions:
ANSI/EIA-632-1999: "An aggregation of end products and enabling products to
achieve a given purpose."
IEEE Std 1220-1998: "A set or arrangement of elements and processes that are
related and whose behavior satisfies customer/operational needs and provides for
life cycle sustainment of the products."
ISO/IEC 15288:2008: "A combination of interacting elements organized to
achieve one or more stated purposes."
NASA Systems Engineering Handbook: "(1) The combination of elements that
function together to produce the capability to meet a need. The elements include
all hardware, software, equipment, facilities, personnel, processes, and procedures
needed for this purpose. (2) The end product (which performs operational
functions) and enabling products (which provide life-cycle support services to the
operational end products) that make up a system."
INCOSE Systems Engineering Handbook: "homogeneous entity that exhibits
predefined behavior in the real world and is composed of heterogeneous parts that
do not individually exhibit that behavior and an integrated configuration of
components and/or subsystems."
INCOSE: "A system is a construct or collection of different elements that together
produce results not obtainable by the elements alone. The elements, or parts, can
include people, hardware, software, facilities, policies, and documents; that is, all
things required to produce systems-level results. The results include system level
qualities, properties, characteristics, functions, behavior and performance. The
value added by the system as a whole, beyond that contributed independently by
the parts, is primarily created by the relationship among the parts; that is, how
they are interconnected."
The systems engineering process
Depending on their application, tools are used for various stages of the systems
engineering process:
Using models
Models play important and diverse roles in systems engineering. A model can be defined
in several ways, including:
An abstraction of reality designed to answer specific questions about the real
An imitation, analogue, or representation of a real world process or structure; or
A conceptual, mathematical, or physical tool to assist a decision maker.
Together, these definitions are broad enough to encompass physical engineering models
used in the verification of a system design, as well as schematic models like a functional
flow block diagram and mathematical (i.e., quantitative) models used in the trade study
The main reason for using mathematical models and diagrams in trade studies is to
provide estimates of system effectiveness, performance or technical attributes, and cost
from a set of known or estimable quantities. Typically, a collection of separate models is
needed to provide all of these outcome variables. The heart of any mathematical model is
a set of meaningful quantitative relationships among its inputs and outputs. These
relationships can be as simple as adding up constituent quantities to obtain a total, or as
complex as a set of differential equations describing the trajectory of a spacecraft in a
gravitational field. Ideally, the relationships express causality, not just correlation.
Tools for graphic representations
Initially, when the primary purpose of a systems engineer is to comprehend a complex
problem, graphic representations of a system are used to communicate a system's
functional and data requirements. Common graphical representations include:
Functional Flow Block Diagram (FFBD)
Data Flow Diagram (DFD)
N2 (N-Squared) Chart
IDEF0 Diagram
UML Use case diagram
UML Sequence diagram
USL Function Maps and Type Maps.
Enterprise Architecture frameworks, like TOGAF, MODAF, Zachman Frameworks etc.
A graphical representation relates the various subsystems or parts of a system through
functions, data, or interfaces. Any or each of the above methods are used in an industry
based on its requirements. For instance, the N2 chart may be used where interfaces
between systems is important. Part of the design phase is to create structural and
behavioral models of the system.
Once the requirements are understood, it is now the responsibility of a systems engineer
to refine them, and to determine, along with other engineers, the best technology for a
job. At this point starting with a trade study, systems engineering encourages the use of
weighted choices to determine the best option. A decision matrix, or Pugh method, is one
way (QFD is another) to make this choice while considering all criteria that are
important. The trade study in turn informs the design which again affects the graphic
representations of the system (without changing the requirements). In an SE process, this
stage represents the iterative step that is carried out until a feasible solution is found. A
decision matrix is often populated using techniques such as statistical analysis, reliability
analysis, system dynamics (feedback control), and optimization methods.
At times a systems engineer must assess the existence of feasible solutions, and rarely
will customer inputs arrive at only one. Some customer requirements will produce no
feasible solution. Constraints must be traded to find one or more feasible solutions. The
customers' wants become the most valuable input to such a trade and cannot be assumed.
Those wants/desires may only be discovered by the customer once the customer finds
that he has overconstrained the problem. Most commonly, many feasible solutions can be
found, and a sufficient set of constraints must be defined to produce an optimal solution.
This situation is at times advantageous because one can present an opportunity to
improve the design towards one or many ends, such as cost or schedule. Various
modeling methods can be used to solve the problem including constraints and a cost
Systems Modeling Language (SysML), a modeling language used for systems
engineering applications, supports the specification, analysis, design, verification and
validation of a broad range of complex systems.
Universal Systems Language (USL) is a systems oriented object modeling language with
executable (computer independent) semantics for defining complex systems, including
Related Fields and Sub-fields
Many related fields may be considered tightly coupled to systems engineering. These
areas have contributed to the development of systems engineering as a distinct entity.
Cognitive systems engineering
Cognitive systems engineering (CSE) is a specific approach to the description and
analysis of human-machine systems or sociotechnical systems. The three main
themes of CSE are how humans cope with complexity, how work is accomplished
by the use of artefacts, and how human-machine systems and socio-technical
systems can be described as joint cognitive systems. CSE has since its beginning
become a recognised scientific discipline, sometimes also referred to as Cognitive
Engineering. The concept of a Joint Cognitive System (JCS) has in particular
become widely used as a way of understanding how complex socio-technical
systems can be described with varying degrees of resolution. The more than 20
years of experience with CSE has been described extensively.
Configuration Management
Like systems engineering, Configuration Management as practiced in the defence
and aerospace industry is a broad systems-level practice. The field parallels the
taskings of systems engineering; where systems engineering deals with
requirements development, allocation to development items and verification,
Configuration Management deals with requirements capture, traceability to the
development item, and audit of development item to ensure that it has achieved
the desired functionality that systems engineering and/or Test and Verification
Engineering have proven out through objective testing.
Control engineering
Control engineering and its design and implementation of control systems, used
extensively in nearly every industry, is a large sub-field of systems engineering.
The cruise control on an automobile and the guidance system for a ballistic
missile are two examples. Control systems theory is an active field of applied
mathematics involving the investigation of solution spaces and the development
of new methods for the analysis of the control process.
Industrial engineering
Industrial engineering is a branch of engineering that concerns the development,
improvement, implementation and evaluation of integrated systems of people,
money, knowledge, information, equipment, energy, material and process.
Industrial engineering draws upon the principles and methods of engineering
analysis and synthesis, as well as mathematical, physical and social sciences
together with the principles and methods of engineering analysis and design to
specify, predict and evaluate the results to be obtained from such systems.
Interface design
Interface design and its specification are concerned with assuring that the pieces
of a system connect and inter-operate with other parts of the system and with
external systems as necessary. Interface design also includes assuring that system
interfaces be able to accept new features, including mechanical, electrical and
logical interfaces, including reserved wires, plug-space, command codes and bits
in communication protocols. This is known as extensibility. Human-Computer
Interaction (HCI) or Human-Machine Interface (HMI) is another aspect of
interface design, and is a critical aspect of modern systems engineering. Systems
engineering principles are applied in the design of network protocols for localarea networks and wide-area networks.
Mechatronic engineering
Mechatronic engineering, like Systems engineering, is a multidisciplinary field of
engineering that uses dynamical systems modeling to express tangible constructs.
In that regards it is almost indistinguishable from Systems Engineering, but what
sets it apart is the focus on smaller details rather than larger generalizations and
relationships. As such, both fields are distinguished by the scope of their projects
rather than the methodology of their practice.
Operations research
Operations research supports systems engineering. The tools of operations
research are used in systems analysis, decision making, and trade studies. Several
schools teach SE courses within the operations research or industrial engineering
department, highlighting the role systems engineering plays in complex projects.
Operations research, briefly, is concerned with the optimization of a process
under multiple constraints.
Performance engineering
Performance engineering is the discipline of ensuring a system will meet the
customer's expectations for performance throughout its life. Performance is
usually defined as the speed with which a certain operation is executed or the
capability of executing a number of such operations in a unit of time. Performance
may be degraded when an operations queue to be executed is throttled when the
capacity is of the system is limited. For example, the performance of a packetswitched network would be characterised by the end-to-end packet transit delay or
the number of packets switched within an hour. The design of high-performance
systems makes use of analytical or simulation modeling, whereas the delivery of
high-performance implementation involves thorough performance testing.
Performance engineering relies heavily on statistics, queueing theory and
probability theory for its tools and processes.
Program management and project management.
Program management (or programme management) has many similarities with
systems engineering, but has broader-based origins than the engineering ones of
systems engineering. Project management is also closely related to both program
management and systems engineering.
Proposal engineering
Proposal engineering is the application of scientific and mathematical principles
to design, construct, and operate a cost-effective proposal development system.
Basically, proposal engineering uses the "systems engineering process" to create a
cost effective proposal and increase the odds of a successful proposal.
Reliability engineering
Reliability engineering is the discipline of ensuring a system will meet the
customer's expectations for reliability throughout its life; i.e. it will not fail more
frequently than expected. Reliability engineering applies to all aspects of the
system. It is closely associated with maintainability, availability and logistics
engineering. Reliability engineering is always a critical component of safety
engineering, as in failure modes and effects analysis (FMEA) and hazard fault
tree analysis, and of security engineering. Reliability engineering relies heavily
onstatistics, probability theory and reliability theory for its tools and processes.
Safety engineering
The techniques of safety engineering may be applied by non-specialist engineers
in designing complex systems to minimize the probability of safety-critical
failures. The "System Safety Engineering" function helps to identify "safety
hazards" in emerging designs, and may assist with techniques to "mitigate" the
effects of (potentially) hazardous conditions that cannot be designed out of
Security engineering
Security engineering can be viewed as an interdisciplinary field that integrates the
community of practice for control systems design, reliability, safety and systems
engineering. It may involve such sub-specialties as authentication of system users,
system targets and others: people, objects and processes.
Software engineering
From its beginnings, software engineering has helped shape modern systems
engineering practice. The techniques used in the handling of complexes of large
software-intensive systems has had a major effect on the shaping and reshaping of
the tools, methods and processes of SE.