RELIABILITY AND DIAGNOSTIC OF MODULAR SYSTEMS ∗

Reliability and diagnostic are in general two problems discussed separately. Yet the two problems are in fact closely related to each other. Here, this relation is considered in the simple case of modular systems. We show, how the computation of reliability and diagnostic can efficiently be done within the same Bayesian network induced by the modularity of the structure function of the system.


Introduction
Reliability is concerned with the prediction of the correct functioning of systems.Diagnostic on the other hand aims at finding the cause of the malfunctioning of systems.Usually the two problems are discussed separately, each domain has its own literature.But in fact the two problems are closely related.This will be demonstrated in this paper in a simple context.We consider modular systems.It is well known that modularity helps in computing the reliability of a system.The same structure helps also for the task of diagnostic, once a malfunctioning of a modular system has been observed.
The logical dependence of the good functioning of a system on the intactness of its components can be described by a Boolean function, the structure function.This is well known in Esary [5], Birnbaum et al. [6].So, this shows again the duality between reliability and diagnostic.We remark, that modular systems are only very simple, also convenient, examples, where this duality can be exploited.Model-based reliability and diagnostic is a much more general framework, where this duality finds its most general expression.A first discussion of this framework is in Provan [14]; yet this field has still to be explored.

Reliability of Modular Systems
Systems are made up from different components.At any given time point, these components may or may not work.Depending on the states of all its components, the system itself may or may not work.This dependence of the functioning of the system on the functioning of its components is described by a structure function.Suppose that the system is composed of i = 1, . . ., n components.Then we introduce a Boolean variable a i for each component i with the meaning that a i =    if the component with number i works, ⊥ otherwise.(1) We use here the symbols ⊥ and borrowed from logic to denote the states of the components in order to clearly distinguish these states from other concepts introduced in the sequel.
Clearly, in reliability theory, the symbols ⊥ and are usually interpreted as numerical values 0 and 1 respectively.
The state of the variables a 1 , . . ., a n can be summarized by a vector a = (a 1 , . . ., a n ).This vector has 2 n different possible states.The set of all possible states is divided into two subsets, the set S of states for which the system works and its complement, the set S ⊥ of states for which the system is down.The state of the system is denoted by a Boolean variable a.The dependence of the system state a on the vector a is denoted by the Boolean function φ, The function φ is called the structure function of the system.It is a Boolean function, which maps n Boolean variables in a Boolean variable; for the basic definitions for boolean functions see Beichelt [4], Kohlas [11].In the sequel, we assume that the systems are monotone, which means that the structure function φ is nondecreasing in each variable.
There are two important general representations of any monotone Boolean function.A subset P ⊆ {1, . . ., n} of components is called a path of φ, if a i = for all i ∈ P implies that φ(a) = , independently of the state of the other components.That is, a path is a subset of components whose functioning is sufficient for the functioning of the system.A path P is called minimal, if no proper subset of P is a path.Let P be the family of minimal paths of φ.
Then φ(a) = P ∈P i∈P a i . (3) Here ∨ denotes disjunction (or) and ∧ conjunction (and).So the formula above simply states that the system is functioning, if, and only if, all (conjunction) components of at least one (disjunction) minimal path are working.This is the disjunctive normal form.We assume in this paper, that the set of all components is a path: if all components work, then the system works.If this is the only path in a system, then it is called a series system.A component which belongs to no minimal path is irrelevant for the function of the system.It can be eliminated as far as the system reliability is concerned.We assume hereafter that all components are relevant.
A subset of components C is called a cut if the system is down, whenever all the elements of C are down (irrespective of the states of the other components).A cut C is called minimal, if no proper subset of C is a cut.Let C denote the family of all minimal cuts of the system.Then, This says that the system is functioning if at least (disjunction) one component in all (conjunction) minimal cuts is working.This is the conjunctive normal form of φ.The two representations of conjunctive and disjunctive normal forms are dual to each other.We assume that the set of all components is a cut, otherwise there would be no cut, hence no reliability problem.
If this is the only cut in a system, then it is called a parallel system.
We assume that the probability p i that the component i is working at a given time is known for every component i.The probabilities p 1 , . . ., p n are assumed to be mutually independent.Let p denote the vector (p 1 , . . ., p n ).Then, the structure function φ determines the probability p that the system itself is functioning, p = E(φ) = P {φ = 1}. (5) Clearly, this probability depends on the vector p of the component probabilities.In order to emphasize this dependence, we write p = h φ (p).(6) The computation of the probability p from the structure function is in general no trivial task.
Many methods have been proposed, we refer to Beichelt [4] and Kohlas [11].One method consists of transforming the disjunctive normal form into a disjunction of disjoint terms, whose probabilities can be easily computed and which can be simply summed up, since the terms are disjoint, see Abraham [1], Heidtmann [8]; this subject is discussed in the article "Disjoint Sum Forms in Reliability Theory" of Anrig & Beichelt in this journal [2].
So, in the simplest case of a disjoint form, this means, that there are terms c i = j∈I i l j with l j = a j or ¬a j , such that φ = i c i and c i ∧ c j = ⊥ if i = j.Then we have It is well known that the problem of computing the probability p that the system is functioning is NP-hard, cf. for example Ball [3].The modular structure of a system may however help to simplify the computation.A subset of components, say for example i = 1, . . ., m (m < n), is called a module, if there are Boolean functions φ and φ such that φ(a) = φ (φ (a 1 , . . ., a m ), a m+1 , . . ., a n ).(9) φ is the structure function of the module.Suppose that the set of components {1, . . ., n} of a system decomposes into m ≥ 2 modules M 1 , . . ., M m .Let a i denote the vector of the Boolean variables associated with the components in module M i and let φ i be the structure function of module M i under the restriction that the variables in a j are disjoint from those in a k if j = k.Then there is a Boolean function ψ, such that φ(a) = ψ(φ 1 (a 1 ), . . ., φ m (a m )).
(10) M 1 , . . ., M m is called a modular decomposition of the system and ψ its organizing structure.
If we have a modular decomposition of a system φ, then we obtain p = h φ (p) = h ψ (h φ 1 (p 1 ), . . ., h φm (p m )), (11) where p i is the vector of probabilities corresponding to a i .This formula explains how a modular decomposition helps to compute p: the organizing function h ψ as well as the modular functions h φ i have less, possibly much less, arguments as the The hierarchy of Boolean functions is arranged in a tree.
original function h φ .And this helpful property of a modular decomposition will be amplified, if the modules themselves possess their own modular decomposition.
This leads then to a hierarchical structure of modules over several levels.We represent this structure by a tree (see Fig. 1).The root node at level 0 corresponds to the Boolean system variable, denoted now by a 0 1 .Its descendants on level 1 are the Boolean variables a 1 1 to In general, a variable (node) a i j at level i has descendants a i+1 k j,i to a i+1 h j,i where k 1,i = 1 and k j+1,i = h j,i + 1.We denote the vector of Boolean variables of the descendants of a i j by a i j .Associated with any node a i j of the tree, except the leaves with no descendants, there is a structure function φ i j such that So, we have, starting at the root node, and etc. We denote the probability of a Boolean variable a i j by p i j and the probability function, associated with a structure function φ i j by h i j .Then we have also Thus, we may start with the given component probabilities at the leaves of the tree and compute probabilities upwards in the tree, until we get the system reliability p 0 1 at the root, ). (16) This supposes for example that we determine the paths of each structure function φ i j for each non-leave node of the tree and use an appropriate orthogonalization of the corresponding disjunctive normal form.Since -hopefully -in a modular hierarchie nodes have only a small number of descendants (components or modules) these computations for each node will be relatively small.We shall discuss in Section 4 an alternative approach to the reliability computation in a modular hierarchy.

Diagnostic of Modular Systems
The reliability computation presented in the previous section corresponds to a more or less classical pattern.But in this section we consider a new problem which seems so far hardly having been considered in the framework of modular system structures: the problem of diagnostic.Assume that we observe that the system is down.Then clearly some modules and components must be down.The question is which ones.Given only the observation that the system is down, it will in general not be possible to identify unambiguously the defect module(s) or component(s) which cause the system failure.But, by computing the posterior probabilities of module or component failures, given the event that the system is down, we may be able to point out those modules or components, which are more likely to cause the problem.Performing further tests on selected modules or components, we may identify the cause of the system failure with more and more certainty.
To start with, consider a system described by a structure function and assume that the components a i have probabilities p i .But assume now, we observe that a = ⊥, i.e. the system is down.This given event changes the prior probabilities p i = P {a i = } into posterior, conditional probabilities P {a i = |a = ⊥}.How do we compute these posterior probabilities?
In fact, we could ask the same question, if we observe the system to be working, i.e. a = .
So we consider more generally the problem of computing the family of conditional distributions p(a i |x) where x = or x = ⊥.Of course, we use Bayes' theorem.Assume first that a = The denumerator of this formula is the result of the reliability computation, which we assume to be done.How can the numerator then be computed?Assume that for the reliability computation, the structure function φ has been transformed into a disjoint disjunctive form From this we deduce that where Note that the c i are still disjoint.So, we have For a = ⊥, we have in the same way Here, the numerator is computed using the following identities: The first term is the prior probability p i and the second one has been computed above.So there is no need to compute new probabilities in this case.
If we consider now the more complicated case of a modular hierarchy, then the computations above permit to obtain the posterior distribution of the descendants of the root node.More generally, for any node a i j assume that we have the family of posteriors p(a i j |x).How can we Then, by the formula of total probability, we have The conditional probabilities p(a i+1 k |a i j ) and p(a i+1 k |¬a i j ) are computed just as above in the case of a root node, using the reliabilities p(a i j ) which have been computed beforehand.If we work the tree downwards, then the probabilities p(a i j |a) and p(¬a i j |a) = 1 − p(a i j |a) have already been computed on a higher, hence previous, level.In this way, we may work downwards down to the leaves of the tree to get the posteriors p (a i j ) of all nodes of the tree.
In summary, we work first the tree of a modular hierarchy upwards to get the reliabilities of the nodes of the tree.Then, once we observe the system state, we work the tree downwards, using the results of the reliability computation, to obtain the posterior probabilities of the states of all modules and components.This exhibits nicely the duality inherent between reliability and diagnostic.
To illustrate this theory consider the example in Fig. 2. In this example the function φ 1 1 has two parameters a 1 1 and a 1 2 .The first parameter depends of an other module a 1 1 = φ 2 1 (a 2 1 , a 2 2 ).The two Boolean functions are defined as Assume the prior probabilities of a 1 2 , a 2 1 and a 2 2 to be p(a 1 2 ) = 0.7, p(a 2 1 ) = 0.8, p(a 2 2 ) = 0.6. ( So in a first step the prior probabilities of a 1 1 and a 0 1 have to be calculated as 2 ) = 0.8 * 0.6 = 0.48, Now, if the variable a 0 1 is observed to be ⊥, then its posterior probability changes to p (a 0 1 ) = 0.
And the posterior probabilities of all underlying variables change as well: In the same way we obtain p (a 1 2 ) = 0.It is clear, that if a parallel system is down, then all its components must be down.The calculation of p (a 2 1 ) is done as follows: We may now select some modules or components and test them.This will introduce additional information and therefore lead to revised posterior probabilities.In order to discuss these more general issues, we link modular hierarchies to a more general structure, namely Bayesian network.This will allow us to draw on a well developed computational theory and apply it to our particular problem.

Modular Systems and Bayesian Networks
The tree of a modular hierarchy can be considered as a Bayesian network.If we put the system into this framework, then we can use computational procedures developed for Bayesian networks (Jensen [9], Cowell et al. [7]).We shall study this approach in this section and compare it to the computational methods introduced in the previous sections.
In a tree of a modular hierarchy, each node corresponds either to the system (the root node), to a module or to a component (leaf nodes).We number the nodes of the tree from 0 (root) to m.For any node i of the tree, we denote the descendants of i (the elements of the module) by D(i).For leaves i of course we have that D(i) = ∅.Fig. 3 shows the modular tree of the example in Section 3. By a we denote the vector of all Boolean variables associated with the nodes (system, modules, components) of the tree.More generally, if J is a subset of nodes, then a J denotes the vector of the Boolean variables a j for j ∈ J.To every non-leaf node i corresponds a structure function φ i (a D(i) ).In order to translate the structure into a Bayesian network, we associate with this structure function a 0-1 conditional probability matrix, The prior probability p i = p(a i ) is defined for every leave i.The tree, together with these prior and conditional probabilities constitutes a Bayesian network.We refer to Cowell et al. [7] for details about the theory of Bayesian networks.
We sketch here the necessary elements in order to understand the application of Bayesian networks to the reliability and diagnostic of modular Boolean systems.First of all note that we may put p(a i ) = p(a i |D(i)) for leaves, since D(i) is empty in this case.With this convention, we can define the overall multivariate distribution of all variables in the tree by where the product is to be taken over all nodes of the tree.The system reliability is then simply obtained from the marginal of this distribution with respect to the variable a 0 , p(a 0 ) = That is, we sum out all variables in p(a), except a 0 .Assume furthermore, that we observe now a certain variable a i , say that a i = ⊥.Then we look at the conditional distribution of the other variables, given this event; p(a 0 , . . ., This shows that the conditional probability is proportional to p(a 0 , . . ., a i−1 , ⊥, a i+1 , . . ., a m ) which is obtained from Equation (32) simply by putting the i-th variable a i to the value ⊥.
In fact, the conditional probability distribution is computed from the family of probabilities p(a 0 , . . ., a i−1 , ⊥, a i+1 , . . ., a m ) simply by normalizing it to 1.This is also true for any marginal of the conditional distribution: It can be obtained by normalizing the corresponding marginal of p(a 0 , . . ., a i−1 , ⊥, a i+1 , . . ., a m ).Finally, we remark that where ψ is defined by ψ( ) = 0, and ψ(⊥) = 1 Based on these considerations, we may replace for computational purposes a probability distribution p in a Bayesian network by any non-negative function ψ, which is proportional to the distribution.Such non-negative functions are called potentials.According to (35), observed events can also be represented by potentials.So Bayesian networks reduce to a calculus of potentials, which will be sketched in the sequel.A potential ψ refers always to some set of variables (nodes) J, i.e. ψ is a non-negative function of a J , and we define d(ψ) = J.We remark that a potential is just a multidimensional table of non-negative numbers.If K ⊆ J, then we denote the projection of the vector a J by a ↓K J .Also, if ψ is a potential on J, then its marginal with respect to K ⊆ J is obtained by summing out the variables outside K, So, given the Bayesian network related to a modular hierarchy together with, for example, some observations (tests) on certain modules or components, the problem consists of com-puting marginals relative to variables a i of a product of potentials.For example for the system reliability, we compute i p(a i |a D(i) ) ↓{0} .
(37) Suppose we observe then that a 0 = ⊥, i.e. the system is down.Therefore, define ψ 0 (a 0 ) by ψ 0 ( ) = 0 and ψ 0 (⊥) = 1.Then we are interested in the marginals relative to some other This gives, up to normalization, the posterior probability of module or component i, given that the system is down.There may also be several observations, not only on the system as a whole, but on other modules.This leads to similar problems of the computation of marginals of some product of potentials.
Although it seems straightforward to compute the marginal of a product of potentials, in practice the product refers to 2 m different states and an explicit computation of such a product, followed by summing out all the variables not in the marginal, is not feasible.Therefore Lauritzen and Spiegelhalter [13] proposed a more realistic approach, based on graphical structures like Bayesian networks.Jensen et al. [10] introduced later a more efficient variant, which we shall present next for our modular hierarchy.For this purpose, we transform our tree, which represents the Bayesian network into another tree.We take the nodes i of the original tree, but add nodes, each of which represent a non-empty set D(i).We introduce edges between i to D(i) and D(i) to every j ∈ D(i).We add a node 0 and link it to the original root node 0; similarly, we we add a node j to every leave node j.Such a tree is called a join or junction tree.The nodes D(i) together with the added nodes 0 and j for every leave node j form a set of nodes which we denote by V, whereas the other nodes i (including the original 0) form a set of nodes S. Note that there is exactly one node i ∈ S between a pair D(j) and D(i) of nodes of V with i ∈ D(j).The nodes of S are called separators.Fig. 4 represents the join tree constructed from the modular tree in Fig. 3.
Assume now, in a general setting, that on every node v = D(i) of V there is a potential ψ v with d(ψ v ) = D(i), and on every node i of S there is a potential ψ i with d(ψ i ) = {i}.We make further the important assumption that, if node i is the separator between nodes v and w, We direct first the edges in the join tree towards the root 0 .We pass the messages between nodes, starting with the leaves.Once a node v = D(i) has obtained a message, it passes its message to its separator i, which passes its message to w = D(j), i ∈ D(j).If ψ v , ψ w and ψ i are the potentials on the corresponding nodes of the join tree before the messages are passed, and ψ * v , ψ * w and ψ * i the contents of the nodes after the messages are passed, then Note that ψ i (a i ) = ψ i (a ↓{i} w ) can be zero for some value of a i .But, then by (39) the numerator vanishes also.In this case we fix the result arbitrary to, say, 0. In the sequel, we will write the formulas from (40) simply as follows: This message passing continues, until the root is reached.This is called the collect phase.
It can be proved, that at the end of the collect phase, we have the following marginal in the root 0 (Jensen et al. [10]): Usually we start with ψ v = p(a i |a D(i) ) and unit tables ψ i on the separators.Then the last result shows that we get the reliability p(a 0 ) at the end of the collect phase on the root.This corresponds clearly to the reliability computation in a modular hierarchy, as presented in Section 2. Only, instead of computing with disjoint disjunctions for structure function, we use the tabular form of conditional probabilities and the corresponding multiplications and summations of tables.If this were all, the approach based on Bayesian networks would, in most cases, be computationally less efficient than the approach of Section 2. Bayesian networks become interesting however if diagnostic problems arise.
Assume then that an observation of the system state is made ( or ⊥).We then add a corresponding potential ψ 0 to our product.That is, the content ψ * 0 of root node becomes Now, the edges are oriented away from the root.Then, starting with the root, messages are passed outwards towards the leaves.The message passing mechanism is exactly the same as in the collect phase, (41).This message passing scheme stops when all leaves have received their message.This is called the distribute phase.Jensen et al. [10] show that at the end of the distribute phase every node v = D(i) ∈ V and every node i ∈ S contains the marginal ψ ↓D(i) or ψ ↓{i} respectively, where In our case this product equals the common probability distribution, with a 0 instantiated to the observation, say p(⊥, a 1 , . . ., a m ), if a 0 = ⊥ was observed.This shows that ψ ↓{i} is then, up to normalization the posterior distribution of the state of module or component i, given the system observation.
This distribute phase corresponds clearly to the diagnostic computation discussed in Section 3. Now the Bayesian networks become interesting.We may introduce further observations of other modules or components besides the one of the whole system already intro-duced.It suffices to take the corresponding node as a new root, direct all edges of the join tree outwards, i.e. away from the new root, and add a new distribute phase based on the actual contents of the node of the join tree, cf.Jensen et al. [10] and Kohlas [12].This will give the updated posterior probabilities for all other modules and components.This incremental procedure helps to identify the cause of a system breakdown by computing posterior probabilities, selecting modules to test on the base of these posterior probabilities and updating the posterior probabilities by a distribute phase.This can be repeated until the faulty elements are identified with sufficient certainty.

Conclusion
The computation of the reliability of a modular system is a classical problem of reliability theory.But there is the dual problem of diagnostic: if the failure of the system or of some of its modules is observed, what can then be the possible cause of this observation?Which submodules or components are down?Or also: how does this change the reliability of the system?We show in this paper that the modular structure of the system helps also to answer these questions.In fact, it can be used to compute the posterior probabilities of the submodules and the components given some observation.This calculation uses the results of the previous reliability computation.The duality of reliability and diagnostic becomes most clear, if we consider a modular hierarchy as a particular Bayesian network.We then get a unifying framework, in which we may compute reliability and diagnostic information using any information or observation about module states we may obtain.In fact, it is possible to use incremental procedures, where one observation at a time is added.The new updated posterior probabilities may be used to decide which modules to test next.
The unifying framework of Bayesian networks shows that reliability and diagnostic of modular systems are two sides of the same coin.It offers well developed computation procedures.
The Bayesian network of a modular hierarchy has however a very special form.This particular structure may help to adapt the Bayesian methods and to render them more efficient in this particular case.This issue is still open.So are many other issues related to the application of Bayesian networks in the study of modular systems.For example the time dependency of the availability and reliability, related to aging, and other features could be integrated into this framework.Also, the duality between reliability and diagnostic can be explored in more general structures than modular system.This is the subject of model-based reliability, which extends the well-known approach of model-based diagnostic.

Figure 2 :
Figure 2: A serial module in a parallel organizing structure.

Figure 3 :
Figure 3: Modular Tree of the Example.

2 a 2 2Figure 4 :
Figure 4: Join Tree of the Example.The Circles denote the Separators in S.