Projects # F61775
Contract No. F61775-98-WE116 (EOARD)(1998-1999)
Summary of the Project Results
This summary gives a brief outline of the results presented in the Final Report on the contract No. F61775 98-WE116 from European Office of Aerospace Research and Development (USA).
The Report describes briefly all accomplishments, results and conclusions of the research for contract. In particular, it contains results of research on the following tasks:
- Mathematical analysis of existing diagnostic procedure utilizing statistical clustering and Bayes’ techniques and further development of mathematical basis and algorithm for diagnostic model design.
As a whole, these results formed a new technology of assessment and prognosis of the probability of failure of the hardware like avionics;
- Development of software package implementing all steps of cluster analysis of statistical data and further phases of the developed technology.
This software made it possible to verify and validate the developed technology numerically. A peculiarity of the developed software is (1) the interactivity, (2) the utilization a computer graphics to facilitate previewing of the clustering pattern in particular subspaces and (3) the use of the modern software engineering environment like Visual C++ 5.0 and MS Access-97;
- Application of Algebraic Bayes’ Network approach to various classes of applications of experts’ analysis techniques to diagnostic related problem;
- Application of classical regression analysis to estimation of remaining life expectancy (residual performance resource) of hardware subjected to adverse effects.
Advantage of the developed regression model is that it is based on Dynamic Data Model (DDM) what has made it possible to design the regression model utilizing ideas of prognosis of time series.
All results provided by contract are presented in the sufficient details in the Final Report. Nevertheless, the Interim Report [IR-98] has to be considered as the indefeasible part of the former report because some aspects of the developed knowledge engineering technology for diagnostic related problem solving are considered in the latter in more details. In addition, Interim Report presented more numerical results that were not repeated in Final Report.

Contents
| Preface | 2 |
| 1.Introduction | 3 |
| 2. Dynamic Data Model (DDM) of Failure Development | 6 |
| 2.1. Dynamic data model vs. static data model | 6 |
| 2.2. Properties of DDM and assumptions | 6 |
| 2.3. Numerical characteristics of DDM | 7 |
| 2.4. Generation of trajectories of failure development | 10 |
| 2.5. Simulation of DDM | 11 |
| 2.6. Interpretation of simulated data | 11 |
| 3. Regression Model for Residual Performance Resource Assessment | 17 |
| 3.1. Problem statement and traditional approach | 17 |
| 3.2. DDM-based regression model | 17 |
| 3.3. Numerical results | 18 |
| 4. Knowledge Discovery from Statistical Data Base for Health Assessment System Design | 24 |
| 4.1. Outline of the technology of knowledge discovery from statistical data base | 24 |
| 4.2. Heuristic selection of informative subspaces. Informativity criteria | 26 |
| 4.3. Visual design of arbitrary classification predicates as a step towards a new technology of classification model design | 28 |
| 4.4. Forest of decision trees as a step towards improving quality of classification | 33 |
| 4.5. Probabilistic decision making procedure | 33 |
| 4.6. Improvement of assessment of task-related probabilities | 35 |
| 4.7. Numerical results | 37 |
| Algebraic Bayes' Network for Knowledge Engineering | 42 |
| 5.1. Introduction | 42 |
| 5.2. Properties of expert information | 42 |
| 5.3. Advantages of probabilistic model of expert information vs. fuzzy one | 43 |
| 5.4. Concept of knowledge piece and background probabilistic knowledge | 43 |
| 5.5. Algebraic Bayes’ Network: formal definition | 46 |
| 5.6. Consistency of Algebraic Bayes' Networks: Background knowledge | 48 |
| 5.7. Case study of experts’ information processing: Car engine diagnostics | 53 |
| 5.8. Consistent integration of statistical and expert information within diagnostic model | 59 |
| 6. Conclusion: Contribution of the Research and Perspective Future works | 60 |
| 6.1. Contribution of the research | 60 |
| 6.2. Proposals for future research | 62 |
| References | 63 |
| Appendices |
| Appendix A1. Trajectories of failure development. |
| Appendix A2. Learning and Testing Data. |
| Appendix A3. ABN consistency conditions and algorithms of experts’ knowledge processing. |

Introduction and Report Summary
Safe, reliable and efficient operation of avionics is crucial for a modern aircraft or spacecraft. While in operation, avionics components are exposed to electrical perturbations, mechanical vibrations, excessive temperatures, humidity, etc. These adverse conditions, acting individually and in combination, are known to have cumulative effects leading to avionics performance degradation and failures. Until recently, it was virtually impossible to obtain data characterizing performance of individual units. At the present, availability of dedicated monitoring systems and like devices allows for the collection of large amounts of actual data of any particular unit of aircraft hardware. Based on this data, modern Data Mining techniques, common in technology of Knowledge Discovery from Data (KDD), made it possible to facilitate formulation and solution of important on-line and off-line prognostic-related problems.
These new possibilities for hardware monitoring, for on-line and off-line prognostic related problem solving predetermined the tasks that are the subject of the contract. According to the contract the research presented in this Report aimed at the development of mathematical models, algorithms and software for solving the following tasks:
- accurate assessment of the probability of failure of hardware, such as avionics, on the basis of its known history of abuse by environmental and operational factors;
- prognosis of the probability of failure of hardware at a given time in the future, for example, at the end of the forthcoming sortie of the aircraft;
- accurate assessment of the residual performance resource of hardware on the basis of regression model and its known history of abuse by environmental and operational factors and known cumulative time of maintenance (number of sortie).
These task statement was prompted by the modern concept of maintenance known as the "service when needed" [Skormin et al-97]. Let us consider the peculiarity of the above task statement compared to the traditional one.
Traditionally, reliability of any technical device (electronic, electro–mechanical, and mechanical) is defined in terms of such characteristics as the average time of normal (no-failure) operation. These reliability concepts referring to a statistically–generic device may be considered acceptable as long as the failures are caused by the factors related to manufacturing. At the present, this approach is not always acceptable. Manufacturers of electronics, due to completely automated processes, have achieved a very high degree of reliability of their products and very little variation in properties from device to device. Manufacturing-related effects on failures of electronics are gradually becoming less significant. The main causes of failures are traced now to the individual operational and environmental conditions of particular units. Therefore, the average time of normal operation and other "traditional" reliability characteristics, defined without taking into account actual "history of abuse" of a device, are becoming less important.
Classical reliability had a good reason for addressing a statistically generic device. At that time it was virtually impossible to obtain data characterizing performance of individual units in various operating environments. At the present, availability of Time-Stress Measurement Devices (TSMD) [Popyack-98], smart sensors and data acquisition systems makes possible to collect large amount of actual data of any particular unit of aircraft hardware. Based on this data, modern Data Mining techniques, common in Knowledge Discovery from Data (KDD) technology, facilitate formulation and solution of important on-line and off-line reliability-related problems. The most important problem is forecasting the probability of failure of flight-critical units of aircraft hardware during a forthcoming sortie. Solving such problem implies the investigation of the role of various environmental factors in the development of particular failures, investigation of combined effects of several factors, reevaluation of probability of failure on the basis of known exposure to particular adverse conditions, as well as development of special types of mathematical models and model-based techniques.
Data Mining and KDD address specific practical needs for solving above-mentioned problems. Data Mining provides a wide spectrum of available techniques and tools to develop a KDD technology focusing on design of a mathematical model for particular application ([Frawley et al -91], [Matheus et al 93], [Fayyad et al-95-1], [Fayyad et al 95-2], [Bradley et al-98-1]). It is well known that every particular application possesses specific properties that require either the ability to adapt already existing Data Mining techniques or develop new ones to build an adequate and efficient technology of original data processing aimed at a particular model development.
As per common understanding [Fayyad et al-95-1], a KDD process considered herein consists of a number of Data Mining procedures that, regardless of domain and particular task, conceptually, include such steps as (1) definition of the goal of the task, (2) collection or model-based generation of adequate statistical data and its preprocessing, (3) data reduction, transformation to find useful data patterns and its specifications and representations, including visualization if possible, (4) development of a KDD strategy that, actually, corresponds to the outline of the future technology as a number of steps of Data Mining, (5) selection, adaptation or development of Data Mining methods and algorithms intended for the realization of the accepted KDD technology (search of informative subsets of attributes and pattern of interest, separation and decision making rules creation, features, regression model development, etc.), (6) interpretation of the Data Mining results and incorporation of these results into a target model, (7) testing and validation of the resultant model.
Steps of this KDD process are usually iterative and interactive and are common for any KDD process. Nevertheless, from the algorithmic and implementation points of view, particular KDD processes may be implemented in very different ways. It is well known that the best universal approach does not exist. Moreover, the wider the area of possible utilization of an algorithm or approach, the lesser its efficiency. Therefore, taking into account the domain and task specifics, combined with the experience in KDD technology and Data Mining, assures the successful solution of any particular application problem. Then, following such a principle in the framework of tasks predetermined by contract, we developed an approach that consists of traditional steps of KDD process but its application reflects the following framework:
- peculiarities of the goal of the task (prognosis of probability of failure of avionics);
- original statistical data available for diagnostic and prognostic model design (for example, TSMD-based records of cumulative exposure to environmental factors and operational conditions);
- the need for a highly dependable model-based prognostic procedure;
- requirement of a reliable assessment of probabilities involved in the calculation of the probability of failure of a hardware even if the size of statistical data is small;
The Report is organized as follows.
In the first phase of research reflected in the Interim Report [IR-98] we considered "history of abuse" specified by the vector of adverse exposures of a unit operation and environmental conditions. This data model may be called "Static Data Model" (SDM), because it doesn’t take into account the history of failure development. It was reasonable to use such simplified data model to focus research on the mathematical aspects regarding the task of prognostics.
Unfortunately, SDM is not appropriate for development of the precise regression model aimed at residual performance resource forecasting. It will be justified below that to solve the last task we need a model that reflects the history of failure development for a particular device, i.e. we need a model that makes it possible to specify the "trajectory" of failure development for a particular unit. It is reasonable to call the model that makes it possible to obtain trajectory of failure development of a particular unit as "Dynamic Data Model" (DDM). In the next section (Section 2) we present the developed Dynamic Data Model.
Section 3 is devoted to the presentation of the developed variants of the regression models for the forecasting of the residual performance resource of the hardware and its comparative numerical analysis. It was obtained numerically that traditional regression model that doesn’t takes into account the history of a failure development of the particular device doesn’t possess the required precision of residual performance resource forecasting. Instead, the regression model designed on the basis of DDM model of failure development seems to be much more advantageous. The corresponding regression model was developed and is described in Section 3. Additionally, this section contains results of numerical investigation of the regression procedure parameters that are sensitive regarding to the precision of the assessed residual performance resource.
In Section 4, to assess probability of failure of avionics, the developed technology of the model-based prognosis system is presented. Actually, these results were presented in detail in the Interim Report [IR-98]. Nevertheless, they outlined here in brief and are extended by some new numerical results obtained due to newly developed software. This section contains a brief description of the heuristic informativity criteria that are used in a general case for the preliminary selection of informative subspaces of low dimension. These procedures are demonstrated numerically for two-dimensional case.
A notion of a classification predicate is defined and a number of approaches to obtain such predicates are proposed. We use a visualization technique that makes it possible for a developer to draw a separation bound of any arbitrary form approximated by linear spline and to generate the associated classification predicate automatically. Then we describe the main principle behind the design of decision trees and associated probabilistic spaces that form a set of decision procedures. Since the major purpose of the model under development in this section is the assessment of the probability of failure of a hardware unit, in Section 4 we consider the way of improvement of the precision and reliability of this assessment using the small size of experimental data and experts’ knowledge. We present the numerical results as an example of an implementation of the outlined technology for the development of a model-based prognostic procedure for a particular avionics module.
Section 5 is devoted to the application of Algebraic Bayes’ Network approach to the diagnostic related knowledge engineering tasks. Theoretical part of the section coincides in main aspects with the respective material given in the Interim Report [IR-98]. In contrast, it contains much more numerical results that were calculated on the basis of additionally developed software.
Section 6 has to be considered as the general conclusion of the research. It outlines the main results, research contribution and summarizes the most perspective future research in the framework of prognostics and related topics of Knowledge Discovery from Data (KDD) technology. They may be considered as the topics of the proposals for eventual future research.

Contribution of the Research
The research presented in this report is aimed at development of mathematical models associated with the technology for information based health assessment system and numerical verification of these models. Up to the time when the real statistical database is accumulated we need mathematical model to generate adequate database that makes it possible to verify developed algorithms and corresponding technology and to do further research aimed at specializing algorithms for new concrete type of device. Of course, each new device will require to develop "ad hoc" model. Nevertheless, the general principles and ideas of development task related model may be borrowed from the Dynamic Data model developed in this research and presented in Section 2.
However the major task of this research is development of technology for the accurate assessment of the probability of failure of hardware, such as avionics, on the basis of its known "history of abuse" by environmental and operational factors and assessment of the residual performance resource. The successful solution of both problems allows us to forecast the probability of failure during a forthcoming sortie and to assess the residual performance resource thus providing a quantitative basis for mission planning and timely maintenance as well as preventing emergencies. This application cannot be regarded as a conventional reliability problem because classical reliability does not view exposure to specific environmental conditions and operational factors as a main cause of failures. The problem stated herein does not constitute a conventional prognostic task also because the failure may not occur at all. The problem statement considered in the Report was for the first formulated in the paper [Skormin et al -97]. Such a problem statement is prompted by the modern concept of maintenance known as the "service when needed".
It is expected that the prognostic model presented in this Report is developed on the basis of information downloaded from dedicated monitoring systems of flight-critical hardware and stored in a database. Therefore, the stated problem is related to the area of tasks of Data Mining and KDD ([Frawly et al -91], [Matheus et al 93], [Fayyad et al-95-1], [Fayyad et al 95-2], [Bradley et al-98-1]). According to the existing topics of Data Mining prognostic model design is a classification problem ([Fayyad et al 95-1]). Classification problem of such kind is well known and is being investigated at least during four decades ([Fukunaga-72], [Patrick-72], [Tou et al -74], [Ryin-76]).
Nevertheless, a number of principle tasks of classification are still of great interest and deserve further investigation. For example, a hot area of classification is the so-called problem of feature informativity and algorithms of their selection as well as methods and algorithms of learning for synthesis of classification rule [Bradley et al -98-2]. In addition, there exist a number of problems very important from the applications point of view that still do not have efficient solutions. For example, development of classification models based on Data Mining and KDD for the case when databases contain columns measured on both continuous and discrete scales.
Let us summarize the new results presented in this research and described in [IR-98] and in this Report that constitute its main contribution to the applied classification problem solving and to the area of Data Mining and KDD.
- The basis of the classification model proposed in the Report is formed by so-called classification predicates. Firstly the idea was proposed by V. Skormin and L. Popyack in [Skormin et al -97]. They are associated with subspaces of factors of low dimension, in particular, with 2-d subspaces. Classification predicates are defined over the entire factor space. Each classification predicate divides the latter into two regions according its truth values ("true" and "false") in such way that each region contains mostly realizations of one of two clusters of data. A classification predicate is true within a region of the factor space bounded by a set of separation functions of two arguments, which are particular components of the entire factor space. For a 2-d separation bound, the efficient procedure resulting in optimal separation functions of arbitrary shape (including non- convex case) and associated classification predicates is investigated. This procedure is based on the visualization of cluster projections onto arbitrary 2-d subspaces, and is implemented in an interactive software tool developed in the framework of this research. This procedure provides a user an opportunity to draw any separation rule manually utilizing its approximation by a polygon, i.e. an arbitrary linear spline. Moreover, the regions established by this procedure could be many - connected and non - convex. A user is required to draw a separation bound while the software tool generates the associated classification predicate automatically.
- A decision tree-like model of the classification procedure is proposed and implemented in numerically efficient software. The peculiarity of decision trees presented in Interim Report [IR-98] in and this Report is that it is binary and consists of a number of ranked classification predicates. Each predicate is associated with a node of the tree and subsets of database realizations that belong to the region of factor space where corresponding classification predicate is "true". The above regions of factor space and subsets of realizations are ranked. The set of regions and corresponding subsets of realizations associated with the leaves of a decision tree are not overlapping and their combination covers the entire factor space and the set of realizations respectively. The last property provides an opportunity to introduce, in a natural way, a set of elementary events and the corresponding probabilistic space that constitutes a model for the assessment of the probability of failure of hardware, i.e. to solve the target task. In addition, the notion of meta – tree used in this research and described in section 4 makes it possible to solve classification task for arbitrary number of clusters of data without any change of the technology of the above decision tree design.
- The research presented in this paper is aimed at designing a model for reliable assessment of the probability of failure. A pure statistical approach in the case of a small amount of training and testing data is not sufficient for providing the necessary accuracy and reliability of failure prognosis. Therefore, this paper suggests utilizing a number of different decision trees that are supposed to be used to form a more accurate collective decision. Each decision tree consists of a number of ranked classification predicates associated with 2-d subspaces of factors. The most important requirement is that each decision tree has to be associated with different subspaces of factors. Information redundancy is a reason for a possible accuracy enhancement of the assessment of the probability of failure. To employ the idea of redundancy, a special procedure of joint processing of the decisions obtained by individual decision trees is investigated. It is based on the concept of so-called "Algebraic Bayes’ Networks" developed by the author ([Gorodetski-92], [Gorodetski et al -97]). In fact, estimation enhancement is achieved due to utilizing the background knowledge. This method is demonstrated numerically. Additional utilization of interval mathematics methods to calculate the posterior probability of failure on the basis of Bayes’ formula makes it possible to obtain the precise upper (minimal) bound of the target probability.
It should be noted that the classification rule development technique presented could be efficiently used in a wide area of applied tasks of Data Mining and KDD.
- The utilization of the concept of a classification predicate defined over subspaces of low dimensions makes possible to develop a totally new approach to the task of rules extraction from databases in the most complex case. One such case is the situation when a database contains columns specified both in continuous and discrete scales. It is well known that now this task is one of the key problems of Data Mining and KDD. Any known and conventionally used approach aiming at the creation of a knowledge base by "mining" a database containing both continuous and discrete data is based on direct discretization of continuous (real valued) data and results in substantial dimension increase of the factor space. Consequently, such an approach leads to inefficient algorithms and can not be recognized as a satisfactory one even if a discretiation is made optimally.
As an alternative, an approach based on the utilization of the concept of classification predicate makes it possible to avoid artificial discretization at all. Actually, a classification predicate itself defined over a subset of continuous factors (features) can be considered as a discrete specification of continuous data. A classification predicate can be considered as a new feature which represents the same data in a new way. There exist a number of known approaches to cope with the task of extraction rules from database with columns specified in discrete scales [see, for example, [Michalski et al -81], [Quinlian-83], [Michalski-90]). Hence, the concept of classification predicate makes it possible to solve a number of difficult Data Mining and KDD problems in an efficient new way.
One more original approach to solve the task of rule extraction from learning data was proposed by author of this Report [Gorodetski et al -96]. It was described in brief in Appendix of the Interim Report [IR-98].
- Algebraic Bayes’ Networks theory developed by author and presented in Interim Report [IR-98] and in this one possess a number of advantages regarding to the application that is the subject of this research and regarding to a more wide area of Knowledge Engineering. The area of its applications is dealing with uncertain and sub-defined data including expert’s knowledge.

Proposals for Future Research
This research may be considered as a step in the development of technology for information-based health assessment system design. Of course it is not able to solve a number of problems associated with this very important and difficult task. However, there may be pointed out a number of theoretical and applied problems that in my opinion has to be a subject of research in the framework of health assessment system design. They are as follows:
- Mathematical model for mining knowledge from database of multi-scale data structures.
In compare with the model developed in this research, the proposed research aims at development a mathematical model of technology which integrates (1) processing real valued database resulting in obtaining classification predicates and (2) processing discrete valued data resulting in extraction rules from both real valued and discrete valued data. This research may aim at development a mathematical basis for advanced data mining technology applicable for wide area of information-based health assessment systems.
- Development of software tool prototype aimed at supporting an interactive technology of information-based health assessment system
. Development of such tool could make it possible to investigate numerically the pros and cons of any mathematical basis, its advantages and deficiencies. This tool may be used to prepare future developers to implement technology. It might be a first step to development of powerful multi-purpose software tool for utilization in the area of information –based health assessment systems design.
- Advanced statistical and logical models for extracting sensitive patterns from database.
This mathematical model aims at solving such practically important prognostic related tasks as:
- ranking particular environmental conditions as factors responsible for general and particular types of failures,
- determination of particular groups of environmental conditions ("patterns") and assessment of their combined effects on failures in general and on particular types of failures,
- justification of the development of devices protecting from adverse environmental conditions,
- development of the recommendations on the avoidance of the combined effects of adverse conditions.
Conventionally these tasks are solved by methods of mathematical statistics, which uses ideas from component and factor analyses. But the latter are not appropriate in many practical situations. In addition, in these tasks it is necessary to deal with continuous and discrete factors what takes to develop a more powerful mathematically justified approach. It should be noted that this problem now is the subject of intensive research in Data Mining area.
Project Principal Investigator
Chief Scientist of the St. Petersburg
Institute for Informatics and Automation
Ph.D. Prof. Vladimir Gorodetsky.

|