Autonomous Intelligent Systems: Agents and Data Mining Workshop
Projects # 1994P

I. General information

1. Title of the Project

Formal Methods for Information Protection Technology

2. Project Manager

Karsaev Oleg Vladislavovich
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, 39, 14th Liniya, St. Petersburg, 199178, Russia
Telephone: +7-(812) 328-54-11;
Fax:+7-(812) 328-0685/328-44-50;
E-mail: ok@mail.iias.spb.su

3. Participating Institutions

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, 39, 14th Liniya, St. Petersburg, 199178, Russia
Telephone: +7-(812) 328-44-50;
Fax:+7-(812) 328-0685/328-44-50;
E-mail: spiiran@mail.iias.spb.su
Name of individual empowered to make commitments on behalf of the Institute (Signature Authority):
Yusupov, Rafael Midhatovich, Professor, Director, SPIIRAS
Telephone: +7-(812) 328-44-50; Fax:+7-(812) 328-0685/328-44-50;
E-mail: yusupov@mail.iias.spb.su

4. Partner

European Office of Aerospace Research and Development, 223-231 Old Marylebone Road, London, NW1 5TH, United Kingdom
Name of individual empowered to make commitments on behalf of the Parner (Signature Authority):
Roy Phillips
Program Manager, International Research
Tel: +44-20-7514-4950 Fax: +44-20-7514-4960
E-mail:roy.phillips@eoard.af.mil
Contact Person:
Dr. Christopher E. Reuter
Tel: +44-20-7514-4474 Fax: +44-20-7514-4960
E-mail: chris.reuter@london.af.mil

II. Specific information

1. Introduction and Overview

The problem of information security is currently recognized as one of the most complex problems, and its importance is growing coherently with increasing network connectivity, size, and implementation of new information technologies. The use of open computer networks (including the Internet) as the environment for an information exchange in various distributed information systems is subjected to real threats to the information resources used in such systems. A multitude of widely used TCP/IP protocols was developed as a basis for information interaction over a quarter of a century ago. Having been available for some period of time from DARPA, the Internet actually grew into a research community and incorporated the traditions of openness of the academic world. As a result, the main concepts of the TCP/IP protocols are inconsistent with a number of the requirements of modern computer networks security. Furthermore, the typical weaknesses of the TCP/IP protocols implementation is inherited by modern network operating systems. Many leading software companies pursue the tough policy to protect their "know-how" information and utilize "double standards" to win a market competition. The latter results in hiding the important items of information about the architecture of the complex programs, the resistance of cryptography algorithms, etc. under a signature stamp of corporate secrets The complexity of software inevitably results in building software with inadvertent errors in a rigorous development cycle.

These factors reflected in network based attacks. These attacks use the vulnerability of the computer network infrastructure, basic protocols of the Internet, operating systems and applications vulnerability. Now malefactors turn out to be armed with the powerful and dangerous means to attack information resources as far as the existing vulnerabilities are not detected, published, and fixed by independent investigators. Therefore, if the security function of a network and its resources is realized solely as a function of the operating system itself, the malefactor can effectively employ various threats breaking integrity, confidentiality and availability of information resources of the network. Currently, scientific efforts are undertaken mainly in the development of the network defense mechanisms. Unfortunately, substantially less attention is paid to study of attacks and, in particular, remote distributed attacks. Neither appropriate tool for attacks simulation, nor research on formal frameworks for attacks specification are developed. That is why many studies only deal with an overview of attacks or over-simplified, or particular case models of attacks that, as a rule, are represented by a limited number of attacks cases [1-29].

The first objective of the Project is to develop the formal framework for modeling and the software tool for simulation of distributed attacks in a wide spectrum, and to explore their advantages as applied to the network assurance systems design and validation.

Attack simulation tool can be very useful at least from two points of view. On the one hand, any developed network assurance system must be validated via simulation of its performance in a real-life or artificial environment. The key component of the artificial environment is exactly the attack simulation tool. That is why the first important role of the attack simulation tool is to support the analysis of properties of any developed network assurance system. On the other hand, attack simulation tool can be a powerful aid for the network assurance system design. Indeed, according to the modern standpoint, intrusion detection system must be knowledge-based. Its knowledge base must comprise information about hundreds types of attacks. But the number of attack types grows permanently, and new attacks become more and more sophisticated. Therefore knowledge base must be easily extendible via learning new types of attacks. The last requirement is recognized as the key one in the intrusion detection systems design. Intrusion detection learning system must function based on the interpreted audit data that must play the role of the learning and testing data. Unfortunately, as a rule, any new attack is represented by a unique case or by a very limited number of cases that makes impossible any efficient learning. This leads to the necessity of using mathematical model and attack simulation tool to generate cases of a new type of attacks based on the known cases and experts' knowledge.

In view of a large variety of security objects and types of simple and distributed attacks, NSS should be adaptive and extendible. Such abilities for NSS must be provided by a powerful intelligent learning subsystem (LS). Since NSS and LS are operating in an unpredictable environment, both of them must be able to get easily adapted to the environment. According to the modern view, agent-based architecture is the most promising way to provide the required properties for NSS and LS. Mathematical foundation, agent-based architecture, and principles of implementation of the Learning Systems operating in a parallel with NSS form the second objective of the project.

Although the research in this area is being done for several last years by research teams headed by S.Stolfo, S.Forrest, etc. [30-49], it might be reasonable to admit that an elsewhere-accepted view on many aspects of such LS development, design and implementation is not developed yet. This problem is recognized now as one of the most difficult within the information security scope both in theoretical and design aspects.

Being currently developed the advanced learning technology applied to attack detection ([34,38,43,44]) considers LS as a multi-level hierarchical system that combines classification algorithms, link and sequence analyses of the audit data at the lower levels and meta-classification learning procedures at the higher levels of learning. The general scheme of the learning technology is relatively standard. First, the link and sequence analyses are used for extracting useful patterns from the preprocessed audit data. The latter results from pre-processing of the input traffic at different levels of a protocol like TCP/IP. Then on the basis of these patterns and using a rule discovery procedure a rule-based classification algorithm is designed. The latter forms the knowledge base, which together with the chosen inference mechanism constitutes a knowledge-based intrusion detection system. However, implementation of such a technology for the design of the knowledge base of NSS is a very difficult task. Indeed, audit data collected from a server on a daily basis may contain more than one million records of different lengths. Consequently, when providing both for more accuracy and for an appropriate computational complexity of the learning procedure and the resultant classifier we arrive to a formidable task.

The use of the agent-based architecture of learning component of network security system combined with modern achievements in the area of Data Mining and Knowledge Discovery should make it possible to overcome a number of drawbacks of the existing approaches to the problem under discussion.

Present state-of-the-art and tendencies in the area of information security determine the project objectives, problem statement and approaches to be used. The main objective of the project can be formulated as a reduction of the existing discrepancy between already developed and implemented approaches to NSS and modern requirements to the level of security while taking into account existing and future sophisticated types of known and unknown attacks on computer networks. The project aims to develop mathematical foundations, architecture and principles of implementation of subsystems of attack simulation and security components learning multi-level system.

2. Expected Results and their Applications

The project supposes to achieve two main results.

  1. Formal grammar-based framework, model and software tool prototype for simulation of distributed attacks on computer network;
  2. Mathematical foundations, basic algorithms, architecture, principles of implementation and software prototypes of multi-agent learning components of network security systems focused on supporting intrusions detection;

The main expected result of the first task of the project will be a formal framework, model, architecture and software tool prototype for simulation of distributed attacks on computer networks. This result will include specification of the representative set of distributed attacks; mathematical methods and techniques realizing the attack modeling; software prototype of the Attack Simulator implementing theoretical results of the research and its evaluation. In more detail, these results consist of the following:

  • scenario-based specification of the representative set of distributed attacks (description of a set of particular cases of attacks specified on the macro-level);
  • technique for case-based regenerating (inductive recovery) of the formal grammar specifying formal model of the attack of the given class;
  • stochastic model of a fragment of the attack on the micro level;
  • object-oriented project of the Attack Simulator–software tool prototype for simulation of attacks on the computer network;
  • Attack Simulator software prototype, results of exploration of the developed Attack Simulator and evaluation of its benefits in the network assurance system design and maintenance.

The main expected result of the second task should be the development of the formal model, architecture, and software prototypes of the basic components the intelligent Multi-agent Learning System. This system is intended to provide adaptability of the intrusion detection system to the unknown attacks against computer network. The following particular results will be received in the scope of the second task as follows:

  • learning task ontology,
  • allocation of learning tasks over generic learning agents and development of the architecture of their interaction within Multi-agent Learning System;
  • mathematical basis and algorithms realizing learning functionalities of the particular agents and software prototypes of the components of the Multi-agent Learning System based on theoretical results of the research;
  • simulation-based evaluation of the properties, advantages and disadvantages of the developed multi-agent model and architecture of the Multi-agent Learning system aimed to support adaptability and learnability of the Network Security System.

Although the Project does not pursue the development of any industrial technology, nevertheless, we intend to develop software using modern standard software tools like Visual C++, JAVA 2, SQL server, XML, etc. in order to provide a possibility for its further technological development.

The results to be obtained in the Project cannot be considered as invention or contain business confidential information.

Completion of the Project in the future should open a perspective of collaboration with commercial and/or industrial organizations in development of advanced technologies in the area of computer network security systems for the research team.

3. Technical Approach and Methodology

Task 1

Experts analysis of distributed attacks shows that malefactor plans each attack on the macro-level as a partially ordered set of steps. Each step aims at achieving a particular sub-goal, say, to break through a "security wall", get non-authorized access to some information, services, applications, etc. The partially ordered set of attack steps on the macro level is called a scenario of attack. In realization some steps of such a scenario may not be successful, while another ones may be successful. In principle, the total number of such steps of different purposes is not too large, and they can be realized by malefactors in diverse orders, in a repeatable mode, from different source computers, etc. They may intend to compromise different resources of the computer network. In any case, realization of a scenario is represented by a sequence of various lengths for successful and unsuccessful steps on the macro level. That is why the diversity of scenarios may be very high. To realize each particular step of the attack scenario the malefactor uses operations of low level. Thus, each such a step of a scenario may be represented as a sequence of low level commands, system calls, etc.

Following the aforementioned conceptual representation of the attack, the proposed research focuses on the two-level model of attacks. It is supposed that available learning information about attacks of different classes comprises the experts' information and limited number of cases.

The main sub-tasks of the theoretical research efforts associated with the first task of the Project are as follows:

The first sub-task concerns the macro-level of the attack modeling and simulation, i.e. modeling of the attack scenario. Availability of even a unique case of the attack scenario allows an expert to identify and anticipate the malefactor's intentions, a peculiarity of the attack implementation and to anticipate a variety of possible scenarios aimed to the same target. As a result of such an analysis, the expert can get a multitude of possible malefactor's activity scenarios that leads to the same target reaching. An expert can consider the aforementioned set of scenarios (partially ordered sequences of actions) as a set of macro level specifications ("macro level cases") of the respective attack. Each such a sequence can be considered as a "word" belonging to a formal language that, in turn, can be specified by a formal grammar. This grammar may be regenerated by formal methods on the basis of cases, and later it can serve as a formal model of such kind of attacks on the macro level. Based on experience, one can assert that the resulting formal grammar is not more complex than the stochastic (attribute) LL(1) grammar or even simpler. There exist a number of additional formal approaches to the case-based modeling of attacks, e.g., Markov’s Chains model. In the Project, the task of macro level modeling of known and unknown attacks is considered as the first sub-task that is key issue of this project.

The second sub-task concerns the micro-level of the attack modeling and specification. In terms of the lower (micro-) level, each step of the macro level scenario consists of a sequence of events (for example, system calls). Each of these sequences implements a single step of the attack scenario on the micro level. Modeling and simulation of the attack on the micro level also can be realized based on the expert analysis of the intruder's intention at each step of the attack. In some respects this task is similar to the one considered on the macro level. As a rule, intruder's actions on the micro level may slightly vary from the normal actions and be more noticeable on the macro level. This is an additional significant argument in favor of the necessity of the macro level attack modeling and simulation. Mathematically the formal model on micro level can be specified in terms of formal grammars or in terms of Markov chains. Moreover, on this level some probabilistic models must be used.

The primary advantages of the described approach are as follows:

  • Expert's information used for regenerating the two-level attack model makes it possible to fill a lack of the learning data (cases).
  • Use of the macro and micro levels of attack model simplifies the development of the intrusion detection system, because these models provide a portion of the information necessary to design a syntax analyzer to recognize (identify) a type of the intruder.
  • Multitude of the "words" given by the formal grammar regenerating procedure can be implemented later as a software tool.
  • The proposed approach implements an advanced idea of the syntactically driven data processing applied to the synthesis of the computer attack simulator.

See also Power Point presentation Task 1.

Annual Report Task 1 - 2002 (Russian, English)
Annual Report Task 1 - 2001 (Russian, English)

Task 2

From scientific point of view, the research concerning the second task plays the crucial role in network security scope. Research on this task within the Project supposes selection of adequate methods among existing ones and development of a number of specialized algorithms supporting case-based learning and aiming at a design and off-line updating of knowledge bases of the host-based intrusion detection agents and meta-agents of network security system.

In contrast with the conventional tasks of knowledge discovery from data (KDD), when a case is specified in terms of a vector of features (parameters, factors), cases specifying intrusions are mainly ordered sequences of audit data of various length specified in terms of possibly repeatable symbols. These symbols correspond to the preprocessed messages of the input traffic at a host port of the computer network. In case of the distributed attack as well as in case of the normal distributed users' activity, a set of such sequences specifies a situation. That is why the KDD task concerning intrusion detection is much more complex as compared to any conventional KDD task and is much less studied.

Thus, the input data associated with any user's activity is distributed over the network hosts and respective cases of audit data are formed on the basis of the entire network input traffic. The desirable output of a learning procedure is a distributed multi-level knowledge base specified in formal terms that makes it possible to classify reliably new messages of the input traffic to recognize normal activities as well as attacks.

The approaches we intend to explore are threefold. The first class of them is based on the specification of each input sequence (case) corresponding to a class of attacks or normal activity as a sentence of an unknown formal Context Free language. In this case the learning task may be reduced to a task of the case-based formal grammar recovery. Note that this approach overlaps with the Markov’s chain theory that can be combined within the learning procedure. The second class of approaches to be explored is based on statistical properties of the cases specifying normal and abnormal activity of the user(s) accessing the network resources. The third class of methods intends to solve the task of the rule extraction from the cases specified in terms of predicates given over higher level concepts like patterns.

This proposal concerns both architectural and mathematical issues of LS design applied to the attack detection tasks.

The first issue is a decomposition of the whole learning task into multitude of sub-tasks according to the ontology of attacks, and their allocation among specialized learning software agents. Each specialized learning agent is considered as the generic one able to extract particular class of dependencies (statistic dependencies, association rules, patterns, logical rules given over patterns, etc.) from data of a fixed format (event sequences, sets of patterns, subset of rules, etc.). Each particular agent located on a host and responsible for learning the particular attack detection is cloned from a specific class of generic agent when necessary. The last functionality is realized by the Cloning subsystem of LS. If necessary, the newly cloned agent could be tuned by a user who must be furnished with the respective software tool. Note that it is not intended to develop in any sense complete ontology of attacks. The ontology to be developed will be restricted by the needs of developing a case study to simulate attack detection learning system.

The second issue is a mathematical issue of the generic agent learning abilities. It is necessary to select or to develop a number of mathematical methods to cover the necessary and sufficient functionalities of generic agents solving learning tasks over data of various formats and aiming at extracting dependencies of needed classes. One more mathematical issue is to develop a multi-level interaction between the learning agents of various levels realizing meta-classification idea within multi-agent knowledge discovery from data architecture applied to the attack detection LS.

Conceptually, the proposed approach uses several simple ("light") interacting learners instead of one complex ('heavy") learner. Learning System is composed of a number of simple classifiers learned on the basis of their own training and testing data. Each such a classifier is called the "base classifier".. It must be computationally efficient though not very accurate. All basic classifiers are tested on a new audit data, and each testing record is submitted to all base classifiers. While performing classification of the same audit record, each base classifier makes its own decision about its category. These decisions are joined in a new record supplemented by the correct interpretation. These records form data for training the meta-classifier. Base classifiers and the meta-classifier interact during solving the intrusion detection tasks same way.

See also Power Point presentation Task 2.

Annual Report Task 2 - 2002 (Russian, English)
Annual Report Task 2 - 2001 (Russian, English)

Copyright ©2004 Intelligent Systems Laboratory, SPIIRAS, All rights reserved.
In case of any problems, please contact the webmaster.