ChAD: Change and Anomaly Detection
Discovering when a system is not behaving normally
We depend on complex systems every day. Manufacturing processes provide our material needs and support our economy. Combustion or other engines provide power and transportation. Communications networks allow us to share and obtain information. The more dependent on these systems we become, the greater the effect when there is a breakdown or failure. Think about how lost you felt the last time you lost your internet connection at work, or how difficult life is when there is even a brief water or power outage. We know nothing lasts forever, so we must emphasize preventive maintenance along with fault prediction, early fault detection, and expedited diagnosis and repair.
Frequently, simple gauges and alarms are used to indicate system performance problems. Your car’s dashboard will light up if your engine temperature rises too high or the oil pressure drops too low; computer network administrators set trap levels for certain types of traffic or events. But these systems tend to be informative only when a problem has occurred – when your car’s engine temperature light comes on, you have very limited time to pull off the highway before you incur serious damage. What is needed is a system that can track performance over time, and can spot subtle as well as dramatic changes from normal behavior and alert an operator in time to take preventive or corrective action with minimal downtown and concomitant damage.
The system to be tracked can be described by sensor readings. Various approaches have been able to use such data streams to detect problems. SHAI developed the ChAD system within the network management and security domain. In that domain we find network fault or intrusion detection approaches that require sensor data comprised of both normal and faulty examples ([Lee99], [Cabr01]), and others that attempt to discern anomalous behavior based solely on normal examples ([Eski00], [Warr99]). Much of the work in this area surrounds computer network attack, intrusion and misuse detection, and many of the virus and security software tools currently available work by comparing current observations to known "signatures" of attacks and intrusions. However, development of an effective system that requires only normal data is highly desirable, in computer network security as well as other domains, for several reasons:
SHAI’s ChAD (Change and Anomaly Detection) is such a system. ChAD uses normal data observations to model normal system operation and generate alarms when current observations fail to fit that model. We make use of a new development in data mining, the Concept-adaptive Very Fast Decision Tree learner (CVFDT) [Hult01], which has been shown to accurately detect and adapt to statistical changes in the data distribution and to operate robustly on data with noise. Laurie Spencer, who helped to develop CVFDT, led the design of ChAD, using CVFDT models to:
Moreover, we believe that, given sufficient data, our approach can also be used to assess wear-out rates, detect normal versus abnormal wear, or to predict particular kinds of faults. The following sections give an overview of CVFDT models and describe the ChAD system. We also present other components of SHAI’s network reliability monitoring and response system as it relates to the monitoring and fault prediction problems in other domains.
1. Modeling and fault predictionIn this section, we introduce CVFDT models and discuss our use of them for creating a baseline signature of "normal" for a modeled system and performing anomaly detection for fault prediction. We also outline extensions or additional applications of ChAD system.
1.1 Decision trees and CVFDT models
Table 1. Descriptions of mushrooms |
Decision trees are inductive models that are used for classifying data instances. For example, given the descriptions of mushrooms shown in Table 1, one might wish to learn characteristics distinguishing the class of mushroom, poisonous from non-poisonous. A decision tree similar to that shown in Figure 1 might be induced from the given data. In this case, the classification we wish to make is that of the attribute "poisonous" (with classes yes and no), and the decision tree splits on the other variables that tend to characterize the class. In this very small sample, one can see that all yellow mushrooms are poisonous and all white mushrooms are not; no other information is needed to reach a leaf node in the tree. Brown mushrooms cannot be classified by color alone. This tree further distinguishes the class by using the size attribute. Decision tree models can be used to classify new examples for which the class (the value of the poisonous attribute) is not yet known (although I certainly wouldn’t eat any mushrooms based on predictions made by this decision tree!). |
|
The CVFDT is a decision tree learning algorithm that was developed to address the fact that data collected over a period of time can suffer from "concept drift". Just as knowing the annual inches of rainfall in Seattle does not help you determine whether or not to pack your umbrella for a trip there next week, models that tend to average data over time may not be very informative about a system as it behaves right now. The CVFDT model is learned from data, but rather than treating it as though all the data were generated by a single concept (an unchanging system), it behaves as though the data were generated by a series of concepts. If white mushrooms in our example underwent some genetic mutation so that all white mushrooms became poisonous, a concept-adaptive model would soon "forget" that all white mushrooms are non-poisonous and would learn new ways of distinguishing the class from new examples, introducing new split-point nodes below the "color = white" node. If, over time, all white mushrooms seen are poisonous, the tree will shift to the most compact and efficient representation. |
Figure 1. Decision tree classifying mushroom by the |
In general, the CVFDT works by observing when enough new examples have been observed to determine that part of the current tree has become questionable. As this occurs, alternate subtrees are grown to replace portions of the tree that no longer appear valid. When the newly learned subtree becomes more accurate than the model’s corresponding subtree, that portion of the model is updated to reflect the change in the modeled system. The CVFDT paper by Hulten et. al. ([Hult01]) describes the algorithm in detail, including its statistical foundation and properties. In the experiments described in the paper, CVFDT models were able to successfully track concept drift and maintain model accuracy.
1.2 ChAD
This section describes how SHAI has extended CVFDT to create an anomaly detection system. Decision tree models themselves are not typically used to develop a signature of "normal" behavior of a system. They require labeled examples of the various possible classes in order to build a useful and accurate model. However, SHAI has developed a method of using the CVFDT algorithm to build this signature model in order to detect when the system deviates from normal or expected operation. Our approach can indicate not only when the modeled system is changing, but also can be quite informative as to the kind and severity of the change.
ChAD can be used to build a baseline signature model using data collected during normal operation of the system to be modeled. As the models are then used with new data to make fault predictions, they can provide useful insight into changes in the system’s operation. When the CVFDT models used by ChAD are stable, we know that the system is operating normally. When the CVFDT trees detect changes in the new data stream, ChAD can communicate:
The ChAD system can use this information to develop a "stability" score, from which might be developed an alert or alarm system. Feedback to the ChAD system should be used to tune this score.
SHAI believes that this solution provides the following benefits:
Because this is a new technology, we do not yet have any examples of its use, and we are not at liberty to discuss its implementation in more detail (see contact information at the end of this paper to request more information). Experiments reported in the CVFDT paper show the accuracy of that approach, and SHAI has a number of other successes in data mining projects (see section 2).
1.3 Other solutions
ChAD is being developed as part of an agent-based system for computer network management and security, funded by DARPA. The primary goal of that project is to ensure computer network reliability through fault prediction/recognition, diagnosis and response. Fault prediction and identification in that domain is similar to that in other domains, such as manufacturing processes, combustion engine or electric motor systems, and even health monitoring. None of these domains provides a complete set of data from which to characterize or learn faults. It would be far too costly to induce all possible faults to simply collect the data. In computer network security, it is not possible to even know all potential faults or attacks, as new vulnerabilities are constantly discovered. We must, therefore, be able to predict faults by observing normal operation and then recognizing important deviations from normal.
SHAI’s solution to the network reliability problem includes other interesting components. We intend to use ChAD for behavior pattern segmentation. Each system agent has a ChAD component that initially learns what constitutes "normal" for various network usage periods throughout the hours of the day and days of the week. ChAD itself can be used to learn this segmentation. Likewise, ChAD can be used to create and remember a snapshot of how the system behaves after maintenance and through a regular maintenance period. The sequence of snapshots can then be used to determine whether components are wearing abnormally, in order to predict wear-out that might occur prior to a regularly scheduled maintenance changeover.
Other components of interest include the use of heuristics and case-based reasoning (CBR). Eskin discusses the difficulty of ensuring that the "normal" data is truly normal – in that case, that it contains no network intrusions [Eski01]. His solution, treating all data as normal with a certain amount of noise introduced by the intrusions, was permitted because intrusions are rare compared to valid traffic. However, in a system tasked with identifying or predicting general failures or faults as well as intrusions, the same property does not hold. In fact, there may be faults and misconfigurations pre-existing in the network that might be deemed normal by the ChAD system. To address this possibility, SHAI’s network reliability solution employs a heuristics component to discern when normal, recurring behaviors are in fact detrimental to system performance. The reliability solution is designed as a response system as well as a monitoring and alarm system. A CBR component is being developed to interpret current observations and select actions to avoid or correct network problems. Any of these components, and in fact, SHAI’s network reliability system as a whole, has useful application to the monitoring and management of many complex systems.
2. SHAI Data Mining ApplicationsSHAI staff has a demonstrated interest and success in data mining applications. We have developed an interactive and user-centered data mining system that includes implementations of six data mining algorithms and "data aware" visualization techniques. Our Intelligent KnOwledge Discovery Assistant (IKODA) system's visualizations are interactive tools rather than simple information displays. . IKODA was developed for our client, AFRL/IFTD (Air Force). We have published several papers describing this work: [Goan00], [Goan99], [Goan98], [Goan97]. Using the IKODA framework, we have performed data mining tasks for two additional projects, FlexiMiner and CASAD.
FlexiMiner is an eclectic, integrated text/data mining system targeted specifically at assisting program managers, laboratory and center directors, etc. in making decisions that increase the probability that strategic technology requirements will be met in a timely fashion. FlexiMiner moves beyond the analysis of bibliographic databases to the mining of the copious free text documents available in a wide variety of forms, such as planning documents, project design specifications, technical reports, etc. FlexiMiner will enable users to construct analytic models that enable, for instance, the evaluation of potential actions (e.g., organizing a workshop, funding a particular investigator) with regard to the effect on the overall level of progress and/or maturity of desired technology. FlexiMiner uses data mining algorithms made available by IKODA, and was recently awarded an SBIR (Small Business Innovation Research) program Phase II contract for continued development. The client for FlexiMiner is US Army TACOM.
The CASAD (Clustering Activity Streams for Anomaly Detection) project uses an innovative machine learning approach to detecting anomalous Database Management System (DBMS) usage. We have devised a unique clustering-based approach to formulating data characterization models that can be used as the basis for detecting the misuse of data resources. By recognizing the deeper structure of monitored usage patterns, our CASAD system will be capable of accurately recognizing unique classes of database transactions and user behavior. Phase II funding for this project is pending. The client for CASAD is the Department of Commerce.
References
[Cabr01] João B. D. Cabrera, Lundy Lewis, Xinzhou Qin, Wenke Lee, Ravi K. Prasanth, B. Ravichandran and Raman K. Mehra. "Proactive Detection of Distributed Denial of Service Attacks using MIB Traffic Variables –A Feasibility Study," Proceedings of the 7th IFIP/IEEE International Symposium on Integrated Network Management, Seattle, WA - May 14-18, 2001.
[Eski00] Eleazar Eskin, Matthew Miller, Zhi-Da Zhong, George Yi, Wei-Ang Lee, Sal Stolfo. "Adaptive Model Generation for Intrusion Detection Systems" Workshop on Intrusion Detection and Prevention, 7th ACM Conference on Computer Security, Athens, GR: November, 2000.
[Goan00] Terrence Goan. "From Data to Actionable Knowledge: Applying Data Mining to the Problem of Intrusion Detection," International Conference on Artificial Intelligence.
[Goan99] Terrence Goan and Laurie Spencer. "Supporting Thorough Knowledge Discovery With Data-Aware Visualizations," Proc. of the International Conf. on Information Fusion, 1999.
[Goan98] Terrence Goan. "An Active and Intelligent Tool for Knowledge Discovery," Proceedings of the 6th INFORMS Computer Science Technical Section Conference on Computer Science and Operations Research: Recent Advances in the Interface.
[Goan97] Terrence Goan. "Supporting the User: Conceptual Modeling & Knowledge Discovery," Proc. of Conceptual Modeling: Historical Perspectives and Future Directions, 1997.
[Hult01] Geoff Hulten, Laurie Spencer, and Pedro Domingos. "Mining Time-changing Data Streams," Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 26-29, 2001, San Francisco. Available online at http://www.cs.washington.edu/homes/pedrod/kdd01b.ps.gz.
[Lee99] W. Lee, S. J. Stolfo, and K. Mok. "Data mining in work flow environments: Experiences in intrusion detection." In Proceedings of the 1999 Conference on Knowledge Discovery and Data Mining (KDD-99), 1999.
[Warr99] Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. "Detecting Intrusions using system calls: alternative data models." In Proceedings of the 1999 IEEE Symposium on Security and Privacy, pages 133–145. IEEE Computer Society, 1999.