MASRR: Multi-Agent System for Resource Reliability

Maintaining network resource reliability using a decentralized
system of autonomous agents

Background
The rapid growth of computer networks over the past ten years has resulted in a highly complicated operating environment. Networks are susceptible to a variety of attacks and malfunctions capable of compromising reliability at all levels.  Whether network resource reliability is lost due to malicious behavior, device failure, administrative misconfigurations, or even congestion from valid traffic, the cost of this loss can be astronomical, in terms of lost productivity, the possible theft or destruction of sensitive documents, and civilian or military health, safety, and mission risks resulting from the loss of communication and control.

Computer networks are composed of an assortment of computer platforms, network devices, communication protocols, architectures and configurations. Networks are subject to a number of failures stemming from hardware and software faults, power outages, or demands that exceed capacity. The difficulty is compounded by vulnerabilities, known and unknown, that allow new attacks to be found and exploited.

In order to maintain network reliability in the face of this complexity, a software solution is needed that not only works across heterogeneous networks and inter-networks, but is also scalable to a variety of network sizes and able to continue functioning even when portions of the network are unavailable or have become compromised. SHAI introduces MASRR, a "Multi-Agent System for network Resource Reliability".

Top of page     MASRR home page
 

MASRR focuses on meeting the following criteria:
Scalability: The agents operate autonomously or semi-autonomously within a decentralized architecture, allowing MASRR to manage large networks with varying topologies. Additionally, agents employ message passing protocols that minimize bandwidth consumed by management operations.

Robustness: Each agent operates by observing the state of its own network element(s) and those of neighboring elements. One agent will continue operating even when faced with failures in other parts of the network, adapting to and minimizing the effect of failures by providing alternate routing, modifying security policies, or taking other remedial actions.

Interoperability: Agents are developed using platform-independent code. They are designed to operate within the current TCP/IP and SNMP framework. While they obtain information from their neighbors, they make decisions based on their observations and do not need to be concerned with implementation details of neighboring elements.

Pro-active management: MASRR agents use models of network element and user behavior and misbehavior to detect both known and unknown network-degrading events. Agents reason about their observations, assess the source of a fault, and select and execute actions to mitigate, contain, and recover from problems in the network. 

Top of page     MASRR home page

Innovations:
In developing MASRR, SHAI contributes to the development of these technologies: 

  • Decentralized network monitoring and maintenance. Agents respond to network degrading events at the local level and with as much information as might be available from peer MASRR agents.  In addition, the MASRR agents can perform tasks autonomously and semi-autonomously. 
  • Reconciliation of differing local views. Agents reconcile their own views of the network with the views from peer agents, allowing each agent to utilize data from disparate sources in the selection of appropriate network maintenance and security actions.
  • Acting under uncertainty. MASRR agents will be able to select reasonable responses to network degrading events even though available information is incomplete, enabling the agent to, unlike signature-based intrusion detection systems, respond to situations never previously observed.

Recent Developments:
The contract for this project has been completed. We are currently looking for potential partners who can provide domain expertise and are interested in pursuing additional research funding with us.

Summary: SHAI has developed an architecture to support distributed monitoring agents that share information and perform localized correlation and decision making. Lacking an actual network on which to deploy our agents, we have created a prototype system that uses the Scalable Simulation Framework (SSFNet) network simulator as its runtime platform. We have created a simple scenario to illustrate detection and resolution of a problem cause by an attack. The simulator also confirms the potential of an important project innovation, Change Detection modeling:

Change Detection modeling. A critical ability for MASRR agents is change detection -- noting when network traffic and application behaviors have departed from what is normal or expected. But this expectation varies throughout the day, the week, the month; moreover, the network itself changes as stations and devices are added, removed, and reconfigured. The appearance of "normal" is unique to each network element and thus differs from the perspective of individual agents. We must, therefore, build a system in which each agent can learn patterns for itself and adapt to changes, and yet detect when changes signal faults or attacks. SHAI is using recent advances in data mining to enable agents to model fluctuating, normal patterns of behavior and to classify current observations. When the agent encounters anomalous behaviors, additional diagnostics and reasoning are performed as the agent decides on a course of action.

  • Technical report on SHAI's Change and Anomaly Detection (ChAD) system [html].

  • Presentation on ChAD [Power Point]. The presentation includes detailed speaker notes; if you are viewing it as a Power Point Show, click the control in the lower left version and use the menu to view the notes.

    Top of page     MASRR home page  

    Artificial Intelligence (AI) technologies:
    Agents
    MASRR scales well to networks of any size, replacing centralized management (a network bottleneck) with a decentralized system of agents that observe network status and respond accordingly. Agent communication is similar to the link-state model, as they share their own views of the network without excessive or redundant polling. Agents act locally to correlate information and select appropriate actions. MASRR agents execute in secure environments; they are not mobile code.

    Case Based Reasoning
    Agents use case based reasoning to determine whether a fault or attack might be occurring and to select the best possible action to repair, mitigate, or circumvent the problems. Such an approach is useful for identifying anomolies of both known and unknown types, based on their symptoms. Because a fault can cause more than one set of symptoms, and a set of symptons can have multiple causes, agents correlate evidence indicated by their neighbors' views of the network. Scenarios form "cases", consisting of the network status and the steps the agent should take in a similar situation. 

    Data Mining
    SHAI utilizes data mining tools to build models of normal network usage and behavior. Our important innovation is the ability to develop this model of normal using real data collected in real time. Typical systems seeking to classify normal behavior have required extensive - and expensive - data collection and cleaning to create a set of labeled examples from which normal behaviors can be distinguished from abnormal ones. Aside from the expense of creating this training data, the models may fail when confronted with abnormalities not represented in the data. In the network management and security domain, this means they would fail to detect new attacks or faults.
    MASRR agents compare their current observations to the learned models and draw conclusions about network performance. The models also serve as an explanation facility, so that system administrators can not only learn about their current network configurations, but also offer feedback to the agents. 

    Machine Learning
    Each agent keeps a history of the network observations and its actions. This information will be used for periodic, off-line learning of which actions were the most successful under which circumstances. Thus the agent can become more effective over time.

    Time Series Prediction
    In order to respond rapidly to an attack or other failure, each agent uses correlated evidence to predict the most likely cause and head off damage. For example, agents detecting a compromised router can quickly act to reroute traffic. They remedy the situation before other network degradation can occur, as well as isolate the compromised element from the rest of the network and contain the damage. 

    Top of page     MASRR home page

    Ongoing development of MASRR is funded by a DARPA contract under the Small Business Innovation Research (SBIR) program.

    For more information on MASRR, please contact the project manager, Lynn Jones.

  •