1107 NE 45th St., Suite 427 |
Lynn Jones and Tamitha Carpenter |
The rapid growth of computer networks over the past ten years has resulted in a highly complicated operating environment susceptible to a variety of attacks and malfunctions capable of compromising network reliability at all levels. Whether network resource reliability is lost due to malicious behavior, device failure, or common misconfigurations, the cost of this loss can be astronomical, both in terms of lost productivity and the possible theft of sensitive documents.
Certainly networks are vulnerable to intrusions and attacks. Computer "crackers" continue to find new ways to exploit weaknesses in both hardware and software; security patches themselves have been known to introduce new bugs. Additionally, the internet permits attack and intrusion scripts, viruses, and other tools to be disseminated as rapidly as their solutions. The availability of these resources expands the number of potential attackers, as it enables those possessing only modest computer skills to wreak havoc on networks.
Even in a world with no malicious computer users, maintaining network system reliability is a difficult problem. Modern computer networks involve many different computer platforms and network elements, interacting over a variety of network protocols as well as assorted implementations of those protocols, all combinable in a virtually infinite number of configurations. These complicated environments lead to two major obstacles to maintaining network reliability: 1) ensuring that every host and every network element is properly configured at all times is a monumental task and is rarely, if ever, accomplished; and 2) any software solution for detecting and responding to network reliability issues must be sensitive to fluctuations in normal or valid usage that exist in these networks.
The task of maintaining network health is compounded by the difficulty of accurately diagnosing problems once symptoms are observed. There is no exact correspondence between network problems and their underlying causes. Faults may manifest themselves in a variety of ways, and observable symptoms may have a number of possible causes. Faults may be intermittent and difficult to consistently reproduce. Relatively minor faults can persist undetected, exacerbating and masking the causes of larger problems that might occur.
A variety of network monitoring and management tools exist. Some of these are device- or vendor-specific, although integration of management tools is improving. Most management tools perform only monitoring and reporting, presenting information to network administrators who must further diagnose and correct problems. Some management tools, especially network security tools such as intrusion detection systems, incorporate evidence correlation into monitoring. Many of these systems rely on "signatures" or descriptions of known attacks or faults and are less effective at diagnosing events that have not been previously encoded into the software. Most of these applications employ a centralized processing module that collects and analyzes data from monitoring points throughout the network.
Software tools that can automatically maintain network health are highly desirable. The supply of experienced and knowledgeable network administrators does not meet growing demand. The amount of data to be tracked is too great for the unassisted human to manage. Network service providers are contracting "service level agreements," guaranteeing not just connectivity but quality as well, requiring well-tuned networks and nearly immediate corrective action. A current research goal is not just to diagnose but to predict faults, attacks and intrusions, so that timely intervention can minimize or prevent the effects.
Some characteristics of network management practices actually stand as obstacles to automated response, however. Centralized decision making, be it automated or on the part of a human operator, cannot scale well to support large networks, both because of the size of the data to be analyzed and because the system relies on network performance to collect the data. Performance of applications that rely on "signatures" for fault or attack detection is limited to what they have been programmed to recognize, virtually ruling out their customization to unique network topologies and leaving them unable to detect changes in network behavior that indicate a problem.
Stottler Henke Associates, Inc. (SHAI) proposes a unique system for maintaining network resource reliability through a decentralized multi-agent architecture. Our approach differs from existing computer security and network management tools in a number of ways. First, our system will not rely on the presence of a central analysis unit. Instead, individual agents will monitor individual hosts and/or local portions of the network, communicating between agents when needed, and autonomously or semi-autonomously applying network maintenance measures where and when needed and appropriate. This design is not only easily scalable, it enables the agents to perform local responses, even when much of the network is compromised. Second, unlike many intrusion detection tools which have hard-coded responses to known attacks, our system will be able to respond to known attacks, recognizable network degradations (e.g., router misconfigurations), and unexplained anomalies. Third, our agents will be able to reconcile differing local views through data fusion, thus allowing the agents to provide increasingly accurate responses to network performance degrading events as information is collected and aggregated from other agents.
A real-world scenario
In this article, we outline a real-world network problem, its diagnosis and solution. We explain how our "Multi-Agent System for Network Resource Reliability" (MASRR) would behave in the same situation. We then contrast performance of the two approaches.
Lou Steinberg, one of the authors of the first network Management Information Base (MIB) standard, describes this difficult to diagnose network problem:
"The obvious problems aren’t always as they appear. One particularly tricky debug session was started without enough data. We clearly had a routing protocol problem because packets were being forwarded to systems that should not have received them. We spent a lot of time trying to fix the wrong problem.
"As luck would have it, we had t1 cards with defective memory. The cards held a cache of the routing table. Values written to the card could be immediately read back, but tended to change over time. This placed junk in the forwarding table. It wasn’t until we had more data, changes in routing behavior without any routing updates, that we began to properly debug the error."
Steinberg does not outline the symptoms of this problem, nor does he list the steps they took in debugging. But we can imagine network users complaining of application errors and a “slow” network due to the retransmission of packets. Such reports would have inspired, among other investigations, a review of packet traces, from which would have been determined that packets were being incorrectly routed. This information would have prompted examination of the routing tables and routing update messages, which would have shown that the contents of one network card’s routing table cache did not correspond to the routing messages flowing through the network. At this point, the administrators could test and prove defective memory on the card. Note, however, that he indicates that the diagnosis path the administrators took was not that direct.
MASRR’s response to such a situation:
MASRR agents are programmed to reason about problem symptoms and their resolutions. This reasoning is represented in “cases” comprised of one or more symptoms along with the appropriate actions to take for diagnosis and correction. Steinberg’s scenario, however, outlines a problem particularly suited to human deductive reasoning. Because we have this working example, we could create a case that would enable the agent to recognize and report this problem and to take intermediate corrective action by resetting the interface and forcing the routing table cache to refresh more frequently. However, we expect that our agents will be confronted with a number of faults that are unanticipated by their case libraries. We refer to these cases as "unknown anomalies", and it is important that MASRR agents respond to these in beneficial ways.

MASRR agents managing a network under the described circumstances would encounter a number of network behaviors that they might not be able to exactly identify. There are several ways that the faulty card could disrupt network traffic. The network depicted in Figure 1 has several subnets and includes a router, F, that has a faulty network card connected to the switch to subnet 4. The routing table cache on this card has random changes corrupting its contents. For this example, we make the following assumptions1:
Increases in the number of packets dropped due to lack of a route (increases in MIB counters "ipOutNoRoutes" and "icmpOutDestUnreach"). The switches at subnets 3 and 4 might receive a number of frames with unknown destination addresses. Because these are inbound packets, they will be dropped (as opposed to being sent out to the default gateway). Router F will also be dropping packets for this reason, if it does not have an interface for the forwarding address given by the faulty network card.
Decrease in inbound traffic on subnet 4, as outbound file requests and connection acknowledgements fail to reach their destinations.
A decrease in network traffic at the default gateway, G, and on subnets 1 and 2. Unless the faulty cache preserves the route to the default gateway G, no packets will travel from subnet 4 through G from the time the cache is corrupted until it is refreshed. The degree to which this decrease is noticeable depends on the normal traffic volume between hosts on subnet 4 and those on subnets 1 and 2 and the "world".
Traffic between hosts on subnet 4 remains normal since these packets are forwarded by the switch itself and don't cross the faulty network card.
Normal communication between agents, except for M4.
Evidence of traffic flow across the router's interface to subnet 4 while the hosts themselves are not responding.
The first of these observations indicate that there is a problem with routing and perhaps with agent M4's host. MF performs diagnostics using inter-agent communication, SNMP monitoring, and other actions such as examining contents of the routing table cache (which does not itself contain the faulty cache, which is on the network interface card itself). Diagnostics show that all of subnet 4 appears unavailable, yet the monitoring agent sees that traffic from subnet 4 is still coming across the (faulty) network card. MF might try resetting that interface, which would clear the corrupted cache. Finding that action successful, MF might continue to reset the card whenever communications with M4 were lost. Of course, resetting an interface introduces its own performance problems and is not a permanent solution. MASRR produces a detailed report to alert the administrator to the specific location of this network fault.
MASRR agents maintain performance, assist administrator
Steinberg’s example is likely atypical, and in fact, unusual enough that the administrators investigating the symptoms were initially on the wrong track. It represents the unrecognized anomalies that make ensuring network reliability a difficult task. It also represents a number of problems that, even if perfectly diagnosed, cannot be corrected by software agents; these include hardware or physical link failures, misbehaving applications, and undesirable though valid usage patterns. When faced with a situation like these, MASRR’s goal is for agents to work together to locate the source of the problem, take reasonable steps to maintain network resource availability, and to enable the administrator to confirm and correct the problem as quickly and accurately as possible.
The scenario also suggests that traditional automated network management, relying on centralized correlation and decision making, could benefit from MASRR agents’ collaboration and localized reasoning. A centralized processing module would receive incomplete information, due to the loss of paths in the network. Intermittent failures would present conflicting information, making it hard to pinpoint the location of the errors. Isolating and responding to unknown anomalies such as this would be greatly simplified by whittling down the data and examining what is occurring between different points in the network. MASRR agents sending summaries to the centralized module would speed it along its diagnosis of the real problem.
MASRR is currently under research and development at SHAI. One of our initial undertakings has been to examine the need, as well as the feasibility, of this decentralized, agent-based network management approach. We find that this scenario and others highlight the benefits of our approach as compared to current tools and methods of network management. This project is funded by DARPA through the Small Business Innovation Research Program. For more information, contact the project manager, Lynn Jones
1This author, whose domain is artificial intelligence rather than network management, is not entirely sure that the events described follow the true workings of a router. However, she was running out of hairs to pull out when every step in the scenario simply brought forth more low-level questions to be answered. Thus, these assumptions are given. The scenario is meant to illustrate the ways in which MASRR would confront network faults, and we make no claim as to the accuracy of events with respect to fault resolution.
References [Steinberg] Lou Steinberg, Troubleshooting with SNMP and
Analyzing MIBs, page 51. New York: McGraw-Hill, 2000.