1107 NE 45th St., Suite 427 |
A Motivating Example Lynn Jones and Tamitha Carpenter |
The rapid growth of computer networks over the past ten years has resulted in a highly complicated operating environment susceptible to a variety of attacks and malfunctions capable of compromising network reliability at all levels. Whether network resource reliability is lost due to malicious behavior, device failure, or common misconfigurations, the cost of this loss can be enormous, both in terms of lost productivity and the possible theft of sensitive documents.
Certainly networks are vulnerable to intrusions and attacks. Computer "crackers" continue to find new ways to exploit weaknesses in both hardware and software; security patches themselves have been known to introduce new bugs. Additionally, the internet permits attack and intrusion scripts, viruses, and other tools to be disseminated as rapidly as their solutions. The availability of these resources expands the number of potential attackers, as it enables those possessing only modest computer skills to wreak havoc on networks.
Even in a world with no malicious computer users, maintaining network system reliability is a difficult problem. Modern computer networks involve many different computer platforms and network elements, interacting over a variety of network protocols as well as assorted implementations of those protocols, all combinable in a virtually infinite number of configurations. These complicated environments lead to two major obstacles to maintaining network reliability: 1) ensuring that every host and every network element is properly configured at all times is a monumental task and is rarely, if ever, accomplished; and 2) any software solution for detecting and responding to network reliability issues must be sensitive to fluctuations in normal or valid usage that exist in these networks.
The task of maintaining network health is compounded by the difficulty of accurately diagnosing problems once symptoms are observed. There is no exact correspondence between network problems and their underlying causes. Faults may manifest themselves in a variety of ways, and observable symptoms may have a number of possible causes. Faults may be intermittent and difficult to consistently reproduce. Relatively minor faults can persist undetected, exacerbating and masking the causes of larger problems that might occur.
A variety of network monitoring and management tools exist. Some of these are device- or vendor-specific, although integration of management tools is improving. Most management tools perform only monitoring and reporting, presenting information to network administrators who must further diagnose and correct problems. Some management tools, especially network security tools such as intrusion detection systems, incorporate evidence correlation into monitoring. Many of these systems rely on "signatures" or descriptions of known attacks or faults and are less effective at diagnosing events that have not been previously encoded into the software. Most, if not all, of these applications employ a centralized processing module that collects and analyzes data from monitoring points throughout the network.
Software tools that can automatically maintain network health are highly desirable. The supply of experienced and knowledgeable network administrators does not meet growing demand. The amount of data to be tracked is too great for the unassisted human to manage. Network service providers are contracting "service level agreements," guaranteeing not just connectivity but quality as well, requiring well-tuned networks and nearly immediate corrective action. A current research goal is not just to diagnose but to predict faults, attacks and intrusions, so that timely intervention can minimize or prevent the effects.
Some characteristics of network management practices actually stand as obstacles to automated response, however. Centralized decision making, be it automated or on the part of a human operator, cannot scale well to support large networks, both because of the size of the data to be analyzed and because the system relies on network performance to collect the data. Performance of applications that rely on "signatures" for fault or attack detection is limited to what they have been programmed to recognize, virtually ruling out their customization to unique network topologies and leaving them unable to detect changes in network behavior that indicate a problem.
Stottler Henke Associates, Inc. (SHAI) has been tasked by the Defense Advanced Research Projects Agency (DARPA) to research and develop a unique system for maintaining network resource reliability through a decentralized multi-agent architecture. Our approach differs from existing computer security and network management tools in a number of ways. First, our system will not rely on the presence of a central analysis unit. Instead, individual agents will monitor individual hosts and/or local portions of the network, communicating between agents when needed, and autonomously or semi-autonomously applying network maintenance measures where and when needed and appropriate. This design is not only easily scalable, it enables the agents to perform local responses, even when much of the network is compromised. Second, unlike many intrusion detection tools which have hard-coded responses to known attacks, our system will be able to respond to known attacks, recognizable network degradations (e.g., router misconfigurations), and unexplained anomalies. Third, our agents will be able to reconcile differing local views through data fusion, thus allowing the agents to provide increasingly accurate responses to network performance degrading events as information is collected and aggregated from other agents.
A real-world scenario
In this article, we outline a real-world network problem, its diagnosis and solution. We explain how our "Multi-Agent System for Network Resource Reliability" (MASRR) would behave in the same situation. We then contrast performance of the two approaches.
A "broadcast storm" is defined as:
A state in which a message that has been broadcast across a network results in even more responses, and each response results in still more responses in a snowball effect. A severe broadcast storm can block all other network traffic, resulting in a network meltdown. Broadcast storms can usually be prevented by carefully configuring a network to block illegal broadcast messages.[Webopedia]
Not all broadcasts are bad – many routers and services use them to advertise availability of network resources. However, misconfigurations can allow broadcasts to "run amok," temporarily impacting network performance. Storms can subside on their own but may recur frequently, as in this example summarized from [Alderson & Haugdahl]:
In a Novell IPX network, thousands of broadcast packets were seen in short periods of time. The packets included both Routing Information Protocol (RIP) and Service Advertising Protocol (SAP) messages, coming from both routers and servers, and announced services and routes that were no longer reachable. The packets that appeared to be the beginning of the storm were from routers announcing that they had lost contact with hundreds of networks and SAP services. All the other routers and servers then propagated the announcements.
The network analysts (humans) examined the intervals at which the storms were occurring, in order to gain more evidence as to the cause. They filtered the data so they could isolate the traffic at one router and learned that the first packet of the storm was always announcing the unreachability of the same network or service that was lost during the prior storm. Moreover, by comparing the timing of the start of the storms, they found that every five minutes, a router was announcing that a particular server was no longer reachable.
The analysts compared these messages with SAPs advertising availability of the same server, which occurred over two-minute intervals and then were absent over three-minute intervals. They suspected that a router had been configured to supply RIP and SAP across a WAN link every five minutes (to save bandwidth), which they found to be true. However, the receiving router was configured to expect updated RIP and SAP every minute. So if updates weren't seen in four minutes, the router assumed the networks/services were dead and broadcast an announcement before purging the information from its tables, thus causing the broadcast storm. When the receiving router was set to expect five-minute RIP and SAP updates, the broadcast storms went away, and SAP broadcasts were then observed on a regular 60-second basis.
There are several items of note. First, these broadcast storms had been occurring undetected, and were only found when the company held a training session led by independent network management consultants. The disruption of the network was too transient to have been otherwise investigated, yet the impact could have made other problems hard to find, or could potentially have compounded other problems, inducing enough stress to overwhelm the network. Next, the diagnostics concentrated on network traces, looking at evidence of the broadcast storm. The analysts looked at the message origins and intervals, and not at whether the service was truly unreachable or not. Finally, their suspicion (and conclusion) was likely drawn from a great deal of knowledge and experience. We are not told if this was the first diagnosis that they pursued.
MASRR’s response to such a situation:
While the above scenario transpired on a network of hundreds of nodes, we illustrate the scenario with the small abstraction of a network as diagrammed in Figure 1. In fact, the scale of the problem can be reduced in this way as, rather than detecting the broadcast storm itself, MASRR agents observe the discrepancy in information between peer agents monitoring the routers. Initially, service V is available. The sending router S advertises the availability every 5 minutes. Router R announces availability every minute, and is also expecting receipt of an advertisement every minute.

Our scenario unfolds:
At time t, S and R advertise the availability of service V. A minute later, R’s path to V has not been refreshed. R assumes that service V is no longer available, and broadcasts the message. This announcement goes only to "downstream" routers 3 and 4, and the broadcast storm is confined to R’s subnet. MASRR agent MR observes the broadcast but has received no "bad news" from peer agent MS. Agent MR queries MS about the loss of service V; MS replies that V is available. The two agents work together to determine why R thinks the service is unavailable. They use "case-based retrieval" libraries of actions the agents can take, including an action to compare and reconcile the advertisement intervals configured in the routers. This case is retrieved by the description of symptoms and is adapted to correct router R’s refresh interval to 5 minutes so that it will not incorrectly announce that Service has been lost.
At the same time, agents M3 and M4 go through the same process. However, their upstream peer, MR, replies that the problem lies further upstream. Thus M3 and M4 respond by accepting that the service is unavailable and deferring problem-solving to MR. Agents M3 and M4 can implement filtering of broadcasts to prevent the storms, which is the network degradation they observe. Similar events may or may not be occurring at M1 or M2, depending on the router configurations at 1 and 2.
When MR and MS resolve the discrepancy between refresh intervals, MR informs agents M3 and M4, which then remove the broadcast filters.
The "case-based retrieval" mechanism that all agents employ uses a case library containing a variety of actions for responding to known situations. Certainly the case library cannot contain actions for every conceivable event, but there will be sufficient breadth of representative cases that can be adapted to respond to and improve many situations, even those not previously observed. In addition, MASRR agents are equipped with learning components that adapt to changes in normal usage patterns and that learn which actions are best. The choice and application of actions are available to the network administrator in a detailed reporting of findings.
How do these approaches differ?
The real-world scenario exemplifies what we consider a traditional approach to network management, with limited or no localized reasoning and all data analysis occurring at a centralized location (in this case, by human analysts). We see differences along the following measures:
Our decentralized approach detects or corrects situation sooner than the centralized approach. MASRR agents recognize the discrepancy in their observations before the broadcast storm symptoms are observed, and immediately assess that the problem is occurring between S and R (and possibly 1 and/or 2). The centralized approach starts by examining the broadcast storm packet trace in the subnets and has to deduce or infer the location of the problem.
Our decentralized approach is less expensive, in terms of processing time and/or bandwidth consumption. In order to make that deduction, the centralized processor would need to correlate packet trace data before it can begin diagnosis of the root cause. All the data must be shipped to a centralized location. MASRR agents can concentrate on identifying and correcting the problem after transmitting a small amount of data across a local communication path. Information from uninvolved stations is not considered, and relevant information is locally available.
Our decentralized approach is less reliant on network connectivity. Because the centralized approach requires the data to be sent to one location, if the connection is lost or becomes too congested, the data will not arrive. With MASRR, only communication between S and R is required; if that fails, then the service is truly unavailable anyway. In general, MASRR agents reason amongst themselves; if connections between peers are lost, they continue to monitor and act to improve or preserve network reliability locally.
Our agent-reasoning approach provides more meaningful reporting to the network administrator. In this example, detailed analysis revealed a network-degrading situation that had previously gone undetected. Even if the network administrator had used some of the available network monitoring tools, substantial analysis would have remained in order to track down the source of the broadcast storms. MASRR agents would detect this problem without supervision and either resolve it automatically or alert the administrator to the misconfigurations. In either case, the system would provide a detailed explanation of its reasoning and the root cause of the problem.
Combining agents’ work with centralized processing
Today’s network management tools tend to take a centralized approach to data analysis and decision making. One benefit of centralized processing is that they have access to all available data, creating a bigger "picture" of what is happening in the network. It may assist with the elimination of redundant alarms or other duplicate events. The centralized approach also allows very lightweight components dispersed across the network to perform data collection. Network administrators may even feel more confident with centralized management because its operation seems more akin to the process the human would take in performing diagnostics and corrections. However, fully centralized management cannot scale well to very large networks, due to both the overhead of data reporting within the network (i.e., communication bandwidth usage) and the amount of data to be processed. Moreover, the trend in network management will be to examine more data, not less, as management focuses on quality of service and application behavior and its impact on network behavior [Lee].
Hybrid systems employing localized analysis of monitored data along with centralized reasoning about the network (or between networks) can offer the benefits of the centralized as well as the decentralized approach. Agents monitoring network elements or subnets can examine data at fine granularity and collaborate with close neighbors to assess and diagnose local network behavior and to take small corrective actions. After local agents work together to quickly identify the location and cause of network troubles, a single report can then be sent providing pertinent information to the centralized processing module for further analysis. This hybrid system allows for fairly lightweight agents, distributed throughout the network. It reduces management bandwidth requirements by sending only summaries or correlated events. Faults and problems are located more quickly and directly, enabling faster prediction, detection, and response and reducing the central processing load. Such a system could also adjust the amount of work done by local agents according to network congestion or if connectivity to the central unit is lost altogether.
Conclusion
MASRR is currently under research and development at SHAI. One of our initial undertakings has been to examine the need, as well as the feasibility, of this decentralized, agent-based network management approach. We find that this scenario and others highlight the benefits of our decentralized approach or a hybrid approach as compared to current tools and methods of network management. This project is funded by DARPA through the Small Business Innovation Research Program. For more information, contact the project manager, Lynn Jones
References
[Alderson & Haugdahl]
Bill Alderson and J. Scott Haugdahl, "A Broadcast Storm Becomes A Thunderstorm", Network Computing online magazine. http://www.networkcomputing.com/611/611alderson.html.[Lee]
Stephen Lee, "Vendors bring network management under one roof", InfoWorld, 4/30/01. Available at http://www.itworld.com/Net/3206/IWD010430hnmanage/.[Webopedia]
http://webopedia.internet.com/TERM/b/broadcast_storm.html