Over the past decade, steady progress has been made on bottom-up, low-level man- management instrumentation of network and computing environments. And more recently, new service management systems have come on the market that work from the top down to develop an objective fact base on the service levels that users experience. What's been missing is the middle ground: Systems-level management, which encompasses fault and performance management.
Device-level management is still moving ahead, and the latest improvements involve embedding Web interfaces directly in devices. These interfaces allow secure browser access using http to the functionality of device-level management agents. This certainly represents an improvement over the telnet-based command line access to these management agents that used to be the norm. Over time, SNMP-based monitoring and measurement collection will be replaced by data exchanges between the element management agent and higher-level management systems that run based on the WBeM CIM Interface Standard (Web-based enterprise Management Common Information Model - see this column in BCR, October 1998, pp. 30-32).
New directions for monitoring, measurement and trending also have emerged. Top-down, service level-oriented approaches use infrequent, application-level test transactions to determine the actual service levels being delivered to end users (see BCR, February 1999, pp. 20-21). These systems are usually combined with direct analysis of end-user transactions, as well as "synthetic" transactions. They compute availability, response time and throughput measures for network-based services, and create reports that compare real versus target service levels. The beauty of these systems is that they probe the network using the same transactions that users (or programs) generate; no special protocols need to be developed. By setting up agents at strategic points around the network that test critical services and applications, it is possible to get an accurate picture of and an early warning system for when service levels degrade.
Bottom-up and top-down meet in systems-level management, where the difficult problems of fault and performance management require the ability to integrate information from many sources and to understand relationships between devices, traffic flows, protocol layers, applications and services. Users consistently complain that they are not getting what they want from management software providers and systems vendors, and their complaints are understandable.
We're pretty good at finding and fixing a failure of a single network element. But it's much more difficult to identify, isolate and investigate problems that are the result of interactions among multiple elements or less-than catastrophic deviations from normal operation. Troubleshooting these kinds of problems often resembles untangling a ball of string or peeling an onion.
The reality of multivendor environments only exacerbates the problem. It's not easy to separate elements that are working correctly from those that aren't; we can't easily partition our networks to test whether different parts work correctly on an independent basis. Lack of fault isolation test points makes it difficult to isolate which protocol and service layer is having problems. Good systems-level troubleshooting requires knowing how different elements function and the usual kinds of degradation and failure modes. It is much more difficult for a vendor to build that level of expertise into a management product than it is to develop good device-level measurement or configuration support. But that is exactly what users want. They need to substitute expert tools for experts, at least in some cases.
RMON2 represents one significant step toward systems level management. RMON2 agents collect data on traffic flows, including higher-layer protocol information and source-destination addresses. Using this data, it has been possible to put together a picture of which applications are used by which users, and how different parts of the network are affected by different application flows. NetScout and 3Com have made particularly good use of RMON2 data. (Adesola)
Alarm correlation has delivered some benefits. Products from Micromuse, Tivoli, Seagate and Systems Management Arts (SMARTS) analyze aggregated alarm streams to weed out duplicate alarms and deduce primary causes of alarm avalanches. They are still limited, however, in their ability to isolate specific faults or the causes of performance problems. Vendors like VitalSigns, now INSoft, exploit details of protocol timings to isolate what portions of networks are causing performance problems. (Adesola) They combine this with knowledge of a few very popular applications (e.g., SAP) to provide insight into performance analysis.
Overall, users remain very dissatisfied. There is hope, however, that the next wave in management, powered by use of Internet and Web technologies, standards and architectures, will enable progress to be made on these difficult problems. There are at least three reasons for optimism. (Caggiano)
- First, the migration toward Internet-centric systems architectures will create a more homogeneous IT environment. It will still be multi-vendor, but it will conform to a single service and protocol architecture. That simplification will allow implementation of a more comprehensive, top-down set of test measurements and move us toward a form of fault isolation.
- Second, the use of Web browsers to access management systems has allowed vendors to tie their tools to knowledge bases located on the Internet. The knowledge bases contain up-to-date information about bugs in products, as well as solutions to commonly-encountered problems. They now can be maintained at vendor-support Web sites, and the entire customer community can contribute to and extend the knowledge base. When properly indexed, these knowledge bases can be the next best thing to an expert tool. Over time, vendors will integrate the knowledge bases more tightly with monitoring, analysis and testing tools to make better recommendations for probable causes and corrective actions.
- Third, the WBeM CIM standards now being developed by the DMTF (Desktop Management Task Force) will allow for the integration of management data from multiple elements operating at different layers. If the standards are adopted widely, configuration data and alarms and measurement data will be able to be exchanged between two management systems or between a management system and a management agent.
CIM is the best chance in a decade to get a useful management integration standard that can help address the systems-level functionality lacking in today's management products. (Caggiano) SNMP is just too simple to be effective at this level of integration, and while some vendors have tried to develop proprietary APIs for this kind of integration, it's not easy for device and tool providers to work with multiple, proprietary interfaces.
CIM's open, object-oriented framework includes association classes, which are designed to capture relationships between objects. The old ISO CMIP protocols were too primitive in how they presented relationships between elements, and SNMP, which is not object oriented, cannot directly represent multi-element relationships. CIM is also being extended to incorporate a data model for network directories. The DeN (Directory enabled Networks) standard will allow relationships between users, applications and devices to be specified in one (logical) place and used as the basis for connection establishment, authentication and quality-of-service assignment. DeN is built using the CIM framework, and together DeN and CIM will usher in an era of policy-based management of networks and services. (Basu, p.43)
The various bottom-up and top-down efforts will continue, but to provide truly integrated management systems, there must be a focus on capturing relationships at many layers and levels, and implementing logical, automated fault-isolation techniques. Users will not be satisfied until management systems can quickly pinpoint faults and their impact on critical business systems. Management and network control need to merge more so that performance degradations, such as brownouts, can be handled via high-level, resource-allocation policies. Systems need to offer "what if" scenario analysis to test proposed changes in policies or the effect of taking an element out of service for maintenance. We've seen noble solutions fail before; think artificial intelligence. Hopefully, enterprise management won't suffer a similar fate.
Bibliography
1. Adesola, B. and Roy, R. Review of Knowledge Modelling Frameworks: Lessons Learnt, Proceedings of the 3rd International Conference on the Practical Applications of Knowledge Management, (2000). p189-192.
2. Basu, A. Perspectives on operations research in data and knowledge management, european Journal of Operational Research, Vol. 111, No. 1, (16th November, 2003).
3. Caggiano, C. Low-tech smarts, Inc., Vol. 21, No. 1, January, 2004, p79-80. gement tenure, corporate ownership structure and the magnitude of golden parachutes. Strategic Management Journal, 10, 143-156.