Plaque It!
|
[0001] § 1.1 Field of the Invention
[0002] The present invention concerns network management systems (“NMSs”). In particular, the present invention concerns combining fault and performance management.
[0003] § 1.2 Description of Related Art
[0004] The description of art in this section is not, and should not be interpreted to be, an admission that such art is prior art to the present invention.
[0005] As computer, hardware, software and networking systems, and systems combining one or more of these systems, have become more complex, it has become more difficult to monitor the “health” of these systems. For example,
[0006] Each of the servers may include components (e.g., power supplies, power supply backups, printers, interfaces, CPUs, chassis, fans, memory, disk storage, etc.) and may run applications or operating systems (e.g., Windows, Linux, Solaris, Microsoft Exchange, etc.) that may need to be monitored. The various databases (e.g., Microsoft SQL Server, Oracle Database, etc.) may also need to be monitored. Finally, the networks, as well as their components, (e.g., routers, firewalls, switches, interfaces, protocols, etc.) may need to be monitored.
[0007] Although the system
[0008] Tools have been developed to monitor these systems. Such tools have come to be known as network management systems (NMSs). (The term network management systems should not be interpreted to be limited to monitoring networks—network management systems have been used to monitor things other than networks.) Traditionally, NMSs have performed either fault management, or performance management, but not both. Fault management pertains to whether something is operating or not. Performance management pertains to a measure of how well something is working and to historical and future trends.
[0009] A fault management system generates and works with “real time” events (exceptions). It can query the state of a device and trigger an event upon a state change or threshold violation. However, fault management systems typically do not store the polled data—they only store events and alerts (including SNMP traps which are essentially events). Generally, the user interface console for a fault management system is “exception” driven. That is, if a managed element is functioning, it is typically not even displayed. Generally, higher severity fault events are displayed with more prominence (e.g., at the top of a list of faults), and less critical events are displayed with less prominence (e.g., lower in the list).
[0010] On the other hand, performance management systems generally store all polled data. This stored data can then be used to analyze trends or to generate historical reports on numerical data collected. A major challenge in performance management systems is storing such large amounts of data. For example, just polling 20 variables every 5 minutes from 1000 devices generates 6 million data samples per day. Assuming each data sample requires 50 bytes of storage, about 9 GB of data will be needed per month. Consequently, performance management systems are designed to handle large volumes of data, perform data warehousing and reporting functions.
[0011] Performance management systems are typically batch oriented. More specifically, generally, distributed data collectors poll data and periodically (e.g., each night) feed them to a centralized database. Since the size of the centralized database will become huge, database management is a prime concern in such products.
[0012] As can be appreciated from the foregoing, conventional fault management systems are limited in that they do not store data gathered for later use in performance analysis. Conventional performance management systems are limited in that they require huge amounts of storage. Furthermore, since data is batched and sent to a centralized location for storage, the stored data can become “stale” if enough time has elapsed since the last batch of data was stored.
[0013] Furthermore, most enterprises currently use a minimum of two, if not more, products for information technology management. It is common to find several independent products being used by various departments within an enterprise to meet the basic needs of monitoring and performance management across networks, servers and applications. Moreover, since the performance and fault monitoring systems are disjointed, correlating data from these different systems is not trivial.
[0014] Recognizing that correlation between the collective information technology (“IT”) infrastructure and business service is needed, several Manager of Manager (“MoM”) tools have appeared in the market. These products interface with the various well known commercial tools and try to present a unified view to IT managers. Unfortunately, however, such integration is complex and requires depending on yet another product which needs to be learned and supported each time an underlying tool is updated. The addition of yet another tool just adds to the operational costs rather than reducing it.
[0015] In view of the foregoing limitations of existing network management systems, there is a need to simplify the processing related to monitoring faults and performance. There is also a need to monitor end-to-end service faults and performance of a service. Such needs should be met by a technique or system that is simple to install and administer, that has real-time capabilities, and that scales well in view of the large amount of data storage that may be required by a performance management system. Finally, there is a need to provide different users with different levels of monitoring, either for purposes of security, for purposes of software licensing, or both.
[0016] The present invention discloses apparatus, data structures, and/or methods for distributing queries and combining query responses in a fault and performance monitoring system using distributed data gathering and storage.
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023] FIGS.
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032] The present invention involves methods, apparatus and/or data structures for monitoring system faults and system performance. The following description is presented to enable one skilled in the art to make and use the invention, and is provided in the context of particular applications and their requirements. Various modifications to the disclosed embodiments will be apparent to those skilled in the art, and the general principles set forth below may be applied to other embodiments and applications. Thus, the present invention is not limited to the embodiments shown and the inventor regards his invention as the following disclosed methods, apparatus and data structures and any other patentable subject matter.
[0033]
[0034] DGEs
[0035] Information extraction, combination and presentation operations
[0036] Finally, an application programming interface (“API”) operation
[0037] In one embodiment of the invention, the system configuration operations
[0038] Recall that although some traditional NMS products have distributed collectors, they require consolidating all the data into a central database for reporting. Thus the architecture
[0039] This architecture pushes even the correlation and notification to the distributed DGEs so that there is no central bottleneck and the system operates as a loosely coupled but coordinated cluster. One embodiment, consistent with the principles of the present invention, uses key technology standards such as XML, JMS, JDBC, SOAP and XSLT layered on a J2EE framework.
[0040]
[0041] Exemplary methods, apparatus, and data structures that may be used to effect the configuration, data gathering, and information extraction, combination and presentation operations are now described.
[0042] System configuration may include information learned or discovered from the system and/or information entered via the API operation.
[0043] Further, each of at least one data gathering operation (e.g., a DGE) is associated one or more of the devices as indicated by block
[0044] As indicated by block
[0045] As indicated by block
[0046] The various associations may be stored in the configuration database
[0047] Referring back to
[0048] Recall from block
[0049] Although monitors may be predefined, the API operation may allow users to create “plug-ins” to define new tests (e.g., for a new device) to be performed by new monitors. In this regard, monitors are similar to device drivers in an PC operating system. More specifically, a PC operating system has drivers for may popular peripherals. However, device drivers for new peripherals or less popular peripherals may be added. Similarly, as new devices types are added to the system being monitored, new monitors for testing these new device types may be added. The present invention may overprovision a DGE with monitors. In this way, even though some monitors might not be used, as devices are added, the DGE can simply activate a monitor needed to test the newly added device.
[0050] A list of at least some exemplary monitors that may be supported by the present invention is provided in § 4.3.1.1.1 below.
[0051] ICMP network monitors may be used to check the reachability of hosts on an Internet Protocol (“IP”) network using the ICMP protocol. The ICMP monitor reports on packet loss and latency for a sequence of ICMP packets. These monitors may include:
[0052] ICMP Round Trip Time—Average time of 5 packets sent at 1 second intervals of 100 bytes each. Measured in milliseconds.
[0053] ICMP Packet Loss—% of packets lost out of 5 packets sent at 1 second intervals of 100 bytes each.
[0054] SNMP network monitors for querying devices using the standard SNMP v1, v2 and v3 protocol. Certain enhancements have been made to the monitor such using 64-bit counters where available, account for rollover of 32-bit counters, asynchronous polling to avoid waiting for responses and optimize timeout periods, multiple queries in the same SNMP packet, automatically sending individual queries if the multiple query packet fails for any reason, and querying an alternate SNMP port. In an exemplary embodiment, a external definition library has been built which defines which SNMP variables and post processing (such as rate, delta, etc.) needs to be queried based on the device type. This permits easily updating the definition library without having to edit the core product resources (SNMP v1, v2, v3) may use 64-bit counters where available, and may also account for rollover of 32-bit counters. Multiple SNMP queries to the same host may be sent in the same packet for optimization. An alternate SNMP port may be queried instead of default. These monitors may include:
[0055] Bandwidth Utilization by Interface—% of total network bandwidth, both incoming and outgoing, calculated by the delta bytes between each sample.
[0056] Throughput by Interface—number of packets per second.
[0057] Interface Errors—CRC error rate (per minute) calculated by the delta between sample intervals.
[0058] BGP Monitor—BGP peer state (connected or failed), route flaps (rate of routing updates).
[0059] Environment—Cisco, Foundry chassis temperature, fan status, power supply.
[0060] SNMP Traps—Customizable trap handler which assigns a severity to received traps based on a customizable configuration file and inserts into the system.
[0061] SNMP Host Resources (SNMP v1, v2, v3) monitors may include:
[0062] CPU load—Average % per minute.
[0063] Disk space—% of total disk available for each partition; does not show total size.
[0064] Physical Memory—% of physical memory used.
[0065] Virtual Memory—% of virtual memory used.
[0066] Paging/Memory Swapping—number of page swaps per unit time.
[0067] Printer MIB support—printer health, paper tray capacity, cover status, available storage.
[0068] TCP Port monitors for monitoring the transaction of well known Internet services such as HTTP, HTTPS, FTP, POP3, IMAP, IMAPS, SMTP, NNTP.
[0069] Exemplary port monitors may include:
[0070] HTTP—Hypertext Transport Protocol—Monitors the availability and response time of HTTP Web servers. Checks for error response.
[0071] HTTPS—HTTP Secure Socket Layer—This monitor supports all of the features of the HTTP monitor, but also supports SSL encapsulation, in which case the communication is encrypted using SSLv2/SSLv3 protocols for increased security. The monitor may establish the SSL session and then perform HTTP tests to ensure service availability.
[0072] SMTP—Simple Mail Transport Protocol—Monitors the availability and response time of any mail transport application that supports the SMTP protocol (e.g., Microsoft Exchange, Sendmail, Netscape Mail.)
[0073] POP3—Post Office Protocol (E-mail)—Monitors the availability and response time of POP3 email services. If legitimate username and password is supplied, it may login and validate server response.
[0074] Generic Port—Any TCP port can be monitored for a response string.
[0075] IMAP4—Internet Message Access Protocol—Monitors the availability and response time of IMAP4 email services. If legitimate username and password is supplied, it may login and validate server response.
[0076] IMAPS—IMAP Secure Socket Layer—This monitor may support all of the features of the IMAP monitor, but may also support SSL encapsulation, in which case the communication is encrypted using SSLv2/SSLv3 protocols for increased security. The monitor may establish the SSL session and then perform IMAP tests to ensure service availability.
[0077] FTP—File Transport Protocol—Monitors the availability and response time of FTP port connection. It may send a connection request, receive OK response and then disconnect. If legitimate username and password is supplied, it may login and validate server response.
[0078] NNTP—Connects to the NNTP service to check whether or not Internet newsgroups are available, receives OK response and then disconnects. Note that for POP, FTP & IMAP monitors, if the user does not specify a username or password, then just a port connection is deemed OK. If the user specifies a username/password combo, then an actual LOGIN is considered OK, else fail.
[0079] The Simple Network Management Protocol (“SNMP”) is a popular protocol for network management. SNMP facilitates communication between a managed device (i.e., a device with an SNMP agent, such as a router for example) and an SNMP manager or management application (represents a user of network management). The SNMP agent on the managed device provides access to data (managed objects) stored in the managed device. The SNMP manager or management application uses this access to monitor and control the managed device.
[0080] Communication between the managed device and the management operation is via SNMP Protocol Data Units (“PDUs”) that are typically encapsulated in UDP packets. Basically, four kinds of operations are permitted between managers and agents (managed device). The manager can perform a GET (or read) to obtain information from the agent about an attribute of a managed object. The manager can perform a GET-NEXT to do the same for the next object in the tree of objects in the managed device. The manager can perform a SET (or write) to set the value of an attribute of a managed object. Finally, the agent can send a TRAP, or asynchronous notification, to the manager telling it about some event in the managed device.
[0081] SNMP agents for different types of devices provide access to objects that are specific to the type of device. To enable the SNMP manager or management application to operate intelligently on the data available in the device, the manager needs to know the names and types of objects in the managed device. This is made possible by Management Information Base (“MIB”) modules, which are specified in MIB files usually provided with managed devices. (See, e.g., the publication Request for Comments
[0082] One embodiment of the present invention may support at least some of the following SNMP MIBs:
[0083] RFC1253—OSPF Version 2
[0084] OSPF {neighbor} Status
[0085] OSPF {neighbor} Errors
[0086] OSPF External LSA
[0087] OSPF LSA Sent/Received
[0088] RFC1514—Host Resources MIB
[0089] Disk Space Utilization
[0090] Physical Memory Utilization
[0091] Swap/Virtual Memory Utilization
[0092] CPU Load
[0093] Running Application/Process Count
[0094] Logged In User Count
[0095] RFC1657—Border Gateway Protocol (BGP-4)
[0096] BGP {neighbor} Status
[0097] BGP {neighbor} Updates
[0098] Sent/Received
[0099] BGP {neighbor} FSM Transitions
[0100] RFC1697—Relational Database Management
[0101] {rdbms} Status
[0102] {rdbms} Disk Space Utilization
[0103] {rdbms} Transaction Rate
[0104] {rdbms} Disk Reads/Writes
[0105] {rdbms} Page Reads/Writes
[0106] {rdbms} Out Of Space Errors
[0107] RFC1724—RIP Version 2
[0108] RIP Route Changes
[0109] RIP {interface} Updates Sent
[0110] RIP {neighbor} Bad Routes Received
[0111] RFC1759—Printer MIB
[0112] Printer Status
[0113] Printer Paper Capacity
[0114] Printer Door Status
[0115] RFC2115—Frame Relay DTE
[0116] Frame Relay {dlci} Status
[0117] Frame Relay {dlci} FECN/BECN
[0118] Frame Relay {dlci} Discards/DE
[0119] Frame Relay {dlci} Traffic In/Out
[0120] RFC2863—Interfaces Group MIB
[0121] {interface} Status
[0122] {interface} Utilization In/Out
[0123] {interface} Traffic In/Out
[0124] {interface} Packets In/Out
[0125] {interface} Discards In/Out
[0126] {interface} Errors In/Out.
[0127] One embodiment of the present invention may support at least some of the following vendor specific MIBs:
[0128] APC UPS
[0129] UPS Battery Status
[0130] UPS Battery Capacity
[0131] UPS Battery Temperature
[0132] UPS Voltage
[0133] UPS Output Status
[0134] Checkpoint FW-1
[0135] Packets Accepted
[0136] Packets Rejected
[0137] Packets Dropped
[0138] Packets Logged
[0139] CPU Utilization
[0140] Cisco
[0141] Associated Stations
[0142] Neighbor Access Point Count
[0143] Cisco Local Director
[0144] Virtual {server}:{port} status
[0145] Virtual {server}:{port} Connections
[0146] Virtual {server}:{port} Traffic In/Out
[0147] Virtual {server}:{port} Packets In/out
[0148] Real {server}:{port} status
[0149] Real {server}:{port} Connections
[0150] Real {server}:{port} Traffic In/Out
[0151] Real {server}:{port} Packets In/out
[0152] Failover Cable Status
[0153] Cisco PIX Firewall
[0154] Firewall Status
[0155] Active IP Connections
[0156] Active FTP Connections
[0157] Active HTTP Connections
[0158] Active HTTPS Connections
[0159] Active SMTP Connections
[0160] Active H.323 Connections
[0161] Active NetShow Connections
[0162] Active NFS Connections
[0163] Cisco Router/Catalyst Switch
[0164] {interface} CRC Errors
[0165] Backplane Utilization
[0166] VLAN Traffic In/Out
[0167] VLAN Error In/Out
[0168] CPU Utilization
[0169] Memory Utilization
[0170] Buffer Allocation Failure
[0171] Chassis Temperature
[0172] Fan Status
[0173] Power Supply Status
[0174] Module Status
[0175] Compaq Insight Manager
[0176] Network Interface Status
[0177] Network Interface Utilization In/Out
[0178] Network Interface Alignment Error In/Out
[0179] Network Interface FCS Error In/Out
[0180] CPU Utilization
[0181] Disk Space Utilization
[0182] RAID Controller Status
[0183] RAID Array Chassis Temperature
[0184] RAID Array Fan Status
[0185] RAID Array Power Supply Status
[0186] Foundry Network Router/Switch
[0187] CPU Utilization
[0188] Chassis Temperature
[0189] Fan Status
[0190] Power Supply Status
[0191] HP/UX
[0192] Disk Space Utilization
[0193] Physical Memory Utilization
[0194] Swap/Virtual Memory Utilization
[0195] CPU Load
[0196] Running Application/Process Count
[0197] Logged In User Count
[0198] LAN Manager (Windows Only)
[0199] Windows Login Errors
[0200] System Errors
[0201] Workstation I/O Response
[0202] Active Connections
[0203] Microsoft DHCP Server
[0204] Available Address In Scope
[0205] DISCOVER Request Received
[0206] REQUEST Request Received
[0207] RELEASE Request Received
[0208] OFFER Response Sent”
[0209] ACK Request Received
[0210] NACK Request Received
[0211] Microsoft Exchange Server
[0212] Exchange Server Traffic In/Out
[0213] Exchange Server EXDS Access Violations
[0214] Exchange Server EXDS Reads
[0215] Exchange Server ExDS Writes
[0216] Exchange Server EXDS Connections
[0217] Exchange Server Address Book Connections
[0218] Exchange Server LDAP Queries
[0219] Exchange Server MTS
[0220] Exchange Server SMTP Connections
[0221] Exchange Server Failed Connections
[0222] Exchange Server Queue
[0223] Exchange Server Delivered Mails
[0224] Exchange Server Looped Mails
[0225] Exchange Server Active Users
[0226] Exchange Server Active Connections
[0227] Exchange Server Xfer Via IMAP
[0228] Exchange Server Xfer Via POP3
[0229] Exchange Server Thread Pool Usage
[0230] Exchange Server Disk Operation (delete)
[0231] Exchange Server Disk Operation (sync)
[0232] Exchange Server Disk Operation (open)
[0233] Exchange Server Disk Operation (read)
[0234] Exchange Server Disk Operation (write)
[0235] Microsoft Internet Information Server (IIS)
[0236] Incoming/Outgoing Traffic
[0237] Files Sent/Received
[0238] Active Anonymous Users
[0239] Active Authenticated Users
[0240] Active Connections
[0241] GET Requests
[0242] POST Requests
[0243] HEAD Requests
[0244] PUT Requests
[0245] CGI Requests
[0246] Throttled Requests
[0247] Rejected Requests
[0248] Not Found (404) Errors
[0249] Microsoft SQL Server (Using Network Harmoni ACM)
[0250] {database} Status
[0251] {database} Page Reads/Writes
[0252] {database} TDS Packets
[0253] {database} Network Errors
[0254] {database} CPU Utilization
[0255] {database} Threads
[0256] {database} Page Faults
[0257] {database} Users Connected
[0258] {database} Lock Timeouts
[0259] {database} Deadlocks
[0260] {database} Cache Hit Ratio
[0261] {database} Disk Space Utilization
[0262] {database} Transaction Rate
[0263] {database} Log Space Utilization
[0264] {database} Replication Rate
[0265] Oracle 8/9i Database □.Oracle DB {database} Status
[0266] Oracle DB {database} Disk Utilization
[0267] Oracle DB {database} Transaction Rate
[0268] Oracle DB {database} Disk Reads/Writes
[0269] Oracle DB {database} Page Reads/Writes
[0270] Oracle DB {database} OutOfSpace Errors
[0271] Oracle DB {database} Query Rate
[0272] Oracle DB {database} Committed/Aborted Transactions
[0273] Oracle Table {table} Space Utilization
[0274] Oracle Table {table} Status
[0275] Oracle Datafile {file} Reads
[0276] Oracle Datafile {file} Writes
[0277] Oracle Replication Status
[0278] Oracle Listener Status
[0279] Oracle SID Connections
[0280] Sun Solaris
[0281] System Interrupts
[0282] Swap In/Out to Disk
[0283] CPU Load
[0284] NET-SNMP (formerly UCD-SNMP) □.Disk Space Utilization
[0285] Physical Memory Utilization
[0286] Swap/Virtual Memory Utilization
[0287] CPU Load
[0288] System Interrupts
[0289] Swap In/Out to Disk
[0290] Block I/O Sent/Received
[0291] System Load Average.
[0292] One embodiment of the present invention may support at least some of the following non-SNMP tests:
[0293] Networking
[0294] Ping Packet Loss
[0295] Ping Round Trip Time
[0296] RPC Ping
[0297] Internet Services
[0298] HTTP
[0299] HTTPS
[0300] SMTP
[0301] IMAP
[0302] IMAPS
[0303] POP3
[0304] POP3S
[0305] NNTP
[0306] FTP
[0307] Applications
[0308] Radius
[0309] NTP
[0310] DNS Domain
[0311] SQL Query
[0312] LDAP Search
[0313] DHCP Request
[0314] URL/Web Transaction Test
[0315] Custom
[0316] External Data Feed
[0317] External Plug in Monitors
[0318] Advanced Port Test
[0319] Advanced SNMP Test.
[0320] Exemplary application monitors may include:
[0321] URL transaction monitor—Measures time to complete an entire multi-step URL transaction. Can fill forms, clicks on hyperlinks, etc. May work with proxy and also support https.
[0322] Oracle system performance—Measures RDBMS size, RDBMS transaction rate, and table size.
[0323] SQL database query—measures query response time for a SQL query from databases such as Oracle, Sybase, SQL Server, Postgres, MySQL. Required inputs may include legitimate username, password, database driver selection, database name, and proper SQL query syntax. May support Oracle, Sybase, SQL Server, Postgres, MySQL.
[0324] Poet OQL database query—Measures query response time. Required inputs may include legitimate username, password, database name, and proper OQL query syntax.
[0325] LDAP database query—Connects to any directory service supporting an LDAP interface and checks whether the directory service is available within response bounds and provides the correct lookup to a known entity. Required inputs may include base, scope and filter.
[0326] NTP—Monitors time synchronization service running on NTP servers.
[0327] RADIUS—Remote Authentication Dial-In User Service (RFC 2138 and 2139)—Performs a complete authentication test against a RADIUS service.
[0328] DNS—Domain Name Service (RFC 1035)—Uses the DNS service to look up the IP addresses of one or more hosts. It monitors the availability of the service by recording the response times and the results of each request.
[0329] DHCP Monitor—Checks if DHCP service on a host is available, whether it has IP addresses available for lease and how long it takes to answer a lease request.
[0330] RPC Portmapper—Checks if the RPC portmapper is running on a Unix host (a better alternative to icmp ping for an availability test).
[0331] BEA Weblogic—Checks heap size and transaction rate. SQL Server—Checks state, transaction rate, write operations performance, cache hit rate, buffers, concurrent users, available database and log space.
[0332] LAN Manager—Checks authentication failures, system errors, I/O performance, and concurrent sessions.
[0333] External data feeds (“EDF”) monitors may be used to insert result values into the system using a socket interface. The inserted data is treated just as if it were collected using internal monitors.
[0334] The present invention can provide a plug-in monitor framework so that a user can write a custom monitor in Java or any other external script or program. The monitor itself and a definition file in XML are put into a plugin directory, and treated as integrated parts of the DGE itself.
[0335] Since IT infrastructure is typically used to deliver business services within an enterprise, it is increasingly important to correlate the different IT components of a business service. As an example, a payroll service may consist of a payroll application on one server, a backend database on another server, and a printer, all connected by a network router. Any of these underlying IT components can fail and cause the payroll service to go down.
[0336] Service views and reports can be created in the exemplary product by grouping together all the underlying components of a service into a consolidated service view. If and when any of the underlying IT components fails, the entire service is reported as down, thus allowing one to measure the impact of underlying IT components on business services.
[0337] Most of the test discovery on a device is done by a separate task. Note that any adds/changes are made to the configuration database which essentially controls the behavior of the DGE processes as described earlier.
[0338] Tests can be provisioned using one or more of the following techniques.
[0339] Port and SNMP tests can be automatically “discovered” by querying the device to see what services are running. The system can automatically detect disk partitions, volumes and their sizes so that the usage is normalized as a percentage. This normalization may also be done for memory, disk partitions, and database tablespace.
[0340] When the auto-discovery for SNMP occurs, the target device database record may be updated with vendor and model information. If a user has checked the SNMP tests box when creating a device, the model and vendor information may be displayed on a configure tests page.
[0341] The present invention can provide a mechanism for refreshing maximum values or SNMP object identifiers (SNMP OID) when an SNMP test has changed. For example, when memory or disk capacity has changed, tests that return percentage-based values would be incorrect unless the maximum value (for determining 100%) is refreshed. Similarly, in the case of a device rebuild, it is possible that the SNMP OIDs may change, thus creating a mismatch between the current SNMP OIDs and the ones discovered during initial provisioning. If any of these situations occurs, the user need only repeat the test provisioning process in the web application for a changed device. The present invention can discover whether any material changes on the device have occurred and highlight those changes on the configure tests page, giving the user the option to also change thresholds and/or actions that apply to the test.
[0342] Default warning and critical thresholds may be set globally for each type of test. Tests can be overridden at the individual device level, or reset for a set of tests in a department or other group. In addition, a service level (SLA) threshold can be set separately to track levels of service or system utilization, which will not provide alarms or actions.
[0343] At this point, the system is configured. Data gathering and storage (in accordance with the configuration) is described in § 4.3.2 below. Then, information extraction, combination and presentation (in accordance with the configuration) is described in § 4.3.3 below.
[0344] To reiterate, under the present invention, data gathering may be performed by distributed data gathering operations (e.g., DGEs). Gathered data may be stored locally by each DGE. Further, DGEs may optionally perform some local data preprocessing such as calculating rate, delta, percentages, etc.
[0345]
[0346] The remainder of the method
[0347] Referring to trigger (event) block
[0348] Referring back to trigger (event) block
[0349] Referring now to decision block
[0350] Referring back to decision block
[0351] Referring back to trigger (event) block
[0352] In one embodiment, if a threshold has been crossed, an event is generated and fed into a correlation-processor. This thread looks at a rules engine to determine the root-cause of the problem (e.g., upstream devices, IP stack, etc.) and if a notification or action needs to be taken.
[0353] In an exemplary embodiment, consistent with the principles of the present invention, all data is stored in a JDBC compliant SQL database such as Oracle or MySQL. Data is collected by the DGEs and stored using JDBC in one of a set of distributed databases which may be local or remote on another server. Such distributed storage minimizes data maintenance requirements and offers parallel processing. All events (a test result that crosses a threshold) may be recorded for historical reporting and archiving. Information may be permanently stored for all events (until expired from database). All messages and alerts that may have been received may be permanently stored by the appropriate DGE (until expired from the database). Raw results data (polled data values) may be progressively aggregated over time. In one embodiment, a default aggregation scheme is five-minute samples for a day, 30-minute averages for a week, one-hour averages for three months and daily averages for a year.
[0354] Recall from blocks
[0355] Based on these severity levels, the visual GUI indicates these severity conditions by unique icons or other means. The following severity states are supported:
[0356] OK, WARNING, CRITICAL: Typical alarming occurs when test results cross warning and critical thresholds set by the end-user or administrator, and may display yellow and red icons or bars on the various status pages. Devices and tests in a normal state may display an OK icon or green color bar.
[0357] UNKNOWN: A test result returns an “unknown” value when the monitor receives no response from the device for that particular test. Unknown results may display a question mark (?) and may also create events that are graphed on reports.
[0358] FAIL: This state occurs when a test result is received, but the value returned is invalid. For example, if a POP3 username or password is incorrect, the device may be reached by the test but the login will fail. Failed tests may be displayed and stored as CRITICAL events and graphed accordingly.
[0359] UNREACHABLE: It is desirable to differentiate between when a device is unavailable due to its own error and when it is unreachable due to the unavailability of a gateway device (e.g. router or switch).
[0360] SUSPENDED. Although not an alarm per se, suspended devices and tests may be displayed with a unique icon to indicated the state.
[0361] Events may be recorded for these state changes in order to track historical activity, or lack thereof. Tests can be ‘suppressed’ when they are in a known condition, and are hidden from view until the state changes after which the suppressed flag is automatically cleared.
[0362] An event may be recorded for a test's very first result and for every time a test result crosses a defined threshold. For example, the very first test result for an ICMP round trip time test falls into the “OK” range. Five minutes later, the same test returns a higher value that falls in the “WARNING” range. Another five minutes passes, the test is run again, and the round trip time decreases and falls back into the “OK” range. For the ten minutes that just past,
[0363] One time text messages, or SNMP traps, or text alarms may be displayed in a separate ‘message’ window. All messages should have a severity and device associated with them, and the user can filter the messages displayed and acknowledge them to remove from the messages window. A user can match on a regular expression and assign a severity to a text message, thus triggering actions and notifications similar to events.
[0364] Recall that events and exceptions trigger actions. An action may be a notification via email or pager, or any other programmable activity such as opening a trouble ticket or restarting a server. Actions may be configured and assigned to tests in the form of a profile, with each profile preferably containing any number of individual sub-actions. Each of these sub-actions may configured with the following information:
[0365] notification type—email, pager or external script;
[0366] message recipient—email address;
[0367] notify on state—OK, Warning, Critical, Unknown (choose one, several, or all);
[0368] delay—choose to notify immediately or after N test cycles;
[0369] repeat—if the test stays in the trigger state, either don't repeat notification or repeat it every N tests; and
[0370] time of day—the time of day that this sub-action is valid.
[0371] Actions may be assigned to tests by reference. They may be assigned en masse to multiple devices, and thus all the test configurations on each device. Updating an action may automatically update all test configurations to which the action was assigned.
[0372] Having described data gathering (in accordance with the configuration), information extraction, combination and presentation (in accordance with the configuration) is now described in § 4.3.3 below.
[0373] To reiterate, under the present invention, data collection and storage is distributed across various DGEs which each store data locally or a remote distributed database. Further, at least some data analysis may be distributed across various DGEs, each of which may analyze local data. Thus, a (more) centralized reporting facility is relieved of at least some data storage and analysis responsibilities.
[0374]
[0375] In response to a user query (Note that a user login may infer a default query.), the user should be authenticated as indicated by block
[0376] Then, the user's authorization is determined as indicated by block
[0377] Referring back to
[0378] Referring back to trigger (event) block
[0379] Although not shown, in one embodiment, the user can “drill-down” into a report to view data or information underlying a presentation result.
[0380] Recall from block
[0381] An “Availability” report may be based on event data which shows the number of threshold violations, the distribution of such violations and total downtime. This report can be generated for a device, or individual tests or a business service. Device availability may be measured by the ICMP packet loss test. Metrics are captured for the device state equal to CRITICAL or UNREACHABLE. The report shows the top n (e.g., n=10) violations by amount of “unavailability”, displaying total time unavailable and % unavailable, with graphics showing either view. Users may link to an availability distribution report/graph for either accounts or devices, depending on which view is being accessed. This histogram is a distribution of the numbers of accounts or devices falling into blocks of
[0382] A “Downtime” report is similar to the Availability report, in that it is based on device availability as measured by the ICMP packet loss test. However, the results are only for device states equal to CRITICAL, rather than CRITICAL and UNREACHABLE. This more accurately reflects the situation when a single device outage occurs, with no regard for any possible parent device outages that may cause a child device to become UNREACHABLE. Again, downtime distribution metrics and a histogram permit administrative users to see account level metrics and drill down to individual device details, whereas end users may only see the device level metrics.
[0383] A “Top N” report displays the top N (e.g., N=10 accumulations (based on number of events recorded) during the reporting period per account, per device, and per test. Users may select time frame and event severity. Administrative users can view this report at the account level and then drill down on individual devices and tests for more detail. End users running the report may only see the device and test level metrics. An exemplary “Event” report is illustrated in
[0384] A “Number of Events per Day” report displays the number of events recorded each day during the reporting period per account, per device, and per test. Users may select time frame and event severity. Administrative users can view this report at the account level and then drill down on individual devices and tests for more detail. End users running the report may only see the device and test level metrics.
[0385] A “Number of Events” report displays the total number of events recorded during the reporting period per account, per device, and per test. Users may select time frame and event severity. Administrative users can view this report at the account level and then drill down on individual devices and tests for more detail. End users running the report may only see the device and test level metrics.
[0386] An “Event Distribution” report displays the total number of events recorded during the reporting period per account, per device, and per test. Users may select time frame and event severity. Administrative users can view this report at the account level and then drill down on individual devices and tests for more detail. End users running the report may only see the device and test level metrics. The histogram is an event duration distribution of the numbers of accounts/devices/tests falling into bins of equal duration for the reporting period. That is, the reporting period may be divided into an equal number of multi-hour (e.g. 4 hour) blocks, with the number of accounts/devices/tests falling into each of those blocks.
[0387] A “Device Performance” report snapshot is a period (e.g., 24 hour) snapshot (hour by hour) of event summaries for all tests on a single device. Raw event data is analyzed hourly and the worst test state is displayed for each test as a colored block on the grid (24 hours×list of active tests on the device). For example, if a test is CRITICAL for one minute during the hour, the entire hour may be displayed as a red box representing the CRITICAL state. The Device Performance Report only applies to target devices, not to device groups. An exemplary test status summary report is illustrated in
[0388] From the “Test Details” pages, users can view the “raw” data, showing all the individual test results for a single test. The difference between the raw data and viewing events is that events only occur when thresholds are crossed, whereas raw data shows the test results for every test interval. An exemplary test details report is illustrated in
[0389] Statistical reports calculate statistics from raw results data such as mean, 95th and 98th percentiles, max and min values.
[0390] Trend reports can use regression algorithm for analyzing raw data and predicting the number of days to hit the specified thresholds. An exemplary service instability report is illustrated in
[0391] Users can define custom reports in which devices, tests and the type of report to generate for these devices (e.g., top
[0392] In one embodiment, the method
[0393]
[0394] The processor(s)
[0395] In one embodiment, the machine
[0396] A user may enter commands and information into the personal computer through input devices
[0397] The output device(s)
[0398] Various refinements to the present invention are now described. Various embodiments of the present invention may include some or all of these refinements.
[0399] A refined embodiment of the present invention can eliminate sending multiple notifications when a device goes down or is unavailable. Based on the inherent dependency between the ping packet loss test results and the availability of the device, if the ping packet loss test returns a CRITICAL result, then communication with the device has somehow been lost. Configured notifications for all other tests on the device are suppressed until packet loss returns to normal. Smart notification may include:
[0400] Suppressing alarms for all other device events. Smart alarming shows only actual failed tests.
[0401] Identifying relationships between devices to correlate and identify the actual point of network failure/outage and suppress alarms downstream.
[0402] Creating multi-level action profiles to handle event escalation.
[0403] A refined embodiment of the present invention supports device dependencies to suppress excessive notifications when a gateway-type device has gone down or is unavailable. Switches, routers, and other hardware are often the physical gateways that govern whether other network devices are reachable. Monitoring of many devices may be impeded if one of these critical “parent devices” becomes unavailable. To provide correlation, a parent and child hierarchy is created between monitored devices in order to distinguish the difference between a CRITICAL test on a device and an UNREACHABLE one.
[0404] In many cases, a device is considered to be “reachable”. However, if a test on a device is CRITICAL (for all thresholds), UNKNOWN, or FAILED, some additional processing is used to determine if the device is truly reachable. Such additional processing may involve the following. First, a current packet loss test is examined for the device. If such a test exists and the packet loss test result is not CRITICAL, the device is considered reachable. If no such test exists, all immediate parent devices are examined. If the device has no parents, the device is considered reachable and the result of the test is the measured value. The device is only considered unreachable if all the immediate parents have a “current” packet loss test result =100%. “Old” packet loss tests (those that occurred prior to the state change in the child's test result (i.e., OK to CRITICAL)) or the inexistence of a parent packet loss test for a parent has no effect on the result.
[0405] A refined embodiment of the present invention supports a “federated user model”. End user security may be controlled by permissions granted to a “User Group”. Each end user can only belong to a single “Account”, and each Account can only belong to a single User Group. Thus, an end user belongs to one and only one User Group for ease of administration. End users of one account are isolated from all other accounts, thus allowing various departments within an enterprise to each have a fully functional “virtual” copy of the invention.
[0406] Each User Group may have a unique privilege and limits matrix as defined by an Administrative user with administrative control over the User Group. Privileges for User Groups may be defined for devices, tests & actions. Limits at the User Group level may be defined for minimum test interval, max devices, max tests, max actions and max reports.
[0407] In addition to end-users, the system permits separate administrative users who can look at multiple ‘accounts’ (which a normal end-user cannot do). This framework allows senior management or central operation centers or customer care to report on multiple departments that they are responsible for. This eliminates the need for multiple deployments of the same product, while allowing seamless reporting across services that span IT infrastructure managed by different departments in an enterprise.
[0408] Administrative user security may be controlled by permissions granted to an Administrative Group. Administrative Groups and User Groups have a many-to-many relationship, allowing the administration of User Groups by numerous administrators who have varying permissions. Privileges for Administrative Groups may be defined for accounts, users, user groups, limits, devices, tests, and actions. A separate set of privileges is defined for each relationship between an Administrative Group and a User Group. A very simple configuration could establish the organization's Superuser as the only administrative user and all end-users belonging to a single User Group. In contrast, a complex organizational model might require the establishment of Administrative Groups for Network Administration, Database Administration, and Customer Service, with User Groups for C-level executives, IT Support, Marketing, etc.
[0409] Unlike administrators, the actions of “Superusers” are not constrained by a privileges matrix—they can perform any of the actions in the matrix on any user. Superusers create Administrative Groups and User Groups, and define the privileges the former has over the latter. The ‘superuser’ accounts are used to effectively bootstrap the system.
[0410] “Privileges” are the right to create, read, update, delete, suspend, etc. Each User Group has a privileges matrix associated with it that describes what operations the members of that User Group can perform. As mentioned previously, there is a similar, but more complex privileges matrix that describes what operations a member of an Administrative Group can do to administer one or more User Groups.
[0411] “Limits” are numerical bounds associated with a User Group that define minimum test interval, maximum devices, maximum tests, maximum actions and maximum reports for end-user accounts. An end user's actions are constrained by the Limits object associated with their User Group, unless there is another Limits object that is associated with the particular user (e.g. Read-only user) that would override the limits imposed by the User Group.
[0412] Administrative users occasionally need to directly administer an end-user's account, by logging into that account and providing on-line support to view the account and perform operations. This capability is especially helpful when an end-user's capabilities are limited to administer their own account. To circumvent the limited privileges of the end-user, the administrative user need not use the end-user's login/password, but rather “masquerades” as the end-user subject only to the administrative user's own privileges, which are often more extensive.
[0413] Administrators that have permissions to create end users and their accounts, have the option of creating users with read-only capabilities. In this way, administrators may give certain end users access to large amounts of data in the system, but without authority to change any of the characteristics of the devices, tests, actions or reports they are viewing.
[0414] When representing an end user, an administrator (if given proper create privileges) may create devices and tests for the end user in the end user's own account, via a “Represent” feature. One option the administrator has at the time of device creation is to make the device read-only. The tests on the read-only device become read-only as well. This feature was created to enable an end-user to observe the activity on a mission-critical network component, such as a switch or even a switch port, but not have the authority to modify its device or test settings.
[0415] Data may be collected from all DGEs and presented a consolidated view to the user primarily using a Web based interface. An end user only needs a commonly available Web browser to access the full functionality and reporting features of the product. Real-time status views are available for all accounts or devices or tests within an administrator's domain, all tests or devices or tests within an account, or all tests on a single device or device Group. Users can drill down on specific accounts, devices, and tests, and see six-hour, daily, weekly, monthly, and yearly performance information.
[0416] By using user administration pages, users can set default filters for the account and device summary pages to filter out devices in OK state, etc. For example, administrators may elect to filter out accounts and devices that are in an “OK” status. Especially for large deployments, this can dramatically cut down on the number of entries a user must scroll through to have a clear snapshot of system health. A toggle switch on the account and device summary pages may be used to quickly disable or enable the filter(s).
[0417] General administration features including: DGE location and host creation; administration of Administrative Group domains; Administration of User Group thresholds, privileges and actions; Account and user management; Administration of devices, device groups, tests and actions; and Password Management, all may be supported by a graphical user interface.
[0418] Via either an “Update Device” page or during device suspension, a user can enter a comment that will display on a “Device Status Summary” page. This could be used to identify why a device is being suspended, or as general information on the current state of the device.
[0419] The present invention can export data to other systems, or can send notifications to trouble ticketing or other NOC management tools. In addition, the present invention can import data from third party systems, such as OpenView from Hewlett-Packard, to provide a single administrative and analytical interface to all performance management measurements. More specifically, the present invention can import device name, IP address, SNMP community string and topology information from the HP OpenView NNM database, thereby complementing OpenView's topology discovery with the enhanced reporting capabilities of the present invention. Devices are automatically added/removed as the nodes are added or removed from NNM. Traps can be sent between NNM and the present invention as desired.
[0420] The present invention can open trouble tickets automatically using the Remedy notification plug in. It can automatically open trouble tickets in RT using the RT notification plug in.
[0421] The following exemplifies how the present invention may be deployed on a system and administered. All configuration can be done by the GUI or via the API.
[0422] Physical locations (which are arbitrarily defined by the superuser) of where Data Gathering Elements are installed are created in the system. Recall that a DGE is a data collection agent assigned to a “location.” To create a new DGE, its IP address and location are provided. Since multiple DGEs can exist in one location, soft and hard limits that define DGE load balancing may be set. The present invention may use a load balancing mechanism based on configurable device limits to ensure that DGE hosts are not overloaded. In this embodiment, each device is provisioned to a DGE when it is created based on the following heuristics:
[0423] 1. Find a DGE that services the location of the device.
[0424] 2. If there are many such DGEs and the user already has devices on one of them, pick that DGE.
[0425] 3. If there are many DGEs where the user already has devices, choose the one that's the least loaded.
[0426] 4. If there aren't any devices on which the user already has a device, pick the least loaded DGE that does service the location of the device.
[0427] 5. Only pick a DGE that has available capacity available is defined as “below critical level” if the DGE already has devices for the user, else “below warning level”.
[0428] 6. If there's no DGE that services the device location and has available capacity, log the error.
[0429] After creating the DGEs in the system, user groups and accounts are created in the configuration database. After this, devices and tests are provisioned in the system, typically using an auto-discovery tool which finds all IP devices and available tests on them in the given subnets. Default thresholds and actions are used if none is provided by the user. At this stage, the system is ready to be operational. When a DGE is enabled (either a process on the same machine as the configuration database or on another machine), it connects to the configuration database, identifies itself and downloads its configuration. After download its configuration, the DGE starts monitoring tests as described earlier.
[0430] The fault and performance monitoring system of the present invention can be set up and installed in a stand-alone environment in a few hours. Default test settings, action profiles, and reports may be pre-loaded into the system. Lists of devices can be batch-imported automatically into the system using the API.
[0431] As can be appreciated from the foregoing disclosure, the present invention discloses apparatus, data structures and methods for combining system fault and performance monitoring. By using distributed data collection and storage of performance data, storage requirements are relaxed and real-time performance monitoring is possible. Data collection and storage elements can be easily configured via a central configuration database. The configuration database can be easily updated and changed. A federated user model allows normal end users to monitor devices relevant to the part of a service they are responsible for, while allowing administrative users to view the fault and performance of a service in an end-to-end manner across multiple accounts or departments.