Network Management
Principles & Objectives
- Business Impact
- Business problems
- Is the remote server available to complete my transaction?
- Is the remote database able to process the order information?
- Is the application down or is the network between us very slow?
- What is the cycle time for an entry to complete all transactions across all applications?
- Can you tell me that an application is having problems, before it fails completely?
- Does the application have adequate resources at peak usage times?
- What is the quickest method to inform all resolver groups when a problem occurs?
- How do we know when a complex problem is completely resolved?
- Business solutions
- We can build a monitoring/metrics environment to directly address these issues.
- No single system, process nor infrastructure can be everything to everybody.
- Specific problems are a function of specific parameters, and can be addressed in a methodical manner.
- The most critical Enterprise issues can be addressed, one-at-a-time, by a system based on cooperation of resolver groups and shared learning.
- Key Network Metrics tasks
- Gather metrics beyond simple availability of the hubs, routers and switches.
- Gather metrics about hosts and applications and how they use the network.
- Measure Application network utilization.
- Are Applications efficient and using appropriate interfaces and routes?
- Are Applications and servers using network routing, or acting as their own routers?
- Should application hosts do their own routing?
- Measure Application network response time.
- How efficiently do remote applications communicate with each other?
- How does application A communicate with application B?
- Where are the bottlenecks in application traffic?
- Some hosts, such as mainframes, may not have SNMP facilities.
- Routers are the *only* source for that application traffic information.
- Without adequate router information, such application traffic cannot be measured.
- Firewall management
- Firewall violation attempts are managed by the firewall itself.
- Firewall breaches can only be measured outside the firewall.
- Correlate network metrics to all Enterprise functions:
- Operating System (OS), hardware and firmware metrics
- Database (RDBMS) metrics
- Application metrics
- Middleware metrics
- Managing Networks
- Open Systems Interconnection Reference Model (OSI-RM)
- Framework for network management.
- Specific Management Functional Areas (SMFA's)
- Fault Management
- Something that causes network systems to fail to meet their operational objectives.
- Elements involved:
- Detection of the fault.
- Isolation of the fault to a particular component.
- Correction of the fault.
- Specific examples:
- Maintenance of error logs.
- Error detection processes.
- Diagnostic testing procedures.
- Accounting Managment
- Quantifying individual, group and organizational network usage.
- Provides mechanisms to:
- Identify costs.
- Inform users of costs incurred.
- Associate tariff information with resource usage.
- Configuration Management
- Detailing parameters of:
- Network configuration.
- Current topology.
- Operational status of the network.
- Association of user names with devices.
- Facility to change network configurations, as required.
- Performance Management
- High performance usually implies a low incidence of faults.
- Beyond minimizing faults:
- Gather statistics on the operation of the network.
- Maintain and analyze logs of the state of the network systems.
- Optimize network operations.
- Security Management
- Create, delete and control security services.
- Distribute security related information, as required.
- Report security related events.
- Enforce secure passwords and user access.
- Control access to networks, subnets and hosts via:
- Bridges
- Routers
- Gateways
- Switches
- Provide remote access to network elements for purposes of network diagnostics.
- Managing TCP/IP-based Internetworks
- All internetworks are based on some type of layered architecture, e.g.:
- Differences from other internetwork protocols:
- TCP/IP-based internetworks are designed to be multi-vendor systems.
- Systems build solely upon SNA or DECnet are single-vendor systems.
- Management involves different assumptions, depending on how many vendors are providing network equipment.
- Management needs be an integral part of the internetwork wherever integrating different subsystems from multiple vendors is applicable.
- Also, consider implementation of network managements systems and protocols available for their use.
- Simple Network Management Protocol (SNMP)
- Ubiquitous support from:
- Network Managment systems vendors
- Internetworking Device manufacturers
- Common Management Information Protocol (CMIP)
- International support from organizations, e.g.,
- CMIP over TCP/IP (CMOT)
- Not as widespread support as SNMP.
- Desktop Management Task Force (DMTF)
- Standard set of Application Programming Interfaces (API's) that access and manage:
- Desktop systems
- Components
- Related peripherals
- Desktop Management Interface (DMI)
- Focus of DMI is on desktop and LAN management.
- Independent of:
- Hardware systems
- Software operating systems
- Network operating systems
- Designed to be integrated with all network management protocols and consoles, such as SNMP.
- Internets consist of multiple physical networks interconnected by IP routers.
- A single manager can control heterogeneous routers.
- Controlled entities may not share a common link level protocol.
- The set of machines a manager controls may lie at arbitrary points in an internet.
- The internet management protocol used with TCP/IP operates above the transport level.
- Protocols for internet management operate at the application level and communicate using TCP/IP transport level protocols.
- One set of protocols can be used for all networks.
- Designed without regard to the hardware, these protocols can be used for all managed devices.
- All routers respond to exactly the same set of commands.
- IP communications allow a manager to control routers across an entire TCP/IP internet without having direct attachment to every physical network and router.
- Architectural Model
- Each participating host or router runs a server program called a management agent.
- A manager invokes client software on the local host and specifies a particular agent.
- Once the client contacts the agent:
- It sends queries to obtain information.
- It sends commands to change conditions in the device.
- An authentication mechanism ensures that only authorized managers can access or control a specific device.
- Some protocols support multiple levels of authorization, subdividing into specific privileges a manager may assume on a given device.
- Protocol Architecture
- TCP/IP network management protocols specify separate standards for:
- Communication of information
- Defines the format and meaning of messages clients and servers exchange.
- Defines the form of names and addresses.
- Data management
- Specifies which data items a device may keep.
- Specifies the name of each data item and syntax used to name it.
- Simple Network Management Protocol (SNMP)
- SNMP Model
- Managed nodes
- Hosts
- Routers
- Bridges
- Printers
- Any device capable of communicating status information to the outside world.
- Direct SNMP management requires a resident and running process called an SNMP agent.
- Agents maintain a local database of variables that:
- Describe its state
- Describe its history
- Affect its operation
- Management station
- General purpose computers running special management software.
- Contains one or more processes that:
- Communicate with the agents over the network
- Issue commands
- Get responses
- All of the intelligence is in the management stations.
- Agents are kept as simple as possible and minimize their impact on the devices on which they run.
- Management information
- Management Information Base (MIB)
- Specifies the data items a device must keep and operations allowed on each.
- Each device maintains one or more variables that describe its state.
- In SNMP, these are called ``objects''.
- SNMP Objects have state.
- SNMP Objects have *no* methods, other than reading and writing their values.
- The collection of all possible objects in a network is given in a data structure called the MIB.
- Ten (10) MIB-II categories:
- System
- Seven (7) objects
- Name, location and description of the equipment
- Interfaces
- Twenty-three (23) objects
- Network interfaces and their measured traffic
- AT
- Three (3) objects
- Address translation (e.g., ARP mapping) -- deprecated
- IP
- Forty-two (42) objects
- Internet Protocol (IP) packet statistics
- ICMP
- Twenty-six (26) objects
- Statistics about Internet Control Message Protocol (ICMP) messages received
- TCP
- Nineteen (19) objects
- Transmission Control Protocol (TCP) algorithms, parameters and statistics
- UDP
- Six (6) objects
- User Datagram Protocol (UDP) traffic statistics
- EGP
- Twenty (20) objects
- Exterior Gateway Protocol (EGP) traffic statistics
- Transmission
- Zero (0) objects
- Reserved for media-specific MIB's
- SNMP
- Twenty-nine (29) objects
- SNMP traffic statistics
- Advantages:
- Vendors can include SNMP agent software in products, such as routers, with guaranteed adherence to standards, even after new MIB items are defined.
- Same network management client software can manage devices with different versions of a MIB.
- Since all devices use the same language for communication, they can all parse queries and provide requested information or reply with an error message indicating absence of the requested item.
- Each significant event must be defined in a MIB module.
- A management protocol
- Abstract Syntax Notation One (ASN.1)
- A standard object definition language, including encoding rules.
- Structure of Management Information (SMI)
- A sub-super-set of ASN.1 that defines SNMP data structures.
- Four (4) key macros and eight (8) new data types
- Not part of ASN.1
- Heavily used throughout SNMP
- RFC 1442
- Management stations interact with the agents using the SNMP protocol.
- When an agent notices that a significant event has occurred, it immediately reports that event to all management stations in its configuration list (SNMP trap).
- SNMP casts all operations in a fetch-store paradigm.
- Rather than requiring a separate command for each operation on a data item, SNMP contains only two (2) commands:
- Allow a manager to fetch a value from a data item.
- Allow a manager to store a value into a data item.
- Advantages:
- Flexibility
- Accomodaes arbitrary commands in an elegant and simple framework.
- Simplicity
- Simple to implement, understand and debug.
- Two commands avoid the complexity associated with special cases for multiple commands.
- Stability
- Its definition remains fixed.
- New data items maybe added to the MIB.
- New operations are defined as side effects of storing into those items.
- Actually, SNMP uses a handful of commands at its primary interface:
- get-request
- Fetch a value from one or more variables.
- get-next-request
- Fetch the variable following this one, without knowing its exact name.
- Allow managers to iterate through tables of items.
- get-bulk-request
- set-request
- Store a value in one or more variables.
- Inform-request
- Manager-to-manager message describing local MIB.
- trap
- Asynchronous reply triggered by an event -- agent-to-manager report.
- Allow managers to program servers to send information as events occur.
- get-response
- Reply to a fetch operation
- Operations must be atomic.
- If a single message specifies operation across multiple variables, the server either performs *all* operations or *none* of them.
- No assignment will be made if any of them are in error.
- Trap Direct Polling
- Management stations poll agents regularly and at long intervals.
- This polling accelerates upon receipt of a trap.
- Since traps may simply indicate a condition or state, this gathers corrollary data.
- Proxy Agent
- An agent that watches over one or more non-SNMP aware devices.
- An agent that communicates with the management station on behalf of the non-SNMP aware device.
- Possibly, the proxy communicates with the devices via some other protocol.
- Industry Support
- General categories of devices supporting SNMP agents:
- Wiring hubs
- Network servers & associated operating systems
- Network interface cards & associated hosts
- Internetworking devices, e.g., bridges & routers
- Test equipment, e.g., network monitors & analyzers
- Other devices, e.g., uninterruptible power supplies
- SNMP can remain hidden and transparent to the manager's user interface.
- RMON (MIB) functions
- Alarms
- Compare statistical samples with preset thresholds
- Generate alarms when a particular threshold is crossed
- Thresholds for any variable
- Event
- Control the generation and notification of events
- Distributed logging
- May include the use of SNMP Trap messages
- Filter
- Allow packets to be matched according to a filter equation
- Filters to capture and analyze individual packets
- History
- Record periodic statistical samples over time
- Historical statistics
- Historical trend graphing
- Performance tuning
- Statistical analysis
- Host
- Maintain statistics of the hosts on the network
- Maintain media access control (MAC) addresses of active hosts
- Host table of all addresses
- Node traffic statistics
- Packets sent and received
- Broadcasts
- Multicasts
- Error packets sent
- Host time table
- Relative order in which each host was discovered by the agent
- Improves performance and reduces network traffic
- HostTopN
- Provide reports from host table statistics
- Extensive processing performed remotely at the agent
- Minimizes network traffic and load
- Sorted host statistics
- Which hosts are at the top of the list for a particular statistic
- Matrix
- Store statistics in a traffic matrix
- Regarding conversations between host pairs
- Traffic matrices for all nodes
- Amount of traffic and number of errors between pairs of nodes
- One (1) source and one (1) destination address per pair
- For each pair, maintains counters, between nodes, for
- Numbers of packets
- Numbers of bytes
- Numbers of error packets
- Point-to-point statistics
- SNMP can identify the total number of packets on a port
- RMON can identify
- Destination for packets
- Source of packets
- Packet capture
- Allow packet capture when they match a particular filter or threshold value
- Packet and protocol analysis
- Statistics
- Measures Probe-collected statistics
- Collisions on a particular segment
- Configurable statistics
- Internetwork fault diagnosis
- Network traffic statistics
- Packet error counters
- CRC/alignment errors
- Fragments
- Jabbers
- Oversized packets
- Undersized packets
- RMON requires sophisticated management tools
- RMON can be complex and gather substantial data.
- Tivoli NetView
- Designed to identify network resources and manage discrete events.
- NetView goes onto the network and discovers all nodes within its network bounds.
- A node is what is at one end of a network wire; that thing onto which one end of the wire is attached.
- NetView asks that device, on which the node resides, to identify itself.
- NetView speaks one (1) language: SNMP.
- If that device responds in one (1) language, SNMP, then NetView maps that node into network device context.
- If that device does *not* respond in SNMP, then NetView knows only that a node exists, and knows *nothing* about that node, other than that it exists at one end of a network wire.
- Since every accessible network port on every network device exists at one end of a network wire, NetView will discover and query, via SNMP, every accessible network port on every network device.
- If a network device has sixteen (16) ports, and that device will not speak to NetView in SNMP, each of those sixteen (16) ports will remain un-identified to NetView.
- Each of sixteen (16) un-identified ports appear within NetView as sixteen (16) discrete, un-identified nodes at one end of sixteen (16) separate network wires.
- In such case, NetView will *not* be able to know anything about what goes into one port of that network device in relation to what comes out another port on same device.
- If a known network device resides on each side of such an unknown network device, *nothing* can be known about the traffic that passes between the known devices, because that traffic, by definition, first passes through nodes that happen to reside on a network device that cannot be known.
- Utilizing tools, such as RMON, NetView is able to sort out and make sense of network traffic between all known and identified network devices.
- Provides a simple, summary map of the Enterprise from a network perspective.
- Coordinates various network information, consolidates same and correlates network data.
- Filters events and passes to Tivoli Enterprise Console, as appropriate.
- NetScout Manager Plus
- Sophisticated RMON software tool recommended by Ken McNamara, Ameritech, Indianapolis, IN.
- Very specific requirements for this software to facilitate AFTT requirements:
- RMON must be enabled at the switch.
- Configuration modifications in the software require propagation to both Probe and switch.
- Cannot define a switch's role in the network without a Probe's read/write identification of same switch.
- Cannot make sense of network traffic flowing through a Probe without adequate switch identification.
- Specific Tools & Expertise
- Process
- Operations
- Presentation layer
- Group alarms and metrics based on business perspectives.
- Distribute various console views to remote resolver groups.
- Dynamically Email environment information as a function of who needs to know, what and when.
- Dynamically Page environment information as a function of who needs to know, what and when.
- Generate Trouble Ticket, as necessary.
- Business perspectives
- Application
- Database
- Middleware
- Network
- Operating system (OS)
- System
- Correlation
- Group alarms and metrics based on business perspectives.
- High level perspectives encourage analysis of relationships between events:
- How does a specific alarm impact a specific application?
- How does a specific alarm impact a specific database?
- How does a specific alarm impact middleware?
- How does a specific alarm impact the network?
- How does a specific alarm impact a specific system?
- Different alarms and events need to be known by different sets of Enterprise personnel.
- Different alarms need to be known and acted upon by different sets of resolver groups.
- Is this alarm, or event, a one-time occurrence or part of an ongoing trend?
- Who needs to know about environment trends?
- Problem management
- Manual interface to Trouble Ticket system.
- Developed and maintained ongoing database of alarms and events.
- Analysis of this database may contribute to the automatic Trouble Ticket interface.
- Began outline of requirements for an automatic interface to an Enterprise Trouble Ticket system.
- Methods
- Availability
- Filesystem space, regardless of platform.
- Process up/down, regardless of platform.
- Host up/down, regardless of platform.
- Database functionality, regardless of database and platform.
- Application functionality, regardless of platform.
- Obviously, considerably more work needs to be done, directly involving the application owners, to make this more substantive.
- The most complex riddle, yet to be answered, quantifies necessary relationships between applications.
- Middleware functionality, regardless of platform.
- Obviously, considerably more work needs to be done, directly involving the application owners, to make this more substantive.
- The most complex riddle, yet to be answered, quantifies necessary relationships between applications and middleware.
- Network availability, functionality and responsiveness.
- Logfiles
- Follow every event handled by syslog, on *NIX platforms.
- Follow every event handled by EventLog on NT platforms.
- Follow every event entered into ASCII logfiles, regardless of platform.
- clockShift
- Connect:Direct
- MQSeries
- Orbix
- TraxWay
- Tuxedo
- EventAdapters
- Tivoli is a framework with an extensible API, into which we integrate:
- Tivoli tools
- Third party point products
- Simple, custom tools
- Where out-of-the-box tools do not prove adequate to the task, we have developed simple, scalable and readily maintainable tools that meet specific requirements.
- Dynamic querying
- Whereas most out-of-the-box tools are asynchronous, waiting for alarms and events to come to them, we have also implemented an infrastructure to dynamically query the state of specific attributes on a regular basis.
- Tivoli Plus Modules
- Third party point products maybe integrated into Tivoli via a Plus Module, which facilitates:
- Point product event forwarding to Tivoli TEC.
- Interface to the point product from the Tivoli desktop/console.
- Plus Modules used at, or valuable to, Ameritech:
- BMC Patrol
- Compaq Insight Manager
- Compuware EcoTools
- HP OpenView
- MQSeries
- Platinum POEMS
- Tuxedo
- Decision Support
- Tivoli Decision Support (TDS) generates reports from the data repositories gathered by various Tivoli products, including:
- Tivoli Enterprise Console (TEC)
- NetView (NV)
- Redundancy
- The scope of specific events are limited.
- The information carried by an alarm or event must be interpreted in appropriate context.
- Often, measuring a condition in multiple ways, then correlating the results, provides more adequate information.
- Technology
- Tivoli
- Framework (FW)
- Tivoli Enterprise Console (TEC)
- NetView (NV)
- EventAdapters
- Out-of-the-box
- Customized
- Tivoli Decision Support (TDS)
- Plus Modules
- Software Distribution (SD)
- Distributed Monitoring (DM)
- Point Products
- This team has experience with these tools, on this project or elsewhere in Ameritech:
- BMC Patrol
- CA
- CheckPoint FireWall-1
- Compaq Insight Manager
- Compuware EcoTools
- HP OpenView
- Mercury Interactive
- Platinum
- Enterprise Applications
- Enterprise Middleware
- MQSeries
- Orbix
- Topcom
- TraxWay
- Tuxedo
- Enterprise Platforms
- This team has experience with these tools, on this project or elsewhere in Ameritech, on the following operating platforms:
- AIX (IBM)
- Desktops (Microsoft-based)
- HP-UX
- Microsoft NT Servers
- No other metrics/monitoring effort in Ameritech can approach AFTT/NT level of detail!
- MVS (OS/390)
- NCR
- Sun Solaris
- Sun SunOS
- Tandem
- Network
- At Ameritech
- We developed an infrastructure that managed the AFTT environment:
- Deployed tools remotely, en masse, without direct involvement of System Administrators (SA's).
- Of course, this presupposes that certain published prerequisites are met on a system-by-system basis.
- Certain tools and specific problem resolution may require additional involvement on the part of System and Network Administrators.
- All tool and product enhancements, customizations and patches managed remotely, without local involvement.
- Automatically monitored the environment for alarm and alarm information.
- Automatically gathered to one (1) central repository.
- Automatically monitor and manage functionality of the Event Management infrastructure itself.
- Automatically compile metrics about the environment for:
- Accountability
- Capacity planning
- Service level tracking
- Schedule events for periodic, or one-time, action across the entire environment, or on a system-by-system basis.
- Move software and information to any point, or groups of points, anywhere in the environment.
- Can be scheduled for one-time or periodic updates.
- Can combine data transfer with remote, scheduled events and actions.
- This infrastructure is designed to be scalable and transferable to the Enterprise level.
- Network Impact
- SNMP may periodically send echo packets to check status of each device.
- Large numbers of echo packets *increases* base-level network traffic.
- High network traffic can adversely affect network performance.
- Typical SNMP polling of devices may consume an unacceptably large percentage of WAN bandwidth.
- RMON promotes proxy management capabilities for remote monitoring.
- RMON is especially useful when the remote network is connected over a wide area link.
- RMON Probe is shareable.
- Each management station identifies the resources it is using in the agent.
- Multiple tasks are completed concurrently and in a timely manner.
- Both RMON and SNMP skew the performance of what is measured, because the measurement process requires resources.
- The measurement process requires resources.
- The measurement process resources are included in the network measurements.
- The measurement process is efficient, carrying alot of information in fewest possible packets.