Network Management
Principles & Objectives

Business Impact
- Business problems
  - Is the remote server available to complete my transaction?
  - Is the remote database able to process the order information?
  - Is the application down or is the network between us very slow?
  - What is the cycle time for an entry to complete all transactions across all applications?
  - Can you tell me that an application is having problems, before it fails completely?
  - Does the application have adequate resources at peak usage times?
  - What is the quickest method to inform all resolver groups when a problem occurs?
  - How do we know when a complex problem is completely resolved?
- Business solutions
  - We can build a monitoring/metrics environment to directly address these issues.
  - No single system, process nor infrastructure can be everything to everybody.
  - Specific problems are a function of specific parameters, and can be addressed in a methodical manner.
  - The most critical Enterprise issues can be addressed, one-at-a-time, by a system based on cooperation of resolver groups and shared learning.
- Key Network Metrics tasks
  - Gather metrics beyond simple availability of the hubs, routers and switches.
  - Gather metrics about hosts and applications and how they use the network.
    - Measure Application network utilization.
      - Are Applications efficient and using appropriate interfaces and routes?
      - Are Applications and servers using network routing, or acting as their own routers?
        
        Should application hosts do their own routing?
    - Measure Application network response time.
      - How efficiently do remote applications communicate with each other?
      - How does application A communicate with application B?
      - Where are the bottlenecks in application traffic?
    - Some hosts, such as mainframes, may not have SNMP facilities.
      - Routers are the *only* source for that application traffic information.
      - Without adequate router information, such application traffic cannot be measured.
  - Firewall management
    - Firewall violation attempts are managed by the firewall itself.
    - Firewall breaches can only be measured outside the firewall.
  - Correlate network metrics to all Enterprise functions:
    - Operating System (OS), hardware and firmware metrics
    - Database (RDBMS) metrics
    - Application metrics
    - Middleware metrics
Managing Networks
- Open Systems Interconnection Reference Model (OSI-RM)
  - Framework for network management.
  - Specific Management Functional Areas (SMFA's)
    - Fault Management
      - Something that causes network systems to fail to meet their operational objectives.
      - Elements involved:
        
        Detection of the fault.
        Isolation of the fault to a particular component.
        Correction of the fault.
      - Specific examples:
        
        Maintenance of error logs.
        Error detection processes.
        Diagnostic testing procedures.
    - Accounting Managment
      - Quantifying individual, group and organizational network usage.
      - Provides mechanisms to:
        
        Identify costs.
        Inform users of costs incurred.
        Associate tariff information with resource usage.
    - Configuration Management
      - Detailing parameters of:
        
        Network configuration.
        Current topology.
        Operational status of the network.
        Association of user names with devices.
      - Facility to change network configurations, as required.
    - Performance Management
      - High performance usually implies a low incidence of faults.
      - Beyond minimizing faults:
        
        Gather statistics on the operation of the network.
        Maintain and analyze logs of the state of the network systems.
        Optimize network operations.
    - Security Management
      - Create, delete and control security services.
      - Distribute security related information, as required.
      - Report security related events.
      - Enforce secure passwords and user access.
      - Control access to networks, subnets and hosts via:
        
        Bridges
        Routers
        Gateways
        Switches
      - Provide remote access to network elements for purposes of network diagnostics.
- Managing TCP/IP-based Internetworks
  - All internetworks are based on some type of layered architecture, e.g.:
    - ARPA
    - OSI
    - SNA
  - Differences from other internetwork protocols:
    - TCP/IP-based internetworks are designed to be multi-vendor systems.
      - Systems build solely upon SNA or DECnet are single-vendor systems.
      - Management involves different assumptions, depending on how many vendors are providing network equipment.
      - Management needs be an integral part of the internetwork wherever integrating different subsystems from multiple vendors is applicable.
    - Also, consider implementation of network managements systems and protocols available for their use.
      - Simple Network Management Protocol (SNMP)
        
        Ubiquitous support from:
        
        Network Managment systems vendors
        Internetworking Device manufacturers
      - Common Management Information Protocol (CMIP)
        
        International support from organizations, e.g.,
        
        CCITT
        IEEE
        ISO
      - CMIP over TCP/IP (CMOT)
        
        Not as widespread support as SNMP.
      - Desktop Management Task Force (DMTF)
        
        Standard set of Application Programming Interfaces (API's) that access and manage:
        
        Desktop systems
        Components
        Related peripherals
        
        Desktop Management Interface (DMI)
        
        Focus of DMI is on desktop and LAN management.
        Independent of:
        
        Hardware systems
        Software operating systems
        Network operating systems
        
        Designed to be integrated with all network management protocols and consoles, such as SNMP.
- Internets consist of multiple physical networks interconnected by IP routers.
  - A single manager can control heterogeneous routers.
  - Controlled entities may not share a common link level protocol.
  - The set of machines a manager controls may lie at arbitrary points in an internet.
  - The internet management protocol used with TCP/IP operates above the transport level.
    - Protocols for internet management operate at the application level and communicate using TCP/IP transport level protocols.
    - One set of protocols can be used for all networks.
    - Designed without regard to the hardware, these protocols can be used for all managed devices.
    - All routers respond to exactly the same set of commands.
    - IP communications allow a manager to control routers across an entire TCP/IP internet without having direct attachment to every physical network and router.
- Architectural Model
  - Each participating host or router runs a server program called a management agent.
  - A manager invokes client software on the local host and specifies a particular agent.
  - Once the client contacts the agent:
    - It sends queries to obtain information.
    - It sends commands to change conditions in the device.
  - An authentication mechanism ensures that only authorized managers can access or control a specific device.
    - Some protocols support multiple levels of authorization, subdividing into specific privileges a manager may assume on a given device.
- Protocol Architecture
  - TCP/IP network management protocols specify separate standards for:
    - Communication of information
      - Defines the format and meaning of messages clients and servers exchange.
      - Defines the form of names and addresses.
    - Data management
      - Specifies which data items a device may keep.
      - Specifies the name of each data item and syntax used to name it.
  - Simple Network Management Protocol (SNMP)
    - SNMP Model
      - Managed nodes
        
        Hosts
        Routers
        Bridges
        Printers
        Any device capable of communicating status information to the outside world.
        Direct SNMP management requires a resident and running process called an SNMP agent.
        
        Agents maintain a local database of variables that:
        
        Describe its state
        Describe its history
        Affect its operation
      - Management station
        
        General purpose computers running special management software.
        Contains one or more processes that:
        
        Communicate with the agents over the network
        Issue commands
        Get responses
        
        All of the intelligence is in the management stations.
        
        Agents are kept as simple as possible and minimize their impact on the devices on which they run.
      - Management information
        
        Management Information Base (MIB)
        
        Specifies the data items a device must keep and operations allowed on each.
        
        Each device maintains one or more variables that describe its state.
        In SNMP, these are called ``objects''.
        
        SNMP Objects have state.
        SNMP Objects have *no* methods, other than reading and writing their values.
        
        The collection of all possible objects in a network is given in a data structure called the MIB.
        
        Ten (10) MIB-II categories:
        
        System
        
        Seven (7) objects
        Name, location and description of the equipment
        
        Interfaces
        
        Twenty-three (23) objects
        Network interfaces and their measured traffic
        
        AT
        
        Three (3) objects
        Address translation (e.g., ARP mapping) -- deprecated
        
        IP
        
        Forty-two (42) objects
        Internet Protocol (IP) packet statistics
        
        ICMP
        
        Twenty-six (26) objects
        Statistics about Internet Control Message Protocol (ICMP) messages received
        
        TCP
        
        Nineteen (19) objects
        Transmission Control Protocol (TCP) algorithms, parameters and statistics
        
        UDP
        
        Six (6) objects
        User Datagram Protocol (UDP) traffic statistics
        
        EGP
        
        Twenty (20) objects
        Exterior Gateway Protocol (EGP) traffic statistics
        
        Transmission
        
        Zero (0) objects
        Reserved for media-specific MIB's
        
        SNMP
        
        Twenty-nine (29) objects
        SNMP traffic statistics
        
        Advantages:
        
        Vendors can include SNMP agent software in products, such as routers, with guaranteed adherence to standards, even after new MIB items are defined.
        Same network management client software can manage devices with different versions of a MIB.
        Since all devices use the same language for communication, they can all parse queries and provide requested information or reply with an error message indicating absence of the requested item.
        
        Each significant event must be defined in a MIB module.
      - A management protocol
        
        Abstract Syntax Notation One (ASN.1)
        
        A standard object definition language, including encoding rules.
        
        Structure of Management Information (SMI)
        
        A sub-super-set of ASN.1 that defines SNMP data structures.
        Four (4) key macros and eight (8) new data types
        
        Not part of ASN.1
        Heavily used throughout SNMP
        RFC 1442
        
        Management stations interact with the agents using the SNMP protocol.
        
        When an agent notices that a significant event has occurred, it immediately reports that event to all management stations in its configuration list (SNMP trap).
        
        SNMP casts all operations in a fetch-store paradigm.
        
        Rather than requiring a separate command for each operation on a data item, SNMP contains only two (2) commands:
        
        Allow a manager to fetch a value from a data item.
        Allow a manager to store a value into a data item.
        
        Advantages:
        
        Flexibility
        
        Accomodaes arbitrary commands in an elegant and simple framework.
        
        Simplicity
        
        Simple to implement, understand and debug.
        Two commands avoid the complexity associated with special cases for multiple commands.
        
        Stability
        
        Its definition remains fixed.
        New data items maybe added to the MIB.
        New operations are defined as side effects of storing into those items.
        
        Actually, SNMP uses a handful of commands at its primary interface:
        
        get-request
        
        Fetch a value from one or more variables.
        
        get-next-request
        
        Fetch the variable following this one, without knowing its exact name.
        Allow managers to iterate through tables of items.
        
        get-bulk-request
        
        Fetch a large table.
        
        set-request
        
        Store a value in one or more variables.
        
        Inform-request
        
        Manager-to-manager message describing local MIB.
        
        trap
        
        Asynchronous reply triggered by an event -- agent-to-manager report.
        Allow managers to program servers to send information as events occur.
        
        get-response
        
        Reply to a fetch operation
        
        Operations must be atomic.
        
        If a single message specifies operation across multiple variables, the server either performs *all* operations or *none* of them.
        No assignment will be made if any of them are in error.
        
        Trap Direct Polling
        
        Management stations poll agents regularly and at long intervals.
        This polling accelerates upon receipt of a trap.
        Since traps may simply indicate a condition or state, this gathers corrollary data.
        
        Proxy Agent
        
        An agent that watches over one or more non-SNMP aware devices.
        An agent that communicates with the management station on behalf of the non-SNMP aware device.
        Possibly, the proxy communicates with the devices via some other protocol.
    - Industry Support
      - General categories of devices supporting SNMP agents:
        
        Wiring hubs
        Network servers & associated operating systems
        Network interface cards & associated hosts
        Internetworking devices, e.g., bridges & routers
        Test equipment, e.g., network monitors & analyzers
        Other devices, e.g., uninterruptible power supplies
    - SNMP can remain hidden and transparent to the manager's user interface.
  - RMON (MIB) functions
    - Alarms
      - Compare statistical samples with preset thresholds
        
        Generate alarms when a particular threshold is crossed
        Thresholds for any variable
    - Event
      - Control the generation and notification of events
      - Distributed logging
      - May include the use of SNMP Trap messages
    - Filter
      - Allow packets to be matched according to a filter equation
      - Filters to capture and analyze individual packets
    - History
      - Record periodic statistical samples over time
        
        Historical statistics
        Historical trend graphing
        Performance tuning
        Statistical analysis
    - Host
      - Maintain statistics of the hosts on the network
      - Maintain media access control (MAC) addresses of active hosts
      - Host table of all addresses
        
        Node traffic statistics
        
        Packets sent and received
        Broadcasts
        Multicasts
        Error packets sent
        
        Host time table
        
        Relative order in which each host was discovered by the agent
        Improves performance and reduces network traffic
    - HostTopN
      - Provide reports from host table statistics
        
        Extensive processing performed remotely at the agent
        
        Minimizes network traffic and load
        
        Sorted host statistics
        Which hosts are at the top of the list for a particular statistic
    - Matrix
      - Store statistics in a traffic matrix
        
        Regarding conversations between host pairs
        Traffic matrices for all nodes
        Amount of traffic and number of errors between pairs of nodes
        
        One (1) source and one (1) destination address per pair
        For each pair, maintains counters, between nodes, for
        
        Numbers of packets
        Numbers of bytes
        Numbers of error packets
        
        Point-to-point statistics
        
        SNMP can identify the total number of packets on a port
        
        Inbound
        Outbound
        
        RMON can identify
        
        Destination for packets
        Source of packets
    - Packet capture
      - Allow packet capture when they match a particular filter or threshold value
      - Packet and protocol analysis
    - Statistics
      - Measures Probe-collected statistics
        
        Collisions on a particular segment
        Configurable statistics
        Internetwork fault diagnosis
        Network traffic statistics
        Packet error counters
        
        CRC/alignment errors
        Fragments
        Jabbers
        Oversized packets
        Undersized packets
  - RMON requires sophisticated management tools
    - RMON can be complex and gather substantial data.
    - Tivoli NetView
      - Designed to identify network resources and manage discrete events.
        
        NetView goes onto the network and discovers all nodes within its network bounds.
        
        A node is what is at one end of a network wire; that thing onto which one end of the wire is attached.
        
        NetView asks that device, on which the node resides, to identify itself.
        
        NetView speaks one (1) language: SNMP.
        If that device responds in one (1) language, SNMP, then NetView maps that node into network device context.
        If that device does *not* respond in SNMP, then NetView knows only that a node exists, and knows *nothing* about that node, other than that it exists at one end of a network wire.
        
        Since every accessible network port on every network device exists at one end of a network wire, NetView will discover and query, via SNMP, every accessible network port on every network device.
        
        If a network device has sixteen (16) ports, and that device will not speak to NetView in SNMP, each of those sixteen (16) ports will remain un-identified to NetView.
        Each of sixteen (16) un-identified ports appear within NetView as sixteen (16) discrete, un-identified nodes at one end of sixteen (16) separate network wires.
        In such case, NetView will *not* be able to know anything about what goes into one port of that network device in relation to what comes out another port on same device.
        If a known network device resides on each side of such an unknown network device, *nothing* can be known about the traffic that passes between the known devices, because that traffic, by definition, first passes through nodes that happen to reside on a network device that cannot be known.
        
        Utilizing tools, such as RMON, NetView is able to sort out and make sense of network traffic between all known and identified network devices.
      - Provides a simple, summary map of the Enterprise from a network perspective.
      - Coordinates various network information, consolidates same and correlates network data.
      - Filters events and passes to Tivoli Enterprise Console, as appropriate.
    - NetScout Manager Plus
      - Sophisticated RMON software tool recommended by Ken McNamara, Ameritech, Indianapolis, IN.
      - Very specific requirements for this software to facilitate AFTT requirements:
        
        RMON must be enabled at the switch.
        Configuration modifications in the software require propagation to both Probe and switch.
        Cannot define a switch's role in the network without a Probe's read/write identification of same switch.
        Cannot make sense of network traffic flowing through a Probe without adequate switch identification.
Specific Tools & Expertise
- Process
  - Operations
    - Presentation layer
      - Group alarms and metrics based on business perspectives.
      - Distribute various console views to remote resolver groups.
      - Dynamically Email environment information as a function of who needs to know, what and when.
      - Dynamically Page environment information as a function of who needs to know, what and when.
      - Generate Trouble Ticket, as necessary.
    - Business perspectives
      - Application
      - Database
      - Middleware
      - Network
      - Operating system (OS)
      - System
    - Correlation
      - Group alarms and metrics based on business perspectives.
      - High level perspectives encourage analysis of relationships between events:
        
        How does a specific alarm impact a specific application?
        How does a specific alarm impact a specific database?
        How does a specific alarm impact middleware?
        How does a specific alarm impact the network?
        How does a specific alarm impact a specific system?
      - Different alarms and events need to be known by different sets of Enterprise personnel.
      - Different alarms need to be known and acted upon by different sets of resolver groups.
      - Is this alarm, or event, a one-time occurrence or part of an ongoing trend?
        
        Who needs to know about environment trends?
    - Problem management
      - Manual interface to Trouble Ticket system.
        
        Developed and maintained ongoing database of alarms and events.
        Analysis of this database may contribute to the automatic Trouble Ticket interface.
      - Began outline of requirements for an automatic interface to an Enterprise Trouble Ticket system.
  - Methods
    - Availability
      - Filesystem space, regardless of platform.
      - Process up/down, regardless of platform.
      - Host up/down, regardless of platform.
      - Database functionality, regardless of database and platform.
      - Application functionality, regardless of platform.
        
        Obviously, considerably more work needs to be done, directly involving the application owners, to make this more substantive.
        The most complex riddle, yet to be answered, quantifies necessary relationships between applications.
      - Middleware functionality, regardless of platform.
        
        Obviously, considerably more work needs to be done, directly involving the application owners, to make this more substantive.
        The most complex riddle, yet to be answered, quantifies necessary relationships between applications and middleware.
      - Network availability, functionality and responsiveness.
    - Logfiles
      - Follow every event handled by syslog, on *NIX platforms.
      - Follow every event handled by EventLog on NT platforms.
      - Follow every event entered into ASCII logfiles, regardless of platform.
        
        clockShift
        Connect:Direct
        MQSeries
        Orbix
        TraxWay
        Tuxedo
    - EventAdapters
      - Tivoli is a framework with an extensible API, into which we integrate:
        
        Tivoli tools
        Third party point products
        Simple, custom tools
      - Where out-of-the-box tools do not prove adequate to the task, we have developed simple, scalable and readily maintainable tools that meet specific requirements.
    - Dynamic querying
      - Whereas most out-of-the-box tools are asynchronous, waiting for alarms and events to come to them, we have also implemented an infrastructure to dynamically query the state of specific attributes on a regular basis.
    - Tivoli Plus Modules
      - Third party point products maybe integrated into Tivoli via a Plus Module, which facilitates:
        
        Point product event forwarding to Tivoli TEC.
        Interface to the point product from the Tivoli desktop/console.
      - Plus Modules used at, or valuable to, Ameritech:
        
        BMC Patrol
        Compaq Insight Manager
        Compuware EcoTools
        HP OpenView
        MQSeries
        Platinum POEMS
        Tuxedo
    - Decision Support
      - Tivoli Decision Support (TDS) generates reports from the data repositories gathered by various Tivoli products, including:
        
        Tivoli Enterprise Console (TEC)
        NetView (NV)
    - Redundancy
      - The scope of specific events are limited.
      - The information carried by an alarm or event must be interpreted in appropriate context.
      - Often, measuring a condition in multiple ways, then correlating the results, provides more adequate information.
- Technology
  - Tivoli
    - Framework (FW)
    - Tivoli Enterprise Console (TEC)
    - NetView (NV)
    - EventAdapters
      - Out-of-the-box
      - Customized
    - Tivoli Decision Support (TDS)
    - Plus Modules
    - Software Distribution (SD)
    - Distributed Monitoring (DM)
  - Point Products
    - This team has experience with these tools, on this project or elsewhere in Ameritech:
      - BMC Patrol
      - CA
      - CheckPoint FireWall-1
      - Compaq Insight Manager
      - Compuware EcoTools
      - HP OpenView
      - Mercury Interactive
      - Platinum
  - Enterprise Applications
  - Enterprise Middleware
    - MQSeries
    - Orbix
    - Topcom
    - TraxWay
    - Tuxedo
  - Enterprise Platforms
    - This team has experience with these tools, on this project or elsewhere in Ameritech, on the following operating platforms:
      - AIX (IBM)
      - Desktops (Microsoft-based)
      - HP-UX
      - Microsoft NT Servers
        
        No other metrics/monitoring effort in Ameritech can approach AFTT/NT level of detail!
      - MVS (OS/390)
      - NCR
      - Sun Solaris
      - Sun SunOS
      - Tandem
  - Network
    - RMON
- At Ameritech
  - We developed an infrastructure that managed the AFTT environment:
    - Deployed tools remotely, en masse, without direct involvement of System Administrators (SA's).
      - Of course, this presupposes that certain published prerequisites are met on a system-by-system basis.
      - Certain tools and specific problem resolution may require additional involvement on the part of System and Network Administrators.
      - All tool and product enhancements, customizations and patches managed remotely, without local involvement.
    - Automatically monitored the environment for alarm and alarm information.
      - Automatically gathered to one (1) central repository.
      - Automatically monitor and manage functionality of the Event Management infrastructure itself.
    - Automatically compile metrics about the environment for:
      - Accountability
      - Capacity planning
      - Service level tracking
    - Schedule events for periodic, or one-time, action across the entire environment, or on a system-by-system basis.
    - Move software and information to any point, or groups of points, anywhere in the environment.
      - Can be scheduled for one-time or periodic updates.
      - Can combine data transfer with remote, scheduled events and actions.
  - This infrastructure is designed to be scalable and transferable to the Enterprise level.
Network Impact
- SNMP may periodically send echo packets to check status of each device.
  - Large numbers of echo packets *increases* base-level network traffic.
  - High network traffic can adversely affect network performance.
  - Typical SNMP polling of devices may consume an unacceptably large percentage of WAN bandwidth.
- RMON promotes proxy management capabilities for remote monitoring.
  - RMON is especially useful when the remote network is connected over a wide area link.
  - RMON Probe is shareable.
    - Each management station identifies the resources it is using in the agent.
    - Multiple tasks are completed concurrently and in a timely manner.
- Both RMON and SNMP skew the performance of what is measured, because the measurement process requires resources.
  - The measurement process requires resources.
  - The measurement process resources are included in the network measurements.
  - The measurement process is efficient, carrying alot of information in fewest possible packets.