IBM Tivoli Network Manager

Add to My manuals
51 Pages

advertisement

IBM Tivoli Network Manager | Manualzz

Network Manager

3.9 Fix Pack 4/ 4.1.1

IBM Tivoli Network Manager

Best Practices for Network Monitoring

Version 1

n

Rob Clark

Kimberly Corbitt

Note: Before using this information and the product it supports, read the information in “Notices” on page 55.

This edition applies to 3.9 FP4, 4.1.1 of IBM Tivoli Network Manager and to all subsequent releases and modifications until otherwise indicated in new editions.

© Copyright IBM Corporation 2014.

US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents

About this Guide...........................................................................................................................v

Chapter 1: Poller concepts and terms to get started.................................................................7

Policy...................................................................................................................................................................................................7

Poll Definitions..................................................................................................................................................................................8

Multiple domains..............................................................................................................................................................................9

Chapter 2: How do the various scopes work together?...........................................................11

How scopes are applied..................................................................................................................................................................11

Poll Definition Classes filter...........................................................................................................................................................11

Poll Definition Interface filter.........................................................................................................................................................12

Policy level filters.............................................................................................................................................................................12

Chapter 3: How to get the best out of threshold events..........................................................13

Event ID............................................................................................................................................................................................13

Event summary description...........................................................................................................................................................13

Rules file...........................................................................................................................................................................................14

Event enrichment in the Event Gateway......................................................................................................................................17

Chapter 4: Making the most of historical data..........................................................................21

Short-term diagnostic tool..............................................................................................................................................................21

Do I need Tivoli Data Warehouse?................................................................................................................................................21

Storage capacity...............................................................................................................................................................................22

Use of the Data Label......................................................................................................................................................................24

Chapter 5: What is adaptive polling?........................................................................................25

Chapter 6: How many poller instances do I need?...................................................................27

How many?......................................................................................................................................................................................27

Tips for defining multiple pollers..................................................................................................................................................27

Using multiple pollers....................................................................................................................................................................29

Chapter 7: Are the pollers healthy?...........................................................................................31

1) Is the historical poll data table being maintained?..................................................................................................................34

2) Is the poller keeping up with the policy load at the scheduled frequencies?......................................................................37

3) Is the poller's memory stable?...................................................................................................................................................40

4) Is the poller successfully storing data?.....................................................................................................................................40

5) Do I need to add a new poller?..................................................................................................................................................43

Chapter 8: Am I pinging all the IP addresses I want?..............................................................47

Generate the report.........................................................................................................................................................................47

The report.........................................................................................................................................................................................48

Chapter 9: Poller Configuration.................................................................................................51

Poller settings...................................................................................................................................................................................51

For IBM Support use.......................................................................................................................................................................53

Notices.........................................................................................................................................55

Trademarks......................................................................................................................................................................................57

© Copyright IBM Corp. 2014

iii

About this Guide

Assuring the health of your network is one of the most important functions of

Network Manager. This guide helps you take advantage of the full capabilities available in Network Manager to plan your polling policies and help you enrich the monitoring events for your operational needs.

The Network Manager poller is very flexible and provides you many options to mold it to your business needs and environment. This guide explains various concepts and examples to help you understand how to get the most out of using the poller to monitor your network.

IBM Tivoli Network Manager 3.9 Fix Pack 4 and 4.1.1 introduced many improvements to the poller for scalability and manageability and this guide assumes you are using those releases or higher.

Chapter 1: Poller concepts and terms to get started The first chapter covers the

basic concepts and terms to get started and provide a context that the following chapters can build on.

Chapter 2: How do the various scopes work together? explains how the scope

settings on the poll definitions work with the scopes defined at the policy level. Filtering poll definitions at the class level for vendor specific MIBs, for instance, can make it easier when defining the scope at the policy level. But you should always check the class filtering at the poll definition level to avoid surprises.

Chapter 3: How to get the best out of threshold events, using examples, helps

you make the most of the events, in context with the poller, to provide useful contextual information for operators.

Chapter 4: Making the most of historical data covers the how and why on

storing data for following trends or reporting on top offenders for diagnosing network problems.

Chapter 5: What is adaptive polling? explains how you can take advantage of

using event-based network view scopes to make your polling coverage more efficient.

Chapter 6: How many poller instances do I need? looks at the best practice

setup for pollers to maximize their capability.

Chapter 7: Are the pollers healthy? looks at using the new Health metrics to

quickly understand how the pollers are running. You can use these graphs to monitor load conditions and decide when to setup additional poller instances.

Chapter 8: Am I pinging all the IP addresses I want? looks at troubleshooting

your ping polls. It uses a supplied list of IP addresses that you are responsible for monitoring for availability. It shows you how to ensure the poller is actually polling all of them for your peace of mind, and if it isn't, why not, so you can fix it.

Chapter 9: Poller Configuration Provides insights into the best practices for

configuring the poller processes.

© Copyright IBM Corp. 2014

v

Chapter 1: Poller concepts and terms to get started

© Copyright IBM Corp. 2014

This chapter explains the Network Manager monitoring policy concepts that you can use right away to get started, and also provide a platform for the other chapters to build on. Refer to the Knowledge Center for IBM Tivoli Network

Manager for full documentation on monitoring and the poller: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_monitornw.html

Setting up the polling policies is fairly straightforward, especially if your needs are simple. But it will be useful to point out some options and best practices along the way.

Policy

A policy is a package that describes a set of devices and the data to poll for.

You assign a set of devices to a policy and then add one or more poll definitions that describe what data will be polled and its threshold condition and details of the alert. For each poll definition, you set the polling frequency and whether to store the data. Then assign the policy to a specific poller.

Availability

Start by setting up your availability assurance. Apply a policy to all devices that will test for availability using some or all of the Chassis ping, Interface ping, and End Node ping polls. Using both the Chassis and Interface Pings will allow Root Cause Analysis (RCA) to correlate a Chassis ping failure as root cause for interface ping failures.

By default, the Default Chassis Ping and Default Interface Ping polls are set up to poll all the network device classes and the End Node Ping only pings the devices classified as end nodes, that is, belonging to the classes under the

EndNode superclass (AIX, Linux, NoSNMPAccess, Sun, Windows, and so on

). This class-based scope is configured on the poll definition, not usually on

the policy. See Chapter 2: How do the various scopes work together?

Port Link State

Create another policy for all the switches and use the SNMP Link State poll definition to test the ifOperStatus and ifAdminStatus of all ports and send an alert on state changes. Unlike all the other poll definitions, the poller will only send an alert on state changes – the event count does not increase for these events. For all other poll definitions, the poller sends an alert for each poll that breaches the threshold condition and lets Netcool/OMNIbus perform deduplication to simply increase the count on that event.

Other SNMP polls

Other standard polls to consider are interface Bandwidth usage, Errors, and

Discards, and also memory and CPU usage for the network devices. If the same thresholds can apply to all devices, then this task is straightforward.

Otherwise, copy the poll definitions and edit the threshold conditions and

7

8

assign to a separate policies with appropriate scopes.

Poll Definitions

Poll definitions describe the data to collect, the threshold and clear conditions, the description and severity of the threshold event, and some class and interface filtering capability that is covered in the next chapter. This allows you to target a poll definition to a specific set of classes, so that, for instance, this poll definition will collect Cisco data only from Cisco class devices. Or you might use different threshold conditions for the Cisco64xx class than other Cisco devices.

Note: Don't forget to set the event severity. By default it is zero which is Clear. If left unchanged it will result in the events being removed from Netcool/OMNIbus after 2 or 3 minutes, leaving you wondering why events are not being generated!

Default event severities:

0 Clear

1 Indeterminate

2 Warning

3 Minor

4 Major

5 Critical

Here are the main differences between the poll definition types.

Chassis Ping & Interface Ping

These poll definitions use ICMP polls. You can store the up/down value and optionally the ping response time, and the Packet Loss percentage. Note that the timeout and retries can be adjusted in the poll definition Ping tab so if you need to allow for slow links to different regions for instance, you can create separate poll definitions with difference values. Use the Copy button to duplicate poll definitions, since most attributes will be the same.

Basic Threshold

This poll definition evaluates to an integer which is used in the threshold condition. The value can also be stored in the historical database.

Generic Threshold

The threshold expression defines the data to collect and the expression evaluates to a boolean. This value cannot be stored. But you can create powerful boolean logic correlating MIB values.

However, while you can include several MIB variables in the expression, they must all be from the same MIB table – since there is no way to explicitly identify table instances from other tables.

Typically you use the Basic expression builder to construct straightforward expressions. For more complex logic expressions for Basic and Generic

Thresholds, use the Advanced panel to enter the expression using the OQL

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

and eval statement syntax, which is covered in the Knowledge center: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/reference/nmip_poll_syntaxforpolld efexpressions.html

SNMP Link State

This is a fixed poll definition – you cannot change its logic which is described in the Knowledge center http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/concept/nmip_poll_linkstate.html

One other thing to note is the initial state that the poller uses to calculate if a change has occurred. By default, it assumes the Up state if there is no current event. This can cause a flood of events on startup for unconnected ports where the ifAdminStatus has not been set to Down. Deleting those events, does not prevent another flood of events the next time the poller starts. Either set those ports to ifAdminStatus down on the devices, or if you don't want to do that you can configure the poller to use the initial state from the first poll if no event exists. With this scenario you risk missing ports that have gone down when you take the poller down for maintenance.

To set this option, edit the file,

$NCHOME/etc/precision/NcPollerSchema.domain.cfg

and set this property to 1:

update config.properties set UseFirstPollForInitialState = 1;

Multiple domains

Policies are tied to a domain, so if you want to use them in another domain, you can copy the policies (with their poll definitions) when creating a new domain, or afterwards. See the details of the domain_create.pl and get_policies.pl in the Knowledge center http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/ref/reference/nmip_ref_perlscripts.html

However, poll definitions themselves are global and can be used from any domain without copying. So use a naming convention for convenience if you design them to be domain specific.

Chapter 1: Poller concepts and terms to get started

9

Chapter 2: How do the various scopes work together?

When creating policies we tend to focus on the main scope definitions that are defined within the policy. However, by ignoring the Classes filter in the poll definitions, you can find that you are not polling all the devices you expected.

How scopes are applied

As you can see from the figure, for a device to be actually polled for a set of data, it must first pass through the policy filters (Network Views and the

Device Filter), and then any filter defined within the poll definition for that piece of data (the Classes filter and the Interface filter).

© Copyright IBM Corp. 2014

Poll Definition Classes filter

11

Do not assume the Class filter is correctly set, even for default poll definitions. Always check that the classes you want are selected. For instance, if the Cisco parent class is unchecked, that is because one of the sub-classes is also unchecked and this is easy to miss at a quick glance.

Note that the parent class itself can also contain devices – thus if the parent class itself is not selected those devices will not get polled.

If you intend not to use the class filter and let all devices through, then simply uncheck all the classes. A quick way to ensure nothing is checked is to click the Core class, which will check everything, and then click it again to clear it.

Tip: After any new AOC classes are added to the system, you must revisit the poll definitions in use to select the new classes if you want to poll those devices. Note that new classes will not be selected in the filter by default and that can cause the parent class to become unselected.

As a best practice, set up the Classes filter in the poll definition relative to the context of the data, rather than using it as part of a policy level broader scope.

That will make it less confusing as you reuse poll definitions with other policies over time. For example, if you are polling a specific Cisco MIB, then use the class filter to select Cisco devices. So now your policy scope can be created based on more general criteria, such as geography or device type.

Poll Definition Interface filter

The Interface filter is used to:

• reduce the load on devices which have very large interface MIB tables when responding to SNMP requests

• prevent unimportant interfaces from generating non-interesting alerts

• reduce poller processing resources

A poll definition interface filter is effective in reducing poller processing only when the percentage of interfaces selected is small. An interface filter that only excludes a few rows of data is more resource-intensive for the poller since it queries the SNMP table one row at a time with snmpget requests, instead of the more efficient snmpgetnext table query. So a useful tip is to avoid use of the interface filter unless it reduces the number of interfaces selected considerably.

Policy level filters

Network Manager 3.9 introduced the ability to use Network Views as the scope for polling. This allows you to establish complex, but easily verified, reusable scopes for monitoring. Using the tabular view of the Network View is useful to quickly see the membership at any time. You can also use the

Monitoring reports to verify the scope based on the policy filtering – however, it does not take account of the poll definition filters.

Once you have selected the Network View or Views, you can refine the scope further with a simple device filter in the Device Filter tab if necessary. You can also select All Devices in the Network View tab and simply define a simple device filter if you don't want to create a Network View just for this policy.

12

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Chapter 3: How to get the best out of threshold events

This chapter covers the various controls to enrich and manipulate the events generated by the poller in order to maximize the relevant information about the problem and the device for the operator.

This section involves a deeper level of knowledge of scripting, the Network

Manager OQL language, and Netcool/OMNIbus probe languages. This section will provide you ideas to tailor the events to your environment and needs.

Note: The Netcool/Impact is a powerful tool to correlate and enrich any event with data from virtually any data source. It is beyond the scope of this guide, but worth considering if you are looking to enrich events from custom tables and data sources beyond what is possible with the methods described here.

Event ID

When creating new poll definitions decide whether you need an existing event

ID or a new one. The event ID is used to define that event for use in event handling throughout Network Manager and Netcool/OMNIbus. It corresponds to the EventId field within the event, such as NmosPingFail, inbandwidth, and so on. When you create a new event, a new event ID will be automatically assigned, but you will need to reenter the poll definition edit window to see the new ID.

If you are simply creating a variant of the poll definition, for example with a different threshold value, then use the Copy button to duplicate the existing poll definition and keep the same event ID. This will ensure the event is treated the same in terms of event enrichment, filtering, and so on through the system.

If you are creating a new poll definition for new data or expression, then use the New button so a new event ID is created and can be distinguished from the others.

Event summary description

You can build the event description for the Basic Threshold and Generic

Threshold type events as part of the poll definition. You can change the descriptions to suit your operating processes. In the description, you can embed live SNMP variables (even if they are not in the poll expression itself), information from NCIM for the device, and information from the policy and poll definition. See the details, including the syntax, in the Knowledge center: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/reference/nmip_poll_syntaxforpolld efexpressions.html

For some event types, (Ping Polls, Link State, Remote Ping) the description is hard coded, but you have an opportunity to change it at the next step in the event processing, in the nco_p_ncpmonitor Probe rules file.

If you are polling a single MIB variable you can include its value in the

© Copyright IBM Corp. 2014

13

description. While there is no way to include the value of an expression in the description in the poll definition edit window, you can add it at the next step of the event processing – and this is described in Example 3 in the Probe rules file section below.

Rules file

The poller sends the events to the nco_p_ncpmonitor probe which converts them into the Netcool/OMNIbus event format. The probe has a rules file,

$NCHOME/probes/<arch>/nco_p_ncpmonitor.rules

, which controls the conversion and import into the Netcool/OMNIbus ObjectServer. For the syntax details, see the Knowledge center, http://www.ibm.com/support/knowledgecenter/SSSHTQ_7.4.0/com.ibm.netco

ol_OMNIbus.doc_7.4.0/omnibus/wip/probegtwy/reference/omn_prb_proberul esfilesyntax.html

TIP: Customizing this file is fully documented, but you should understand the logic carefully so that your changes are in the right logic path for the events you are working with. Also be careful to avoid changing other event variables that play a key role in the life of the event.

TIP: When customizing system files, create a backup of them, and then clearly separate your code from the default code. This will be very helpful later on when someone has to migrate this file.

Customizing the Probe rules file is a very powerful way to customize and standardize your polling events. You can add information to the event or filter events under certain conditions so they are discarded and never reach the

ObjectServer. A full treatment of the rules file is beyond this guide, but these examples will give you some ideas.

Example 1: Standardize the identification of devices

In the event viewer, some events use the device IP address in the Node field of the event, others use the name. If you prefer to always see the name, then one way to do this is to use the EntityName for the Node field in the event. This includes the ifName for certain events such as interface Ping Fails and SNMP

Link State which provides better interface identification for the operator than just the ifIndex value.

Edit the nco_p_ncpmonitor.rules file and locate the beginning of the standard fields. Add the following line to set the Node field to the EntityName as shown here.

Note: This will only work if you use an interface filter in the SNMP Link State poll definition – otherwise the entityName will always be that of the main node.

#

# populate some standard fields

#

@Severity = $Severity

@Summary = $Description

.

.

# CUSTOM: Use the EntityName to standardize on the name

14

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

# rather than the IP address in general. Do the same for

# Link State events, but using a different variable.

# rob 4/23/2014

@Node = $EntityName

if (match($EventName, "NmosLinkState"))

{

@Node = $ExtraInfo_ENTITYNAME

}

This is the result:

Example 2: Modify the description for Ping Fail events

This example modifies the Summary field for the event by prefixing the name of the domain. If you are using Netcool/OMNIbus to forward events, e.g. email, SMS, or to a ticketing system, you might want to pack the description with information for convenience. See the end of this section for a description of the details table which contains variables the poller passed on for that event, but might not be in an event variable yet. The event variables are in the

ObjectServer's alert.status

table.

Custom Tip: You can do this in many ways within the file. Examine the logic to make sure you are placing new code in a path that will be executed for your event.

#

# CUSTOM: Prefix ping fail events with the domain name

# (rob 4/23/2014)

#

if ( match( $EventName, "NmosPingFail" ))

{

@Summary = $Domain + ": " + @Summary

}

Example 3: Adding value calculation to event summary

This is a technique you can use to add the value of an expression to the event.

You cannot do any calculations within the event description in the poll definition itself, but you can bring the numbers to the rules file and do the calculation here.

Chapter 3: How to get the best out of threshold events

15

SnmpInBandwidth poll definition. Put the MIB values on the line for the

Threshold and Clear description, as in the following example, with a space in between each one: eval(text,"&SNMP.VALUE.ifName") eval(long64,"&SNMP.DELTA.ifOutOctets") eval(long64,"&POLL.POLLINTERVAL") eval(long64,"&SNMP.VALUE.ifSpeed") Exceeded

In the Clear description use the word “Clear” instead of “Exceeded”.

Calculate the value in the rules file. Edit the nco_p_ncpmonitor.rules

file and add this section in the path of the standard events:

#

# CUSTOM: Calculate value for snmpInBandwidth event

# rob 4/24/2014

#

if (match($EventName, "inbandwidth"))

{

1 [$if_name, $octets, $pollint, $ifspd, $msg] =

scanformat(@Summary,"%s %d %d %d %s")

2 $calculated = ($octets * 800)/($pollint * $ifspd)

3 $percent = int($calculated)

4 @Summary = "Bandwidth threshold " + $msg +

", value is currently " + $percent +

"% on " + $if_name

}

Line 1: scans the 5 values from the Summary into each variable.

Line 2: performs the same bandwidth utilization calculation as in the poll definition. The division forces the result to be a decimal number.

Line 3: Converts the real number back to an integer

Line 4: Rebuilds the summary string with the calculation result

Here are the events after implementing Examples 2 and 3:

Some nuances to be aware of

Note that the @ prefixes the event fields (in the ObjectServer alerts.status table) and $ prefixes the variables from the poller in the Details section. To see the fields passed to the probe from the poller, add the following line to the rules file at the end,

details($*)

This will populate the ObjectServer details table which can be seen by looking at the Information for an event, and clicking the Detail tab.

After making changes to the rules file you can either restart the

16

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

nco_p_ncpmonitor process or just send a SIGHUP signal to the running process to reread the rules file:

# itnm_status ncp

Network Manager:

Domain: ITNMDEMO

ncp_ctrl RUNNING PID=1245 ITNMDEMO

.

.

nco_p_ncpmonitor RUNNING PID=1423 ITNMDEMO

.

.

# kill -HUP 1423

Check

$NCHOME/log/precision/nco_p_monitor.domain.log

for syntax errors.

Event enrichment in the Event Gateway

Once events have been inserted to the ObjectServer by the probe, they are acted on by other agents: ObjectServer triggers for instance, as well as the

Network Manager Event Gateway. This gateway reads events from the

ObjectServer and updates them – enrichment from the various Network

Manager plugins including the Root Cause Analysis engine and the

StandardEventEnrichment stitcher.

For this example, we want to include the ifAlias for interface events. We could do this in the rules file, but that would only affect events from the poller. The

Event Gateway can affect events form all sources that can be matched to the discovered topology.

Step 1: Add new field to alerts.status

in the ObjectServer

TIP: Keep a record of fields you add to the ObjectServer for future reminder during upgrades.

Create a new field called InterfaceAlias. Add this line to a file: let's call it customObjectServerFields.sql, alter table alerts.status add column InterfaceAlias varchar(64

);

Run this Netcool/OMNIbus command to execute the script, nco_sql -server objectservername -user root -password password

< customObjectServerFields.sql

Step 2: Edit EventGatewaySchema.cfg to act on the new field

Near the bottom of this file you will see the insert statements for the two tables, nco2ncp (controls events read from the ObjectServer (nco) into the

Gateway (ncp))

Chapter 3: How to get the best out of threshold events

17

ncp2nco (controls events being written back to the ObjectServer)

Add the new InterfaceAlias field to the nco2ncp table, so it is read in from the

ObjectServer, insert into config.nco2ncp

(

EventFilter,

StandbyEventFilter,

FieldFilter

) values

(

"LocalNodeAlias <> '' and (NmosDomainName = '$DOMAIN_NAME' or NmosDomainName = '')",

"EventId in ('ItnmHealthChk', 'ItnmDatabaseConnection')",

[

"Acknowledged",

"AlertGroup",

"EventId",

.

//CUSTOM: added by rob 4/24/2014

"InterfaceAlias",

.

.

Step 3: Add new field to the ncp2nco table

Now add the new NmosInterfaceAlias field to the ncp2nco table so that it will be written back to the ObjectServer: insert into config.ncp2nco

(

FieldFilter

) values

(

[

"NmosCauseType",

"NmosDomainName",

"NmosEntityId",

"NmosManagedStatus",

"NmosObjInst",

"NmosSerial",

//CUSTOM: added by rob 4/24/2014

"InterfaceAlias"

]

);

Step 4: Populate the new InterfaceAlias field

Edit the file,

$NCHOME/precision/eventGateway/stitchers/StandardEventEnrichment.stch

and add code to populate the new field above the line,

GwEnrichEvent( enrichedFields );

18

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

//CUSTOM: Populate the new InterfaceAlias field,

1

if ( entityType == 2 )

{

2 text ifAlias = @entity.interface.IFALIAS;

3 if ( ifAlias <> eval(text, '&InterfaceAlias') )

{

4 @enrichedFields.InterfaceAlias = ifAlias;

}

}

5 GwEnrichEvent( enrichedFields );

Line 1: Only do this for interface events (entityType of 2).

Line 2: Declare and initialize the ifAlias variable with the current value from the NCIM interface table.

Line 3: Check if the current value is different from the value in the event

(which will be empty the first time it is read from the ObjectServer.)

Line 4: Set the variable to the NCIM value, if different

Line 5: Process the newly enriched event variables.

After making changes to these two files you can either restart the

ncp_g_event process to reread them or send a SIGHUP signal to the running process to reread the files:

# itnm_status ncp

Network Manager:

Domain: ITNMDEMO

ncp_ctrl RUNNING PID=1245 ITNMDEMO

.

.

ncp_g_event RUNNING PID=1569 ITNMDEMO

.

.

# kill -HUP 1569

Check

$NCHOME/log/precision/ncp_g_event.domain.log

for syntax errors.

Chapter 3: How to get the best out of threshold events

19

Chapter 4: Making the most of historical data

Short-term diagnostic tool

Network Manager provides short-term historical data information by storing

SNMP and ICMP data collected from network devices and using Tivoli

Common Reporting to view and analyze the data. Features typical for a full performance management product, such as optimized data storage, routine data gathering over extended periods required for capacity planning, and regulatory reporting, is possible with Tivoli Netcool Performance Manager, but is not a goal with the Network Manager historical reports.

Use the Network Polling configuration panels to:

Define data to collect, including setting any threshold triggers for alerts

Define the scope, and time interval for polling

Determine what data to store

Start the data collection

You can use the Tivoli Common Reporting viewer to:

View sets of defined reports detailing trends and analysis based on

SNMP and ICMP short-term historical data collections for a subset of the collected data

View generic Trend and TopN graphs of ad hoc collections of stored data

You can use this for closely monitoring problematic or key network devices after a maintenance period, or an area where you suspect problems and want to get a better understanding of behavior trends with throughput, device

CPU/memory resources, interface usage, errors, discards, etc. The TopN reports can help operators compare and focus on the right devices and drill down to see patterns over time. Summarization reports (with Tivoli Data

Warehouse) can help extend the time period you want to compare performance over. The default reports can be used as examples to edit to meet your needs.

Do I need Tivoli Data Warehouse?

If storing and reporting on performance data is important and you are not using a dedicated performance management product, then you might want to consider using Tivoli Data Warehouse to store the historical data.

By default the data is stored and maintained in the local NCIM database. If your environment includes IBM Tivoli Monitoring and Tivoli Data

Warehouse, you can take advantage of these tools for storing, summarizing, and managing the performance data collected by Network Manager.

Tivoli Data Warehouse supports summarizing data across time periods such as

© Copyright IBM Corp. 2014

21

hourly, daily, weekly, and monthly.

There are sample Summary reports that make use of the Summarization tables in TDW (and therefore cannot be run without TDW).

The Device Summarization report and Interfaces Summarization report present data in raw, hourly, and daily graphs on the same page for the data you have stored. This allows you to view behavior over a longer period of time.

The Device Availability and Interface Availability reports present ping response time as well as graphs for availability in the last 24 hours, last 30 days, and last 3 months.

Tivoli Data Warehouse also supports advanced data pruning and archiving to other stores.

Storage capacity

When calculating how much data you can store, you need to consider not just the number of rows or data points, but also the rate of storage.

By default, the Network Manager poller maintains a pruning policy to maintain the latest 5 million rows in the local database. You can modify this limit if you are achieving satisfactory performance results when generating the reports in your environment. Reset the limit in the

$NCHOME/etc/precision/NcPollerSchema.cfg file for the local cache, as described in the Knowledge center, http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/admin/task/nmip_adm_increasestorageli mitforhistperfdata.html

.

You can store up to 20 million rows in either the local database or Tivoli Data

Warehouse. Depending on your hardware and database performance, you might see degradation in the storage and reporting performance above 20 million rows. With increasing storage rates and table size, you might also need the services of your Database Administrator to optimize, run statistics and perform transaction log maintenance regularly on the database.

The sustained rate of data storage depends on a number of factors:

Number of polled entities

Number of metrics polled

Frequency of polling

Number of policies

Number of pollers

Database performance (for large rates, a slow database will have an impact)

Insertion rates to the local ncpolldata database have been seen up to 7 million data points per day across all pollers. However, with high rates like this you need to watch the pruning to make sure it is keeping up.

A single polling policy will not store much above 1 million data points per day, but you can use multiple policies and multiple pollers within the

22

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

suggested overall range. You can increase the number of pollers, as described

in Chapter 6: How many poller instances do I need? To avoid excessive

impact on the database, don’t use more than 3 pollers for storing data. Create one ITM agent instance per Network Manager poller.

Throughput to the Tivoli Data Warehouse is up to a total of 7 million rows per day with all pollers. Exceeding this limit shortens the tolerance limit for outages and might cause loss of data. While rates higher than this can be achieved, you will want to allow for error conditions on the network link and transfer processes which include store and forward techniques. ITNM will tolerate and recover from transfer outages without loss of data for a short period of time that depends on the data rate, disk space, and period of time.

The longer the period, the longer it takes for the system to recover.

To calculate the rate of data points you want to store, follow these steps.

For data based on the device, e.g. memory, CPU, for each SNMP poll definition,

Datapoints per day

=

Number of devices

×

Number minutes in a day

Polling freq in mins

For data based on the interface, e.g. bandwidth, ifInDiscards, for each SNMP poll definition,

Datapoints per day

=

Number of interfaces

×

Number minutes in a day

Polling freq in mins

To get a feel for the storage capacity, here is an example.

A user wants to poll 1000 devices with an average of 5 network interfaces for the following historical polled data with the following polling intervals:

Device level: o memory utilization, 5 minute intervals

Interface level: o ifInDiscards, 10 minute intervals o ifInErrors, 10 minute intervals o bandwidth, 10 minute intervals o Pings for up/down status and response time, 5 minute intervals

Based on these device level and interface level polling requirements, a user would calculate the daily rate of database rows using the previously described guidelines.

Note: In the example, the SNMP poll specifies a count of three data points for the ifInDiscards, ifInErrors, and bandwidth historical polled data. The ICMP poll specifies a count of two data points, one for the up/down status and another for the response time.

Number of device level database rows per day (SNMP)

Chapter 4: Making the most of historical data

23

= 1000 devices * 60 * 24/5 polls per day

= 288,000 rows

Number of network interface level database rows per day (SNMP)

= 1000 * 5 interfaces * 3 data points * 60 * 24/10 polls per day

= 2,160,000 rows

Number of ICMP database rows per day

= 1000 devices * 5 interfaces * 2 data points * 60 * 24/5 polls per day

= 2,880,000 rows

Total database rows per day = 5,328,000

The previous example shows the total database rows per day of 5,328,000, which is within the upper guidance of 7 million database rows per day. Thus, this example shows a scenario that results in maintaining about 4 days of raw data after increasing the storage limit for historical polled data to 21 or 22 million database rows.

This example is copied from the Knowledge center: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/concept/nmip_poll_storagecapacitye xample.html

For ICMP data, you can choose to store the up/down value, and optionally the response time and/or packet loss. Use the above formula for the SNMP collection based on whether you are using device or interface based pings and multiply the result by 1, 2 or 3 depending on whether or not you need the response time and packet loss data in addition to the up/down data point.

Use of the Data Label

It is often useful to graph the results gathered from different poll definitions.

For example, if you have duplicated poll definitions for different threshold values and maybe event severities for ifInDiscards, but you want to compare the values on the same reports, regardless of the specific poll definition.

Most reports will use the Data Label field for grouping purposes. By default, the Data Label is set to the poll definition name, but you can set a common name in the poll definition GUI panel across the different poll definitions.

When selecting the parameters for the report, you choose the data to present in the report using the common data label. The report will present data tagged with the same data label but collected by more than one poll definition.

24

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Chapter 5: What is adaptive polling?

You can define a network view based on events associated with the device, which makes the population of this view variable and dependant on the life cycle of these events. Adaptive polling allows you to poll devices only under certain conditions, for example when they are at risk of failure. When an interface is experiencing high throughput it might not be a red flag by itself, but it is now more at risk of discarding packets and/or errors and worth monitoring more closely.

If you wanted to conserve polling on all your devices for bandwidth, discards and errors, you could set up a policy just to poll for discards and errors on devices that exceeded the bandwidth threshold.

First, define a network view based on the existence of the event; let's call it

HighThroughput. Select the Type filtered and define the filter based on the activeEvent table. To find the eventId of the event, go to the poll definition for snmpInBandwith and note the Event ID on the General tab. In this case it is inbandwidth. So the filter will be,

Using activeEvent table: eventId = 'inbandwidth'

Of course you can combine this with any other filter when defining network views to narrow down or expand the scope to suit your needs.

Now you can view all the devices with at least one interface that exceeded the bandwidth threshold. As the throughput declines and the events clear, you will see that the devices no longer appear in the view.

Now set up a policy using this new HighThroughput view to poll for ifInDiscards and ifInErrors.

Note on the policy General tab there is an edit box for Policy Throttle. By default this is zero so that the policy is not affected. But when using polling scopes that can be variable like this, it is sometimes prudent to enter a value for the maximum number of devices to poll. In this case you might feel that even in the worst case, the poller will be able to handle the load, but if you were thinking of setting up an accelerated polling scheme, as described in the examples in the Knowledge center, http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_managingadaptivep olling.html

, then it would be advisable to guard against event storms.

This is an advanced technique so consider starting small, and evaluate how useful it is for you.

© Copyright IBM Corp. 2014

25

Chapter 6: How many poller instances do I need?

When you increase the amount of polling, you might reach the point where you need to set up more poller instances. The number of devices and interfaces, the frequency of polling, how many metrics or data being polled, and your network latency, all contribute to the load. Use the new poller

metrics in Chapter 7: Are the pollers healthy? to monitor the pollers as you

expand the polling demands across your network. This will help you determine when to set up more pollers.

How many?

We suggest, as best practice, that you set up three pollers per domain.

1. One poller to perform the administration functions for all pollers in this domain. In addition, use this poller to perform the MIB Grapher real-time polling so that this variable use does not impact the pollers handling the regular polls.

2. One poller to perform ICMP polls. These are lightweight polls and one poller is generally enough for the biggest networks with 5 minute frequency.

3. One poller to perform SNMP polls. These polls require more resources for the poller. Use the poller metrics to monitor the poller so that you can

determine if you need to set up additional pollers. See Chapter 7: Are the pollers healthy?

Availability polling is your mission-critical monitoring and is relatively light weight on the poller, so use a separate poller for all the pings and verify from the poller metrics that they are healthy even during high load discovery periods. Ideally you do not want to risk overloading or destabilizing this poller over time with people adding ad-hoc policies since it is the most important.

Add additional pollers for SNMP and performance polling. Policies that store data place additional burdens on the poller and it is important to ensure that the poller will be able to sustain the loads even under mild duress, such as brief database maintenance outages.

If you are storing polled data, try to store data from as few pollers as possible, no more than 3, to avoid undue database contention when using high volumes.

Tips for defining multiple pollers

By default, all poller instances will perform the administration duties, so designate just one of them for the admin role and give it a name like

“AdminPoller” to remind everyone not to assign policies to it. Explicitly designate each poller using the command line arguments -admin or -noadmin when defining the poller instances in the

CtrlServices.cfg

file.

Note that the default poller does not have an explicit instance name and you will see it referred to in the Network Polling GUI as DEFAULT_POLLER.

There is nothing special about the default poller and you are free to rename it

© Copyright IBM Corp. 2014

27

explicitly for clarity.

It is a good idea when setting up the new pollers in

CtrlServices.domain.cfg,

to modify the service name ( serviceName

field in the services.inTray

table) to be the same as the poller name. By doing this you ensure the name is used consistently for that poller across the product.

Tip: Poller naming convention

1. Avoid spaces, since the poller name is used as part of file names, such as the logs, metrics, and cfg files (for example,

NcPollerSchema.AdminPoller.NCOMS.cfg

). Therefore to make life easier, chose names that are compatible with the file system and avoid spaces.

2. ServiceName: since the poller processes run under ncp_ctrl (see itnm_status ncp

), use a naming convention with the “ncp” prefix for the serviceName when you add additional pollers. This allows you to continue to use the ps -ef|grep ncp command to view all the core processes.

For example, in

CtrlServices.domain.cfg

, use something like,

“ncp_poller_AdminPoller” when the -name argument is “AdminPoller”

“ncp_poller_PingPoller” when the -name argument is “PingPoller” insert into services.inTray

(

serviceName,

.

argList,

.

) values

(

" ncp_poller_ AdminPoller ",

.

[ "-domain" , "$PRECISION_DOMAIN" , "-latency" , "100000",

"-debug", "0", "-messagelevel", "warn", " -admin ", "-name",

" AdminPoller " ],

.

); insert into services.inTray

(

serviceName,

.

argList,

.

) values

(

" ncp_poller_ PingPoller ",

.

[ "-domain" , "$PRECISION_DOMAIN" , "-latency" , "100000",

"-debug", "0", "-messagelevel", "warn", " -noadmin ", "-name",

28

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

" PingPoller " ],

.

);

MIB Grapher

Don't forget to configure the MIB Grapher to use the admin designated poller.

Edit

$NCHOME/precision/profiles/TIPProfile/etc/tnm/tnm.properties

.

By default, the MIB Grapher is configured to use the default poller: tnm.graph.poller=DEFAULT_POLLER

Change to, (using the -name argument), tnm.graph.poller=”AdminPoller”

Using multiple pollers

When dividing up your policies to specific poller instances, consider the following:

Multiple pollers are only supported on the same server. This ensures a consistent source point with the discovery for event correlation in RCA.

OQL service name. Normally you do not need to query the pollers with

OQL, but it can be useful when diagnosing some issues and you want to see exactly what devices and data the poller has scheduled for polling and when it last polled each data point. When using OQL to query the pollers, use the following syntax for the unnamed default poller (if you have one), ncp_oql -domain NCOMS -service SnmpPoller and for named pollers, such as “PingPoller”, ncp_oql -domain NCOMS -service SnmpPoller -poller

'PingPoller'

For full details, see the Knowledge center: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/admin/task/nmip_adm_admindistpoll.ht

ml

Chapter 6: How many poller instances do I need?

29

Chapter 7: Are the pollers healthy?

In order to prevent problems from occurring on a poller, it is important for an administrator to monitor the health of their poller. To assist them in doing this, the poller outputs a set of metrics that show how the poller is handling the load placed upon it. These metrics show the state of both the poller and it's active policies.

© Copyright IBM Corp. 2014

Metric Name

Health

Memory

BatchQueueSize

Description

Policy Health. This is the percentage of devices that are polled during a policy cycle. If this value is 100%, the poller is working properly. If the value is below 100%, not all the devices were polled during the last polling interval.

The amount of system memory (in MB) that the poller is using.

Memory usage increases as more devices are discovered or more policies are enabled.

The number of SNMP batch requests waiting for a thread in which to complete the operation

PollDataQueueSize The number of INSERT statements that are queued to the

NCPOLLDATA database. Shows whether the poller is successfully storing polling data at a rate consistent with the rate of polling.

PollDataRowCount The number of rows in the ncpolldata.polldata table. This table stores the historical poll data and should not exceed the maximum set.

Table 1: Poller Metric Data

The metrics are written to a file, one per poller:

$NCHOME/log/precision/ncp_poller.SnmpPoller.<pollername>.<domai n>.metrics

and are structured such that they can be easily parsed, or manually scanned by the user:

2014-04-09T16:36:24 PollerStart

2014-04-09T16:36:30 PollStart Policy=41 PollDef=1

2014-04-09T16:36:30 Memory=724

2014-04-09T16:39:00 BatchQueueSize=1

2014-04-09T16:39:34 PollDataRowCount=1909311

2014-04-09T16:45:45 PollDataRowCount=1909169

2014-04-09T16:51:48 PollDataRowCount=1903141

2014-04-09T16:57:51 PollDataRowCount=1903141

2014-04-09T17:03:54 PollDataRowCount=1903141

2014-04-09T17:05:34 Health=100 Monitors=44 Behind=0

Policy=41 PollDef=1

Use this command line tool to graph the metric data:

NCHOME/precision/scripts/perl/scripts/itnm_poller.pl

.

The script scans the metrics file and presents simple charts of the data.

31

Run this script in the location of the metric file you wish to view.

For example, ncp_perl $NCHOME/precision/scripts/perl/scripts/itnm_poller.pl

-domain <name> [-poller <pollername>] -metrics -window

<interval in hours>

For full information on the itnm_poller.pl utility, use the -help argument or see the Knowledge center: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_monitorpollerhealth.

html

The script produces charts for each metric, lining them up on the same timeline making it easy to get a complete picture of the factors involved.

Illustration 1: Policy Health (one for each active policy)

Illustration 2: Memory Usage

32

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Illustration 3: SNMP Batches in Queue

Illustration 4: Data Collection and Storage row count

Each of these metrics is designed to assist the administrator in answering specific questions about the state of their poller.

1. Is the historical poll data table being maintained?

2. Is the poller keeping up with the policy load at the scheduled frequencies?

3. Is the poller's memory stable?

4. Is the poller successfully storing data?

5. Do I need to add a new poller?

We will go through each of these questions and show how to use these metrics

Chapter 7: Are the pollers healthy?

33

to help in answering them.

1) Is the historical poll data table being maintained?

As part of its normal operation the poller performs the task of pruning old and obsolete data from the NCPOLLDATA database. The NCPOLLDATA database is kept trimmed to the cap set in the poller configuration file. If the poller is unable to keep up with deleting records from this database it can result in issues when it attempts to store more data. By reviewing the metric data for the poll data row count the administrator can assess if they have such a situation. If an issue is detected the reasons can vary, by reviewing the complete set of charts the administrator can make an assessment as to the cause.

In the chart below we can see an upward trend in the poll data row count. The cap for this instance is set at 5,000,000 and in the beginning the poller is able to maintain the level. At around 12:00 we can see the trend go upward, and the poller unable to keep the count below the desired cap.

This chart alone would not indicate the reason for the trend; we have to review the other metric data available. We first look at the Poll Data Queue size and from this metric chart we see a jump in the queue during the same time period.

The poll data queue increases, but it is not a continual upward trend, there are occasional drops shown in the chart. This type of trend would direct us to

34

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

check the polling load, specifically the amount and rate at which the poll data is being collected. The next thing to check is what policies are currently enabled. A review of the metrics file (or graph), we can see a large number of new polls started at the point the Poll Data Queue started to climb:

2014-04-09T08:11:08 PollStart Policy=11 PollDef=6

2014-04-09T08:11:09 PollStart Policy=11 PollDef=6

2014-04-11T12:01:56 PollStart Policy=91 PollDef=4 <===

new policy starts

2014-04-11T12:01:56 PollStart Policy=91 PollDef=13

2014-04-11T12:01:56 PollStart Policy=91 PollDef=20

2014-04-11T12:01:56 PollStart Policy=91 PollDef=21

2014-04-11T12:01:56 PollStart Policy=91 PollDef=22

2014-04-11T12:01:56 PollStart Policy=91 PollDef=23

2014-04-11T12:01:56 PollStart Policy=91 PollDef=27

2014-04-11T12:01:56 PollStart Policy=91 PollDef=28

2014-04-11T12:01:56 PollStart Policy=91 PollDef=29

2014-04-11T12:01:56 PollStart Policy=91 PollDef=30

2014-04-11T12:01:56 PollStart Policy=91 PollDef=31

2014-04-11T12:01:56 PollStart Policy=91 PollDef=1

At this point we would want to review the actual policy scope, polling rate, and storage settings. The poller's profiling.policy

OQL table shows the target load in the policy. ncp_oql -domain <domain> -service SnmpPoller -tabular -query

“select * from profiling.policy;”

The polling rate can be seen in the ncpoller.job

OQL table: ncp_oql -domain <domain> -service SnmpPoller -tabular -query

“select * from ncpoller.job;”

Chapter 7: Are the pollers healthy?

35

From these tables we see a large number of targets being polled at a rate of 18 seconds. The polling interval is much too aggressive, probably a typo on the part of the user who configured the policy, and is likely the cause of the climb in our Poll Data Queue and storage counts. At this point the administrator would want to disable the policy, allow for the poller to catch up in storing and pruning the poll data, then fix the policy polling rate and restart it.

This is just an example of how this data can be used to diagnose data storage issues. The inability of the poller to prune the data might not always be related to the storage rate. A poller can also be unable to prune if the database is experiencing issues that prevent the batches of poll data deletes from completing.

In this example we see the Poll Data Row count steadily increasing, with no drops.

Reviewing the Poll Data Queue we see that it is steady, with no increases at all:

36

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

At this point we would suspect the poller's ability to prune the poll data. If the database is fully loaded, it might not be able to handle the batches of deletes from the poller. If the batches of deletes are failing, this information would be recorded in the poller log file.

There are times when database performance causes issues with the poller's ability to perform the poll data delete in a timely manner. When this happens it can result in the pollData

database table exceeding capacity.

In this next chart we see an example of the Poll Data Row count exceeding the limit and being pruned below the desired level. In this example our chosen limit is 250,000 records.

The downward steps in the chart span roughly 1 hour each, implying that it is taking the poller an hour to delete 50,000 records. The lengthy amount of time to perform the delete would be a cause of concern and at this point the user would want to contact their database administrator. A typical cause is often poor log pruning maintenance.

2) Is the poller keeping up with the policy load at the scheduled frequencies?

As policies are enabled on a poller it is important to monitor the policy health to determine if the poller is able to keep up with the load. Since there can be multiple poll definitions on a policy, the health is computed for each poll definition in the policy separately.

Chapter 7: Are the pollers healthy?

37

If the poller can not keep up with the load it writes records in the metrics file of the Policy Health. This health is the percentage of the devices in the poll that the poller is able to complete polling on during the polling interval.

Example 1 – increasing the number of policies

Below is a sample chart of the health of a policy/poll definition. As you can see the policy/poll definition is considered healthy up until 11:30, at which point it begins to decline. The administrator is made aware of this by a status alert that the poller sends for the poll.

Since this particular poll was once perfectly healthy we need to dig down a bit further to find out the cause of the change. Reviewing other metric charts we see that there are other active polls that belong to this poller:

38

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Both of these polls appear to have been started around the time of the declining health of the first poll, so it is a strong indication that the load from these two additional polls is more than the poller can handle.

Example 2 – Increased number of entities

There are other factors that might result in a policy/poll definition to become unhealthy, such as increased load after a discovery. Below is another example of a Policy Health chart, but this time no other policies are enabled:

In this case the health fluctuates a little at first but then reaches 100 percent.

Later on it drops off, never fully recovering. By reviewing the actual Health records in the metrics file for this policy/poll definition we can see a jump in the entity count (Monitors) at 11:10:

2014-04-11T09:52:42 Health=100 Monitors=53 Behind=0 Policy=89 PollDef=38

2014-04-11T10:00:53 Health=100 Monitors=53 Behind=0 Policy=89 PollDef=38

2014-04-11T10:09:39 Health=98 Monitors=1240 Behind=24 Policy=89 PollDef=38

2014-04-11T10:11:49 Health=31 Monitors=1240 Behind=855 Policy=89 PollDef=38

2014-04-11T10:13:57 Health=31 Monitors=1240 Behind=855 Policy=89 PollDef=38

2014-04-11T10:16:05 Health=31 Monitors=1240 Behind=855 Policy=89 PollDef=38

2014-04-11T10:18:47 Health=74 Monitors=1240 Behind=322 Policy=89 PollDef=38

2014-04-11T10:21:16 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38

2014-04-11T11:04:00 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38

2014-04-11T11:06:17 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38

2014-04-11T11:08:36 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38

2014-04-11T11:10:42 Health=92 Monitors=6278 Behind=450 Policy=89 PollDef=38

2014-04-11T11:12:55 Health=11 Monitors=6278 Behind=5547 Policy=89 PollDef=38

2014-04-11T11:15:07 Health=52 Monitors=6278 Behind=2974 Policy=89 PollDef=38

2014-04-11T11:17:16 Health=62 Monitors=6278 Behind=2372 Policy=89 PollDef=38

2014-04-11T11:19:31 Health=5 Monitors=6278 Behind=5902 Policy=89 PollDef=38

2014-04-11T11:21:39 Health=0 Monitors=6278 Behind=6278 Policy=89 PollDef=38

Chapter 7: Are the pollers healthy?

39

2014-04-11T12:04:59 Health=25 Monitors=6278 Behind=4708 Policy=89 PollDef=38

2014-04-11T12:14:59 Health=50 Monitors=6278 Behind=3139 Policy=89 PollDef=38

2014-04-11T12:24:59 Health=50 Monitors=6278 Behind=3139 Policy=89 PollDef=38

So in this example the policy health declined as a result of a discovery or scope change that caused a drastic increase in the number of entities being polled. At this point the user would need to review the policy and either reduce the scope or increase the polling interval.

Example 3 – loss of SNMP access

In this next example we have an SNMP based poll that was running fine and then dropped in health by a small amount.

A review of the metric file shows no change in the number of policies enabled or in the total target count being handled. Looking at the event console and the poller trace we can see a flood of SNMPTIMEOUT alerts:

Description='SNMP poll failure (SNMPTIMEOUT) for poll aFastPoll/ifOutErrors and target 172.31.23.52'

When the poller gets an SNMPTIMEOUT it will try to retest each credential that is scoped for the target. This added testing, as well as the SNMP timeouts, results in the poll taking an extended amount of time. If a large number of targets experience this issue it can result in a poll falling behind. In this example, the trace file shows a poll failure for each entity in scope, so at this point the user needs to check if someone updated the SNMP credentials incorrectly.

The next two poller health questions are frequently related.

3) Is the poller's memory stable?

and

4) Is the poller successfully storing data?

During normal operation of a poller, the amount of system memory being used can fluctuate, as policies are enabled or discovery scope increases. On many systems the amount of memory an individual program can consume is limited and when a program reaches that limit it results in a failure. To help diagnose when this does happen, the poller records its memory usage in the metrics file.

40

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

If a poller experiences an out of memory condition the corresponding metric chart can show if there was a growth issue.

Example 1 – poller fails on startup

When a poller first starts up, the expected behavior is for the memory to climb as the monitors for each target in scope are started. Once all of the monitors have been started, the memory should level off and remain at constant level until more monitors are started either from new targets added to scope or more policies enabled.

In this chart we see an example of runaway memory growth by a poller:

Admittedly, this is a bit of an extreme case, but illustrates the concept. You can easily see the memory usage climbing until it reaches the system limit, then drops at the point the poller fails because it is unable to allocate more memory. The poller gets restarted by ncp_ctrl and the pattern repeats until the poller reaches its limit on restarts. This pattern indicates the load is such that the poller can not start up all of the needed monitors before running out of memory. At this point the user needs to review the polling load and either consider reducing the load or creating additional pollers.

Example 2 – failure to store data to database

In this next example we see a poller with steady memory followed by growth that continues until the limit is reached.

Chapter 7: Are the pollers healthy?

41

At this point we want to take a look at some of the other charts, such as Policy

Health. If the memory growth were the result of too much load then the policies would show signs of being unhealthy. For this example we see two policies enabled:

The health charts are showing the policies are doing fine, so the next chart we want to look at is the Poll Data Queue. The poller keeps a queue of the poll data waiting to be written out to the database, and as poll data is collected it is added to this queue. If the poller is unable to write the data to the database the queue can grow and, in extreme cases, it can grow to the point that the poller runs out of memory.

From this chart we can easily see that our data queue is growing, indicating a problem writing the data to the NCPOLLDATA database. At this point the user needs to determine the cause of this growth. First they would want to check if there is a database connection issue. If there is a connection problem, the poller will write a message to the log file and send a database connection alert. If there are no connection issues being reported then the user needs to review the amount of data being collected and determine if the collection rate exceeds the rate at which data can be stored to the database. This rate limit depends upon the database. If the user suspects the collection rate they can

42

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

review what the policy is storing using the Policy Details Tivoli Common

Reporting report, which provides a graphic on Insert Rate Estimate for the policy:

If the insert rate is too high the user needs to review the reasons for the policy collections they have enabled and determine what action that they need to take, such as increasing the interval, reducing items collected, or adding a new poller instance.

Gracefully handle the Poll Data Queue growth

Regardless of the reason for the Poll Data Queue growth you can set a cap on the queue. This would prevent a poller from exceeding memory during prolonged database outages, or excessive data collection. A configuration option in the poller's configuration file,

NcPollerSchema.cfg

, will direct the poller to dump the queue to avoid excessive growth. update config.properties set PollDataQueueLimit = 5000;

When the queue exceeds this number of data points waiting to be inserted to the database, the poller will write the data off to a flat file instead, ncp_poller.SnmpPoller.<domain>.data

You can use this file to import the data into the database later if the data is still desired.

5) Do I need to add a new poller?

Network Manager has the ability to run multiple pollers within a domain. This allows for the polling to scale as needed. When to add additional pollers is not always obvious. There are three metrics that can give an indication to the user that a new poller is needed: Batch Queue, Memory, and Policy Health.

If there are multiple policies enabled on a poller, you can compare the health of each. In this example we have some long running policies that have been perfectly healthy in the past. Just after the 8:40 timestamp we see that the

SnmpBandWidth poll is starting to fall behind:

Chapter 7: Are the pollers healthy?

43

Looking at the other Policy Health charts we see more that are not healthy:

44

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Chapter 7: Are the pollers healthy?

45

By reviewing these charts the user can easily see that some of the policies do not extend as far back indicating that they were enabled during the time period that the policies started to fall behind. These added policies resulted in the drop in policy health. If the new policies are important to keep then a new poller is needed to handle this additional load.

46

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Chapter 8: Am I pinging all the IP addresses I want?

After setting up your ICMP polls, you can make sure the poller will ping all the devices you are responsible for. Maybe some devices have not been discovered, or out of discovery scope on a discovered device, or not in the scope of the policies you set up, or unmanaged for some reason.

Ideally, you start with an independent list of IP addresses that you are responsible for and maintain separately. Run the check against that list. Failing that, you could extract a list of access IP addresses from the NCIM database if you are confident all the devices you are responsible for are actually discovered.

List of management IP addresses for all discovered devices select accessIPAddress from chassis c

inner join domainMembers dm on dm.entityId = c.entityId

inner join domainMgr d on dm.domainMgrId = d.domainMgrId

where d.domainName = 'domainname';

List of management IP addresses for all interfaces (including the device management address) select ip.address from ipEndPoint ip

inner join domainMembers dm on dm.entityId = ip.entityId

inner join domainMgr d on dm.domainMgrId = d.domainMgrId

where d.domainName = 'domainname';

© Copyright IBM Corp. 2014

Generate the report

Step 1

You start by registering this list from the file containing the IP addresses, one per line in the file. This loads them into a table which will delete all previous entries. You can do this to check on one IP address or thousands.

cd $NCHOME/precision/scripts/perl/scripts ncp_perl ncp_upload_expected_ips.pl -domain domainname

-file filename

This step only needs to be run when the list changes. The script will first remove all existing entries, so each execution replaces the table with the new list.

Step 2

Run this command each time you want to generate a new snapshot to correlate the IP addresses with the poller's list of IP addresses:

47

ncp_perl ncp_ping_poller_snapshot.pl -domain domainname

Step 3

Run the report using the following command: ncp_perl ncp_polling_exceptions.pl -domain domainname

The report

This table contains the categories that are checked and why the IP addresses it lists are not being polled.

Undiscovered Check the scope and seed lists of the discovery configuration.

Out of scope These IP addresses are missing from the policy scope for the Default Ping polls.

Unmanaged, status = 1 Devices or interfaces that have been unmanaged from the GUI or using the

UnmanagedNode.pl script are considered in maintenance mode and will not be polled. They will have Status of 1.

Unmanaged, status = 2 These are unmanaged during discovery, usually in the

TagManagedEntities.stch

stitcher. Check the filter in this stitcher if it is unmanaging interfaces it should not do.

Secondary or

Unpingable interfaces

The discovery selects the management address for each interface with multiple IP addresses and only those will be pinged. Network Manager does not ping the secondary IP addresses.

IBM Tivoli Network Manager Monitoring Status

============================================

UNDISCOVERED

============

List of IP addresses from the reference list that are not in the management database.

+----------------------+

| IP Address |

+----------------------+

| |

+----------------------+

OUT OF SCOPE

============

+---------------+---------------------------------------+------------------+

| IP Address | Hostname | AOC Class |

+---------------+---------------------------------------+------------------+

48

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

| 172.31.23.4 | 172.30.23.4 | JuniperMSeries |

| 172.30.5.1 | 172.31.23.112 | 3ComSuperStack |

| 172.30.23.4 | 172.30.23.4 | JuniperMSeries |

+---------------+---------------------------------------+------------------+

UNMANAGED

=========

List of IP addresses from the reference list that are not being monitored because they were unmanaged in the GUI (status = 1)

+------------+----------+-----------+---------------+---------------+

| IP Address | Hostname | AOC Class | Entity Status | Device Status |

+------------+----------+-----------+---------------+---------------+

| | | | | |

+------------+----------+-----------+---------------+---------------+

List of IP addresses unmanaged from Discovery (status = 2)

Check ifDescr in TagManagedEntities.stch for the following interfaces:

+------------+----------+---------+---------------+

| IP Address | Hostname | ifDescr | Entity Status |

+------------+----------+---------+---------------+

| | | | |

+------------+----------+---------+---------------+

SECONDARY or UNPINGABLE

=======================

List of IP addresses not polled as they are considered secondary addresses

+------------+----------------+---------+-------------+

| IP Address | Hostname | ifIndex | Primary IP |

+------------+----------------+---------+-------------+

| 10.0.0.1 | 172.30.23.4 | 14 | 128.0.0.4 |

| 10.0.0.4 | 172.30.23.4 | 14 | 128.0.0.4 |

| 127.0.0.1 | 172.31.23.26 | 16 | 172.25.0.81 |

|+------------+---------------+---------+-------------+

NOT POLLED IN LAST 15 MINS

==========================

List of IP addresses that have not been polled during the last 15 minutes:

+------------+----------+-----------+

| IP Address | Hostname | AOC Class |

+------------+----------+-----------+

| | | |

+------------+----------+-----------+

FALLING BEHIND

==============

List of IP addresses in policies that are falling behind by more than twice the polling interval

+------------+----------+--------+---------------+---------------

+----------------------+

| IP Address | Hostname | Policy | Poll Interval | Last Poll Int | Time since

Last Poll |

+------------+----------+--------+---------------+---------------

+----------------------+

| | | | | |

|

+------------+----------+--------+---------------+---------------

+----------------------+

NO SNMP ADDRESS

==============

These devices may have other IP addresses that were not be discovered, but only the management address shown here will be polled (unless unmanaged above

).

+------------+----------+--------------+

| IP Address | Hostname | Node Managed |

+------------+----------+--------------+

| | | |

+------------+----------+--------------+

Chapter 8: Am I pinging all the IP addresses I want?

49

For full details see, http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw

orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_troubleshootingnwp olling.html

50

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Chapter 9: Poller Configuration

This chapter covers the options for configuring individual pollers and best practice settings.

By default there is one configuration file per domain for the pollers. However, you can create specific configuration files for individual pollers if necessary by following the naming convention:

NcPollerSchema.<pollername>.<domain>.cfg

Each poller will try to use the most specific configuration file, and if one is not found, will fall back to the domain-specific file. If that is not found, it defaults to NcPollerSchema.cfg.

The fields in the config.properties table can be adjusted for your environment on a per-poller basis. The following list describes each parameter to help you decide if changing it will provide benefits.

Change these with caution and keep a backup copy of the file. Unless guided by IBM Support there is no need to change the fields in the second section.

© Copyright IBM Corp. 2014

Poller settings

MAXPOLLDATAROWS (Default: 5000000)

Unlike all the other fields, this one is in the config.pruning

table in the

NcPollerSchema.cfg

file. The poller deletes the oldest entries in the NCIM

NCPOLLDATA.POLLDATA table to maintain this many rows.

PruneSleepInterval (Default: 3600)

This field was introduced in Network Manager 4.1.1 and post 3.9 FP4. It specifies the interval in minutes to prune the NCPOLLDATA.POLLDATA table. Shorter intervals mean deleting fewer rows at a time with less stress on the database.

AggregationLimit (Default: 30)

This setting determines how the Poller will break apart multi request polls when the PDU can not handle either the request or the results, otherwise known as the tooBig poll failures. The Poller will break apart these requests into multiple PDUs down to the aggregation limit. When the value of this limit is set to the default of 30, the Poller will create PDUs with no less than

30 individual requests. For poll data that is large this can easily overload the results PDU even after breaking apart the requests.

If you are seeing errors similar to this, then try dropping the limit to 10.

ncp_poller.SnmpPoller.DOGFISH.trace:2014-07-01 17:12:27 INFO

51

batchitem.cc(1943): Handling tooBig response (jobid = 6, name =

41/aPolicy/aIfIndexWithAlert, addr = 172.30.135.1, reqid =

575527012, vblsize = 106, numpkts = 4, pktsize = 30)

UseGetBulk (Default: 0)

By default, GetBulk is not used. Set this to 1 to use GetBulk requests in place of GetNext when SNMPv2/v3 is used. change this to 1 to take advantage of the efficiency realized by GetBulk.

Consider changing this to 1 to take advantage of the efficiency realized by

GetBulk.

UseFirstPollForInitialState (Default: 0)

For Snmp Link State polling, specify how you want the Poller to determine initial state in absence of existing event. It can either use the first poll or assume clear state.

The logic changed in Network Manager 3.9 Fix Pack 3 in order to ensure that the poller would trigger events for ports that changed to down while the poller was not running. If you have many empty ports without setting ifAdminStatus to Down, then a large number of events are created on poller startup. In this case change the value of UseFirstPollForInitialState to 1 to tell the poller to use the first poll as reference for future changes.

UpdateNetworkViewCache (Default: 1)

Set this field to 0 if you want to disable poller updating of network views.

If you do not use network views for scoping the policies, then you can disable this, otherwise leave it set to ensure the view memberships are kept up to date automatically as the topology changes.

BatchQueueThreshold (Default: 10)

Set the BatchQueueThreshold to have the poller issue an alert if the polling load results in batches getting queued up.

This proactively alerts you to possible overloading problems. Use the metrics

described in Chapter 7: Are the pollers healthy? to judge if you need to adjust

the threshold for the alert.

PollDataQueueLimit (Default: 5000)

When PollDataQueueLimit is set the poller will issue an alert when the queue exceeds the limit and dump the queue to a file. This queue holds the polling data until it can be written to the database.

This proactively alerts you to possible problems inserting data to the database.

Use the metrics described in Chapter 7: Are the pollers healthy? to judge if

you need to adjust the threshold for the alert.

PolicyUpdateInterval (Default: 30) ncp_poller will scan the NCMONITOR policy table every 30 seconds for changes to polling configuration.

ManagedStatusUpdateInterval (Default: 30)

52

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

ncp_poller will scan the NCIM managedStatus table every 30 seconds. This can be decreased if a quicker response to configuration changes is required.

Balance this against increased overhead costs for each synchronization.

LogAccessCredentials (Default: 0)

By default, access credentials will not appear in log files. Set this to 1 to make them appear in plain-text - this may be useful for debugging problems with device access.

CollectPollerMetrics (Default: 1)

These are the metrics described in Chapter 7: Are the pollers healthy? that the

poller records to help you monitor the health of the pollers. They are collected by default. Setting to 0 will stop the collection.

For IBM Support use

DefaultGetBulkMaxReps (Default: 20)

The number assigned to the max-repetitions field in GetBulk requests issued by ITNM processes. This is the value used when the request contains a single varbind. If multiple varbinds are included, the value is adjusted accordingly

(divide by the number of varbinds), so that responses always contain a similar number of varbinds.

Only change under guidance from IBM Support.

CheckPollDataValueRange (Default: 1)

By default the poller checks that values to be inserted into the pollData.value

field are valid 32-bit signed integers. This check is made to avoid issues when attempting to insert data into the Data Warehouse, if the

Data Warehouse is being used. Setting this value to 0 bypasses the range check. Take care when doing this - the output of relevant configured polls should be understood, and the ncpolldata schema updated if appropriate.

Only change under guidance from IBM Support.

DiscoverInitialAccess (Default: 0)

The poller can test SNMP credentials at startup if keys have changed. Set to 0 to not test, set to 1 to test.

The poller will always cycle through the available community strings if

SNMP access fails on a poll, so there is little benefit from incurring the additional overhead on startup. Leave the default unless guided by IBM

Support.

BatchExtraThreads (Default: 150)

Only change under guidance from IBM Support

PollerProfiling (Default: 0)

This field enables the collection of poller ICMP and SNMP statistics used by

ITM. The collection of these metrics may slow down the poller's ability to

Chapter 9: Poller Configuration

53

process SNMP and Ping responses in large networks.

Unless guided by IBM Support for troubleshooting purposes, leave this disabled.

54

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

Notices

© Copyright IBM Corp. 2014

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:

IBM Director of Licensing

IBM Corporation

North Castle Drive

Armonk, NY 10504-1785

U.S.A.

For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to:

Intellectual Property Licensing

Legal and Intellectual Property Law

IBM Japan Ltd.

1623-14, Shimotsuruma, Yamato-shi

Kanagawa 242-8502 Japan

The following paragraph does not apply to the United Kingdom or any other

country where such provisions are inconsistent with local law: INTERNATIONAL

BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"

WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,

INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-

INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR

PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors.

Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk.

55

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact:

IBM Corporation

958/NH04

IBM Centre, St Leonards

601 Pacific Hwy

St Leonards, NSW, 2069

Australia

IBM Corporation

896471/H128B

76 Upper Ground

London SE1 9PZ

United Kingdom

IBM Corporation

JBFA/SOM1

294 Route 100

Somers, NY, 10589-0100

United States of America

Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.

The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM

International Program License Agreement or any equivalent agreement between us.

Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only.

All IBM prices shown are IBM's suggested retail prices, are current and are subject to change without notice. Dealer prices may vary.

This information is for planning purposes only. The information herein is subject to change before the products described become available.

56

Best Practices for Network Monitoring: IBM Tivoli Network Manager  

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs.

Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows:

© (your company name) (year). Portions of this code are derived from IBM Corp.

Sample Programs. © Copyright IBM Corp. _enter the year or years_.

If you are viewing this information softcopy, the photographs and color illustrations may not appear.

Trademarks

IBM, the IBM logo, ibm.com, Netcool, Netcool/OMNIbus,and Tivoli are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml

.

Adobe, Acrobat, and Portable Document Format (PDF) are trademarks or registered trademarks of Adobe Systems Incorporated in the United States, other countries, or both.

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Microsoft and Windows are trademarks of Microsoft Corporation in the United

States, other countries, or both.

Notices

57

advertisement

Related manuals