|
Network and System Monitoring Primers
Recently I had a discussion with a collegue about the monitoring
of our systems and network devices. I showed him what we all measure,
and he wondered if it was overkill or not. I told him that somethings
maybe were, but that good monitoring is the first step to knowing
what happens in your network, and that knowing what happens in your
network is the first step to be able to isolate problems when they
arise.
So the question is: "What do you need to monitor?" The answer
is easy: "Everything". That's a pretty big amount. Let reality
kick in and rephrase it to: "Everything you can think off".
This is too vague and misses the final goal: "Everything you
assume to be the normal conditions for a system or service to run
properly.".
"Everything you assume to be the normal conditions for a system
or service run properly". That is quite a lot, and different
for each service you provide or device you have on the network. It
will make you to have to you dig into your network and servers to
find out what is going on.
This also means that you need to find out how your network and
services behave. In the beginning you will get a number of alerts
which will be false positives and you need to tune your alert
settings. Or you will get the same alerts every day at the same
time: they will be normal system behaviour. You can chose between
getting these messages everyday and changing the alert-settings so
that you won't get them anymore. History shows that the last option
isn't the smartest one, because it will hide possible issues from
you.
A lot of the items described in this primer can be considered as
being too nitpicky which might or might not be an issue for you at
this moment. Keep in mind that you should monitor for normal
operations! Everything which happens in your network and systems
which isn't normal is worth investigating!
This primer only describes active monitoring and realtime monitoring.
Passive monitoring (via SNMP traps, syslog messages or monitoring
agents) and historical monitoring (history graphs with application
like Cacti) are not described.
Software
This primer tries to be generic, but is based on my experience with
Nagios. At the end there will
be a link to the scripts I use for non-standard Nagios features.
Nagios is described as "a host, service and network monitoring
program". If you are a beginner, you will find its configuration
files horrible. But once you get through that, it is easy to expand.
Nagios does do all the checking by executing scripts in its
libexec/ directory. This doesn't mean that it is limited
to doing checks of remote services which are running on other hosts.
For that there is the program called NRPE, which stands for Nagios
Remote Program Executor. This program runs as a daemon on the
remote hosts and runs the same but local installed Nagios scripts.
(FreeBSD: net-mgmt/nagios and net-mgmt/nrpe2)
Systems monitoring
There are several components which needs to be monitored on a system:
Hardware (disks, CPU), the Operating System and the Services.
Hardware
- These days you can get a lot of information about components
of your motherboards: CPU temperature, internal temperature,
fan speeds and power voltages. Higher temperatures are bad for
your motherboard. Fan speeds which are suddenly much higher,
or lower, indicate that one of them is broken and might cause
higher temperatures. And power voltage changes indicate problems
with your power supply.
On Linux, this information can be gathered from /proc/acpi.
On FreeBSD this can be gathered via sysutils/healthd.
- IDE harddisks characteristics can be monitored via the SMART interface
(Self-Monitoring, Analysis and Reporting Technology), for example
the temperature of the disks and a handful of counters:
Reallocated Sector Count, Seek Error Rate, Spin Retry Count,
Calibration Retry Count, Reallocated Event Count, Current Pending
Sector and UDMA CRC Error Count. If these counters go up, there
might be a problem with your harddisk.
On Linux and FreeBSD this data can be gathered via the smartmontools
software (FreeBSD: sysutils/smartmontools).
- RAID hardware is beautiful, a broken harddisk won't wake you
up in the middle of the night anymore (but two broken harddisks
will so it better be monitored). There are various ways to check
them, and every vendors seems to have its own software. The
following works on FreeBSD:
- camcontrol for HP/Compaq RAID cards:
[~] root@freebsd>camcontrol inquiry da3
pass3: <COMPAQ RAID 1 VOLUME OK> Fixed Direct Access SCSI-0 device
- TW_CLI for 3WARE RAID cards: (sysutils/tw_cli)
[~] root@freebsd>/usr/local/bin/tw_cli info c0 unitstatus
# of units: 1
Unit 0: RAID 5 1.63 TB ( 3516478848 blocks): OK
- aaccli for Adaptec AAC Controllers (sysutils/aaccli)
[~] root@freebsd>aaccli 'open aac0 : disk list : container list'
--------------------------------------------------------------------------------
Adaptec SCSI RAID Controller Command Line Interface
Copyright 1998-2002 Adaptec, Inc. All rights reserved
--------------------------------------------------------------------------------
Executing: open "aac0"
Executing: disk list
C:ID:L Device Type Blocks Bytes/Block Usage Shared Rate
------ -------------- --------- ----------- ---------------- ------ ----
0:00:0 Disk 390721968 512 Initialized NO 100
0:03:0 Disk 390721968 512 Initialized NO 100
Executing: container list
Num Total Oth Stripe Scsi Partition
Label Type Size Ctr Size Usage C:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
0 RAID-5 745GB 64KB Open 0:00:0 64.0KB: 186GB
/dev/aacd0 raid5 0:03:0 64.0KB: 186GB
- Network interfaces monitoring consists of two items: The first one
is the number of packet errors, which can be gathered by the
output of netstat -ni:
[~] root@linux>netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1 1500 0 641851668 0 0 0 646287092 0 0 0 BMRU
eth2 1500 0 711410096 0 0 0 701868617 0 0 0 BMRU
lo 16436 0 6611086 0 0 0 6611086 0 0 0 LRU
[~] root@freebsd>netstat -ni sk0
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
sk0 1500 <Link#1> 00:0f:ea:2c:d5:18 3706970 0 3316491 0 0
sk0 1500 fe80:1::20f:e fe80:1::20f:eaff: 0 - 2 - -
sk0 1500 10.251.1.16/2 10.251.1.18 2923134 - 2536594 - -
The second one is the media status: how is the device talking
to the switch. On Linux this can be found with the output of
mii-tool, on FreeBSD this can be found on the
media line in the output of ifconfig:
[~] root@linux>mii-tool eth1
eth1: negotiated 100baseTx-FD flow-control, link ok
[~] root@freebsd>ifconfig sk0
sk0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=8<VLAN_MTU>
inet6 fe80::20f:eaff:fe2c:d518%sk0 prefixlen 64 scopeid 0x1
inet 10.251.1.18 netmask 0xfffffff0 broadcast 10.251.1.31
ether 00:0f:ea:2c:d5:18
media: Ethernet autoselect (100baseTX <full-duplex,flag0,flag1>)
status: active
- If you have a UPS, see if you can get the status of it. Information
from APC UPS's can be gathered via apcupsd and
apcaccess:
STATUS : ONLINE
Do not check only for ONLINE, check if ONLINE
is the only string because
STATUS : ONLINE REPLACEBATT
can be valid too! (FreeBSD: sysutils/apcupsd)
The Operating System
- Diskspace information, or partition information, can be gotten with the
output of df, which gives you the free disk
space. Another important piece of information is the number of
inodes you have available: df -i, because if you don't
have any inodes free, you can't create any more files.
[~] root@freebsd>df -i /
Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on
/dev/da0s1a 128990 86072 32600 73% 2952 13302 18% /
[~] root@linux>df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup00-LogVol00
35291136 559063 34732073 2% /
/dev/cciss/c0d0p1 26104 36 26068 1% /boot
- A freshly installed system should have very few services running,
maybe only crond, inetd, ntpd, sshd and syslogd. They all create
their own PID files, so it is easy to get the process IDs:
[~] root@freebsd>ps wup `head /var/run/ntpd.pid `
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ntp 3148 0.0 0.1 4044 4044 ? SLs Apr14 0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
Of course you need to check first if the PID file exists.
If the service is a network based service, then besides checking
if the PID files exist and the processes exist, you should also
check if the service works up to some extent: For ssh you should
get the SSH banner, for ntpd you can check if the services is
synced.
- If the server is supposed to transport emails (both mail servers
and application servers), then check if the mail-queue is more
or less empty:
[~] root@postfix>mailq -Queue ID- --Size-- ----Arrival
Time---- -Sender/Recipient------- A634A5CD036 66814 Mon Apr 16
12:38:57 [email protected] [..] -- 343 Kbytes in 3 Requests.
[~] root@postfix>mailq
Mail queue is empty
[~] root@sendmail>mailq
/var/spool/mqueue is empty
Total requests: 0
- On machines which act as a router, and specially ones which do
dynamic routing, it is important to make sure that the default
gateway is pointing to the expected interface:
[~] root@freebsd>netstat -rn -f inet | grep ^default
default 202.83.178.153 UG1 1 885506855 fxp2
[~] root@linux>netstat -rn | grep ^0.0.0.0
0.0.0.0 10.252.13.9 0.0.0.0 UG 0 0 0 eth0
If you use dynamic routing in your network, one other thing you
need to do on one or more machines is to check if you have all
your networks in the routing table. Missing one means that you
can't reach that network from that machine!
- Number of users, total processes and swap: Easy to measure, and
it might be an indication that there is something wrong. For
machines which are servers, the number of users logged in
shouldn't be too high: Unless work is done on them, nobody
should be logged in.
For swap, preferable it is not in use.
[~] root@freebsd>uptime
4:13PM up 88 days, 4 mins, 4 users, load averages: 0.24, 0.39, 0.32
[~] root@freebsd>ps auxw | wc -l
205
[~] root@freebsd>swapinfo
Device 1K-blocks Used Avail Capacity
/dev/ad10s1b 4168496 480 4168016 0%
[~] root@linux>uptime
16:12:49 up 4 days, 3:52, 2 users, load average: 0.11, 0.11, 0.06
[~] root@linux>ps auxw | wc -l
90
[~] root@linux>cat /proc/swaps
Filename Type Size Used Priority
/dev/mapper/VolGroup00-LogVol01 partition 2031608 1576 -1
- Uptime. Don't rely on the host not being able to be pinged to
determine if the machine has been rebooted. With todays hardware
and background file system checks the machine is back before the
ping-timeout threshold has been reached. If the uptime has been
reseted, then something has happened!
Note that if you monitor this via SNMP, that the
system.sysUptime OID returns the number of seconds
from the snmpd being active, not the number of seconds
of the machine being active. Restarting the snmpd will
reset this counter!
SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19
- FreeBSD Jails are great for setting up small isolated environments,
for example webservers and SMTP servers. The server which hosts
all jails should check for them, and warn if one is missing or
if an unknown one has popped up:
[~] root@freebsd>jls
JID IP Address Hostname Path
11 212.73.76.0 ns0.mavetju.org /usr/jails/ns0.mavetju.org
10 212.73.76.3 dhcp.mavetju.org /usr/jails/dhcp.mavetju.org
9 212.73.78.126 proxy2.mavetju.org /usr/jails/proxy2.mavetju.org
8 212.73.78.125 mail4.mavetju.org /usr/jails/mail4.mavetju.org
6 212.73.78.96 tftp.mavetju.org /usr/jails/tftp.mavetju.org
5 212.73.78.95 syslog.mavetju.org /usr/jails/syslog.mavetju.org
3 212.73.78.92 cvs.mavetju.org /usr/jails/cvs.mavetju.org
2 212.73.78.91 mailman.mavetju.org /usr/jails/mailman.mavetju.org
1 212.73.78.90 jabber.mavetju.org /usr/jails/jabber.mavetju.org
The Services
In theory checking of services could be very easy:
- If the service makes a PID file, check if the process is running.
[~] root@freebsd>ps wup `cat /var/run/named.pid `
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 407 0.0 1.3 46472 42104 ?? Ss Thu02PM 13:44.89 /usr/sbin/named -c /etc/namedb/named.conf -u root
- If the service doesn't make a PID file, use pgrep to
see if it is running:
[~] root@freebsd>pgrep -lf named
407 /usr/sbin/named -c /etc/namedb/named.conf -u root
- If the service is listening on the network, check if you can
setup a TCP session towards it.
It might not be ideal, but it's a good start.
Some extra checks can be made for the following services:
- DNS
See if you can get an answer back from the request for
version.bind or version.server. That will
show you if the server is actually answering requests.
[~] root@freebsd>dig @ns0.mavetju.org version.server chaos txt
; <<>> DiG 9.3.2 <<>> @ns0.mavetju.org version.server chaos txt
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39096
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;version.server. CH TXT
;; ANSWER SECTION:
version.server. 0 CH TXT "Nominum ANS 2.7.0.2"
;; Query time: 161 msec
;; SERVER: 212.73.76.0#53(212.73.76.0)
;; WHEN: Wed Apr 18 16:59:37 2007
;; MSG SIZE rcvd: 64
- POP3 / IMAP
Both POP3 and IMAP services return a greeting when you connect
to it:
[~] root@freebsd>telnet pop.mavetju.org pop3
Trying 212.73.78.125...
Connected to mail4.mavetju.org.
Escape character is '^]'.
+OK DBMAIL pop3 server ready to rock <[email protected]<
[~] root@freebsd>telnet imap.mavetju.org imap
Trying 212.73.78.125...
Connected to mail4.mavetju.org.
Escape character is '^]'.
* OK dbmail imap (protocol version 4r1) server 2.0.10 ready to run
- SMTP / spam checker / greylisting / virus scanner
SMTP servers (check if the SMTP server is running) listen for
network connections on port 25 (smtp) and port 587 (submission).
Incoming SMTP traffic but might be greylisted (check if the
greylist daemon is running). The email received goes through
a virus scanner (if you are using a commercial package, make
sure your license hasn't expired) (make sure the virus scanner
daemon is running) (make sure that the signatures are up to
date):
[~] root@freebsd>/usr/local/viruscan/kav/bin/aveclient -c -p /var/run/aveserver
RECORDS 283158
UPDATED 18-04-2007
SERIAL 0367-0003F5-012E4689
EXPIRE 17-04-2008
Then the email goes through the spam checker (make sure that
the daemon is running) and then into the mail folder.
Email can come in bulk. That means that one moment your queue
is empty, and the next moment there are 500 messages in the
queue. If your users get a daily mailing like this every day
at 17:00, then you will get a daily alert about it.
- NAT gateways
Check the size of the NAT table. The expected size is depending
on the policy of your network. If your network is open (no proxy
server, no restrictions on traffic), then the NAT table will
be very big.
If you have a regulated network (HTTP has to go via the proxy
server, email has to be delivered to the local SMTP servers, DNS
requests have to go to the local DNS server etc), then this
will be relative small. A chance in the size can show that there
is something wrong.
[~] root@freebsd>ipnat -l | wc -l
300
- Database replication
Not only the consistency of the data in a database is very
important, but so is the replication of it. And it should be
as realtime as possible. Slony, the replication service for
PostgreSQL, gives these statistics via the sl_status table:
database=# select st_origin,st_received,st_lag_time from _database.sl_status;
st_origin | st_received | st_lag_time
-----------+-------------+-----------------
4 | 1 | 00:00:01.271073
4 | 2 | 00:00:01.091502
- Asterisk VoIP
There are a couple of important things to be monitored in
Asterisk via the Manager interface: Status of the PRI interfaces,
status of the SIP peers, status of the IAX peers.
voip*CLI> pri show spans
PRI span 1/0: Provisioned, Up, Active
PRI span 2/0: Provisioned, Up, Active
PRI span 3/0: Provisioned, Up, Active
PRI span 4/0: Provisioned, Up, Active
voip*CLI> sip show peers
Name/username Host Dyn Nat ACL Port Status
edwin 121.44.244.57 D N 2051 Unmonitored
wen09-vega 10.197.9.12 5060 OK (7 ms)
ccm-publisher 10.252.11.130 5060 OK (1 ms)
3 sip peers [3+0 online, 0 offline, 0 unmonitored]
voip*CLI> iax2 show peers
Name/Username Host Mask Port Status
bluebox-tardis/ 202.83.176.44 (S) 255.255.255.255 4569 (T) OK (3 ms)
1 iax2 peers [1 online, 0 offline, 0 unmonitored]
With the SIP and IAX status, not only the OK status is important
but also the time for the answer.
Network device monitoring
Gathering information for network device monitoring is a little bit
trickier than systems monitoring, because you can't run these fancy
scripts on your routers and switches. Often you only can get
information via SNMP...
- System Uptime: Embedded devices are often very fast with their
reboots, so they can reboot several times and you will not even
know anything. With the system.sysUpTime OID you can
get the uptime:
SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19
- If you have a clean network, and have your network devices
and user devices separated from each other, then there is
a nice border between where the responsibility lays. And it
gives you an easy way to check if all interfaces on your devices
are in the state you expect them in.
[~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifDescr
RFC1213-MIB::ifDescr.1001 = STRING: "hs1-x450/14"
RFC1213-MIB::ifDescr.1002 = STRING: "hs2-ssg550/e0_2"
RFC1213-MIB::ifDescr.1003 = STRING: "hs2-ssg550/e0_0"
RFC1213-MIB::ifDescr.1000006 = STRING: "VLAN 04094 (to-internet)"
RFC1213-MIB::ifDescr.1000007 = STRING: "rtif(202.83.178.178/29)"
RFC1213-MIB::ifDescr.1000008 = STRING: "VLAN 04093 (to-sjh)"
[~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifSpeed
RFC1213-MIB::ifSpeed.1001 = Gauge32: 1000000000
RFC1213-MIB::ifSpeed.1002 = Gauge32: 1000000000
RFC1213-MIB::ifSpeed.1003 = Gauge32: 1000000000
RFC1213-MIB::ifSpeed.1000006 = Gauge32: 0
RFC1213-MIB::ifSpeed.1000007 = Gauge32: 0
RFC1213-MIB::ifSpeed.1000008 = Gauge32: 0
[~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifOperStatus
RFC1213-MIB::ifOperStatus.1001 = INTEGER: up(1)
RFC1213-MIB::ifOperStatus.1002 = INTEGER: up(1)
RFC1213-MIB::ifOperStatus.1003 = INTEGER: up(1)
RFC1213-MIB::ifOperStatus.1000006 = INTEGER: up(1)
RFC1213-MIB::ifOperStatus.1000007 = INTEGER: up(1)
RFC1213-MIB::ifOperStatus.1000008 = INTEGER: up(1)
If an ifSpeed is suddenly 100Mbps instead of 1Gbps, you know
that there is something wrong. If an ifOperStatus is down instead
of up, you know that there is a problem. If you have redundancy
in your network, these issues might have been hidden because
the remote subnet never has been unreachable.
Routers can "suddenly" have more or less interfaces, for example when
you create or delete a new VLAN. So you have to monitor for the
absence of expected VLANs and the presence of unknown VLANs.
This is for a radio link:
[~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifDescr
IF-MIB::ifDescr.1 = STRING: Ethernet Interface
IF-MIB::ifDescr.2 = STRING: lo0
IF-MIB::ifDescr.3 = STRING: WORP Interface
[~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifSpeed
IF-MIB::ifSpeed.1 = Gauge32: 100000000
IF-MIB::ifSpeed.2 = Gauge32: 100000000
IF-MIB::ifSpeed.3 = Gauge32: 36000000
[~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifOperStatus
IF-MIB::ifOperStatus.1 = INTEGER: up(1)
IF-MIB::ifOperStatus.2 = INTEGER: up(1)
IF-MIB::ifOperStatus.3 = INTEGER: up(1)
- If you are exchanging routing information with your ISP to the
internet or to other 3rd parties, then this goes via BGP.
Checking if your BGP neighbours are up can be done via SNMP:
[~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState
BGP4-MIB::bgpPeerState.218.100.2.1 = INTEGER: established(6)
BGP4-MIB::bgpPeerState.218.100.2.62 = INTEGER: idle(1)
BGP4-MIB::bgpPeerState.221.133.215.61 = INTEGER: established(6)
Here also goes: check for the absence of expected neighbours
and the presence of unknown neighbours.
- If a router supports environmental reporting (temperature,
fanspeed), measure it and report anomalies. High temperatures
are bad for hardware!
EXTREME-SYSTEM-MIB::extremeFanOperational.101 = INTEGER: true(1)
EXTREME-SYSTEM-MIB::extremeFanOperational.102 = INTEGER: true(1)
EXTREME-SYSTEM-MIB::extremeFanOperational.103 = INTEGER: true(1)
EXTREME-SYSTEM-MIB::extremeCurrentTemperature.0 = INTEGER: 27
- If a router has multiple power supplies, it is important
that you check if all of them are active. They're just like
RAID cards: You can live with one less, but not with two!
[~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState
EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.1 = INTEGER: presentOK(2)
EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.2 = INTEGER: presentOK(2)
EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.3 = INTEGER: presentOK(2)
Links to software
|
|
|