Networking -> Basic Network Troubleshooting

Home
Personal
Unix
Programming
Networking
Documents
Reporting
Weblog
CityRail
BOM pictures
Other projects
Contact me


	Network and System Monitoring Primers Recently I had a discussion with a collegue about the monitoring of our systems and network devices. I showed him what we all measure, and he wondered if it was overkill or not. I told him that somethings maybe were, but that good monitoring is the first step to knowing what happens in your network, and that knowing what happens in your network is the first step to be able to isolate problems when they arise. So the question is: "What do you need to monitor?" The answer is easy: "Everything". That's a pretty big amount. Let reality kick in and rephrase it to: "Everything you can think off". This is too vague and misses the final goal: "Everything you assume to be the normal conditions for a system or service to run properly.". "Everything you assume to be the normal conditions for a system or service run properly". That is quite a lot, and different for each service you provide or device you have on the network. It will make you to have to you dig into your network and servers to find out what is going on. This also means that you need to find out how your network and services behave. In the beginning you will get a number of alerts which will be false positives and you need to tune your alert settings. Or you will get the same alerts every day at the same time: they will be normal system behaviour. You can chose between getting these messages everyday and changing the alert-settings so that you won't get them anymore. History shows that the last option isn't the smartest one, because it will hide possible issues from you. A lot of the items described in this primer can be considered as being too nitpicky which might or might not be an issue for you at this moment. Keep in mind that you should monitor for normal operations! Everything which happens in your network and systems which isn't normal is worth investigating! This primer only describes active monitoring and realtime monitoring. Passive monitoring (via SNMP traps, syslog messages or monitoring agents) and historical monitoring (history graphs with application like Cacti) are not described. Software This primer tries to be generic, but is based on my experience with Nagios. At the end there will be a link to the scripts I use for non-standard Nagios features. Nagios is described as "a host, service and network monitoring program". If you are a beginner, you will find its configuration files horrible. But once you get through that, it is easy to expand. Nagios does do all the checking by executing scripts in its `libexec/` directory. This doesn't mean that it is limited to doing checks of remote services which are running on other hosts. For that there is the program called NRPE, which stands for Nagios Remote Program Executor. This program runs as a daemon on the remote hosts and runs the same but local installed Nagios scripts. (FreeBSD: `net-mgmt/nagios` and `net-mgmt/nrpe2`) Systems monitoring There are several components which needs to be monitored on a system: Hardware (disks, CPU), the Operating System and the Services. Hardware These days you can get a lot of information about components of your motherboards: CPU temperature, internal temperature, fan speeds and power voltages. Higher temperatures are bad for your motherboard. Fan speeds which are suddenly much higher, or lower, indicate that one of them is broken and might cause higher temperatures. And power voltage changes indicate problems with your power supply. On Linux, this information can be gathered from `/proc/acpi`. On FreeBSD this can be gathered via `sysutils/healthd`. IDE harddisks characteristics can be monitored via the SMART interface (Self-Monitoring, Analysis and Reporting Technology), for example the temperature of the disks and a handful of counters: Reallocated Sector Count, Seek Error Rate, Spin Retry Count, Calibration Retry Count, Reallocated Event Count, Current Pending Sector and UDMA CRC Error Count. If these counters go up, there might be a problem with your harddisk. On Linux and FreeBSD this data can be gathered via the smartmontools software (FreeBSD: `sysutils/smartmontools`). RAID hardware is beautiful, a broken harddisk won't wake you up in the middle of the night anymore (but two broken harddisks will so it better be monitored). There are various ways to check them, and every vendors seems to have its own software. The following works on FreeBSD: `camcontrol` for HP/Compaq RAID cards: [~] root@freebsd>camcontrol inquiry da3 pass3: <COMPAQ RAID 1 VOLUME OK> Fixed Direct Access SCSI-0 device TW_CLI for 3WARE RAID cards: (`sysutils/tw_cli`) [~] root@freebsd>/usr/local/bin/tw_cli info c0 unitstatus # of units: 1 Unit 0: RAID 5 1.63 TB ( 3516478848 blocks): OK aaccli for Adaptec AAC Controllers (`sysutils/aaccli`) [~] root@freebsd>aaccli 'open aac0 : disk list : container list' -------------------------------------------------------------------------------- Adaptec SCSI RAID Controller Command Line Interface Copyright 1998-2002 Adaptec, Inc. All rights reserved -------------------------------------------------------------------------------- Executing: open "aac0" Executing: disk list C:ID:L Device Type Blocks Bytes/Block Usage Shared Rate ------ -------------- --------- ----------- ---------------- ------ ---- 0:00:0 Disk 390721968 512 Initialized NO 100 0:03:0 Disk 390721968 512 Initialized NO 100 Executing: container list Num Total Oth Stripe Scsi Partition Label Type Size Ctr Size Usage C:ID:L Offset:Size ----- ------ ------ --- ------ ------- ------ ------------- 0 RAID-5 745GB 64KB Open 0:00:0 64.0KB: 186GB /dev/aacd0 raid5 0:03:0 64.0KB: 186GB Network interfaces monitoring consists of two items: The first one is the number of packet errors, which can be gathered by the output of `netstat -ni`: [~] root@linux>netstat -ni Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth1 1500 0 641851668 0 0 0 646287092 0 0 0 BMRU eth2 1500 0 711410096 0 0 0 701868617 0 0 0 BMRU lo 16436 0 6611086 0 0 0 6611086 0 0 0 LRU [~] root@freebsd>netstat -ni sk0 Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll sk0 1500 <Link#1> 00:0f:ea:2c:d5:18 3706970 0 3316491 0 0 sk0 1500 fe80:1::20f:e fe80:1::20f:eaff: 0 - 2 - - sk0 1500 10.251.1.16/2 10.251.1.18 2923134 - 2536594 - - The second one is the media status: how is the device talking to the switch. On Linux this can be found with the output of `mii-tool`, on FreeBSD this can be found on the `media` line in the output of `ifconfig`: [~] root@linux>mii-tool eth1 eth1: negotiated 100baseTx-FD flow-control, link ok [~] root@freebsd>ifconfig sk0 sk0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 options=8<VLAN_MTU> inet6 fe80::20f:eaff:fe2c:d518%sk0 prefixlen 64 scopeid 0x1 inet 10.251.1.18 netmask 0xfffffff0 broadcast 10.251.1.31 ether 00:0f:ea:2c:d5:18 media: Ethernet autoselect (100baseTX <full-duplex,flag0,flag1>) status: active If you have a UPS, see if you can get the status of it. Information from APC UPS's can be gathered via `apcupsd` and `apcaccess`: STATUS : ONLINE Do not check only for `ONLINE`, check if `ONLINE` is the only string because STATUS : ONLINE REPLACEBATT can be valid too! (FreeBSD: `sysutils/apcupsd`) The Operating System Diskspace information, or partition information, can be gotten with the output of `df`, which gives you the free disk space. Another important piece of information is the number of inodes you have available: `df -i`, because if you don't have any inodes free, you can't create any more files. [~] root@freebsd>df -i / Filesystem 1K-blocks Used Avail Capacity iused ifree %iused Mounted on /dev/da0s1a 128990 86072 32600 73% 2952 13302 18% / [~] root@linux>df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/VolGroup00-LogVol00 35291136 559063 34732073 2% / /dev/cciss/c0d0p1 26104 36 26068 1% /boot A freshly installed system should have very few services running, maybe only crond, inetd, ntpd, sshd and syslogd. They all create their own PID files, so it is easy to get the process IDs: [~] root@freebsd>ps wup `head /var/run/ntpd.pid ` USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ntp 3148 0.0 0.1 4044 4044 ? SLs Apr14 0:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g Of course you need to check first if the PID file exists. If the service is a network based service, then besides checking if the PID files exist and the processes exist, you should also check if the service works up to some extent: For ssh you should get the SSH banner, for ntpd you can check if the services is synced. If the server is supposed to transport emails (both mail servers and application servers), then check if the mail-queue is more or less empty: [~] root@postfix>mailq -Queue ID- --Size-- ----Arrival Time---- -Sender/Recipient------- A634A5CD036 66814 Mon Apr 16 12:38:57 [email protected] [..] -- 343 Kbytes in 3 Requests. [~] root@postfix>mailq Mail queue is empty [~] root@sendmail>mailq /var/spool/mqueue is empty Total requests: 0 On machines which act as a router, and specially ones which do dynamic routing, it is important to make sure that the default gateway is pointing to the expected interface: [~] root@freebsd>netstat -rn -f inet \| grep ^default default 202.83.178.153 UG1 1 885506855 fxp2 [~] root@linux>netstat -rn \| grep ^0.0.0.0 0.0.0.0 10.252.13.9 0.0.0.0 UG 0 0 0 eth0 If you use dynamic routing in your network, one other thing you need to do on one or more machines is to check if you have all your networks in the routing table. Missing one means that you can't reach that network from that machine! Number of users, total processes and swap: Easy to measure, and it might be an indication that there is something wrong. For machines which are servers, the number of users logged in shouldn't be too high: Unless work is done on them, nobody should be logged in. For swap, preferable it is not in use. [~] root@freebsd>uptime 4:13PM up 88 days, 4 mins, 4 users, load averages: 0.24, 0.39, 0.32 [~] root@freebsd>ps auxw \| wc -l 205 [~] root@freebsd>swapinfo Device 1K-blocks Used Avail Capacity /dev/ad10s1b 4168496 480 4168016 0% [~] root@linux>uptime 16:12:49 up 4 days, 3:52, 2 users, load average: 0.11, 0.11, 0.06 [~] root@linux>ps auxw \| wc -l 90 [~] root@linux>cat /proc/swaps Filename Type Size Used Priority /dev/mapper/VolGroup00-LogVol01 partition 2031608 1576 -1 Uptime. Don't rely on the host not being able to be pinged to determine if the machine has been rebooted. With todays hardware and background file system checks the machine is back before the ping-timeout threshold has been reached. If the uptime has been reseted, then something has happened! Note that if you monitor this via SNMP, that the `system.sysUptime` OID returns the number of seconds from the `snmpd` being active, not the number of seconds of the machine being active. Restarting the `snmpd` will reset this counter! SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19 FreeBSD Jails are great for setting up small isolated environments, for example webservers and SMTP servers. The server which hosts all jails should check for them, and warn if one is missing or if an unknown one has popped up: [~] root@freebsd>jls JID IP Address Hostname Path 11 212.73.76.0 ns0.mavetju.org /usr/jails/ns0.mavetju.org 10 212.73.76.3 dhcp.mavetju.org /usr/jails/dhcp.mavetju.org 9 212.73.78.126 proxy2.mavetju.org /usr/jails/proxy2.mavetju.org 8 212.73.78.125 mail4.mavetju.org /usr/jails/mail4.mavetju.org 6 212.73.78.96 tftp.mavetju.org /usr/jails/tftp.mavetju.org 5 212.73.78.95 syslog.mavetju.org /usr/jails/syslog.mavetju.org 3 212.73.78.92 cvs.mavetju.org /usr/jails/cvs.mavetju.org 2 212.73.78.91 mailman.mavetju.org /usr/jails/mailman.mavetju.org 1 212.73.78.90 jabber.mavetju.org /usr/jails/jabber.mavetju.org The Services In theory checking of services could be very easy: If the service makes a PID file, check if the process is running. [~] root@freebsd>ps wup `cat /var/run/named.pid ` USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 407 0.0 1.3 46472 42104 ?? Ss Thu02PM 13:44.89 /usr/sbin/named -c /etc/namedb/named.conf -u root If the service doesn't make a PID file, use `pgrep` to see if it is running: [~] root@freebsd>pgrep -lf named 407 /usr/sbin/named -c /etc/namedb/named.conf -u root If the service is listening on the network, check if you can setup a TCP session towards it. It might not be ideal, but it's a good start. Some extra checks can be made for the following services: DNS See if you can get an answer back from the request for `version.bind` or `version.server`. That will show you if the server is actually answering requests. [~] root@freebsd>dig @ns0.mavetju.org version.server chaos txt ; <<>> DiG 9.3.2 <<>> @ns0.mavetju.org version.server chaos txt ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39096 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;version.server. CH TXT ;; ANSWER SECTION: version.server. 0 CH TXT "Nominum ANS 2.7.0.2" ;; Query time: 161 msec ;; SERVER: 212.73.76.0#53(212.73.76.0) ;; WHEN: Wed Apr 18 16:59:37 2007 ;; MSG SIZE rcvd: 64 POP3 / IMAP Both POP3 and IMAP services return a greeting when you connect to it: [~] root@freebsd>telnet pop.mavetju.org pop3 Trying 212.73.78.125... Connected to mail4.mavetju.org. Escape character is '^]'. +OK DBMAIL pop3 server ready to rock <[email protected]< [~] root@freebsd>telnet imap.mavetju.org imap Trying 212.73.78.125... Connected to mail4.mavetju.org. Escape character is '^]'. * OK dbmail imap (protocol version 4r1) server 2.0.10 ready to run SMTP / spam checker / greylisting / virus scanner SMTP servers (check if the SMTP server is running) listen for network connections on port 25 (smtp) and port 587 (submission). Incoming SMTP traffic but might be greylisted (check if the greylist daemon is running). The email received goes through a virus scanner (if you are using a commercial package, make sure your license hasn't expired) (make sure the virus scanner daemon is running) (make sure that the signatures are up to date): [~] root@freebsd>/usr/local/viruscan/kav/bin/aveclient -c -p /var/run/aveserver RECORDS 283158 UPDATED 18-04-2007 SERIAL 0367-0003F5-012E4689 EXPIRE 17-04-2008 Then the email goes through the spam checker (make sure that the daemon is running) and then into the mail folder. Email can come in bulk. That means that one moment your queue is empty, and the next moment there are 500 messages in the queue. If your users get a daily mailing like this every day at 17:00, then you will get a daily alert about it. NAT gateways Check the size of the NAT table. The expected size is depending on the policy of your network. If your network is open (no proxy server, no restrictions on traffic), then the NAT table will be very big. If you have a regulated network (HTTP has to go via the proxy server, email has to be delivered to the local SMTP servers, DNS requests have to go to the local DNS server etc), then this will be relative small. A chance in the size can show that there is something wrong. [~] root@freebsd>ipnat -l \| wc -l 300 Database replication Not only the consistency of the data in a database is very important, but so is the replication of it. And it should be as realtime as possible. Slony, the replication service for PostgreSQL, gives these statistics via the `sl_status` table: database=# select st_origin,st_received,st_lag_time from _database.sl_status; st_origin \| st_received \| st_lag_time -----------+-------------+----------------- 4 \| 1 \| 00:00:01.271073 4 \| 2 \| 00:00:01.091502 Asterisk VoIP There are a couple of important things to be monitored in Asterisk via the Manager interface: Status of the PRI interfaces, status of the SIP peers, status of the IAX peers. voipCLI> pri show spans* PRI span 1/0: Provisioned, Up, Active PRI span 2/0: Provisioned, Up, Active PRI span 3/0: Provisioned, Up, Active PRI span 4/0: Provisioned, Up, Active voipCLI> sip show peers* Name/username Host Dyn Nat ACL Port Status edwin 121.44.244.57 D N 2051 Unmonitored wen09-vega 10.197.9.12 5060 OK (7 ms) ccm-publisher 10.252.11.130 5060 OK (1 ms) 3 sip peers [3+0 online, 0 offline, 0 unmonitored] voipCLI> iax2 show peers* Name/Username Host Mask Port Status bluebox-tardis/ 202.83.176.44 (S) 255.255.255.255 4569 (T) OK (3 ms) 1 iax2 peers [1 online, 0 offline, 0 unmonitored] With the SIP and IAX status, not only the OK status is important but also the time for the answer. Network device monitoring Gathering information for network device monitoring is a little bit trickier than systems monitoring, because you can't run these fancy scripts on your routers and switches. Often you only can get information via SNMP... System Uptime: Embedded devices are often very fast with their reboots, so they can reboot several times and you will not even know anything. With the `system.sysUpTime` OID you can get the uptime: SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19 If you have a clean network, and have your network devices and user devices separated from each other, then there is a nice border between where the responsibility lays. And it gives you an easy way to check if all interfaces on your devices are in the state you expect them in. [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifDescr RFC1213-MIB::ifDescr.1001 = STRING: "hs1-x450/14" RFC1213-MIB::ifDescr.1002 = STRING: "hs2-ssg550/e0_2" RFC1213-MIB::ifDescr.1003 = STRING: "hs2-ssg550/e0_0" RFC1213-MIB::ifDescr.1000006 = STRING: "VLAN 04094 (to-internet)" RFC1213-MIB::ifDescr.1000007 = STRING: "rtif(202.83.178.178/29)" RFC1213-MIB::ifDescr.1000008 = STRING: "VLAN 04093 (to-sjh)" [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifSpeed RFC1213-MIB::ifSpeed.1001 = Gauge32: 1000000000 RFC1213-MIB::ifSpeed.1002 = Gauge32: 1000000000 RFC1213-MIB::ifSpeed.1003 = Gauge32: 1000000000 RFC1213-MIB::ifSpeed.1000006 = Gauge32: 0 RFC1213-MIB::ifSpeed.1000007 = Gauge32: 0 RFC1213-MIB::ifSpeed.1000008 = Gauge32: 0 [~] root@freebsd>snmpwalk -v 1 -c secret router IF-MIB::ifOperStatus RFC1213-MIB::ifOperStatus.1001 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1002 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1003 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1000006 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1000007 = INTEGER: up(1) RFC1213-MIB::ifOperStatus.1000008 = INTEGER: up(1) If an ifSpeed is suddenly 100Mbps instead of 1Gbps, you know that there is something wrong. If an ifOperStatus is down instead of up, you know that there is a problem. If you have redundancy in your network, these issues might have been hidden because the remote subnet never has been unreachable. Routers can "suddenly" have more or less interfaces, for example when you create or delete a new VLAN. So you have to monitor for the absence of expected VLANs and the presence of unknown VLANs. This is for a radio link: [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifDescr IF-MIB::ifDescr.1 = STRING: Ethernet Interface IF-MIB::ifDescr.2 = STRING: lo0 IF-MIB::ifDescr.3 = STRING: WORP Interface [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifSpeed IF-MIB::ifSpeed.1 = Gauge32: 100000000 IF-MIB::ifSpeed.2 = Gauge32: 100000000 IF-MIB::ifSpeed.3 = Gauge32: 36000000 [~] root@freebsd>snmpwalk -v 1 -c secret link-1 IF-MIB::ifOperStatus IF-MIB::ifOperStatus.1 = INTEGER: up(1) IF-MIB::ifOperStatus.2 = INTEGER: up(1) IF-MIB::ifOperStatus.3 = INTEGER: up(1) If you are exchanging routing information with your ISP to the internet or to other 3rd parties, then this goes via BGP. Checking if your BGP neighbours are up can be done via SNMP: [~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState BGP4-MIB::bgpPeerState.218.100.2.1 = INTEGER: established(6) BGP4-MIB::bgpPeerState.218.100.2.62 = INTEGER: idle(1) BGP4-MIB::bgpPeerState.221.133.215.61 = INTEGER: established(6) Here also goes: check for the absence of expected neighbours and the presence of unknown neighbours. If a router supports environmental reporting (temperature, fanspeed), measure it and report anomalies. High temperatures are bad for hardware! EXTREME-SYSTEM-MIB::extremeFanOperational.101 = INTEGER: true(1) EXTREME-SYSTEM-MIB::extremeFanOperational.102 = INTEGER: true(1) EXTREME-SYSTEM-MIB::extremeFanOperational.103 = INTEGER: true(1) EXTREME-SYSTEM-MIB::extremeCurrentTemperature.0 = INTEGER: 27 If a router has multiple power supplies, it is important that you check if all of them are active. They're just like RAID cards: You can live with one less, but not with two! [~] root@freebsd>snmpwalk -v 1 -c secret router.mavetju.org BGP4-MIB::bgpPeerState EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.1 = INTEGER: presentOK(2) EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.2 = INTEGER: presentOK(2) EXTREME-SYSTEM-MIB::extremePowerSupplyStatus.3 = INTEGER: presentOK(2) Links to software ...

$Id: basicnetworktroubleshooting.php,v 1.7 2002/10/25 09:01:45 mavetju Exp $