Last week i made some fairly significant changes on a client's production firewall/routing cluster during our maintenance window. The next morning there were reports of file server drives not connecting correctly and inaccessible web sites. Because all wireless-to-wired and Internet traffic goes through this cluster, the firewall changes were the obvious culprit. Looking at the logs it turned out we had run out of space in the connection tracking table:
May 26 08:55:05 corella1 kernel: ip_conntrack: table full, dropping packet.May 26 08:55:13 corella1 kernel: ip_conntrack: table full, dropping packet.May 26 08:55:15 corella1 kernel: ip_conntrack: table full, dropping packet.
I checked the counters in /proc/sys/net/ipv4/netfilter/, upped the limit for net.ipv4.netfilter.ip_conntrack_max in /etc/sysctl.conf to 4 times its previous value, and loaded the new value into /proc.
Then i started to hack up a few little scripts to monitor and graph ip_conntrack_count against ip_conntrack_max using rrdtool. I've used rrdtool a little before, so i thought it would be pretty straightforward. I created my RRD file and started updating it every minute with the latest counters from netfilter. However, as soon as i tried to graph it i got the error
ERROR: parameter 'cnt' does not represent a number in line AREA:cnt#00FF00:countn
A search of Google brought up a lot of hits which contained the same text but were not relevant - most of them were errors in not specifying the variable correctly. However, i came across one very similar problem: https://lists.oetiker.ch/pipermail/rrd-users/2007-November/013277.html
Unfortunately, this post on the rrdtool users mailing list had no responses, so i was down to solving it myself. It took me some time before i realised that both the original poster of that message and myself had made exactly the same elementary mistake: forgetting to include a filename for the graph output. This rudimentary error is not picked up by rrdtool's command line parser (at least not as at version 1.2.12 on SUSE Linux Enterprise Server), resulting in a very confusing error message.
So then i had a working rrd graph on my firewall, which seems to have settled down nicely. You can find the current (very rough) state of the scripts at https://github.com/paulgear/puppet/tree/2b5363a3fbc1e73d5d88158e93ab5d879910173b/modules/netfilter/files.
At the moment i'm only graphing the connection tracking count vs. its maximum (see the graph below). Note the interesting minor variation on the graph from the max value that isn't actually changing. This seems to be due to rrdtool's consolidation of data points - the change to a solid line was effected by truncating the date to an exact multiple of the step interval that the rrd was set up with (in this case, 60 seconds).
After getting this working, i wondered whether there were other conntrack values i should be checking (the ip_conntrack_tcp_be_liberal and ip_conntrack_tcp_loose sounded particularly interesting) so i started going looking for documentation on the files in /proc/sys/net/ipv4/netfilter/. Initial searches came up with very little. The best description i could find of them was at http://netfilter.linux-kernel.at/documentation/pomlist/pom-extra.html#tcp-window-tracking, but i must admit that i crave more detail. If anyone can point me to a better reference, or suggest which conntrack items really need monitoring, please drop me a line.