A Story About check_ntp_time

Poor neglected blog. I've been ignoring you for months and feeling bad about it. Well, it's Friday, I'm slightly drunk, and I've just spent an hour doing some dopey ass shit. Lets do this thing.

A Problem

At work we are deploying a new microservice lately, and I've been getting the ops side of things ready to go. This week I've been getting stuff set up for the new service: deploying servers, provisioning new AWS VPC subnets, some light load testing, blah blah etc etc. All the little things you do when preparing a new service for production.

So the end of the day rolled around and I was basically done. Good job, self! You did it. Feel free to enjoy your weekend. The only thing left to set up was Nagios. I deployed our standardized set of basic checks, sat back, and started thinking about paying a visit to Xur.

Except... huh

{<1>}NTP Sync

(Yes, I recreated the failure to take this screenshot.)

NTP

Before going further, I will say some things now about NTP. NTP is a fucking pain in the ass. It's an ancient protocol written during The Before Times. ntpd is a god damn nightmare. The security model in "secure" NTP is bullshit and complicated and hard to deploy at scale. The documentation is a joke. Stratum? True chimer? False ticker? Drift file? I have to know about fucking leapseconds?!? Fucking what the fuck shit bullshit mother fucker arghasdlkbanoadfidfbw23abrusu.

Okay, I'm sorry.

The other thing about NTP is that it is super critical. Servers with incorrect clocks can cause all kinds of havoc in distributed systems, not to mention it makes correlating logs really hard. Despite hating the living shit out of it, I take NTP very seriously.

Debugging NTP Connectivity

So, okay, back to the failed NTP check. That's weird right? So I ssh into the server and check that NTP is working:

ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+ntpserver1 216.239.34.15    3 u   73  256  377    0.661    0.075   0.051
-ntpserver2 216.239.34.15    3 u  145  256  377    0.578    0.621   0.109
*ntpserver3 216.239.38.15    3 u  213  256  377    0.223   -0.035   0.021
+ntpserver4 216.239.34.15    3 u  159  256  377    1.559    0.212   0.031

ntpq> rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5@1.2349-o Fri Feb  6 15:24:25 UTC 2015 (1)",
processor="x86_64", system="Linux/3.13.0-44-generic", leap=00, stratum=4,
precision=-22, rootdelay=13.590, rootdisp=74.632, refid=192.168.111.96,
reftime=d8d2f923.fd6a1a66  Sat, Apr 11 2015  1:36:35.989,
clock=d8d2fd2d.2801e8af  Sat, Apr 11 2015  1:53:49.156, peer=45213, tc=8,
mintc=3, offset=0.086, frequency=8.937, sys_jitter=0.100,
clk_jitter=0.080, clk_wander=0.012

Alright, looks good. ntpd is syncing with my internal NTP servers. Now lets manually run the Nagios plugin that's failing, check_ntp_time, which is executed locally on the node via check_by_ssh:

root@myserver-ec2-us-east-1b:/usr/lib/nagios/plugins# ./check_ntp_time -H ntpserver1-ec2-us-east-1a -v
sending request to peer 0
re-sending request to peer 0
...
re-sending request to peer 0
NTP CRITICAL: No response from NTP server

"No response from NTP server" means the response packets never got back to the server. But ntpd has been setting the time. Strange. Let's try ntpdate:

root@myserver-ec2-us-east-1b:/usr/lib/nagios/plugins# ntpdate -vv ntpserver1-ec2-us-east-1a
11 Apr 02:08:19 ntpdate[12410]: ntpdate 4.2.6p5@1.2349-o Sat Dec 20 11:47:15 UTC 2014 (1)
11 Apr 02:08:25 ntpdate[12410]: adjust time server 192.168.110.40 offset -0.000461 sec

tcpdump Solves Every Problem

So ntpdate definitely works, but check_ntp_time does not. Lets do a packet capture with tcpdump udp port 123 to compare the two:

# check_ntp_time never gets a response
02:11:23.632804 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48
02:11:24.033512 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48
02:11:25.035296 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48
...
# ntpdate can communicate just fine
02:12:15.531179 IP myserver.ntp > ntpserver1.ntp: NTPv4, Client, length 48
02:12:15.532068 IP ntpserver1.ntp > myserver.ntp: NTPv4, Server, length 48
02:12:17.531199 IP myserver.ntp > ntpserver1.ntp: NTPv4, Client, length 48
...

Now lets do a packet capture on the NTP server in question:

# ntpserver1 responding to check_ntp_time
03:16:46.430854 IP myserver.40144 > ntpserver1.ntp: NTPv4, Client, length 48
03:16:46.430934 IP ntpserver1.ntp > myserver.40144: NTPv4, Server, length 48
03:16:46.431638 IP myserver.40144 > ntpserver1.ntp: NTPv4, Client, length 48

So the NTP server does get the packet from check_ntp_time and does actually respond. We're successfully recieving responses for ntpdate but not for check_ntp_time. What's going on here?

Whoops

What's that phrase about assuming things? I had been assuming that check_ntp_time worked like ntpdate or ntpd and communicated both ways on port 123, but that was totally not the case.

tcpdump reveals that check_ntp_time never gets anything back, so that must mean that the response packet is coming back on an ephemeral high port and not on UDP 123. Clearly there must be some networking problem getting in the way.

A visit to the AWS VPC control panel quickly revealed the issue in my network ACL:

{<2>}mistake

Rule 501 is, of course, supposed to be a custom UDP rule, to allow ephemeral responses. Whoops. check_ntp_time didn't work because the response packet from the NTP server was dropped by the ACL and never made it back to the instance.

Conclusion

  • check_ntp_time gets a response back on an ephemeral port, not UDP 123
  • Don't drink and configure firewalls.