Poor neglected blog. I've been ignoring you for months and feeling bad about it. Well, it's Friday, I'm slightly drunk, and I've just spent an hour doing some dopey ass shit. Lets do this thing.
A Problem
At work we are deploying a new microservice lately, and I've been getting the ops side of things ready to go. This week I've been getting stuff set up for the new service: deploying servers, provisioning new AWS VPC subnets, some light load testing, blah blah etc etc. All the little things you do when preparing a new service for production.
So the end of the day rolled around and I was basically done. Good job, self! You did it. Feel free to enjoy your weekend. The only thing left to set up was Nagios. I deployed our standardized set of basic checks, sat back, and started thinking about paying a visit to Xur.
Except... huh
{<1>}
(Yes, I recreated the failure to take this screenshot.)
NTP
Before going further, I will say some things now about NTP. NTP is a fucking pain in the ass. It's an ancient protocol written during The Before Times. ntpd
is a god damn nightmare. The security model in "secure" NTP is bullshit and complicated and hard to deploy at scale. The documentation is a joke. Stratum? True chimer? False ticker? Drift file? I have to know about fucking leapseconds?!? Fucking what the fuck shit bullshit mother fucker arghasdlkbanoadfidfbw23abrusu.
Okay, I'm sorry.
The other thing about NTP is that it is super critical. Servers with incorrect clocks can cause all kinds of havoc in distributed systems, not to mention it makes correlating logs really hard. Despite hating the living shit out of it, I take NTP very seriously.
Debugging NTP Connectivity
So, okay, back to the failed NTP check. That's weird right? So I ssh into the server and check that NTP is working:
ntpq> pe
remote refid st t when poll reach delay offset jitter
==============================================================================
+ntpserver1 216.239.34.15 3 u 73 256 377 0.661 0.075 0.051
-ntpserver2 216.239.34.15 3 u 145 256 377 0.578 0.621 0.109
*ntpserver3 216.239.38.15 3 u 213 256 377 0.223 -0.035 0.021
+ntpserver4 216.239.34.15 3 u 159 256 377 1.559 0.212 0.031
ntpq> rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.6p5@1.2349-o Fri Feb 6 15:24:25 UTC 2015 (1)",
processor="x86_64", system="Linux/3.13.0-44-generic", leap=00, stratum=4,
precision=-22, rootdelay=13.590, rootdisp=74.632, refid=192.168.111.96,
reftime=d8d2f923.fd6a1a66 Sat, Apr 11 2015 1:36:35.989,
clock=d8d2fd2d.2801e8af Sat, Apr 11 2015 1:53:49.156, peer=45213, tc=8,
mintc=3, offset=0.086, frequency=8.937, sys_jitter=0.100,
clk_jitter=0.080, clk_wander=0.012
Alright, looks good. ntpd
is syncing with my internal NTP servers. Now lets manually run the Nagios plugin that's failing, check_ntp_time
, which is executed locally on the node via check_by_ssh
:
root@myserver-ec2-us-east-1b:/usr/lib/nagios/plugins# ./check_ntp_time -H ntpserver1-ec2-us-east-1a -v
sending request to peer 0
re-sending request to peer 0
...
re-sending request to peer 0
NTP CRITICAL: No response from NTP server
"No response from NTP server" means the response packets never got back to the server. But ntpd
has been setting the time. Strange. Let's try ntpdate
:
root@myserver-ec2-us-east-1b:/usr/lib/nagios/plugins# ntpdate -vv ntpserver1-ec2-us-east-1a
11 Apr 02:08:19 ntpdate[12410]: ntpdate 4.2.6p5@1.2349-o Sat Dec 20 11:47:15 UTC 2014 (1)
11 Apr 02:08:25 ntpdate[12410]: adjust time server 192.168.110.40 offset -0.000461 sec
tcpdump
Solves Every Problem
So ntpdate
definitely works, but check_ntp_time
does not. Lets do a packet capture with tcpdump udp port 123
to compare the two:
# check_ntp_time never gets a response
02:11:23.632804 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48
02:11:24.033512 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48
02:11:25.035296 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48
...
# ntpdate can communicate just fine
02:12:15.531179 IP myserver.ntp > ntpserver1.ntp: NTPv4, Client, length 48
02:12:15.532068 IP ntpserver1.ntp > myserver.ntp: NTPv4, Server, length 48
02:12:17.531199 IP myserver.ntp > ntpserver1.ntp: NTPv4, Client, length 48
...
Now lets do a packet capture on the NTP server in question:
# ntpserver1 responding to check_ntp_time
03:16:46.430854 IP myserver.40144 > ntpserver1.ntp: NTPv4, Client, length 48
03:16:46.430934 IP ntpserver1.ntp > myserver.40144: NTPv4, Server, length 48
03:16:46.431638 IP myserver.40144 > ntpserver1.ntp: NTPv4, Client, length 48
So the NTP server does get the packet from check_ntp_time
and does actually respond. We're successfully recieving responses for ntpdate
but not for check_ntp_time
. What's going on here?
Whoops
What's that phrase about assuming things? I had been assuming that check_ntp_time
worked like ntpdate
or ntpd
and communicated both ways on port 123, but that was totally not the case.
tcpdump
reveals that check_ntp_time
never gets anything back, so that must mean that the response packet is coming back on an ephemeral high port and not on UDP 123. Clearly there must be some networking problem getting in the way.
A visit to the AWS VPC control panel quickly revealed the issue in my network ACL:
{<2>}
Rule 501 is, of course, supposed to be a custom UDP rule, to allow ephemeral responses. Whoops. check_ntp_time
didn't work because the response packet from the NTP server was dropped by the ACL and never made it back to the instance.
Conclusion
check_ntp_time
gets a response back on an ephemeral port, not UDP 123- Don't drink and configure firewalls.