Poor neglected blog. I've been ignoring you for months and feeling bad about it. Well, it's Friday, I'm slightly drunk, and I've just spent an hour doing some dopey ass shit. Lets do this thing.
At work we are deploying a new microservice lately, and I've been getting the ops side of things ready to go. This week I've been getting stuff set up for the new service: deploying servers, provisioning new AWS VPC subnets, some light load testing, blah blah etc etc. All the little things you do when preparing a new service for production.
So the end of the day rolled around and I was basically done. Good job, self! You did it. Feel free to enjoy your weekend. The only thing left to set up was Nagios. I deployed our standardized set of basic checks, sat back, and started thinking about paying a visit to Xur.
(Yes, I recreated the failure to take this screenshot.)
Before going further, I will say some things now about NTP. NTP is a fucking pain in the ass. It's an ancient protocol written during The Before Times.
ntpd is a god damn nightmare. The security model in "secure" NTP is bullshit and complicated and hard to deploy at scale. The documentation is a joke. Stratum? True chimer? False ticker? Drift file? I have to know about fucking leapseconds?!? Fucking what the fuck shit bullshit mother fucker arghasdlkbanoadfidfbw23abrusu.
Okay, I'm sorry.
The other thing about NTP is that it is super critical. Servers with incorrect clocks can cause all kinds of havoc in distributed systems, not to mention it makes correlating logs really hard. Despite hating the living shit out of it, I take NTP very seriously.
Debugging NTP Connectivity
So, okay, back to the failed NTP check. That's weird right? So I ssh into the server and check that NTP is working:
ntpq> pe remote refid st t when poll reach delay offset jitter ============================================================================== +ntpserver1 188.8.131.52 3 u 73 256 377 0.661 0.075 0.051 -ntpserver2 184.108.40.206 3 u 145 256 377 0.578 0.621 0.109 *ntpserver3 220.127.116.11 3 u 213 256 377 0.223 -0.035 0.021 +ntpserver4 18.104.22.168 3 u 159 256 377 1.559 0.212 0.031 ntpq> rv associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync, version="ntpd firstname.lastname@example.org Fri Feb 6 15:24:25 UTC 2015 (1)", processor="x86_64", system="Linux/3.13.0-44-generic", leap=00, stratum=4, precision=-22, rootdelay=13.590, rootdisp=74.632, refid=192.168.111.96, reftime=d8d2f923.fd6a1a66 Sat, Apr 11 2015 1:36:35.989, clock=d8d2fd2d.2801e8af Sat, Apr 11 2015 1:53:49.156, peer=45213, tc=8, mintc=3, offset=0.086, frequency=8.937, sys_jitter=0.100, clk_jitter=0.080, clk_wander=0.012
Alright, looks good.
ntpd is syncing with my internal NTP servers. Now lets manually run the Nagios plugin that's failing,
check_ntp_time, which is executed locally on the node via
root@myserver-ec2-us-east-1b:/usr/lib/nagios/plugins# ./check_ntp_time -H ntpserver1-ec2-us-east-1a -v sending request to peer 0 re-sending request to peer 0 ... re-sending request to peer 0 NTP CRITICAL: No response from NTP server
"No response from NTP server" means the response packets never got back to the server. But
ntpd has been setting the time. Strange. Let's try
root@myserver-ec2-us-east-1b:/usr/lib/nagios/plugins# ntpdate -vv ntpserver1-ec2-us-east-1a 11 Apr 02:08:19 ntpdate: ntpdate email@example.com Sat Dec 20 11:47:15 UTC 2014 (1) 11 Apr 02:08:25 ntpdate: adjust time server 192.168.110.40 offset -0.000461 sec
tcpdump Solves Every Problem
ntpdate definitely works, but
check_ntp_time does not. Lets do a packet capture with
tcpdump udp port 123 to compare the two:
# check_ntp_time never gets a response 02:11:23.632804 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48 02:11:24.033512 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48 02:11:25.035296 IP myserver.46758 > ntpserver1.ntp: NTPv4, Client, length 48 ...
# ntpdate can communicate just fine 02:12:15.531179 IP myserver.ntp > ntpserver1.ntp: NTPv4, Client, length 48 02:12:15.532068 IP ntpserver1.ntp > myserver.ntp: NTPv4, Server, length 48 02:12:17.531199 IP myserver.ntp > ntpserver1.ntp: NTPv4, Client, length 48 ...
Now lets do a packet capture on the NTP server in question:
# ntpserver1 responding to check_ntp_time 03:16:46.430854 IP myserver.40144 > ntpserver1.ntp: NTPv4, Client, length 48 03:16:46.430934 IP ntpserver1.ntp > myserver.40144: NTPv4, Server, length 48 03:16:46.431638 IP myserver.40144 > ntpserver1.ntp: NTPv4, Client, length 48
So the NTP server does get the packet from
check_ntp_time and does actually respond. We're successfully recieving responses for
ntpdate but not for
check_ntp_time. What's going on here?
What's that phrase about assuming things? I had been assuming that
check_ntp_time worked like
ntpd and communicated both ways on port 123, but that was totally not the case.
tcpdump reveals that
check_ntp_time never gets anything back, so that must mean that the response packet is coming back on an ephemeral high port and not on UDP 123. Clearly there must be some networking problem getting in the way.
A visit to the AWS VPC control panel quickly revealed the issue in my network ACL:
Rule 501 is, of course, supposed to be a custom UDP rule, to allow ephemeral responses. Whoops.
check_ntp_time didn't work because the response packet from the NTP server was dropped by the ACL and never made it back to the instance.
check_ntp_timegets a response back on an ephemeral port, not UDP 123
- Don't drink and configure firewalls.