The School for Sysadmins Who Can't Timesync Good and Wanna Learn To Do Other Stuff Good Too, part 2

(Part 1 covered the background and rationale. Part 3 is about installation and configuration.)

What is NTP?

NTP (Network Time Protocol) is an Internet standard for time synchronisation covered by multiple RFCs. "NTP is [arguably] the longest running, continuously operating, ubiquitously available protocol in the Internet" [Mills]. It has been operating since 1985, which is several years before Tim Berners-Lee invented the WWW. The current version is NTPv4, described in RFC5905, which also covers SNTP (Simple NTP), a more limited version designed mostly for clients.

Whilst there are multiple different implementations of NTP, I'll be focusing on the reference implementation, from the Network Time Foundation, because that's what I'm most familiar with, and because it has the most online reference material available.

How Linux keeps time

Linux and other Unix-like kernels maintain a system clock which is set at system boot time from a hardware real time clock (RTC), and is maintained by regular interrupts from a timing circuit, usually a crystal oscillator.

The kernel clock is maintained in UTC; the base unit of time is the number of seconds since midnight 1 January 1970 UTC. Applications can read the system clock via time(2), gettimeofday(2), and clock_gettime(2), the last two of which offer micro- and nano-second resolution.

System calls are available to set the time if it needs to change (called "stepping" the clock), but the more commonly-used technique is to ask the kernel to adjust the system clock gradually via the adjtime(3) library function or adjtimex(2) system call (called "slewing" the clock). Slewing ensures that the clock counter continues to increase rather than jumping suddenly (even if the clock needs to be adjusted backwards), by making slight changes in the length of seconds on the system clock. If the clock needs to go forwards, the seconds are shortened (sped up) slightly until true time is reached; if the clock needs to go backwards, the seconds are lengthened (slowed down) slightly until true time catches up. (There are other interesting timing functions supported by the Linux kernel; see the documentation for more.)

Because oscillators are imperfect, system time is always out from UTC by some amount. Better quality hardware is accurate to within very small variance from the true time (unnoticeable by humans), while cheap hardware can be out by quite significant amounts. Clock accuracy is also affected by other factors such as temperature, humidity, and even system load. NTP is designed to receive timing information from external sources and use clock slewing (or stepping, where necessary) to keep the system clock as close as possible to true UTC time.

How NTP works

The notion of one true time is central to how NTP operates, and it has numerous checks and balances in it which are designed to keep your system zeroing in on the one true time. (For a more detailed and authoritative explanation of this, see Mills' "Notes on setting up a NTP subnet".)

Polling

The primary means which NTP uses for determining the correct time is just to ask for it! An NTP server simply polls other NTP servers (on UDP port 123) or other time sources (more on this below) for their current time, measures how long it takes the request to get there and back, and analyses the results to determine which sources represent the true time. The polling process is very efficient and can support huge numbers of clients with a minimum of bandwidth.

An NTP poll happens at intervals ranging from 8 seconds to 36 hours (going up in powers of two), with 64 seconds to 1024 seconds being the default range. The NTP daemon will automatically adjust its polling interval for each source based on the previous responses it has received. On most systems with a reliable clock and reliable time sources, poll times will settle on the maximum within a few hours of the NTP daemon being started. Here's an example from one of my systems:

$ ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+172.22.254.1    172.22.254.53    2 u  255 1024  177    0.527    0.082   2.488
*172.22.254.53   .NMEA.           1 u   37   64  376    0.598    0.150   2.196
-192.189.54.17   130.95.179.80    2 u 1067 1024  377   44.964   -1.948   0.764
+192.189.54.33   130.95.179.80    2 u  101 1024  377   32.703   -1.666   8.223
+129.127.40.3    130.95.179.80    2 u  953 1024  377   55.609   -0.120   6.276
-2001:4478:fe00: 216.218.254.202  2 u   76 1024  377   35.971    4.814   1.848
-2001:67c:1560:8 17.253.34.125    2 u 1017 1024  377  376.041   -3.303   4.412
+162.213.34.249  17.253.34.253    2 u 1004 1024  377  325.680    1.469  38.157

The 6th column is the poll time, which is 1024 seconds for all but one of its peers. (More on how to interpret the output of ntpq will come in a later post.)

Strata

So if your system gets time from another system on the network, from where does that system get its time? NTP time is ultimately sourced from accurate external sources like atomic clocks, some of which use the ultimate source of the standard second, the Caesium atom, as their reference. Such time sources are expensive, so other sources are used as well, such as radio clocks, stable oscillators, or (perhaps most commonly) the GPS satellite system (which itself uses atomic clocks). These sources are collectively referred to as reference clocks.

In the NTP network, a reference clock is stratum 0 - that is, an authoritative source of time. An NTP server which uses a stratum 0 clock as its time source is stratum 1. Stratum 2 servers get their time from stratum 1 servers; stratum 3 servers get their time from stratum 2 servers, and so on. In practice it's rare to see servers higher than stratum 4 or 5 on the Internet [Mills] [Minar].

Stratum 1 servers are connected to their stratum 0 sources via local hardware such as a serial port or expansion card slot. The reason we have additional strata after stratum 1 is to ensure that there are enough servers to cope with the load from all the clients. As much as it is possible, network delay (latency) between strata should be kept to a minimum.

Algorithms

NTP uses a number of different algorithms to ensure that the time it receives is accurate. [Mills] Knowing how these algorithms work at a basic level can help us avoid configuration mistakes later, so we'll look at them here briefly:

filtering - The poll results from each time source are filtered in order to produce the most accurate results. [Mills]
selection (a.k.a. intersection) - The results from all sources are compared to determine which ones can potentially represent the true time, and those which cannot (called falsetickers or falsechimers) are discarded from further calculations. [Mills]
clustering - The surviving time sources from the selection algorithm are combined using statistical techniques. [Mills]

Read on in part 3 - installation and configuration, where we'll explore how to install and configure NTP on an Ubuntu Linux 16.04 system.

The School for Sysadmins Who Can't Timesync Good and Wanna Learn To Do Other Stuff Good Too, part 2 - how NTP works