Incident-response without NTP

Published: 2012-05-08
Last Updated: 2012-05-08 20:02:10 UTC
by Kevin Liston (Version: 4)
9 comment(s)


While we patiently await the arrival of this month's patches from Microsoft (and everyone else who publishes today) I have a little thought experiment for you. We all know that the internet doesn't work too efficiently if DNS isn't working or present. NTP is just as critical for your security infrastructure. Without reliable clock synchronization, piecing together what happened during an incident can become extremely difficult.

Consider a hypothetical services network and DMZ: there's an external firewall, a couple of webservers, an inner firewall with a database server behind it. Let's also assume that something bad happened to the webservers a couple of months ago and you've been brought in as a consultant to piece together the order of events and figure out what the attacker did. The web administration team, and the database team, and the firewall team have all provided your request for logs and you've got them on your system of choice.

More About NTP

For a complete background on NTP I recommend:

There are two main types of clock error that we are concerned with in this example:

  • Clock Skew  also called Accuracy, determines how close a clock is to an official time reference.
  • Clock Drift or the change in accuracy over time.

Common clock hardware is not very accurate; an error of 0.001% causes a clock to be off by nearly one second per day.  We can expect most clocks to have one second of drift every 2 days.  The oscillator used in computer clocks can be influenced by changes in local temperature, and the quality of the electricity feeding the system. Update: Joanne wrote in to point out that the accuracies that I've cited in this paragraph are an order of magnitude better than what one would expect in computer hardware.  We'll see later in some example data how optimistic my values were.

Today's Challenge

How do you begin order the events between the systems?  First I'll solicit general approaches via comments and email, later I'll summarize and provide some example data to illustrate the most popular/promising approaches.

Example Data

Let's take a look at what the web team and the firewall team sent to us.


Date                   Time                    Event                                         Epoch


1/1/1972              13:24:04               First Request from badguy          262990.55837962962963

1/1/1972              13:24:04               2nd Request from badguy           262990.55837962962963

1/1/1972              13:24:04               3rd Request from badguy            262990.55837962962963

1/1/1972              13:24:05               4th Request from badguy            262990.558391203703704

1/1/1972              13:24:09               5th Request from badguy            262990.5584375


Web 2:                                      

1/1/1972              13:25:37               First Request from badguy          262990.559456018518519

1/1/1972              13:25:41               2nd Request from badguy           262990.559502314814815

1/1/1972              13:25:57               3rd Request from badguy            262990.5596875

1/1/1972              13:26:49               4th Request from badguy            262990.560289351851852

1/1/1972              13:26:59               5th Request from badguy            262990.560405092592593

1/1/1972              13:27:42               6th Request from badguy            262990.560902777777778



1/1/1972              7:00:41  Accept "tcp_80" "34153" "bad_guy_ip" "web1" 262990.292141203703704

1/1/1972              7:00:43  Accept "tcp_80" "34154" "bad_guy_ip" "web1" 262990.292164351851852

1/1/1972              7:00:45  Accept "tcp_80" "34155" "bad_guy_ip" "web1" 262990.2921875

1/1/1972              7:00:49  Accept "tcp_80" "34156" "bad_guy_ip" "web1" 262990.292233796296296

1/1/1972              7:00:52  Accept "tcp_80" "34157" "bad_guy_ip" "web1" 262990.292268518518518

1/1/1972              7:02:27  Accept "tcp_80" "59498" "bad_guy_ip" "web2" 262990.293368055555556

1/1/1972              7:02:31  Accept "tcp_80" "59499" "bad_guy_ip" "web2" 262990.293414351851852

1/1/1972              7:02:47  Accept "tcp_80" "59500" "bad_guy_ip" "web2" 262990.293599537037037

1/1/1972              7:03:39  Accept "tcp_80" "59501" "bad_guy_ip" "web2" 262990.294201388888889

1/1/1972              7:03:49  Accept "tcp_80" "59502" "bad_guy_ip" "web2" 262990.29431712962963

1/1/1972              7:04:32  Accept "tcp_80" "59503" "bad_guy_ip" "web2" 262990.294814814814815


I've merged added the epoch column since that will help some folks apply their favorite methods and trimmed the logs from the three systems down to the activity of one suspicious IP address.

My Naive Approach

My initial assumption is that we should be able to account for the bias between the clocks on sufficiently-small windows of time.  We will not likely come up with a simple formula to correct several months-worth of logs.  However, for critical periods, we should be able to knit together log events from multiple systems, identify the clock bias, and account for it in the ultimate investigative timeline.  So my approach is to pick a small time-frame of events, pick a system to be the reference point, tie events together manually a bit, and plot it out to see if there is a simple linear relationship, or if we have other issues.

Immediately we see that there's clearly a timezone difference between the web team and the firewall team, that's not a big deal at the moment.  Initially we may feel in luck that the firewall can act as a semi-reliable observer to compare the attack against web1 and web2.  Maybe fortune will continue and we can simply shift the times a little to account for clock skew.  The event was only a few seconds so the window should be small enough that drift should be undetectable, right?

No, something's not right.  If we compare the elapsed time of the event for web1 and web2 using the firewall as a frame of reference.  While the firewall and web2 agree that it was visited over 2 minutes and 5 seconds, web1 records an elapsed time of 5 seconds, while the firewall indicates 11 between the first accept and the last accept.

Let's plot out the times from the web server vs. the time noted by the firewall.  Ideally we should see something of a straight line with a slope of one and a zero-intercept of zero.  In this case, we're hoping for a slope near one, and a zero-intercept that will help identify the timezone used by the firewall or the webservers.

How about a closer look at those two:

Web 1 recorded the first few probes as happening in the same second.  Over time though (draw a line between the first and last event) and it's a bit more in agreement with web2.

A Side Note About the Comments

The comments strayed off-topic pretty quickly today, but there are some nice gems in there about deploying and monitoring NTP.  They're worth a look.



Keywords: NTP
9 comment(s)
Diary Archives