[Planetlab-users] Timing Problem on PlanetLab (bad NTP)
nakao at cs.princeton.edu
Fri Mar 5 18:35:57 EST 2004
We thought we'd share what we've learned recently about perceived
scheduling problems on PlanetLab. We had been investigating these for
various reasons, assuming that they were responsible for some timing
anomalies we'd seen. However, after building a more accurate timer, we
found that the timing measurements themselves were at fault, and the
interaction between NTP and timing calls like gettimeofday are the cause
of the problem. Details of this are attached below, and we can answer
any questions about what we've encountered.
(1) our timer library (gettimeofday_cc)
(2) time-difference between gettimeofday and gettimeofday_cc on a UCLA
Details of the problem
We've found that NTP is really causing havoc with the system clock on
quite a few Planetlab nodes. We've written our version of gettimeofday
using only the cycle counter of the cpu (gettimeofday_cc) and have been
watching the difference between two time values reported by
gettimeofday_cc and gettimeofday. Note that only the clock in
gettimeofday is ajusted by NTP, while the cycle counter remains
The attached plot shows the difference (gettimeofday - gettimeofday_cc)
on a UCLA node which we suspected has bad NTP, and on our local machine
in princeton with good NTP. As you may see from the plot, UCLA shows
that gettimeofday fluctuates a lot over time, but average of this
difference over time is only a 1.7sec away from zero.
We also excamined the drift of cpu frequency on both nodes over times,
but that was only less than 0.5% over time. Therefore we've concluded
that NTP is badly affecting the system clock on several PlanetLab nodes.
In fact, we conjecture the number of nodes with bad NTP amounts upto
about a hundred nodes.
Particularly affected are experiments that try to measure latencies by
differencing two calls to gettimeofday. When NTP adjustments occur
between these calls, the results are very problematic.
We have been running this new timer on the CoDeen project in order to
measure scheduler delays. We used to sometimes see nodes that claimed
tens to hundreds of seconds between invocations of a 1-second timer.
Now, those results are gone, and most nodes, even under heavy load,
rarely report more than a few seconds. The continually-updated
measurements are on the CoDeeN status page at
http://codeen.cs.princeton.edu/status/ under the "Max Timer" heading.
More information about the Users