[Planetlab-users] Timing Problem on PlanetLab (bad NTP)

Aki Nakao nakao at cs.princeton.edu
Sat Mar 6 10:12:07 EST 2004


Hi Scott.

I admit that there is definitely a long time drift. This is a limitation
right now.
When I first wrote this, I made an assumption that we typically measure
relatively a short period of time (sub-second), by differencing two
calls to gettimeofday. We intended to use this for "timer" in CoDeeN,
RON, and ping/traceroute, not for the system "wall clock".

If we want to use gettimeofday_cc as the system wall clock, we need to
correct the long term effect via NTP.
In fact, the original gettimeofday is using the same cycle counter, so
it also suffers from the same long term effect
(temperature/voltage/modulation etc). So it is just a matter of how we
adjust the clock.
I was thinking of running my own version of NTP, which is a very simple
one, but does not do weird thing such as drifting backward for a while
and jumping forward by tens of seconds, as seen on the UCLA node.
This should not be too hard, since I still believe that
cycle-counter-based time should not be off too much from the correct
time.
But at the same time I would like to know the real reason why NTP is
screwed that badly on PlanetLab nodes.
I personally do not believe this problem is caused by the stale
hardware, but I think something is very wrong with NTP process or its
configuration.

Aki


> -----Original Message-----
> From: users-bounces at planet-lab.org 
> [mailto:users-bounces at planet-lab.org] On Behalf Of Scott Karlin
> Sent: Saturday, March 06, 2004 8:12 AM
> To: users at lists.planet-lab.org
> Subject: Re: [Planetlab-users] Timing Problem on PlanetLab (bad NTP)
> 
> 
> Hi Aki,
> 
> A couple of things to consider (I don't know if these 
> conditions exist on our PlanetLab nodes or even at all):
>    * While the crystal on the motherboard should be stable,
>      it can be affected by temperature and supply voltage.
>      This could lead to some small long term drift.
>    * It's possible that the CPU clock might be modulated
>      somewhat.  The clock on the PCI bus is often modulated
>      over a wide range to reduce the EMI emissions.  I haven't
>      heard of modulation beyond big steps (like to put the
>      processor to sleep) but you may want to make sure.
>    * When the machine reboots, it gets its time from the
>      battery backed-up clock.  If this is way out of whack,
>      it may take a while for NTP to stabilize.
> Again, I'm not sure if these effects exist (or are 
> significant) on any of our systems -- just something to think about.
> 
> Scott
> 
> On Fri, Mar 05, 2004 at 06:35:57PM -0500, Aki Nakao wrote:
> > We thought we'd share what we've learned recently about perceived 
> > scheduling problems on PlanetLab. We had been investigating 
> these for 
> > various reasons, assuming that they were responsible for 
> some timing 
> > anomalies we'd seen. However, after building a more 
> accurate timer, we 
> > found that the timing measurements themselves were at 
> fault, and the 
> > interaction between NTP and timing calls like gettimeofday are the 
> > cause of the problem. Details of this are attached below, 
> and we can 
> > answer any questions about what we've encountered.
> > 
> > Aki Nakao
> > KyoungSoo Park
> > 
> > =============
> > Downloads
> > 
> > (1) our timer library (gettimeofday_cc)
> > http://www.cs.princeton.edu/~nakao/gettimeofday_cc.tgz
> > (2) time-difference between gettimeofday and 
> gettimeofday_cc on a UCLA
> > node 
> > http://www.cs.princeton.edu/~nakao/diff.pdf
> > 
> > =============
> > Details of the problem
> > 
> > We've found that NTP is really causing havoc with the 
> system clock on 
> > quite a few Planetlab nodes. We've written our version of 
> gettimeofday 
> > using only the cycle counter of the cpu (gettimeofday_cc) and have 
> > been watching the difference between two time values reported by 
> > gettimeofday_cc and gettimeofday. Note that only the clock in 
> > gettimeofday is ajusted by NTP, while the cycle counter remains 
> > unaffected.
> > 
> > The attached plot shows the difference (gettimeofday - 
> > gettimeofday_cc) on a UCLA node which we suspected has bad 
> NTP, and on 
> > our local machine in princeton with good NTP. As you may 
> see from the 
> > plot, UCLA shows that gettimeofday fluctuates a lot over time, but 
> > average of this difference over time is only a 1.7sec away 
> from zero.
> > 
> > We also excamined the drift of cpu frequency on both nodes 
> over times, 
> > but that was only less than 0.5% over time. Therefore we've 
> concluded 
> > that NTP is badly affecting the system clock on several PlanetLab 
> > nodes. In fact, we conjecture the number of nodes with bad 
> NTP amounts 
> > upto about a hundred nodes.
> > 
> > Particularly affected are experiments that try to measure 
> latencies by 
> > differencing two calls to gettimeofday. When NTP adjustments occur 
> > between these calls, the results are very problematic.
> > 
> > We have been running this new timer on the CoDeen project 
> in order to 
> > measure scheduler delays. We used to sometimes see nodes 
> that claimed 
> > tens to hundreds of seconds between invocations of a 
> 1-second timer. 
> > Now, those results are gone, and most nodes, even under heavy load, 
> > rarely report more than a few seconds. The continually-updated 
> > measurements are on the CoDeeN status page at 
> > http://codeen.cs.princeton.edu/status/ under the "Max 
> Timer" heading.
> > 
> > 
> > _______________________________________________
> > Users mailing list
> > Users at lists.planet-lab.org 
> > http://lists.planet-lab.org/mailman/listinfo/users
> _______________________________________________
> Users mailing list
> Users at lists.planet-lab.org 
> http://lists.planet-lab.org/mailman/listinfo/users
> 



More information about the Users mailing list