[Planetlab-users] Why so high load averages?
Douglas S Reeves
reeves at eos.ncsu.edu
Wed May 12 10:55:56 EDT 2004
Sounds like planetlab is succeeding to me.
Two suggestions:
encourage existing planetlab users to pony up more horsepower
(maybe the cost of entry should be 3 or 4 machines, rather than 2,
and maybe you should only count machines with a minimum clock rate /
memory configuration). Adding more sites isn't going to help -- it's
the cpu power / site that sounds like it needs to be addressed.
police usage -- I don't really want to subsidize use by power users.
If site x contributing 2 machines consistently uses 30 machines
worth of processing / network bandwidth, they should be deploying
more machines to contribute to the available pool.
Finally, if some users look at planetlab as a cheap supercomputer / grid
computer for their production use, this is contrary to the intent of
planetlab in my opinion. Again, some policing may be necessary.
-Doug
Vivek Pai wrote:
> Joe Armstrong wrote:
>
>>
>> This is terrible - one third less than one. This means PL is not
>> representative of machines in general - my local machine load average
>> is 0.23 and this is running loads of things.
>
>
> Depends on what you want to do. Before PlanetLab, I had about 2 places
> where I could run CoDeeN. After PlanetLab, I have about 150. I'll take
> the (in my opinion, tolerable) unpredictability for the huge gain in
> infrastructure. They are, for the most part, minor inconveniences and
> it's pretty simple to deal with them for the kinds of things we do.
>
>> The specific problem I encountered was when I tried to connect to
>> my target machine I kept getting getting "Connection refused" - If I
>> were queued or if it were slow I'd be much happier.
>
>
> I don't think I've seen connections refused due to load. I'd rule
> out other possibilities first.
>
>
>> Now this machine was *not listed* in
>> http://codeen.cs.princeton.edu/status/ - or if it was I couldn't find
>> it.
>
>
> CoDeeN is a service, and the status page only reflects the status of the
> nodes running our service. Note that CoDeeN is not a monitoring service,
> so it's not surprising that not every node is listed. We simply don't
> run outside of North America right now.
>
> The CoDeeN status page is simply us allowing other people to see what we
> think of the nodes, since we happen to like our own status page more
> than any of the others we'd seen in the past. Now that Ganglia's back up
> and Trumpet exists, there's ample coverage of nodes.
>
>> ganglia said the load averages were 8088%, 3081%, 1623%
>> codeen said the load average was 4.88
>>
>> The detailed graphs were confusing - since I guess the sample times
>> etc are
>> different.
>
>
> I believe our legend explains our measurements.
>
> "The Load value is the maximum of the load averages for the past 1, 5,
> and 15 minutes. This value indicates how many active processes are
> competing for the CPU"
>
> We plot every 5 minutes.
>
>> Now I guess if there are lots of measurement processes they should all
>> get the same result -
>
>
> Not necessarily. See above.
>
> > so why have lots? - they'll all end up measuring
>
>> each other.
>
>
> We use our status page extensively, and it provides what we want. If you
> don't like it, you're free to not use it. I don't think our measurement
> process consumes enough resources to drive up the load averages.
>
>> What worries me is how well the OS protects one slice against other
>> slices that run wild. Most OSs are not good at this - if one process
>> starts hogging memory and CPU sometimes the other processes can't get
>> a look in.
>>
>> It happens that sometimes I get a process that runs wild, so much so
>> that I can't open a shell or run top or do anything to kill it.
>>
>> Is there any kind of built-in protection in the scheduler - like
>> throw out a process if it's used more than 50% cpu on average in the
>> last 5 minutes ?
>
>
> There's probably a PDN that explains the modified scheduler. Processes
> aren't killed, but CPU is split by slices, not processes. Memory space
> isn't protected at the moment.
>
> -Vivek
>
>
> _______________________________________________
> Users mailing list: Users at lists.planet-lab.org
> http://lists.planet-lab.org/mailman/listinfo/users
>
--
Douglas S. Reeves
NC State University
http://reeves.csc.ncsu.edu
More information about the Users
mailing list