[Planetlab-users] node availability benchmarks

Neil Spring nspring at cs.washington.edu
Fri Jun 18 13:36:47 EDT 2004


I'd very much like to see a number that represents "fully operational" 
-- the number of machines that have zero problems.  In particular, that 
DNS and ping work.  I would like sites to be encouraged not to put 
planetlab boxes behind firewalls.

I'd change:

"visible": I don't understand the value in reporting a metric based on 
being able to ssh into a non-vserver slice.  who has one of these but 
you?

"usable": to me should be a number of machines I should have no 
problems with.  If you keep a list of "usable" machines and I find one 
to be unusable, I should be able to file a trouble ticket with high 
confidence that it's a new issue to you.  That is, usable nodes should 
pass *all* of Mic's tests.  Ganglia running is not evidence of a node 
being usable -- ganglia is lightweight and designed to run on pretty 
messed up machines.   This is where ping belongs.

think lower bound.

thanks,
-neil

On Jun 18, 2004, at 6:43 AM, Larry Peterson wrote:

> As has recently been discussed on this list, there are a lot of 
> different
> numbers being used to describe how many nodes are on PlanetLab. I 
> thought
> it might be helpful to provide a rough break down. I'd also be 
> interested
> in any thoughts people have on my definitions.
>
> Total: This is the number that appears on the home page (currently 
> 395) and
>        corresponds to the number of machines that have been registered 
> with
>        the database. Some of these machines have never actually 
> booted. We
>        all understand that this number has more PR value than anything 
> else.
>
> Visible: This corresponds to the number of nodes we can "reach" from 
> Princeton.
>        (Others run similar experiments from other sites.) Our working 
> definition
>        for a "visible" node is that we can ssh into a non-vserver 
> account on it.
>        Recently, we have been seeing ~325 visible nodes.
>
>        As an aside, an alternative definition of visible is that the 
> machine is
>        pingable. This set is typically smaller because many sites 
> filter pings, so
>        reachable via ssh seems to be a better definition of visible.
>
>        Of the 70-odd nodes that have been registered but are not 
> currently visible,
>        roughly half have are in "debug" mode; i.e., have known, 
> long-term problems.
>        It's sometimes a hardware failure, sometimes a site that's 
> gun-shy about a
>        recent incident report, sometimes a change in the local net 
> configuration,
>        and sometimes a non-responsive contact that hasn't rebooted a 
> hung machine.
>        Martin monitors node availability on a daily basis, and has 
> started to post
>        his results at 
> https://www.planet-lab.org/Wiki/bin/view/Planetlab/NodeInfo
>
>        The other half are what I would characterize as having 
> transient problems:
>        the machine crashes or hangs, we remotely reboot it or send 
> email to the
>        site's technical contact, and within a day or two, all is well. 
> We've been
>        seeing ~10 nodes come up and another ~10 go down on any give 
> day.
>
> Usable: This corresponds to the number of nodes that an application 
> can productively
>        use. By it's very nature, whether a node qualifies as usable is 
> application-
>        specific. However, it seems useful to define one or more 
> generic benchmarks
>        that we could run periodically.
>
>        The lowest threshold seems to be that a service running on the 
> node is able
>        to phone home. Ganglia and the NOC sensor do this, and report 
> availablility
>        numbers in the upper 200's (~290).
>
>        A higher threshold would be that that one could ssh into the 
> node, and
>        "consume" some number of resources, something representative of 
> a minimal
>        service.
>
>        Another possibility would be to create a new slice and see how 
> many nodes it
>        comes up on (in what timeframe). The current model/architecture 
> does not
>        necessarily support rapid slice creation/termination since 
> slices are intended
>        to be fairly long-lived, but this would be a useful metric 
> nonetheless.
>
> I would be interested in hearing other suggestions for benchmarks we 
> should run to
> evaluate node availability. In this context, it's worth noting that 
> we've recently
> seen numbers posted to this mailing list ranging from around 200 to 
> nearly 300. Why
> the wide range? One thing that's clear from Mic's data is that slices 
> aren't reliably
> being created on all nodes the user selects. We expect upgrades to the 
> node manager
> being rolled out in coming weeks will improve this situation, but we 
> clearly need to
> measure our progress on this front. There are also nodes with 
> insufficient resources;
> again, the allocation/scheduling upgrades in the works should help.
>
> What I don't have a good handle on is the extent to which users are 
> able to
> successfully work around the churn that happens on PlanetLab. Insights 
> on this
> point would be helpful.
>
> Larry
>
>
> _______________________________________________
> Users mailing list: Users at lists.planet-lab.org
> http://lists.planet-lab.org/mailman/listinfo/users
>



More information about the Users mailing list