[Planetlab-users] node availability benchmarks
nspring at cs.washington.edu
Fri Jun 18 13:36:47 EDT 2004
I'd very much like to see a number that represents "fully operational"
-- the number of machines that have zero problems. In particular, that
DNS and ping work. I would like sites to be encouraged not to put
planetlab boxes behind firewalls.
"visible": I don't understand the value in reporting a metric based on
being able to ssh into a non-vserver slice. who has one of these but
"usable": to me should be a number of machines I should have no
problems with. If you keep a list of "usable" machines and I find one
to be unusable, I should be able to file a trouble ticket with high
confidence that it's a new issue to you. That is, usable nodes should
pass *all* of Mic's tests. Ganglia running is not evidence of a node
being usable -- ganglia is lightweight and designed to run on pretty
messed up machines. This is where ping belongs.
think lower bound.
On Jun 18, 2004, at 6:43 AM, Larry Peterson wrote:
> As has recently been discussed on this list, there are a lot of
> numbers being used to describe how many nodes are on PlanetLab. I
> it might be helpful to provide a rough break down. I'd also be
> in any thoughts people have on my definitions.
> Total: This is the number that appears on the home page (currently
> 395) and
> corresponds to the number of machines that have been registered
> the database. Some of these machines have never actually
> booted. We
> all understand that this number has more PR value than anything
> Visible: This corresponds to the number of nodes we can "reach" from
> (Others run similar experiments from other sites.) Our working
> for a "visible" node is that we can ssh into a non-vserver
> account on it.
> Recently, we have been seeing ~325 visible nodes.
> As an aside, an alternative definition of visible is that the
> machine is
> pingable. This set is typically smaller because many sites
> filter pings, so
> reachable via ssh seems to be a better definition of visible.
> Of the 70-odd nodes that have been registered but are not
> currently visible,
> roughly half have are in "debug" mode; i.e., have known,
> long-term problems.
> It's sometimes a hardware failure, sometimes a site that's
> gun-shy about a
> recent incident report, sometimes a change in the local net
> and sometimes a non-responsive contact that hasn't rebooted a
> hung machine.
> Martin monitors node availability on a daily basis, and has
> started to post
> his results at
> The other half are what I would characterize as having
> transient problems:
> the machine crashes or hangs, we remotely reboot it or send
> email to the
> site's technical contact, and within a day or two, all is well.
> We've been
> seeing ~10 nodes come up and another ~10 go down on any give
> Usable: This corresponds to the number of nodes that an application
> can productively
> use. By it's very nature, whether a node qualifies as usable is
> specific. However, it seems useful to define one or more
> generic benchmarks
> that we could run periodically.
> The lowest threshold seems to be that a service running on the
> node is able
> to phone home. Ganglia and the NOC sensor do this, and report
> numbers in the upper 200's (~290).
> A higher threshold would be that that one could ssh into the
> node, and
> "consume" some number of resources, something representative of
> a minimal
> Another possibility would be to create a new slice and see how
> many nodes it
> comes up on (in what timeframe). The current model/architecture
> does not
> necessarily support rapid slice creation/termination since
> slices are intended
> to be fairly long-lived, but this would be a useful metric
> I would be interested in hearing other suggestions for benchmarks we
> should run to
> evaluate node availability. In this context, it's worth noting that
> we've recently
> seen numbers posted to this mailing list ranging from around 200 to
> nearly 300. Why
> the wide range? One thing that's clear from Mic's data is that slices
> aren't reliably
> being created on all nodes the user selects. We expect upgrades to the
> node manager
> being rolled out in coming weeks will improve this situation, but we
> clearly need to
> measure our progress on this front. There are also nodes with
> insufficient resources;
> again, the allocation/scheduling upgrades in the works should help.
> What I don't have a good handle on is the extent to which users are
> able to
> successfully work around the churn that happens on PlanetLab. Insights
> on this
> point would be helpful.
> Users mailing list: Users at lists.planet-lab.org
More information about the Users