[Planetlab-devel] BootManager / Boot states / Node states
Stephen Soltesz
soltesz at CS.Princeton.EDU
Mon Dec 3 15:35:09 EST 2007
Hello,
I've observed some ambiguity in the Boot states for our nodes. Currently we
have the following "Boot States":
* New == ??
* Install
* Reinstall
* Debug
* Boot
Right now there is no distinction between nodes in debug boot state as a result
of manual intervention by an admin or tech, and debug boot state as a result of
automatic response of BootManager during a failed boot. While, there may be
errors in bm.log, this data is lost after a subsequent reboot of the machine if
the boot state is not reset first. Uploading the logs would only help slightly,
b/c there would still be no solid indication of whether changes to the boot
state were made manually or automatically, which is an awkward thing to infer
from boot logs alone.
Historically, Monitor has avoided nodes in debug state. In part the above
ambiguity will prevent Monitor from working cooperatively with admins in
day-to-day operations, like taking an otherwise "good" node into debug mode due
to a security complaint, for instance. Monitor seeing nothing wrong and having
no secondary source of information for what the machine state should be, would
like to just start it back up. Adding a secondary interface to notify Monitor
would only introduce additional mis-matches.
To ameliorate the ambiguity, I would like to propose two new boot states.
* Install
* Reinstall
* Boot
* Stand-by (debug, automatic)
* Diagnostic (debug, manual)
* Disabled (manual)
Stand-by takes the place of how I think of 'Debug' working today. It is the
automatic state entered by BootManager as a result of problems. It's literally
'standing-by' for someone to come along and help it out.
Diagnostic is the manual version of 'debug'. If Monitor saw a node in this
state, it would ignore it, trusting that the admin who placed the node into this
state would ultimately take it back out.
Disabled is a state that is currently unrepresented by any of the others. This
is helpful for nodes that are known to be down for weeks or months due to either
hardware failures, hardware compatibility, funding reasons or whatever that
isn't covered by the other states. Basically, there is no possibility that this
machine will be on line.
Thank you,
Stephen.
More information about the Devel
mailing list