[Planetlab-devel] BootManager / Boot states / Node states

Stephen Soltesz soltesz at CS.Princeton.EDU
Mon Dec 3 15:35:09 EST 2007


Hello,

I've observed some ambiguity in the Boot states for our nodes.  Currently we 
have the following "Boot States":

  * New  == ??
  * Install
  * Reinstall
  * Debug
  * Boot

Right now there is no distinction between nodes in debug boot state as a result 
of manual intervention by an admin or tech, and debug boot state as a result of 
automatic response of BootManager during a failed boot.  While, there may be 
errors in bm.log, this data is lost after a subsequent reboot of the machine if 
the boot state is not reset first.  Uploading the logs would only help slightly, 
b/c there would still be no solid indication of whether changes to the boot 
state were made manually or automatically, which is an awkward thing to infer 
from boot logs alone.

Historically, Monitor has avoided nodes in debug state.  In part the above 
ambiguity will prevent Monitor from working cooperatively with admins in 
day-to-day operations, like taking an otherwise "good" node into debug mode due 
to a security complaint, for instance.  Monitor seeing nothing wrong and having 
no secondary source of information for what the machine state should be, would 
like to just start it back up.  Adding a secondary interface to notify Monitor 
would only introduce additional mis-matches.

To ameliorate the ambiguity, I would like to propose two new boot states.

  * Install
  * Reinstall
  * Boot
  * Stand-by   (debug, automatic)
  * Diagnostic (debug, manual)
  * Disabled   (manual)

Stand-by takes the place of how I think of 'Debug' working today.  It is the 
automatic state entered by BootManager as a result of problems.  It's literally 
'standing-by' for someone to come along and help it out.

Diagnostic is the manual version of 'debug'.  If Monitor saw a node in this 
state, it would ignore it, trusting that the admin who placed the node into this 
state would ultimately take it back out.

Disabled is a state that is currently unrepresented by any of the others.  This 
is helpful for nodes that are known to be down for weeks or months due to either 
hardware failures, hardware compatibility, funding reasons or whatever that 
isn't covered by the other states.  Basically, there is no possibility that this 
machine will be on line.

Thank you,
Stephen.



More information about the Devel mailing list