[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2

Thierry Parmentelat thierry.parmentelat at sophia.inria.fr
Fri May 30 07:09:07 EDT 2008


On May 29, 2008, at 7:05 PM, Faiyaz Ahmed wrote:

> Hi Thierry,
>
> This sounds fishy.
>
>> the thing is, this fails, so the actual boot-mode ssh keys dont get  
>> pushed on the hard drive
>> however the node proceeds and does the kexec, resulting in an  
>> unreachable node.
>> ======
>> . when running conf_files, the command performs an xmlrpc call to  
>> the api, for obtaining getslivers, and does so with a session auth  
>> method:
>> the capture that I made shows that the postdata that gets into the  
>> http session looks like this
>> "<?xml version='1.0'?>
>> <methodCall>
>> <methodName>GetSlivers</methodName>
>> <params><param>\n<value><struct>\n<member>\n<name>session</name>
>> <value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much  
>> work for irq4^M
>> 7aROqL73mKM=</string>
>> </value>\n</member>\n<member>\n<name>AuthMethod</name> 
>> \n<value><string>session</string></value>\n</member>\n </struct></ 
>> value>\n</param>\n</params>\n</methodCall>\n"
>> where the actual session string has this suspicious-looking part  
>> about 'serial8250: too much work for irq4'
>
> I've seen this message before but its usually the kernel of the  
> machine I'm SSH'ing from or the node I'm SSH'ing to complaining  
> about bad irq settings.  The syslog is likely set to echo emerg to  
> all terminals.  Do you think this is the case?
>
> Can you call this method, with the same credentials used by the node  
> from a different machine and check the post/get results?

a quick update on this; after some googling it looks pretty likely  
that this is qemu-related.
Do you remember if you've ever seen this on real nodes as well ?
For the time being I will assume that this is a red herring, and will  
try to figure a way to get my test framework to test 5.0; it's still  
puzzling how this message can have polluted the xmlrpc channel, but  
your own report seems to confirm that the message can make it to a  
network connection.


>
>> PS.
>> one last thing; in this case the BootManager should basically  
>> refrain from reaching the kexec, as this means losing the node
>> on another, totally unrelated, problem, we have a node failing to  
>> perform 'chkconfig ntpd on' in the chroot, and in this case rather  
>> than trying to proceed anyway, BM gives up and the node remains in  
>> 'dbg' mode.
>> I'd rather the opposite; ntp failing does not seem like such a big  
>> deal, while in the conf_files case the node becomes unreachable to  
>> us.
>> Did anyone have a plan to review this BM logic. ?
>
> I agree.  It seems counterintuitive to fail on a non issue and  
> continue on a real one.  I can work out the exit codes on conf_files  
> so it BM will put the node back into debug if it can't contact PLC.

would be cool; thanks

>
>
>
> Faiyaz
>
>> _______________________________________________
>> Devel mailing list
>> Devel at lists.planet-lab.org
>> https://lists.planet-lab.org/mailman/listinfo/devel
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel



More information about the Devel mailing list