[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2

Daniel Hokka Zakrisson dhokka at CS.Princeton.EDU
Fri May 30 07:54:42 EDT 2008


Hi Thierry,

Quoting Thierry Parmentelat <thierry.parmentelat at sophia.inria.fr>:
> On May 29, 2008, at 7:05 PM, Faiyaz Ahmed wrote:
>
>> Hi Thierry,
>>
>> This sounds fishy.
>>
>>> the thing is, this fails, so the actual boot-mode ssh keys dont get 
>>>  pushed on the hard drive
>>> however the node proceeds and does the kexec, resulting in an  
>>> unreachable node.
>>> ======
>>> . when running conf_files, the command performs an xmlrpc call to  
>>> the api, for obtaining getslivers, and does so with a session auth  
>>> method:
>>> the capture that I made shows that the postdata that gets into the  
>>> http session looks like this
>>> "<?xml version='1.0'?>
>>> <methodCall>
>>> <methodName>GetSlivers</methodName>
>>> <params><param>\n<value><struct>\n<member>\n<name>session</name>
>>> <value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much 
>>>  work for irq4^M
>>> 7aROqL73mKM=</string>
>>> </value>\n</member>\n<member>\n<name>AuthMethod</name> 
>>> \n<value><string>session</string></value>\n</member>\n </struct></ 
>>> value>\n</param>\n</params>\n</methodCall>\n"
>>> where the actual session string has this suspicious-looking part  
>>> about 'serial8250: too much work for irq4'
>>
>> I've seen this message before but its usually the kernel of the  
>> machine I'm SSH'ing from or the node I'm SSH'ing to complaining  
>> about bad irq settings.  The syslog is likely set to echo emerg to  
>> all terminals.  Do you think this is the case?
>>
>> Can you call this method, with the same credentials used by the node 
>>  from a different machine and check the post/get results?
>
> a quick update on this; after some googling it looks pretty likely  
> that this is qemu-related.
> Do you remember if you've ever seen this on real nodes as well ?
> For the time being I will assume that this is a red herring, and will 
>  try to figure a way to get my test framework to test 5.0; it's still 
>  puzzling how this message can have polluted the xmlrpc channel, but  
> your own report seems to confirm that the message can make it to a  
> network connection.

So you got this dump on the server-side? Because if you're getting it 
on the node, it's just qemu doing too much too fast, and the kernel 
outputting that on the console.

Daniel

>>
>>> PS.
>>> one last thing; in this case the BootManager should basically  
>>> refrain from reaching the kexec, as this means losing the node
>>> on another, totally unrelated, problem, we have a node failing to  
>>> perform 'chkconfig ntpd on' in the chroot, and in this case rather  
>>> than trying to proceed anyway, BM gives up and the node remains in  
>>> 'dbg' mode.
>>> I'd rather the opposite; ntp failing does not seem like such a big  
>>> deal, while in the conf_files case the node becomes unreachable to  
>>> us.
>>> Did anyone have a plan to review this BM logic. ?
>>
>> I agree.  It seems counterintuitive to fail on a non issue and  
>> continue on a real one.  I can work out the exit codes on conf_files 
>>  so it BM will put the node back into debug if it can't contact PLC.
>
> would be cool; thanks
>
>>
>>
>>
>> Faiyaz
>>
>>> _______________________________________________
>>> Devel mailing list
>>> Devel at lists.planet-lab.org
>>> https://lists.planet-lab.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at lists.planet-lab.org
>> https://lists.planet-lab.org/mailman/listinfo/devel
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel
>





More information about the Devel mailing list