[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on
4.2
Daniel Hokka Zakrisson
dhokka at CS.Princeton.EDU
Fri May 30 07:54:42 EDT 2008
Hi Thierry,
Quoting Thierry Parmentelat <thierry.parmentelat at sophia.inria.fr>:
> On May 29, 2008, at 7:05 PM, Faiyaz Ahmed wrote:
>
>> Hi Thierry,
>>
>> This sounds fishy.
>>
>>> the thing is, this fails, so the actual boot-mode ssh keys dont get
>>> pushed on the hard drive
>>> however the node proceeds and does the kexec, resulting in an
>>> unreachable node.
>>> ======
>>> . when running conf_files, the command performs an xmlrpc call to
>>> the api, for obtaining getslivers, and does so with a session auth
>>> method:
>>> the capture that I made shows that the postdata that gets into the
>>> http session looks like this
>>> "<?xml version='1.0'?>
>>> <methodCall>
>>> <methodName>GetSlivers</methodName>
>>> <params><param>\n<value><struct>\n<member>\n<name>session</name>
>>> <value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much
>>> work for irq4^M
>>> 7aROqL73mKM=</string>
>>> </value>\n</member>\n<member>\n<name>AuthMethod</name>
>>> \n<value><string>session</string></value>\n</member>\n </struct></
>>> value>\n</param>\n</params>\n</methodCall>\n"
>>> where the actual session string has this suspicious-looking part
>>> about 'serial8250: too much work for irq4'
>>
>> I've seen this message before but its usually the kernel of the
>> machine I'm SSH'ing from or the node I'm SSH'ing to complaining
>> about bad irq settings. The syslog is likely set to echo emerg to
>> all terminals. Do you think this is the case?
>>
>> Can you call this method, with the same credentials used by the node
>> from a different machine and check the post/get results?
>
> a quick update on this; after some googling it looks pretty likely
> that this is qemu-related.
> Do you remember if you've ever seen this on real nodes as well ?
> For the time being I will assume that this is a red herring, and will
> try to figure a way to get my test framework to test 5.0; it's still
> puzzling how this message can have polluted the xmlrpc channel, but
> your own report seems to confirm that the message can make it to a
> network connection.
So you got this dump on the server-side? Because if you're getting it
on the node, it's just qemu doing too much too fast, and the kernel
outputting that on the console.
Daniel
>>
>>> PS.
>>> one last thing; in this case the BootManager should basically
>>> refrain from reaching the kexec, as this means losing the node
>>> on another, totally unrelated, problem, we have a node failing to
>>> perform 'chkconfig ntpd on' in the chroot, and in this case rather
>>> than trying to proceed anyway, BM gives up and the node remains in
>>> 'dbg' mode.
>>> I'd rather the opposite; ntp failing does not seem like such a big
>>> deal, while in the conf_files case the node becomes unreachable to
>>> us.
>>> Did anyone have a plan to review this BM logic. ?
>>
>> I agree. It seems counterintuitive to fail on a non issue and
>> continue on a real one. I can work out the exit codes on conf_files
>> so it BM will put the node back into debug if it can't contact PLC.
>
> would be cool; thanks
>
>>
>>
>>
>> Faiyaz
>>
>>> _______________________________________________
>>> Devel mailing list
>>> Devel at lists.planet-lab.org
>>> https://lists.planet-lab.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at lists.planet-lab.org
>> https://lists.planet-lab.org/mailman/listinfo/devel
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel
>
More information about the Devel
mailing list