[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2
Thierry Parmentelat
thierry.parmentelat at sophia.inria.fr
Fri May 30 07:09:07 EDT 2008
On May 29, 2008, at 7:05 PM, Faiyaz Ahmed wrote:
> Hi Thierry,
>
> This sounds fishy.
>
>> the thing is, this fails, so the actual boot-mode ssh keys dont get
>> pushed on the hard drive
>> however the node proceeds and does the kexec, resulting in an
>> unreachable node.
>> ======
>> . when running conf_files, the command performs an xmlrpc call to
>> the api, for obtaining getslivers, and does so with a session auth
>> method:
>> the capture that I made shows that the postdata that gets into the
>> http session looks like this
>> "<?xml version='1.0'?>
>> <methodCall>
>> <methodName>GetSlivers</methodName>
>> <params><param>\n<value><struct>\n<member>\n<name>session</name>
>> <value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much
>> work for irq4^M
>> 7aROqL73mKM=</string>
>> </value>\n</member>\n<member>\n<name>AuthMethod</name>
>> \n<value><string>session</string></value>\n</member>\n </struct></
>> value>\n</param>\n</params>\n</methodCall>\n"
>> where the actual session string has this suspicious-looking part
>> about 'serial8250: too much work for irq4'
>
> I've seen this message before but its usually the kernel of the
> machine I'm SSH'ing from or the node I'm SSH'ing to complaining
> about bad irq settings. The syslog is likely set to echo emerg to
> all terminals. Do you think this is the case?
>
> Can you call this method, with the same credentials used by the node
> from a different machine and check the post/get results?
a quick update on this; after some googling it looks pretty likely
that this is qemu-related.
Do you remember if you've ever seen this on real nodes as well ?
For the time being I will assume that this is a red herring, and will
try to figure a way to get my test framework to test 5.0; it's still
puzzling how this message can have polluted the xmlrpc channel, but
your own report seems to confirm that the message can make it to a
network connection.
>
>> PS.
>> one last thing; in this case the BootManager should basically
>> refrain from reaching the kexec, as this means losing the node
>> on another, totally unrelated, problem, we have a node failing to
>> perform 'chkconfig ntpd on' in the chroot, and in this case rather
>> than trying to proceed anyway, BM gives up and the node remains in
>> 'dbg' mode.
>> I'd rather the opposite; ntp failing does not seem like such a big
>> deal, while in the conf_files case the node becomes unreachable to
>> us.
>> Did anyone have a plan to review this BM logic. ?
>
> I agree. It seems counterintuitive to fail on a non issue and
> continue on a real one. I can work out the exit codes on conf_files
> so it BM will put the node back into debug if it can't contact PLC.
would be cool; thanks
>
>
>
> Faiyaz
>
>> _______________________________________________
>> Devel mailing list
>> Devel at lists.planet-lab.org
>> https://lists.planet-lab.org/mailman/listinfo/devel
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel
More information about the Devel
mailing list