[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2

Larry Peterson llp at CS.Princeton.EDU
Fri May 30 07:52:54 EDT 2008


I'd like to plan/schedule expected features as best we can. Marc, can you
update (and own) the Trac milestone(s)  for this. I agree with Thierry that
we need to discuss the roadmap on devel.

Larry

On Fri, May 30, 2008 at 7:09 AM, Thierry Parmentelat
<thierry.parmentelat at sophia.inria.fr> wrote:
>
> On May 29, 2008, at 7:05 PM, Faiyaz Ahmed wrote:
>
>> Hi Thierry,
>>
>> This sounds fishy.
>>
>>> the thing is, this fails, so the actual boot-mode ssh keys dont get
>>> pushed on the hard drive
>>> however the node proceeds and does the kexec, resulting in an unreachable
>>> node.
>>> ======
>>> . when running conf_files, the command performs an xmlrpc call to the
>>> api, for obtaining getslivers, and does so with a session auth method:
>>> the capture that I made shows that the postdata that gets into the http
>>> session looks like this
>>> "<?xml version='1.0'?>
>>> <methodCall>
>>> <methodName>GetSlivers</methodName>
>>> <params><param>\n<value><struct>\n<member>\n<name>session</name>
>>> <value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much work
>>> for irq4^M
>>> 7aROqL73mKM=</string>
>>>
>>> </value>\n</member>\n<member>\n<name>AuthMethod</name>\n<value><string>session</string></value>\n</member>\n
>>> </struct></value>\n</param>\n</params>\n</methodCall>\n"
>>> where the actual session string has this suspicious-looking part about
>>> 'serial8250: too much work for irq4'
>>
>> I've seen this message before but its usually the kernel of the machine
>> I'm SSH'ing from or the node I'm SSH'ing to complaining about bad irq
>> settings.  The syslog is likely set to echo emerg to all terminals.  Do you
>> think this is the case?
>>
>> Can you call this method, with the same credentials used by the node from
>> a different machine and check the post/get results?
>
> a quick update on this; after some googling it looks pretty likely that this
> is qemu-related.
> Do you remember if you've ever seen this on real nodes as well ?
> For the time being I will assume that this is a red herring, and will try to
> figure a way to get my test framework to test 5.0; it's still puzzling how
> this message can have polluted the xmlrpc channel, but your own report seems
> to confirm that the message can make it to a network connection.
>
>
>>
>>> PS.
>>> one last thing; in this case the BootManager should basically refrain
>>> from reaching the kexec, as this means losing the node
>>> on another, totally unrelated, problem, we have a node failing to perform
>>> 'chkconfig ntpd on' in the chroot, and in this case rather than trying to
>>> proceed anyway, BM gives up and the node remains in 'dbg' mode.
>>> I'd rather the opposite; ntp failing does not seem like such a big deal,
>>> while in the conf_files case the node becomes unreachable to us.
>>> Did anyone have a plan to review this BM logic. ?
>>
>> I agree.  It seems counterintuitive to fail on a non issue and continue on
>> a real one.  I can work out the exit codes on conf_files so it BM will put
>> the node back into debug if it can't contact PLC.
>
> would be cool; thanks
>
>>
>>
>>
>> Faiyaz
>>
>>> _______________________________________________
>>> Devel mailing list
>>> Devel at lists.planet-lab.org
>>> https://lists.planet-lab.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at lists.planet-lab.org
>> https://lists.planet-lab.org/mailman/listinfo/devel
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel
>
>



More information about the Devel mailing list