[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2

Larry Peterson llp at CS.Princeton.EDU
Thu May 29 11:39:40 EDT 2008


What's 5.0?

On Thu, May 29, 2008 at 11:32 AM, Thierry Parmentelat
<thierry.parmentelat at sophia.inria.fr> wrote:
> Hi
>
> I'm currently in the middle of a 5.0 with most of the ideas that we had
> discussed in princeton available, even though this is not usable yet.
>
> Right now this new code is bugged somewhere, but the symptom seems to be
> rather far away from the initial cause, and I'm having a really hard time
> nailing it down.
>
> So I'm taking the time to write things down, that's helpful per se, and am
> sending this to you guys in case you'd have hints - or more generally
> comments on how we could avoid having this kind of situations where the
> initial bug and its effects are so remote.
>
> Sorry this is rather long, but I hope you'll find it worth it; in particular
> I've found things that might potentially impact 4.2 as well
>
> So, the symptoms:
> ======
> . my node takes off, performs installation and all, then before kexec'ing
> the bootmanager does something like
> chroot <newimg> /etc/init.d/conf_files start --noscripts
>
> the thing is, this fails, so the actual boot-mode ssh keys dont get pushed
> on the hard drive
> however the node proceeds and does the kexec, resulting in an unreachable
> node.
>
> ======
> . when running conf_files, the command performs an xmlrpc call to the api,
> for obtaining getslivers, and does so with a session auth method:
> the capture that I made shows that the postdata that gets into the http
> session looks like this
>
> "<?xml version='1.0'?>
> <methodCall>
> <methodName>GetSlivers</methodName>
> <params><param>\n<value><struct>\n<member>\n<name>session</name>
> <value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much work for
> irq4^M
> 7aROqL73mKM=</string>
> </value>\n</member>\n<member>\n<name>AuthMethod</name>\n<value><string>session</string></value>\n</member>\n
> </struct></value>\n</param>\n</params>\n</methodCall>\n"
>
> where the actual session string has this suspicious-looking part about
> 'serial8250: too much work for irq4'
>
> ======
> . at that time the planetlab DB contains this
> planetlab5=> select * from sessions;
>                  session_id                  | expires
> ----------------------------------------------+---------
>  eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/7aROqL73mKM= |
>
> which seems more reasonable
> more generally there's not a single string containing 'serial' in the db sql
> dump
>
> ======
> . searching for 'serial8250' in the qemu log files is interesting; I've
> found this particularly funny:
>
> Primary network as returned from PLC: {'network': '138.96.0.0', 'ip':
> '138.96.250.221', 'dns1': '138.96.0.10', 'hostname': None, 'n\
> etmask': '255.255.0.0', 'interface_id': 1, 'broadcast': '138.96.255.255',
> 'method': 'static', 'mac': '02:34:56:00:00:01', 'nodeseri\
> al8250: too much work for irq4^M
> _id': 1, 'gateway': '138.96.248.250', 'dns2': '138.96.0.11', 'is_primary':
> True, 'type': 'ipv4', 'bwlimit': None, 'interface_settin\
> g_ids': []}^M
>
> where the 'node_id' key has the same string inserted again in the middle
>
> AND ALSO
>
> a bit further down, when rebuilding the initrd
> Using modules:
>  /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab.2008.05.29/kernel/drivers/usb/host/ehci-hcd.ko
> /lib/modules/2.6.22.19\
> -vs2.3.0.34.7.planetlab.2008.05.29/kernel/drivers/usb/host/ohci-hcd.ko
> /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab.2008.05.29seri\
> al8250: too much work for irq4^M
>
> ======
> . Much less funny is the fact that I found the same kind of things happen in
> the test logs for 4.2-rc5 !!!
>
> Primary network as returned from PLC: {'broadcast': '138.96.255.255',
> 'network': '138.96.0.0', 'ip': '138.96.250.221', 'dns1': '138.96.0.10',
> 'hostname': N\
> one, 'netmask': '255.255.0.0', 'gateway': '138.96.248.250',
> 'nodenetwork_id': 1, 'method': 'static', 'mserial8250: too much work for
> irq4^M
> ac': '02:34:56:00:00:01', 'node_id': 1, 'dns2': '138.96.0.11', 'is_primary':
> True, 'type': 'ipv4', 'bwlimit': None, 'nodenetwork_setting_ids': []}^M
>
> AND
>
> Using modules:
>  /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/kernel/drivers/usb/host/ehci-hcd.ko
> /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/kernel/driv\
> ers/usb/host/ohci-hcd.ko
> /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/kernel/drivers/usb/host/uhci-hcdserial8250:
> too much work for irq4^M
>
> except that in the rc5 case I do not see the pollution reach the session key
> and so everything really works fine
>
> ======
> so this is where I am now.
> my best guess is that we have screwed up when setting up qemu; ahem, in
> fact, I hope so, because otherwise..
>
> -- thierry
>
> PS.
> one last thing; in this case the BootManager should basically refrain from
> reaching the kexec, as this means losing the node
> on another, totally unrelated, problem, we have a node failing to perform
> 'chkconfig ntpd on' in the chroot, and in this case rather than trying to
> proceed anyway, BM gives up and the node remains in 'dbg' mode.
> I'd rather the opposite; ntp failing does not seem like such a big deal,
> while in the conf_files case the node becomes unreachable to us.
> Did anyone have a plan to review this BM logic. ?
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel
>
>



More information about the Devel mailing list