[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2
Thierry Parmentelat
thierry.parmentelat at sophia.inria.fr
Thu May 29 11:32:05 EDT 2008
Hi
I'm currently in the middle of a 5.0 with most of the ideas that we
had discussed in princeton available, even though this is not usable
yet.
Right now this new code is bugged somewhere, but the symptom seems to
be rather far away from the initial cause, and I'm having a really
hard time nailing it down.
So I'm taking the time to write things down, that's helpful per se,
and am sending this to you guys in case you'd have hints - or more
generally comments on how we could avoid having this kind of
situations where the initial bug and its effects are so remote.
Sorry this is rather long, but I hope you'll find it worth it; in
particular I've found things that might potentially impact 4.2 as well
So, the symptoms:
======
. my node takes off, performs installation and all, then before
kexec'ing the bootmanager does something like
chroot <newimg> /etc/init.d/conf_files start --noscripts
the thing is, this fails, so the actual boot-mode ssh keys dont get
pushed on the hard drive
however the node proceeds and does the kexec, resulting in an
unreachable node.
======
. when running conf_files, the command performs an xmlrpc call to the
api, for obtaining getslivers, and does so with a session auth method:
the capture that I made shows that the postdata that gets into the
http session looks like this
"<?xml version='1.0'?>
<methodCall>
<methodName>GetSlivers</methodName>
<params><param>\n<value><struct>\n<member>\n<name>session</name>
<value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much
work for irq4^M
7aROqL73mKM=</string>
</value>\n</member>\n<member>\n<name>AuthMethod</name>
\n<value><string>session</string></value>\n</member>\n
</struct></value>\n</param>\n</params>\n</methodCall>\n"
where the actual session string has this suspicious-looking part about
'serial8250: too much work for irq4'
======
. at that time the planetlab DB contains this
planetlab5=> select * from sessions;
session_id | expires
----------------------------------------------+---------
eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/7aROqL73mKM= |
which seems more reasonable
more generally there's not a single string containing 'serial' in the
db sql dump
======
. searching for 'serial8250' in the qemu log files is interesting;
I've found this particularly funny:
Primary network as returned from PLC: {'network': '138.96.0.0', 'ip':
'138.96.250.221', 'dns1': '138.96.0.10', 'hostname': None, 'n\
etmask': '255.255.0.0', 'interface_id': 1, 'broadcast':
'138.96.255.255', 'method': 'static', 'mac': '02:34:56:00:00:01',
'nodeseri\
al8250: too much work for irq4^M
_id': 1, 'gateway': '138.96.248.250', 'dns2': '138.96.0.11',
'is_primary': True, 'type': 'ipv4', 'bwlimit': None, 'interface_settin\
g_ids': []}^M
where the 'node_id' key has the same string inserted again in the middle
AND ALSO
a bit further down, when rebuilding the initrd
Using modules: /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab.
2008.05.29/kernel/drivers/usb/host/ehci-hcd.ko /lib/modules/2.6.22.19\
-vs2.3.0.34.7.planetlab.2008.05.29/kernel/drivers/usb/host/ohci-
hcd.ko /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab.2008.05.29seri\
al8250: too much work for irq4^M
======
. Much less funny is the fact that I found the same kind of things
happen in the test logs for 4.2-rc5 !!!
Primary network as returned from PLC: {'broadcast': '138.96.255.255',
'network': '138.96.0.0', 'ip': '138.96.250.221', 'dns1':
'138.96.0.10', 'hostname': N\
one, 'netmask': '255.255.0.0', 'gateway': '138.96.248.250',
'nodenetwork_id': 1, 'method': 'static', 'mserial8250: too much work
for irq4^M
ac': '02:34:56:00:00:01', 'node_id': 1, 'dns2': '138.96.0.11',
'is_primary': True, 'type': 'ipv4', 'bwlimit': None,
'nodenetwork_setting_ids': []}^M
AND
Using modules: /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/kernel/
drivers/usb/host/ehci-hcd.ko /lib/modules/2.6.22.19-
vs2.3.0.34.7.planetlab/kernel/driv\
ers/usb/host/ohci-hcd.ko /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/
kernel/drivers/usb/host/uhci-hcdserial8250: too much work for irq4^M
except that in the rc5 case I do not see the pollution reach the
session key and so everything really works fine
======
so this is where I am now.
my best guess is that we have screwed up when setting up qemu; ahem,
in fact, I hope so, because otherwise..
-- thierry
PS.
one last thing; in this case the BootManager should basically refrain
from reaching the kexec, as this means losing the node
on another, totally unrelated, problem, we have a node failing to
perform 'chkconfig ntpd on' in the chroot, and in this case rather
than trying to proceed anyway, BM gives up and the node remains in
'dbg' mode.
I'd rather the opposite; ntp failing does not seem like such a big
deal, while in the conf_files case the node becomes unreachable to us.
Did anyone have a plan to review this BM logic. ?
More information about the Devel
mailing list