[Planetlab-devel] diagnosing a sneak bug in 5.0 & impact on 4.2

Thierry Parmentelat thierry.parmentelat at sophia.inria.fr
Thu May 29 11:32:05 EDT 2008


Hi

I'm currently in the middle of a 5.0 with most of the ideas that we  
had discussed in princeton available, even though this is not usable  
yet.

Right now this new code is bugged somewhere, but the symptom seems to  
be rather far away from the initial cause, and I'm having a really  
hard time nailing it down.

So I'm taking the time to write things down, that's helpful per se,  
and am sending this to you guys in case you'd have hints - or more  
generally comments on how we could avoid having this kind of  
situations where the initial bug and its effects are so remote.

Sorry this is rather long, but I hope you'll find it worth it; in  
particular I've found things that might potentially impact 4.2 as well

So, the symptoms:
======
. my node takes off, performs installation and all, then before  
kexec'ing the bootmanager does something like
chroot <newimg> /etc/init.d/conf_files start --noscripts

the thing is, this fails, so the actual boot-mode ssh keys dont get  
pushed on the hard drive
however the node proceeds and does the kexec, resulting in an  
unreachable node.

======
. when running conf_files, the command performs an xmlrpc call to the  
api, for obtaining getslivers, and does so with a session auth method:
the capture that I made shows that the postdata that gets into the  
http session looks like this

"<?xml version='1.0'?>
<methodCall>
<methodName>GetSlivers</methodName>
<params><param>\n<value><struct>\n<member>\n<name>session</name>
<value><string>eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/serial8250: too much  
work for irq4^M
7aROqL73mKM=</string>
</value>\n</member>\n<member>\n<name>AuthMethod</name> 
\n<value><string>session</string></value>\n</member>\n
</struct></value>\n</param>\n</params>\n</methodCall>\n"

where the actual session string has this suspicious-looking part about  
'serial8250: too much work for irq4'

======
. at that time the planetlab DB contains this
planetlab5=> select * from sessions;
                   session_id                  | expires
----------------------------------------------+---------
  eTJPKAIex5XysDpHBcZgXTVZ25OlxLH/7aROqL73mKM= |

which seems more reasonable
more generally there's not a single string containing 'serial' in the  
db sql dump

======
. searching for 'serial8250' in the qemu log files is interesting;  
I've found this particularly funny:

Primary network as returned from PLC: {'network': '138.96.0.0', 'ip':  
'138.96.250.221', 'dns1': '138.96.0.10', 'hostname': None, 'n\
etmask': '255.255.0.0', 'interface_id': 1, 'broadcast':  
'138.96.255.255', 'method': 'static', 'mac': '02:34:56:00:00:01',  
'nodeseri\
al8250: too much work for irq4^M
_id': 1, 'gateway': '138.96.248.250', 'dns2': '138.96.0.11',  
'is_primary': True, 'type': 'ipv4', 'bwlimit': None, 'interface_settin\
g_ids': []}^M

where the 'node_id' key has the same string inserted again in the middle

AND ALSO

a bit further down, when rebuilding the initrd
Using modules:  /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab. 
2008.05.29/kernel/drivers/usb/host/ehci-hcd.ko /lib/modules/2.6.22.19\
-vs2.3.0.34.7.planetlab.2008.05.29/kernel/drivers/usb/host/ohci- 
hcd.ko /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab.2008.05.29seri\
al8250: too much work for irq4^M

======
. Much less funny is the fact that I found the same kind of things  
happen in the test logs for 4.2-rc5 !!!

Primary network as returned from PLC: {'broadcast': '138.96.255.255',  
'network': '138.96.0.0', 'ip': '138.96.250.221', 'dns1':  
'138.96.0.10', 'hostname': N\
one, 'netmask': '255.255.0.0', 'gateway': '138.96.248.250',  
'nodenetwork_id': 1, 'method': 'static', 'mserial8250: too much work  
for irq4^M
ac': '02:34:56:00:00:01', 'node_id': 1, 'dns2': '138.96.0.11',  
'is_primary': True, 'type': 'ipv4', 'bwlimit': None,  
'nodenetwork_setting_ids': []}^M

AND

Using modules:  /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/kernel/ 
drivers/usb/host/ehci-hcd.ko /lib/modules/2.6.22.19- 
vs2.3.0.34.7.planetlab/kernel/driv\
ers/usb/host/ohci-hcd.ko /lib/modules/2.6.22.19-vs2.3.0.34.7.planetlab/ 
kernel/drivers/usb/host/uhci-hcdserial8250: too much work for irq4^M

except that in the rc5 case I do not see the pollution reach the  
session key and so everything really works fine

======
so this is where I am now.
my best guess is that we have screwed up when setting up qemu; ahem,  
in fact, I hope so, because otherwise..

-- thierry

PS.
one last thing; in this case the BootManager should basically refrain  
from reaching the kexec, as this means losing the node
on another, totally unrelated, problem, we have a node failing to  
perform 'chkconfig ntpd on' in the chroot, and in this case rather  
than trying to proceed anyway, BM gives up and the node remains in  
'dbg' mode.
I'd rather the opposite; ntp failing does not seem like such a big  
deal, while in the conf_files case the node becomes unreachable to us.
Did anyone have a plan to review this BM logic. ?



More information about the Devel mailing list