[PL #2829] RE: [PL #2937] Re: rmdir freezing system

frankeh at watson.ibm.com via RT devel at planet-lab.org
Fri Nov 5 12:35:43 EST 2004


Email Recipients (see http://www.planet-lab.org/Support)
       Requestor: frankeh at watson.ibm.com, mlhuang at cs.princeton.edu
       Ticket Ccs: mef at cs.princeton.edu, mlhuang at cs.princeton.edu, nagar at watson.ibm.com, sekharan at us.ibm.com

==================================================



Marc E. Fiuczynski via RT wrote:
> Email Recipients (see http://www.planet-lab.org/Support)
>        Requestor: frankeh at watson.ibm.com, mlhuang at cs.princeton.edu
>        Ticket Ccs: mef at cs.princeton.edu, mlhuang at cs.princeton.edu, nagar at watson.ibm.com, sekharan at us.ibm.com
> 
> ==================================================
> 
> 
>>>However, I am still seeing a freeze for our 
>>>"/rcfs/taskclass/system" class, which is our
>>>default class. When I rmdir this one, I get an 
>>>"infinite" stack trace dumped to the console. 
>>
>>OK, that must be related to a different problem.
> 
> 
> Unfortunately, it is non trivial for me to attach a serial console to the box.
> For this reason, I cannot get your stack trace for it today.  I tried debugging 
> the same problem by running the same kernel as a guest on the qemu pc emulator. 
> It appears that rq_get_next_queue() in rq_get_next_task() returns a queue data 
> structure whose array (queue->array) is NULL. This causes all hell to 
break loose and
> the system goes into the telltale infinite stack dump.

OK, that is already more then we can ask for ...
Put a BUG_ON(queue->array == NULL) and stop the kernel right
then and there when it happens.

I have an idea what is happening, actually I am certain I know.
Something to do with our optimizations  on class dequeueing.
The local class runqueue is still enqueued despite it has not
task running anymore.

Working with Haoqiang today on share accuracy and weight adjustment 
problem. I should be able to get to this particular problem tomorrow 
evening and fix it, can you survive until then ?

> 
> So maybe it is possible for the queue data structure to hang around for a 
 > while and not be properly released.
> I'll try to track this down further by using gdb to walk around 
 > in the kernel running on the qemu pc emulator.
> 
> 
>>This is definitely a fix that needs to go in.
> 
> 
> Glad that the other bug definitely needs to go into your code.  
> 
> 
>>I have not upgraded our x206 with your YUM information, since 
>>I can break this with your kernel (but now ours).
> 
> 
> While you are waiting for me, please do the full upgrade. This way you can play with the exact problems I see, and thereby also harden your own code for other distro releases. ;)
> 
> Cheers,
> Marc
> 
> 
> 
> 




More information about the Devel-community mailing list