[PL #2829] RE: [PL #2937] Re: rmdir freezing system

Marc E. Fiuczynski via RT devel at planet-lab.org
Fri Nov 5 13:17:10 EST 2004


Email Recipients (see http://www.planet-lab.org/Support)
       Requestor: frankeh at watson.ibm.com, mlhuang at cs.princeton.edu
       Ticket Ccs: mef at cs.princeton.edu, mlhuang at cs.princeton.edu, nagar at watson.ibm.com, sekharan at us.ibm.com

==================================================

What is far more critical is a fix to the memory controller.  That is something we need ASAP.

Marc

> -----Original Message-----
> From: devel-community-bounces at planet-lab.org
> [mailto:devel-community-bounces at planet-lab.org]On Behalf Of Marc E.
> Fiuczynski via RT
> Sent: Friday, November 05, 2004 12:47 PM
> To: frankeh at watson.ibm.com; mlhuang at CS.Princeton.EDU
> Subject: RE: [PL #2829] RE: [PL #2937] Re: rmdir freezing system
> 
> 
> Email Recipients (see http://www.planet-lab.org/Support)
>        Requestor: frankeh at watson.ibm.com, mlhuang at cs.princeton.edu
>        Ticket Ccs: mef at cs.princeton.edu, 
> mlhuang at cs.princeton.edu, nagar at watson.ibm.com, sekharan at us.ibm.com
> 
> ==================================================
> 
> We can survive without this fix for now. In fact, we are already 
> deploying our first release candidate.
> 
> > -----Original Message-----
> > From: frankeh at watson.ibm.com via RT [mailto:devel at planet-lab.org]
> > Sent: Friday, November 05, 2004 12:36 PM
> > To: mlhuang at CS.Princeton.EDU
> > Subject: Re: [PL #2829] RE: [PL #2937] Re: rmdir freezing system
> > 
> > 
> > Email Recipients (see http://www.planet-lab.org/Support)
> >        Requestor: frankeh at watson.ibm.com, mlhuang at cs.princeton.edu
> >        Ticket Ccs: mef at cs.princeton.edu, 
> > mlhuang at cs.princeton.edu, nagar at watson.ibm.com, sekharan at us.ibm.com
> > 
> > ==================================================
> > 
> > 
> > 
> > Marc E. Fiuczynski via RT wrote:
> > > Email Recipients (see http://www.planet-lab.org/Support)
> > >        Requestor: frankeh at watson.ibm.com, mlhuang at cs.princeton.edu
> > >        Ticket Ccs: mef at cs.princeton.edu, 
> > mlhuang at cs.princeton.edu, nagar at watson.ibm.com, sekharan at us.ibm.com
> > > 
> > > ==================================================
> > > 
> > > 
> > >>>However, I am still seeing a freeze for our 
> > >>>"/rcfs/taskclass/system" class, which is our
> > >>>default class. When I rmdir this one, I get an 
> > >>>"infinite" stack trace dumped to the console. 
> > >>
> > >>OK, that must be related to a different problem.
> > > 
> > > 
> > > Unfortunately, it is non trivial for me to attach a serial 
> > console to the box.
> > > For this reason, I cannot get your stack trace for it today.  I 
> > tried debugging 
> > > the same problem by running the same kernel as a guest on the 
> > qemu pc emulator. 
> > > It appears that rq_get_next_queue() in rq_get_next_task() 
> > returns a queue data 
> > > structure whose array (queue->array) is NULL. This causes all hell to 
> > break loose and
> > > the system goes into the telltale infinite stack dump.
> > 
> > OK, that is already more then we can ask for ...
> > Put a BUG_ON(queue->array == NULL) and stop the kernel right
> > then and there when it happens.
> > 
> > I have an idea what is happening, actually I am certain I know.
> > Something to do with our optimizations  on class dequeueing.
> > The local class runqueue is still enqueued despite it has not
> > task running anymore.
> > 
> > Working with Haoqiang today on share accuracy and weight adjustment 
> > problem. I should be able to get to this particular problem tomorrow 
> > evening and fix it, can you survive until then ?
> > 
> > > 
> > > So maybe it is possible for the queue data structure to hang 
> > around for a 
> >  > while and not be properly released.
> > > I'll try to track this down further by using gdb to walk around 
> >  > in the kernel running on the qemu pc emulator.
> > > 
> > > 
> > >>This is definitely a fix that needs to go in.
> > > 
> > > 
> > > Glad that the other bug definitely needs to go into your code.  
> > > 
> > > 
> > >>I have not upgraded our x206 with your YUM information, since 
> > >>I can break this with your kernel (but now ours).
> > > 
> > > 
> > > While you are waiting for me, please do the full upgrade. This 
> > way you can play with the exact problems I see, and thereby also 
> > harden your own code for other distro releases. ;)
> > > 
> > > Cheers,
> > > Marc
> > > 
> > > 
> > > 
> > > 
> > 
> 
> 
> _______________________________________________
> Devel-community mailing list
> Devel-community at lists.planet-lab.org
> http://lists.planet-lab.org/mailman/listinfo/devel-community
> 





More information about the Devel-community mailing list