[PL #2899] CKRM memory controller hangs

Marc E. Fiuczynski via RT devel at planet-lab.org
Wed Nov 3 14:17:17 EST 2004


Email Recipients (see http://www.planet-lab.org/Support)
       Requestor: acb at cs.princeton.edu
       Ticket Ccs: frankeh at watson.ibm.com, mef at cs.princeton.edu, sekharan at us.ibm.com

==================================================

Hi Chandra & Hubertus,

The two functions shrink_class and shrink_classes are ifdef'd out.  But the underlying functions -- refill_inactive_zone and shrink_caches -- are still called by other code (shrink_zone) in vmscan.c.  The loop in shrink_zone is essentially identical to the one in your shrink_class function, and the irq's are off too long within the internal loop.

Maybe it is sufficient to move the "redo:" label inside these functions before the spin_lock_irq() and then explicitly unlock right before doing the goto back to the redo label; thereby letting the interrupt through. Thoughts?

The test case I have assumes classes are mapped to vservers. Need to port that over to just using standard classes. However, maybe a better approach would be for you to work directly with our kernel on your test system? Specifically, I thought that Hubertus got an x206 system (w/ RSA II adapter) like the one he is loaning me for exactly this purpose. Maybe there is a way you can use it to do the test. This way we can avoid this back and forth between different systems. 

Marc


> -----Original Message-----
> From: sekharan at us.ibm.com via RT [mailto:devel at planet-lab.org]
> Sent: Wednesday, November 03, 2004 11:48 AM
> To: acb at CS.Princeton.EDU
> Subject: Re: [PL #2899] CKRM memory controller hangs
> 
> 
> Email Recipients (see http://www.planet-lab.org/Support)
>        Requestor: acb at cs.princeton.edu
>        Ticket Ccs: frankeh at watson.ibm.com, mef at cs.princeton.edu, 
> sekharan at us.ibm.com
> 
> ==================================================
> 
> On Tue, Nov 02, 2004 at 02:32:08PM -0500, Marc E. Fiuczynski wrote:
> > This problem with the TSC loosing ticks has happened before in 
> response to shrinking the buffer cache and moving pages from the 
> active list to the inactive list and eventually swapping them out.
> > 
> > Unfortunately, the 2.6.8.1 kernel (and in fact subsequent 
> kernels) has a bug in switching from TSC to PIT as a time source. 
> Specifically, the switch attempts to invoke initialization code 
> that has been remote after boot. Consequently, the switch from 
> TSC to PIT jumps into "random" code causing the system to hang 
> itself.  While I have developed a patch for this (see attached 
> file), it only masks the higher-level bug with the CKRM modified 
> memory subsystem. For this reason, I have not committed this code 
> into our tree.
> > 
> > Chandra, do you have any insight as to why the memory subsystem 
> enters such a tight loop in trying to shuffle pages from the 
> active->inactive->swap, thereby locking out TSC ticks.
> 
> I think the problem is that we get into the shrink_class loop
> continously(for all the classes) and hence try to do 
> spin_lock_irq and that
> gets into this state... BTW, you still have the shrink_class functionality
> (trying to shrink class at 90%) disabled, right ?
> 
> BTW, can you send me the tests you are using that reproduces this problem.
> 
> > 
> > Marc
> > 
> > > -----Original Message-----
> > > From: devel-community-bounces at planet-lab.org
> > > [mailto:devel-community-bounces at planet-lab.org]On Behalf Of 
> Andy Bavier
> > > via RT
> > > Sent: Tuesday, November 02, 2004 2:12 PM
> > > Subject: [PL #2899] CKRM memory controller hangs 
> > > 
> > > 
> > > Email Recipients (see http://www.planet-lab.org/Support)
> > >        Requestor: acb at cs.princeton.edu
> > > 
> > > 
> > > ==================================================
> > > 
> > > Tue Nov 02 14:12:25 2004: Request 2899 was acted upon.
> > > Transaction: Ticket created by Andy
> > > 
> > > Subject: CKRM memory controller hangs
> > > 
> > > With the CKRM memory controller active, a slice which is still 
> > > well within  
> > > its memory limit of 128MB can hang the machine by requesting 
> memory when  
> > > there is none free (even though swap is not full).  Often this 
> > > generates a  
> > > console message such as "spurious 8295A interrupt: IRQ 7" or 
> "Losing too  
> > > many ticks!  TSC cannot be used as a timesource...".
> > > 
> > > My test case is on ckrm.cs.princeton.edu in /root/acb/memtest.sh. It  
> > > causes N slices to each allocate 100MB of data.  Usually 8 such 
> > > slices are  
> > > sufficient to hang the machine.
> > > 
> > > Andy
> > > 
> > > _______________________________________________
> > > Devel-community mailing list
> > > Devel-community at lists.planet-lab.org
> > > http://lists.planet-lab.org/mailman/listinfo/devel-community
> > > 
> 
> 
> 
> -- 
> 
> ----------------------------------------------------------------------
>     Chandra Seetharaman               | Be careful what you choose....
>               - sekharan at us.ibm.com   |      .......you may get it.
> ----------------------------------------------------------------------
> 





More information about the Devel-community mailing list