[PL #2899] CKRM memory controller hangs
Marc E. Fiuczynski via RT
devel at planet-lab.org
Wed Nov 3 14:17:17 EST 2004
Email Recipients (see http://www.planet-lab.org/Support)
Requestor: acb at cs.princeton.edu
Ticket Ccs: frankeh at watson.ibm.com, mef at cs.princeton.edu, sekharan at us.ibm.com
==================================================
Hi Chandra & Hubertus,
The two functions shrink_class and shrink_classes are ifdef'd out. But the underlying functions -- refill_inactive_zone and shrink_caches -- are still called by other code (shrink_zone) in vmscan.c. The loop in shrink_zone is essentially identical to the one in your shrink_class function, and the irq's are off too long within the internal loop.
Maybe it is sufficient to move the "redo:" label inside these functions before the spin_lock_irq() and then explicitly unlock right before doing the goto back to the redo label; thereby letting the interrupt through. Thoughts?
The test case I have assumes classes are mapped to vservers. Need to port that over to just using standard classes. However, maybe a better approach would be for you to work directly with our kernel on your test system? Specifically, I thought that Hubertus got an x206 system (w/ RSA II adapter) like the one he is loaning me for exactly this purpose. Maybe there is a way you can use it to do the test. This way we can avoid this back and forth between different systems.
Marc
> -----Original Message-----
> From: sekharan at us.ibm.com via RT [mailto:devel at planet-lab.org]
> Sent: Wednesday, November 03, 2004 11:48 AM
> To: acb at CS.Princeton.EDU
> Subject: Re: [PL #2899] CKRM memory controller hangs
>
>
> Email Recipients (see http://www.planet-lab.org/Support)
> Requestor: acb at cs.princeton.edu
> Ticket Ccs: frankeh at watson.ibm.com, mef at cs.princeton.edu,
> sekharan at us.ibm.com
>
> ==================================================
>
> On Tue, Nov 02, 2004 at 02:32:08PM -0500, Marc E. Fiuczynski wrote:
> > This problem with the TSC loosing ticks has happened before in
> response to shrinking the buffer cache and moving pages from the
> active list to the inactive list and eventually swapping them out.
> >
> > Unfortunately, the 2.6.8.1 kernel (and in fact subsequent
> kernels) has a bug in switching from TSC to PIT as a time source.
> Specifically, the switch attempts to invoke initialization code
> that has been remote after boot. Consequently, the switch from
> TSC to PIT jumps into "random" code causing the system to hang
> itself. While I have developed a patch for this (see attached
> file), it only masks the higher-level bug with the CKRM modified
> memory subsystem. For this reason, I have not committed this code
> into our tree.
> >
> > Chandra, do you have any insight as to why the memory subsystem
> enters such a tight loop in trying to shuffle pages from the
> active->inactive->swap, thereby locking out TSC ticks.
>
> I think the problem is that we get into the shrink_class loop
> continously(for all the classes) and hence try to do
> spin_lock_irq and that
> gets into this state... BTW, you still have the shrink_class functionality
> (trying to shrink class at 90%) disabled, right ?
>
> BTW, can you send me the tests you are using that reproduces this problem.
>
> >
> > Marc
> >
> > > -----Original Message-----
> > > From: devel-community-bounces at planet-lab.org
> > > [mailto:devel-community-bounces at planet-lab.org]On Behalf Of
> Andy Bavier
> > > via RT
> > > Sent: Tuesday, November 02, 2004 2:12 PM
> > > Subject: [PL #2899] CKRM memory controller hangs
> > >
> > >
> > > Email Recipients (see http://www.planet-lab.org/Support)
> > > Requestor: acb at cs.princeton.edu
> > >
> > >
> > > ==================================================
> > >
> > > Tue Nov 02 14:12:25 2004: Request 2899 was acted upon.
> > > Transaction: Ticket created by Andy
> > >
> > > Subject: CKRM memory controller hangs
> > >
> > > With the CKRM memory controller active, a slice which is still
> > > well within
> > > its memory limit of 128MB can hang the machine by requesting
> memory when
> > > there is none free (even though swap is not full). Often this
> > > generates a
> > > console message such as "spurious 8295A interrupt: IRQ 7" or
> "Losing too
> > > many ticks! TSC cannot be used as a timesource...".
> > >
> > > My test case is on ckrm.cs.princeton.edu in /root/acb/memtest.sh. It
> > > causes N slices to each allocate 100MB of data. Usually 8 such
> > > slices are
> > > sufficient to hang the machine.
> > >
> > > Andy
> > >
> > > _______________________________________________
> > > Devel-community mailing list
> > > Devel-community at lists.planet-lab.org
> > > http://lists.planet-lab.org/mailman/listinfo/devel-community
> > >
>
>
>
> --
>
> ----------------------------------------------------------------------
> Chandra Seetharaman | Be careful what you choose....
> - sekharan at us.ibm.com | .......you may get it.
> ----------------------------------------------------------------------
>
More information about the Devel-community
mailing list