[Planetlab-devel] identifying source of corrupt filesystems on pl

Marc Fiuczynski mef at CS.Princeton.EDU
Fri Oct 16 16:45:46 EDT 2009


Dear Stephen,

Nearly two months ago you wrote about FS corruption problems.  Since  
then we've identified an undocumented "feature" of fsck returning a  
non-zero status code even when the filesystem checks out ok and its  
only possible to get fsck to exit with a zero status code when running  
the forced invocation (i.e., fsck -f -y). For this reason, is it  
possible that we are seeing less severely corrupt FSes out there?  Or  
were these cases truly ones where there was a lot of FS corruption?

Marc


On Aug 25, 2009, at 10:29 PM, Chris Tracy wrote:

>>>>
>>>> Hello,
>>>>
>>>> Over the last three weeks PlanetLab has seen on the order of 50
>>>> occurrences of severely corrupt filesystems that needed a forced  
>>>> fsck to
>>>> repair the filesystem and continue with the boot. ?The problem  
>>>> has been
>>>> occurring longer, but the numbers before the last three weeks are  
>>>> not known
>>>> by me.
>>>>
>>>> The nodes that are affected go offline unexpectedly either due to  
>>>> power
>>>> failure or a kernel crash. ?When the system comes back up, the  
>>>> standard fsck
>>>> fails, and requires a forced invocation (fsck -f -y).
>>>>
>>>> This is occurring with such a high frequency that I think it  
>>>> would be
>>>> preferable to identify the source of the problem (i.e. use  
>>>> chopstix to track
>>>> down a kernel bug) or adopt a different strategy for handling the  
>>>> effects of
>>>> the problem on nodes that crash frequently (i.e. changing the fs  
>>>> type if
>>>> that would really help).
>>>>
>>>> I'm writing to solicit feedback from the community about this  
>>>> problem.
>>>>
>>>> Best,
>>>> Stephen



More information about the Devel mailing list