[Planetlab-devel] identifying source of corrupt filesystems on pl
Marc Fiuczynski
mef at CS.Princeton.EDU
Fri Oct 16 16:45:46 EDT 2009
Dear Stephen,
Nearly two months ago you wrote about FS corruption problems. Since
then we've identified an undocumented "feature" of fsck returning a
non-zero status code even when the filesystem checks out ok and its
only possible to get fsck to exit with a zero status code when running
the forced invocation (i.e., fsck -f -y). For this reason, is it
possible that we are seeing less severely corrupt FSes out there? Or
were these cases truly ones where there was a lot of FS corruption?
Marc
On Aug 25, 2009, at 10:29 PM, Chris Tracy wrote:
>>>>
>>>> Hello,
>>>>
>>>> Over the last three weeks PlanetLab has seen on the order of 50
>>>> occurrences of severely corrupt filesystems that needed a forced
>>>> fsck to
>>>> repair the filesystem and continue with the boot. ?The problem
>>>> has been
>>>> occurring longer, but the numbers before the last three weeks are
>>>> not known
>>>> by me.
>>>>
>>>> The nodes that are affected go offline unexpectedly either due to
>>>> power
>>>> failure or a kernel crash. ?When the system comes back up, the
>>>> standard fsck
>>>> fails, and requires a forced invocation (fsck -f -y).
>>>>
>>>> This is occurring with such a high frequency that I think it
>>>> would be
>>>> preferable to identify the source of the problem (i.e. use
>>>> chopstix to track
>>>> down a kernel bug) or adopt a different strategy for handling the
>>>> effects of
>>>> the problem on nodes that crash frequently (i.e. changing the fs
>>>> type if
>>>> that would really help).
>>>>
>>>> I'm writing to solicit feedback from the community about this
>>>> problem.
>>>>
>>>> Best,
>>>> Stephen
More information about the Devel
mailing list