[Planetlab-devel] identifying source of corrupt filesystems on pl
Stephen Soltesz
soltesz at CS.Princeton.EDU
Wed Oct 21 13:55:15 EDT 2009
Hi, Marc,
I'm sure that there were some safe fsck checks that were falsely
identified as corrupt fs. It was not all of them.
I think the thing to do would be to deploy the fix to BM that excludes
the safe-fsck check and only fails when there is true fs corruption.
Then we could follow up and identify how often real corruption is
occurring now.
What do you think of that strategy?
Best,
Stephen
On Oct 16, 2009, at 4:45 PM, Marc Fiuczynski wrote:
> Dear Stephen,
>
> Nearly two months ago you wrote about FS corruption problems. Since
> then we've identified an undocumented "feature" of fsck returning a
> non-zero status code even when the filesystem checks out ok and its
> only possible to get fsck to exit with a zero status code when
> running the forced invocation (i.e., fsck -f -y). For this reason,
> is it possible that we are seeing less severely corrupt FSes out
> there? Or were these cases truly ones where there was a lot of FS
> corruption?
>
> Marc
>
>
> On Aug 25, 2009, at 10:29 PM, Chris Tracy wrote:
>
>>>>>
>>>>> Hello,
>>>>>
>>>>> Over the last three weeks PlanetLab has seen on the order of 50
>>>>> occurrences of severely corrupt filesystems that needed a forced
>>>>> fsck to
>>>>> repair the filesystem and continue with the boot. ?The problem
>>>>> has been
>>>>> occurring longer, but the numbers before the last three weeks
>>>>> are not known
>>>>> by me.
>>>>>
>>>>> The nodes that are affected go offline unexpectedly either due
>>>>> to power
>>>>> failure or a kernel crash. ?When the system comes back up, the
>>>>> standard fsck
>>>>> fails, and requires a forced invocation (fsck -f -y).
>>>>>
>>>>> This is occurring with such a high frequency that I think it
>>>>> would be
>>>>> preferable to identify the source of the problem (i.e. use
>>>>> chopstix to track
>>>>> down a kernel bug) or adopt a different strategy for handling
>>>>> the effects of
>>>>> the problem on nodes that crash frequently (i.e. changing the fs
>>>>> type if
>>>>> that would really help).
>>>>>
>>>>> I'm writing to solicit feedback from the community about this
>>>>> problem.
>>>>>
>>>>> Best,
>>>>> Stephen
>
> _______________________________________________
> Devel mailing list
> Devel at lists.planet-lab.org
> https://lists.planet-lab.org/mailman/listinfo/devel
More information about the Devel
mailing list