strange badblocks problem

Bill's LFS Login lfsbill at nospam.dot
Thu Jan 15 08:27:57 PST 2004


On Thu, 15 Jan 2004, marnixk wrote:

> Bill's LFS Login wrote:
><snip>

> Where does dd get its information about the size of the partition from? The
> kernel or the FS?

It doesn't. What they get is an error return when they try to read a
block outside of the partition. They *could* get this info if desired,
but they have no need for it (*ahem*, I've been wrong before about these
sorts of things when it turned out that GNU/Linux hasn't cloned a real
UNIX function or utility as closely as I expected - this is one of those
cases where I haven't checked, but the docs give no indication that this
additional functionality was needed).

For sequential operations, such as with dd, the error will *usually*
happen when the read system call returns an EOF and the application
issues another read.

I just got reminded of an error in dd from many moons ago that I *think*
is not involved here, but we shouldn't overlook.

Shortly after Unix System Labs was sold to Novell, drives larger than
500MB started coming to market in larger quantities. To accommodate
these, required expansion of an integer field (human integer, not C) to
accommodate the larger number.

Novell did the update and I got the new version (relatively early it now
appears). Well, I used dd for my backup routines and guess what? They
forgot to expand the field in dd. Caused some interesting results that
took awhile recognize because of the mix of usage (whole disk = problem,
a single partition = no problem, due to size of partitions).

I must have been one of the first to report the problem and offer a
potential cause and solution because they opened a bug on it. It was
fixed sometime later.

Now this narrative was not just for my entertainment, but reminds that
we have another possibility to rule out: application error (hopefully?).

Keep in mind that *any* application that reads data through the normal
mechanisms and is *not* dependent on specifics of data-storage media or
layout can read partitions in "raw" mode (Linux doesn't specifically
have a raw mode yet I guess, but the results are effectively the same).

This means we can use "cat", "tail" and other utilities to see if the
behavior differs.

I'm suggesting that you use something like

  cat </dev/hda9 >/dev/null

and see if the same problem occurs. Of course, duplicate the steps and
environment leading to the error. This may give a further confirmation
or a clue of where the error is. If it happens in cat too, that means it
is not application-specific.

Another thought occurred. See near the end of this post.

> I ask this because I have discovered something strange.
> My home system (also Gentoo based) shows the exact same behaviour. This
> kinda rules out hardware problems i guess, but I suspect the gentoo livecd
> that I have used on both machines even more. I have a theory that this only
> happens on partitions that have once been mounted when booting from this
> CD. I will go to work later and boot from a knoppix cd or something, create
> partitions and FS, mount, unmount, reboot, mount, ... and so on. Then I
> will reboot with some other boot disk and keep running badblocks and dd in

And consider the "cat" also to see if the error is duplicated.

> between all steps. Then if no errors occur I will reboot with the liveCD
> and mount/unmount and check if the problem re-appeares and persists...

Since you have some indication that it might be OS related, have you
Googled or visited gentoo sites to see if there has been any bug or
threads related to this? Maybe we'd luck out. Regardless, your plan
sounds like it should help narrow the field of potential candidates.

>
> > I have no theory on why that is happening. It would sound as if the
> > update of the meta-data got some bad info from somewhere. When you do an
> > unmount, certain info about the file system is updated.

And I just realized, based on the other idea I had, that in-system
meta-data may not be removed from memory and the OS may be using that
in-memory version.

> >
> > Did you right down the alternate super-blocks when you made the file
> > system? If so, we can specify one of the alternates and run the fsck.
> > Normally this alternate will not have been updated and should have the
> > info as originally transcribed when the FS was created. This could be
> > compared with the default super-block that was modified at unmount time.
>
> I tried dumpe2fs to dump the super-block, but I am not sure how I can dump
> the backup super-block (which is located at 32768, 98306,...) However when

*chuckle* "dd is your friend!". Really, I'm serious. Use dd to skip the
proper number of blocks and write a block (count=xx) out to a temporary
file. The you can examine this (in conjunction with the C source header
is most useful) or cmp it to another block similarly captured.

> I run e2fsck -f /dev/hda9 it says all is OK, but when I do e2fsck -b 98304

98304 or 98306? Should be 98,304, I think the 98306 above is a typo?

> it says FILE SYSTEM WAS MODIFIED, even if I do this repeatedly. But they
> both report the same line at the end: /dev/hda6: 11/141696 files (0.0%
> non-contiguous), 12663/283137 blocks

That is expected, given the sort of problem you experience. Since the
backup super-block is "pristine", one of the fields (I think) it has is
last-mount time. So fsck should update at least that field?

>
> So I suspect there indeed *is* a difference between those two super-blocks,
> but I do not now how to "dump" the backup blocks to compare them to the
> first one.

Now you do!  :)

><snip>

> I am very sure that the FS is not mounted when running the fsck, but it is
> not fsck-ing that gives me problems. Filesystems are reported as being

Yes, but running an fsck while it is mounted is supposed to lead to
problems. Current e2fsck versions check for the mount status, but the
check can be suppressed with -f (or is it -F ?) and no message would
appear, IIRC. But the data on the FS could have been scrogged then.

> clean, although I have had the experience on my home system that everytime
> when it was time to check the rootFS it said: FILE SYSTEM WAS MODIFIED and
> something about a reboot needed. So maybe this has to do something with it
> as well...

Sounds suspiciously related, don't you think?

><snip>

> > Ugh! This puts a crimp on what I was hoping might be the problem. The
> > further we go, the less hope I have that I can ask the right questions
> > to help locate the problem. This because of the "remote" nature, I can't
> > get the normal "visual" clues that spark a thought, etc.
>
> I will try the approach that I have described above and try to find out if
> the mounting with the gentoo CD (it has a 2.4.21 kernel btw) writes
> something bad somewhere. And believe me I would not have had this idea
> without your questions and advice so far!
>
> > Right file system type specified everywhere? Ext2 or reiserfs?
>
> yup, pretty sure

The reason I mention reiserfs (I didn't recall if you mention the type)
is that reiserfs (and other?) logging FS's are supposed to be *not*
fsck'd as a normal course of action. Just wanted to make sure the
problem wasn't being caused by fsck-ing an FS that should be checked.

> > I hope that's it. Do you have enough of an LFS system to test it?
> See above. Actually my LFS systems are my firewall and mailserver, so I
> really do not want to test anything on those machines...

Sounds like a wise course to follow. too bad though, a known-good
platform that you could stick that drive into would be a bonus.

>
> > That's making me *guess* that the FS is still mounted somewhere because
> > I think an in-core copy of the super-block (and other meta-data) is
> > being used, based on your earlier description that it goes away when you
> > reboot.
>
> So dd and badblocks get their info from the super-block or from this other
> meta-data?

AFAIK, they don't need or get this data from anywhere. They just do
standard reads of the medium and let the OS do its thing.

> Is there some way to dump this meta-data when I don't have the
> problem and compare it with the case when I do have the problem?

I can't say. I've never had to attack this particular sort of problem that
deeply. But my thought at the end might give a confirmation (or maybe
not).

>
> > This is especially pernicious if, as you suspect, it is something wrong
> > in the OS that is being run.
>
> <snip>
>
> > Sounds more and more like something flaky in the OS if it's not related
> > to dupe mounts.
>
> I hope too find out soon (see above)
>
> >
> > Have you checked to see that all the cables are in good condition and
> ><snip>

> Ok, again many thanks, will report back to you later!

Consider using sfdisk -R, which (if we are lucky) will cause the kernel
to replace it's in-core copy of partition info. I think this *should*
make it as if a reboot had been done. Since, IIRC, the problem doesn't
occur after a reboot until a mount has been done, this might make the
problem disappear until a mount is again done. Make sure the partition
is not mounted, of course.

The results *may* tell us something.

Another thought that occurred: what else is running that might be
updating something? Be sure to run the tests in single-user mode with as
few processes as possible. You can even kill several of the
kernel-spawned processes (5 or 6 of them, IIRC).

If the problem doesn't occur, repeated runs with more and more left
running until the problem occurs ...

>
> Marnix

Hoping you have good luck with all this!

-- 
NOTE: I'm on a new ISP, if I'm in your address book ...
Bill Maltby
lfsbillATearthlinkDOTnet
Fix line above & use it to mail me direct.



More information about the lfs-support mailing list