strange badblocks problem

Bill's LFS Login lfsbill at nospam.dot
Wed Jan 14 15:12:10 PST 2004


On Wed, 14 Jan 2004, marnixk wrote:

> Bill's LFS Login wrote:
>
> > First, if the last block starts on sector 5879721, and you have 8
> ><snip>

> Thanks for all your great help, still I have discovered that blocksizes and
> sectors are probably not the cause of my problem. What I have done is the
> following:
>
> On /dev/hda9 i have created a FS of about 1/4 the size of the partition.
> First I discovered that no matter how small the FS inside the partition is,
> dd and badblocks always read right on to the end of the partition and not
> just to the end of the FS.

This is *correct* operation. Badblocks and the dd as you should be using
it *will* read the whole partition (or regular file if that was
specified). Dd can be limited by adding "count=<some number>". If you
don't do this, its job is to read until end of file (which in this case
is the whole partition).

> When I run badblocks on the partition just after I have booted into the
> system, it doesn't attempt to read past the end of the device, so no
> "badblocks" are found. But then I mounted the partition and unmounted it
> again and ran badblocks on the partition again. To my surprise now the
> badblocks program *did* find "badblocks" and the logs showed the same
> "attempt to read past the end of the device" stuff.

I have no theory on why that is happening. It would sound as if the
update of the meta-data got some bad info from somewhere. When you do an
unmount, certain info about the file system is updated.

Did you right down the alternate super-blocks when you made the file
system? If so, we can specify one of the alternates and run the fsck.
Normally this alternate will not have been updated and should have the
info as originally transcribed when the FS was created. This could be
compared with the default super-block that was modified at unmount time.

>
> Now, when I reboot the system and run badblocks on the partition again, it
> *does not* find any bad blocks, but once I do the mount/unmount routine
> again, it *does* find the badblocks. So the problem only seems to occur
> after I have mounted the FS at least once.

And only while the same instance of the OS is running? IIUC, all is good
when you do nothing more than reboot?

Then I *have* to ask this (don't take offense, I feel we are at the
stage of "what have we overlooked here"). Are you *sure* that the FS is
*not* mounted when you are running the fsck? An FS can be mounted
multiple times at the same time. E.g. two mount statements in
/etc/fstab.

Cat /proc/mounts to be sure that the partition is not referenced twice.
Be sure it is not mounted under another directory with a "mount --bind"
(this is harder to see because the secondary mounts don show the
partition. Here's an example of things mounted with bind (from "mount").

/dev/sr0 on /mnt/SourcesCD type ext2 (ro)
/mnt/workspac/New650MbFs on /mnt/Sources2 type ext2 (rw,loop=/dev/loop0)
/mnt/SourcesCD on /mnt/archives/Lfs_Sources/Sources1 type none (rw,bind)
/mnt/SourcesCD on /mnt/archives/Blfs_Sources/Sources1 type none (rw,bind)

Note the secondary mounts show in place of the device the path of the
mounted device instead.

Here's the same from cat /proc/mounts

/dev/sr0 /mnt/SourcesCD ext2 ro 0 0
/dev/loop0 /mnt/Sources2 ext2 rw 0 0
/dev/sr0 /mnt/archives/Lfs_Sources/Sources1 ext2 ro 0 0
/dev/sr0 /mnt/archives/Blfs_Sources/Sources1 ext2 ro 0 0

Note that here it does show the base device for the secondary mounts.

>
> All this is when I boot the system with the Gentoo LiveCD (current), so I
> decided to try it with a boot floppy (RIP rescue disk). I wiped the entire
> disk with dd if=/dev/zero of=/dev/hda and repartitioned the drive with
> fdisk on the rescue disk. After this I created the filesystems and ran
> badblocks on it, which returned no errors. Then I did the mount/unmount
> thing on all FS and ran badblocks on them and again no badblocks! So I
> figured there must be something wrong with the Gentoo CD I have. After this
> I rebooted with the Gentoo CD, did the mount/unmount thing and still no
> badblocks. So then I though my problem was gone. But only minutes later I
> tried the same on one partition again and the error was back.

Ugh! This puts a crimp on what I was hoping might be the problem. The
further we go, the less hope I have that I can ask the right questions
to help locate the problem. This because of the "remote" nature, I can't
get the normal "visual" clues that spark a thought, etc.

Right file system type specified everywhere? Ext2 or reiserfs?

> After this I have not yet tried to boot from the floppy again and check for
> badblocks and I cannot try it now, but I would be surprised if the
> badblocks problem occurs when I have booted from the rescue disk (will try
> as soon as I can). If it does then I suspect some hardware problem, because
> then it occurs under two completely different linux systems, but until then
> my bet is on something wrong with my gentoo live cd... (although I have not
> experienced anything like this on any of my other boxes)

I hope that's it. Do you have enough of an LFS system to test it?

>
> Hope you have some comments on the fact that mount/unmount seems to
> influence the way the kernel "sees" the partition. Oh, one more thing, I
> did dd if=/dev/hda9 of=/dev/null bs=512 on /dev/hda9, like you suggested
> and I found 5879726 records in/out and no I/O error. Then I did the
> mount/unmount thing and I found 5879720 records in/out and an I/O error.

That's making me *guess* that the FS is still mounted somewhere because
I think an in-core copy of the super-block (and other meta-data) is
being used, based on your earlier description that it goes away when you
reboot.

This is especially pernicious if, as you suspect, it is something wrong
in the OS that is being run.

> Then I did dd if=/dev/zero of=/dev/hda9 bs=512 and then I found 5879725 in,
> 5879724 out (IIRC). So there doesn't seem to be any agreement on how large
> my partition actually was.

Sounds more and more like something flaky in the OS if it's not related
to dupe mounts.

>
> One final point of interest may be the fact that before I discovered this
> problem I had another problem with my harddisk. I suddenly could not
> mount /dev/hda1 anymore (on /) because no superblock could be found (IIRC),
> I then checked the drive for bad blocks and there seemed to be many such
> badblocks. I scanned the drive with the tool provided by the vendor
> (maxtor) and this tool said the disk was bad and that I should return it as
> a warranty claim. Then I did a low-level format and rescanned the disk
> again. Then the tool said the drive was OK and "certified error free".
> After this I was more cautious and I decided I would scan my partitions for
> bad blocks with the badblocks program before installing Linux again, and
> then I discovered the errors that this thread is all about. Strangely when
> I rescan the drive with the maxtor tool it still says the drive is good
> (also checks for badblocks).

Have you checked to see that all the cables are in good condition and
*well* seated? I know sometimes intermittent hardware problems (like a
cable *almost* making solid contact) can cause erratic results.
Temperature changes of just a few degrees can cause it to appear and
disappear.

>
> Maybe this provides some more context on the possible cause of my
> problems... Very curious what you make of this. Thanks alot so far!

Shoot. I beginning to feel like I've been no help. I would first confirm
the cabling (data and power) are all good. Look for any nicks in the
data cables too (a break can be almost invisible).

If you can carry the drive to a known-good hardware and OS and test
there, that would be a big plus.

>
> Marnix

BTW, what's the size of the drive? Any chance (part of) hda9 is beyond
some upper limit of the OS or bios or something (I don't think BIOS
normally affects this, but...).

-- 
NOTE: I'm on a new ISP, if I'm in your address book ...
Bill Maltby
lfsbillATearthlinkDOTnet
Fix line above & use it to mail me direct.



More information about the lfs-support mailing list