strange badblocks problem

marnixk lfs at mkaart.net
Thu Jan 15 14:51:04 PST 2004


Bill's LFS Login wrote:

> It doesn't. What they get is an error return when they try to read a
> block outside of the partition. They *could* get this info if desired,
> but they have no need for it (*ahem*, I've been wrong before about these
> sorts of things when it turned out that GNU/Linux hasn't cloned a real
> UNIX function or utility as closely as I expected - this is one of those
> cases where I haven't checked, but the docs give no indication that this
> additional functionality was needed).
> 
> For sequential operations, such as with dd, the error will *usually*
> happen when the read system call returns an EOF and the application
> issues another read.

I think then that dd (and badblocks) does not respond correctly to the error
returned by the system, because it keeps trying to read several more blocks
(?). And this "bug" only occurs when "the math" of sectors and blocksize
doesn't work...

> Now this narrative was not just for my entertainment, but reminds that
> we have another possibility to rule out: application error (hopefully?).

Perhaps this is indeed the case (different hardware, "different" OS and
kernel versions, only same applications.
 
> This means we can use "cat", "tail" and other utilities to see if the
> behavior differs.
> 
> I'm suggesting that you use something like
> 
>   cat </dev/hda9 >/dev/null

Okay I've ran several tests in a row on a partition (now /dev/hda8, changed
partitions a bit, now hda9 doesn't show strange behaviour anymore) with
these results (booted from gentoo CD) (/dev/hda8 has FS on it and was
mounted/unmounted):

cdimage root # badblocks -v /dev/hda8
Checking for bad blocks in read-only mode
>From block 0 to 3036253
3036252
Pass completed, 1 bad blocks found.

cdimage root # dd if=/dev/hda8 of=/dev/null bs=512
dd: reading `/dev/hda8': Input/output error
6072504+0 records in
6072504+0 records out

cdimage root # cat </dev/hda8 >/dev/null
cat: -: Input/output error (after a reasonable long wait)

cdimage root # dd if=/dev/zero of=/dev/hda8 bs=512
dd: writing `/dev/hda8': Input/output error
6072505+0 records in
6072504+0 records out

cdimage root # badblocks -v /dev/hda8
Checking for bad blocks in read-only mode
>From block 0 to 3036253
3036252
Pass completed, 1 bad blocks found.

mke2fs -j /dev/hda8 --> ok

cdimage root # badblocks -v /dev/hda8
Checking for bad blocks in read-only mode
>From block 0 to 3036253
3036252
Pass completed, 1 bad blocks found.

... 

> and see if the same problem occurs. Of course, duplicate the steps and
> environment leading to the error. This may give a further confirmation
> or a clue of where the error is. If it happens in cat too, that means it
> is not application-specific.

And indeed it occurs in cat as well.
 
> Since you have some indication that it might be OS related, have you
> Googled or visited gentoo sites to see if there has been any bug or
> threads related to this? Maybe we'd luck out. Regardless, your plan
> sounds like it should help narrow the field of potential candidates.

See my other post, a Debian box i found has the same problem as well.
 
> And I just realized, based on the other idea I had, that in-system
> meta-data may not be removed from memory and the OS may be using that
> in-memory version.

> *chuckle* "dd is your friend!". Really, I'm serious. Use dd to skip the
> proper number of blocks and write a block (count=xx) out to a temporary
> file. The you can examine this (in conjunction with the C source header
> is most useful) or cmp it to another block similarly captured.
I'm learning, i'm learning...
 
> That is expected, given the sort of problem you experience. Since the
> backup super-block is "pristine", one of the fields (I think) it has is
> last-mount time. So fsck should update at least that field?

Okay, that was shortsighted of me, it also has the field Last Checked, wich
is update in the main superblock each check but not in the backup
superblock of course (or am i wrong?)
 
>> So I suspect there indeed *is* a difference between those two
>> super-blocks, but I do not now how to "dump" the backup blocks to compare
>> them to the first one.
> 
> Now you do!  :)

Hope i got this right:

cdimage root # dd if=/dev/hda8 skip=0 count=1 of=primary.txt bs=4096
1+0 records in
1+0 records out

cdimage root # dd if=/dev/hda8 skip=163840 count=1 of=backup.txt bs=4096
1+0 records in
1+0 records out

cdimage root # diff primary.txt backup.txt
--> no output, so the same?

when i do cat on both files i get:
cdimage root # cat backup.txt 
Í
 AdF
    õÌà=x at Sïx at Ní        bv5T~uämåtL'
                "Èû_FÁ³§            òmÒÍÙðx@

Both dumps seem identical!

as a check i also did:

cdimage root # dd if=/dev/hda8 skip=163839 count=1 of=test.txt bs=4096
1+0 records in
1+0 records out

but this gave me an empty file (or better, a file occupied with a lot of ^@)

So, I think it has nothing to do with the FS and the superblocks.., which is
backed up by the fact that after dd if=/dev/zero of=/dev/hda8, the problems
persists...

>> clean, although I have had the experience on my home system that
>> everytime when it was time to check the rootFS it said: FILE SYSTEM WAS
>> MODIFIED and something about a reboot needed. So maybe this has to do
>> something with it as well...
> 
> Sounds suspiciously related, don't you think?
maybe, but it might be a coincidence...

> The reason I mention reiserfs (I didn't recall if you mention the type)
> is that reiserfs (and other?) logging FS's are supposed to be *not*
> fsck'd as a normal course of action. Just wanted to make sure the
> problem wasn't being caused by fsck-ing an FS that should be checked.

I use ext3, but this also occurs on partitions/FS-es that have not been
fscked yet... and even on those without an FS on it...
 
>> > I hope that's it. Do you have enough of an LFS system to test it?
>> See above. Actually my LFS systems are my firewall and mailserver, so I
>> really do not want to test anything on those machines...
> 
> Sounds like a wise course to follow. too bad though, a known-good
> platform that you could stick that drive into would be a bonus.

See my Debian experienced in the other post...

> Consider using sfdisk -R, which (if we are lucky) will cause the kernel
> to replace it's in-core copy of partition info. I think this *should*
> make it as if a reboot had been done. Since, IIRC, the problem doesn't
> occur after a reboot until a mount has been done, this might make the
> problem disappear until a mount is again done. Make sure the partition
> is not mounted, of course.

I did the following (no mounts in between):

cdimage root # sfdisk -R /dev/hda
cdimage root # badblocks -v /dev/hda8
Checking for bad blocks in read-only mode
>From block 0 to 3036253
3036252
Pass completed, 1 bad blocks found.

same problem...

> The results *may* tell us something.
> 
> Another thought that occurred: what else is running that might be
> updating something? Be sure to run the tests in single-user mode with as
> few processes as possible. You can even kill several of the
> kernel-spawned processes (5 or 6 of them, IIRC).

I'm currently remote so cannot go to single-user mode (or can i?), but will
try it next time i am at the console...
 
> If the problem doesn't occur, repeated runs with more and more left
> running until the problem occurs ...
yup, get the point, will try asap.
 
> Hoping you have good luck with all this!
Thanks a lot, although it might seem that there are actually more people
with this problem, accept they don't know it yet. Unless of course it has
something to do with *me* at the console... maybe we should try my machines
on a "known-good hacker".

Thanks again, and look forward to your reply.

Marnix




More information about the lfs-support mailing list