strange badblocks problem

marnixk lfs at mkaart.net
Tue Jan 13 06:15:36 PST 2004


Hello,

I am using LFS on several of my systems and Gentoo on some others. I have a
problem with my Gentoo box, but since I know that there are many very good
linux hackers on this list I decided to post my question here as well as on
the Gentoo forums. Hope this is not a problem...

I am having a strange problem with badblocks on some of my partitions. I
have found 2 references to similar problems, but none of these lead to a
good solution:
http://forums.gentoo.org/viewtopic.php?t=96487
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=001001c27d02
246297fe50%244500a8c0%40cybernet.cz

I have tried to analyze the problem and I will post my findings here.
Hopefully one of you can help me further with this:

The problem: Some partitions have at most 3 "badblocks" at the end of the
partition. When running badblocks on /dev/hda9 for example I find the
following badblocks:

2939860
2939861
2939862

badblocks indicates that it is checking From block 0 to 2939863, so these
are the last 3 blocks (except the final one?)

Also these messages appear in the log:
Jan 13 10:50:28 xxxxxx attempt to access beyond end of device
Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863
Jan 13 10:50:28 xxxxxx attempt to access beyond end of device
Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863
Jan 13 10:50:28 xxxxxx attempt to access beyond end of device
Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863
Jan 13 10:50:28 xxxxxx attempt to access beyond end of device
Jan 13 10:50:28 xxxxxx 03:09: rw=0, want=2939864, limit=2939863

Because I previously had other strange behaviour that might suggest a
failing harddisk I ran a testprogram provided by the disk manufacturer
(maxtor). This tool reported that all tests where passed and that the disk
was "certified error free", whatever that may mean.

After this I ran badblocks /dev/hda and there where no badblocks on the disk
at all, but running it on /dev/hda9 again gave the same output as above.
Testing with dd if=/dev/hda0 of=/dev/null and cat /dev/hda9 gave the same
messages in the logs. So probably the badblocks reported on hda9 are caused
by some other problem than actual bad blocks on the device. So where can it
be?

When looking at the partition table in fdisk (expert mode) I found the
following for /dev/hda9:

Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
9 00 1 1 292 254 63 657 63 5879727 83

One point of interest may be that the size reported by fdisk is twice the
number of blocks reported by badblocks plus one (2x 2939863 = 5879726).
Strangely the number of blocks reported in the log is 2939864, and 2x
2939864 = 5879728. So different parts of the system seem to think
differently about the exact size of /dev/hda9.

Running tune2fs -l /dev/hda9 results in the following output (only showing
interesting parts):

Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 368000
Block count: 734965
Reserved block count: 36748
Free blocks: 715200
Free inodes: 367989
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16000
Inode blocks per group: 500

Considering a blocksize of 4K the number of 1K blocks would be 734965 * 4 =
2939860.

>From all these number I have drawn the following conclusion, but please
correct me if I'm wrong:

The partition consists of 5879727 512B blocks, which means that it consists
of 2939863.5 1K blocks. So the partition boundary is halfway a 1K block and
this might explain why the programs want to access 2939864 blocks. This
does not work, because the kernel has placed a limit at 2939863 (not quite
sure about this). But it puzzles me why any program would want to read
beyond the 2939860 blocks in the first place, because the filesystem
reports its boundary to be there. Probably because the kernel reports the
wrong boundary? Also the fact that we see 4 lines in the logging might
indicate that it is trying to read 4 1K blocks (or 1 4K block) too many.

After this short analysis I tried the following:
badblocks -sv -b 1024 /dev/hda9 (which should consider the blocksize as
being 1K), but the results are still the same (three "badblocks and 4 lines
in the log).

Having read the two threads mentioned above I decided to try this under
another kernel version (current version is 2.4.22-gentoo-r4). First I tried
the 2.4.24 vanilla kernel (compiled on the same gentoo machine):
I made a new filesystem on /dev/hda9 and ran badblocks on it. Too my
surprise the problem was not occurring anymore. BUT, after I mounted the
filesystem and ran badblocks again, the problems reapeared again. Finally
after unmounting the filesystem the problem was still there.

Next I booted off the gentoo liveCD (gentoo kernel: 2.4.21-gss) and followed
the same procedure and with the exact same results. I only used the tools
available on the gentoo liveCD to create a filesystem and to check it. I
might assume that the toolchain that was used to compile the gentoo liveCD
kernel is fully tested so the suggestion that it is caused by a bad
toolchain used to compile the system (suggested by Alan Cox in the second
thread, see above) probably does not apply here?

(cdspecs of the LiveCD used:):
Sun Sep 14 02:53:41 UTC 2003
cvs.gentoo.org
x86-livecd-20030801
x86-doc-20030911-drob
x86-distsnap-20030911
x86-x86-20030911-cd1
)

So far for my analysis of this strange problem, I am totally out of ideas
and I could not find any other posts on this matter. Hopefully someone else
can help in solving this problem, since now I am not sure whether I should
trust the system with my valuable data...

Thanks in advance for any help!

Marnix



More information about the lfs-support mailing list