strange badblocks problem

marnixk lfs at mkaart.net
Wed Jan 14 12:21:30 PST 2004


Bill's LFS Login wrote:

> First, if the last block starts on sector 5879721, and you have 8
> sectors per "block", you will fail when trying to read the last 4K
> block. My previous comments always assume that the partition size is
> large enough to hold all the sectors the file system refs (and that is
> what I meant when saying "if the partition is larger...").
> 
> Traditionally, many FSs I've dealt with do *not* use the first sector
> for a true part of the file system. It had a boot block. The super block
> began in sector 1, then came various i-node lists, ... I don't recall
> if ext2/3 also doesn't use sector 0, but I think it does not. This
> indicates that the last block would start on sector 5879721 (or maybe
> higher, I really haven't taken the time to learn all its details).
> 
> So it seems likely to me that if you reduce the blocksize/count combo or
> increase partition size just a little you'll be OK.
> 
> Re "You've lost me":
> 
> Now, we're both lost then. IIRC, on your previous post, you ref'd
> another system that apparently had one more block in the partition than
> the calculated FS size. IIRC, you asked why that one wasn't failing.
> That's what I was referring to. Its combo of blocksize and partition size
> was big enough to hold all the sectors and may even have more sectors
> than needed (I don't recall). That's OK.
> 
> On the example you mention in the above, you *may* right. Problem can
> be the one I mention (block start on *1 and occupies 8 sectors that
> extend beyond...) and can come from two other places I can think of
> right away.
> 
> FS corruption: if a pointer references an invalid block, that is, one
> outside of the partition. An fsck should detect this and allow you to
> correct it. I presume this is unlikely because you've run fsck.
> 
> Utilities (or any program) like dd: will try to read the number of
> blocks specified and will accept a blocksize on the command line. If
> it's combination of bs= (or any of the pertinent blocksize specs) and
> count= (in conjunction with any possible skip= or seek=, as appropriate)
> causes it to attempt a read *after* the return indicates an end-of-file,
> you can also get this problem.
> 
> Another uncertainty is the counts you show. Do this
> 
>   dd if=/dev/hda9 of=/dev/null bs=512
> 
> It will be slow, but will show the number of sectors in the partition.
> To speed it up, you could add skip=<a large number of sectors> but then
> you need to do an extra math step.
> 
> You can then calculate and see if we have been misled by the old
> "some things count from one, some things count from zero" routine.
> 
> I also don't think this is the problem.
> 
> With the count from above, you should be able to do the calculations and
> confirm if it is really just a "block" extending beyond the partition.
> 
> And I really think it is. But I am often wrong. So, the "acid test"
> 
>   dd if=/dev/hda9 of=/dev/null bs=4096 skip=734964
> 
> further confirmation can be gained by converting everything to sectors
> and attempting to read the last two "blocks". I expect you will get 7 or
> fewer sectors returned.

Thanks for all your great help, still I have discovered that blocksizes and
sectors are probably not the cause of my problem. What I have done is the
following:

On /dev/hda9 i have created a FS of about 1/4 the size of the partition.
First I discovered that no matter how small the FS inside the partition is,
dd and badblocks always read right on to the end of the partition and not
just to the end of the FS. 

When I run badblocks on the partition just after I have booted into the
system, it doesn't attempt to read past the end of the device, so no
"badblocks" are found. But then I mounted the partition and unmounted it
again and ran badblocks on the partition again. To my surprise now the
badblocks program *did* find "badblocks" and the logs showed the same
"attempt to read past the end of the device" stuff.

Now, when I reboot the system and run badblocks on the partition again, it
*does not* find any bad blocks, but once I do the mount/unmount routine
again, it *does* find the badblocks. So the problem only seems to occur
after I have mounted the FS at least once.

All this is when I boot the system with the Gentoo LiveCD (current), so I
decided to try it with a boot floppy (RIP rescue disk). I wiped the entire
disk with dd if=/dev/zero of=/dev/hda and repartitioned the drive with
fdisk on the rescue disk. After this I created the filesystems and ran
badblocks on it, which returned no errors. Then I did the mount/unmount
thing on all FS and ran badblocks on them and again no badblocks! So I
figured there must be something wrong with the Gentoo CD I have. After this
I rebooted with the Gentoo CD, did the mount/unmount thing and still no
badblocks. So then I though my problem was gone. But only minutes later I
tried the same on one partition again and the error was back.

After this I have not yet tried to boot from the floppy again and check for
badblocks and I cannot try it now, but I would be surprised if the
badblocks problem occurs when I have booted from the rescue disk (will try
as soon as I can). If it does then I suspect some hardware problem, because
then it occurs under two completely different linux systems, but until then
my bet is on something wrong with my gentoo live cd... (although I have not
experienced anything like this on any of my other boxes)

Hope you have some comments on the fact that mount/unmount seems to
influence the way the kernel "sees" the partition. Oh, one more thing, I
did dd if=/dev/hda9 of=/dev/null bs=512 on /dev/hda9, like you suggested
and I found 5879726 records in/out and no I/O error. Then I did the
mount/unmount thing and I found 5879720 records in/out and an I/O error. 
Then I did dd if=/dev/zero of=/dev/hda9 bs=512 and then I found 5879725 in,
5879724 out (IIRC). So there doesn't seem to be any agreement on how large
my partition actually was.

One final point of interest may be the fact that before I discovered this
problem I had another problem with my harddisk. I suddenly could not
mount /dev/hda1 anymore (on /) because no superblock could be found (IIRC),
I then checked the drive for bad blocks and there seemed to be many such
badblocks. I scanned the drive with the tool provided by the vendor
(maxtor) and this tool said the disk was bad and that I should return it as
a warranty claim. Then I did a low-level format and rescanned the disk
again. Then the tool said the drive was OK and "certified error free".
After this I was more cautious and I decided I would scan my partitions for
bad blocks with the badblocks program before installing Linux again, and
then I discovered the errors that this thread is all about. Strangely when
I rescan the drive with the maxtor tool it still says the drive is good
(also checks for badblocks).

Maybe this provides some more context on the possible cause of my
problems... Very curious what you make of this. Thanks alot so far!

Marnix



More information about the lfs-support mailing list