Encoding of filenames

Jörg W Mittag Joerg.Mittag at Web.De
Sat Oct 9 10:07:00 PDT 2004


Nico R. wrote:
> When I have got a directory called "Fotos für Laura" (for the
> non-German-speaking people here, that's "photos for Laura" in English),
> how is the German character "ü" (Latin small letter u with diaresis)
> stored in the filesystem?

It isn't stored at all. In Linux, filenames are just streams of bytes,
terminated by 0x00, with the path components seperated by 0x2F. (Note that
0x2F looks like '/' when interpreted as an ASCII character but it could be
theoretically anything in some hypothetical character encoding.) A stream of
bytes does only consist of ... uhm ... bytes, so there are not "letters" or
"characters" at all. So, actually filenames are not "names" at all, they are
just identifiers. Those identifiers *could* be interpreted as a string in a
certain locale, but they could also be something entirely different.

So, in short: native Linux filesystems only store streams of bytes, whether
these bytes have some special meaning to humans if interpreted according to
some specific character set, is completely irrelevant.

> I am using ISO-8859-15, set via LANG=... on some boxes, but does this
> trigger how to store the characters on disk?

Yes. Everytime you write a filename, whatever name you supply will get
converted to a stream of bytes according to whatever character encoding the
application doing the naming currently happens to work in. And everytime you
read a filename, whatever stream of bytes you receive will get converted to
a string according to whatever character encoding the application doing the
listing currently happens to work in. Very funny things can happen when
those two character encodings are not the same or when the supposed
file"name" is not a "name" at all, but some binary identifier that some
application chose to stick on a file.

> Or is UTF-8 used in the filesystem?

No. As I said: filenames aren't strings, they are just bytes.

This is different in other filesystems / operating systems. In OS/2s JFS,
the kernel knows about the locale of each process and it also knows about
the character encoding of each filename. IBM tried to port this behaviour
over to Linux JFS, but it was removed a few months ago, because it horribly
broke the abovementioned concept of "everything is a bytestream". Windows
filesystems also have some default encodings, I think ASCII for FAT and
UTF-16 for VFAT and NTFS, but I don't know for sure.

> And if it's the former, how can I force e.g. the Linux kernel to write
> the filenames in UTF-8 to disk?

You can't. It's as simply as that. The kernel will always just write the
bytestream that is supplied by the application. It will never do any
character conversions or something like that. (At least not in Linux
filesystems, the fileystem drivers for (V)FAT, NTFS, SMB and CIFS contain
such conversion code.)

So, you will have to make sure that whenever your application supplies a
bytestream to the kernel that this bytestream is actually a UTF-8
representation of your filename. You can usually achieve this by working in
a locale with UTF-8 character encoding. The Glib people believe that
filenames should always be stored in an UTF-8 character encoding, which of
course horribly breaks if you manipulate those same files in an ISO8859-15
locale in Bash, for example.

The only possibility you have is to make sure that everybody who works with
the same filesystem also uses the same character encoding.

> Because I'd like to take the disk (in this case an USB stick) with me to a
> friend and use it at her computer - - which uses different locale and
> charset settings than the boxes I mostly work at.

That simply won't work. The conversion from string to bytestream has to be
done using the same character encoding as the conversion from bytestream to
string.

You could try to archive the files using an archive format that stores the
character encoding used during archiving. Then, upon extracting the files
from the archive, the extracting utility could convert the filenames to the
currently used character encoding. The pax archive format supports that,
unfortunately the much more widely deployed ustar and cpio formats do not.
However, I don't know if any existing tool actually implements the pax
archive format. The GNU paxutils are dead. The new integrated GNU
tar+paxutils (aka GNU tar 2.0) will not be released for a long time. SuSEs
port of OpenBSDs pax doesn't support the pax archive format, only several
tar and cpio variants. AFAIK Jörg Schillings spax (from the S tar package)
doesn't support the pax archive format either and even if it does, I don't
know if it actually supports filename conversions to and from multibyte
variable length character encodings such as UTF-8. I don't know of other
archiving tools.

jwm
-- 
gimp-1.2.5% ./configure
checking for intelligent life... not found



More information about the lfs-chat mailing list