Encoding of filenames
Alexander E. Patrakov
patrakov at ums.usu.ru
Sat Oct 9 08:31:42 PDT 2004
Nico R. wrote:
> I'm a bit confused about some i18n things. Reading about
> G_FILENAME_ENCODING (see GLib or GTK+ sources) made me think about
I will try to explain, but I expect a patch for BLFS book from you in
order to make the wording better.
> When I have got a directory called "Fotos für Laura" (for the
> non-German-speaking people here, that's "photos for Laura" in English),
> how is the German character "ü" (Latin small letter u with diaresis)
> stored in the filesystem? I am using ISO-8859-15, set via LANG=... on
> some boxes, but does this trigger how to store the characters on disk?
For now, let me start with the following bash command line typed into
the properly configured xterm or Linux console (properly configured =
mkdir "Fotos für Laura"
You can type it into the bash shell, but bash will not interpret it
until you press Enter. Before you press Enter, the string "Fotos für
Laura" is just a sequence of 15 8-bit bytes that has some meaning to
humans if interpreted according to the ISO-8859-15 encoding.
Now you press Enter. Bash is stupid enough (in the good meaning) to just
start the "mkdir" program with that string passed as an argument without
modifications (i.e. it is still in ISO-8859-15).
As you can find out by running strace, the mkdir program just passes
these 15 bytes (plus a terminating null byte) to the mkdir syscall
without modification. In other words, the kernel receives the 15 bytes
that form a meaningful string in ISO-8859-15.
<start of material to skip during the first reading>
Now the kernel decides how to write these 15 bytes to the disk. The
answer depends on the filesystem.
In all Linux native filesystems, the bytes passed to mkdir system call
are written to the device without modification. These filesystems don't
have the "iocharset" mount parameter.
With vfat, ntfs and other Windowsish things, the string is first
converted (in the kernel) from the encoding specified in the "iocharset"
mount option to UCS-2 (note that the application does not know this and
should not know since the kernel will convert the string back upon reading).
In short, the kernel in both cases guarantees that the same sequence of
8-bit bytes can be read back.
<end of material to skip during the first reading>
Now let's type another bash command line:
ls -d Foto*
The state of affairs is the same: the kernel returns 15 bytes, and they
are printed on the screen without any conversions. The terminal
interprets the string according to ISO-8859-15, and you can read the result.
So the conclusion from the above is: With most programs, filenames are
just sequences of 8-bit bytes that should be displayed without
modifications. There is nothing special about them. In other words, it
is sufficient for the programs to be just 8-bit-clean (and not
necessarily locale-aware) in order to display readable filenames. The
terminal will then interpret them according to the configured font.
Please note that the string "Fotos für Laura" is 15 bytes long, and
these 15 bytes are passed to the "mkdir" system call without any
modification. In other words, one can say that ISO-8859-15 is the only
encoding visible in userspace here.
> Or is UTF-8 used in the filesystem?
Glib2 developers think that the above approach is wrong (i.e. that the
"ls" command does the wrong thing). They think that the approach above
worked only when users don't exchange files (the "problem" they see is
that the interpretation of a filename depends on the locale settings).
Therefore they say: the programs should convert the filenames to UTF-8
before passing to the kernel. That will of course work if people use
only Glib-based programs. Coreutils behave incorrectly according to that
Let's see the breakage.
First, create another user with an almost empty home directory (i.e.
only the standard bash startup files there) and log in as that user. Run
startx. Verify in xterm that the locale settings are correct.
Then, from xterm, unset G_FILENAME_ENCODING and run any gtk2 application
that uses the native "Save" dialog, e.g. gvim or gedit. Edit the text
and press the "save" button. Create the folder "Fotos für Laura" from
within the "Save" dialog (there is a button just for that). Save the
file there. Close the editor.
Now in bash, type the following command:
ls -d "Fotos für Laura"
You will get the following error:
ls: Fotos für Laura: Datei oder Verzeichnis nicht gefunden
(i.e. no such file or directory). Instead, the following directory
exists: "Fotos fÃŒr Laura" and it is difficult to delete from the
command line. For you, only one character changed. For Russians, the
string usually becomes completely unreadable.
So, until coreutils and all other programs support the notion of
filename encoding being different from the default locale encoding, I
will say that the default setting of glib2 introduces too much breakage.
Fortunately, Glib2 allows us to make it compatible with the rest of the
world, by setting G_FILENAME_ENCODING (please repeat the test with this
variable set to "@locale").
> So the question is whether setting a certain locale (and charset) only
> effects what is displayed on my screen or also effects how programs
> work internally.
This effects e.g. the sorting order, e.g. compare:
echo -e 'Foo\nbar' | LC_ALL=C sort
echo -e 'Foo\nbar' | LC_ALL=en_US sort
But, unless the program uses glib2, this has no effect on the
interpretation of filenames.
> And if it's the former, how can I force e.g. the Linux kernel to write
> the filenames in UTF-8 to disk? Because I'd like to take the disk (in
> this case an USB stick) with me to a friend and use it at her computer
> - - which uses different locale and charset settings than the boxes I
> mostly work at.
If the USB stick uses the vfat filesystem, mount it with the correct
"iocharset=iso8859-15,codepage=437" options both here and there. This
stick will also be Windows-compatible.
If the stick uses one of the native Linux filesystems, you can use the
"convmv" perl script (search Freshmeat for it). But the filenames after
conversion will be unreadable.
> Would be very nice if somebody could enlighten me a bit about that (to
> me) very confusing i18n, locale and charset stuff - or, also
> appreciated, if somebody could give me a pointer to a man page or
> source file with some explanations. My Googling skills brought up some
> comments and articles, but it didn't help me a lot this time... :-/
Feel free to ask me specific questions. Unfortunately, since English
speakers don't have problems, there is really few information on that
subject in English.
Alexander E. Patrakov
More information about the lfs-chat