Encoding of filenames

Alexander E. Patrakov patrakov at ums.usu.ru
Sat Oct 9 08:31:42 PDT 2004


Nico R. wrote:

> I'm a bit confused about some i18n things. Reading about
> G_FILENAME_ENCODING (see GLib or GTK+ sources) made me think about
> this.

I will try to explain, but I expect a patch for BLFS book from you in 
order to make the wording better.

> When I have got a directory called "Fotos für Laura" (for the
> non-German-speaking people here, that's "photos for Laura" in English),
> how is the German character "ü" (Latin small letter u with diaresis)
> stored in the filesystem? I am using ISO-8859-15, set via LANG=... on
> some boxes, but does this trigger how to store the characters on disk?

It depends.

For now, let me start with the following bash command line typed into 
the properly configured xterm or Linux console (properly configured = 
speaks ISO-8859-15):

mkdir "Fotos für Laura"

You can type it into the bash shell, but bash will not interpret it 
until you press Enter. Before you press Enter, the string "Fotos für 
Laura" is just a sequence of 15 8-bit bytes that has some meaning to 
humans if interpreted according to the ISO-8859-15 encoding.

Now you press Enter. Bash is stupid enough (in the good meaning) to just 
start the "mkdir" program with that string passed as an argument without 
modifications (i.e. it is still in ISO-8859-15).

As you can find out by running strace, the mkdir program just passes 
these 15 bytes (plus a terminating null byte) to the mkdir syscall 
without modification. In other words, the kernel receives the 15 bytes 
that form a meaningful string in ISO-8859-15.

<start of material to skip during the first reading>
Now the kernel decides how to write these 15 bytes to the disk. The 
answer depends on the filesystem.

In all Linux native filesystems, the bytes passed to mkdir system call 
are written to the device without modification. These filesystems don't 
have the "iocharset" mount parameter.

With vfat, ntfs and other Windowsish things, the string is first 
converted (in the kernel) from the encoding specified in the "iocharset" 
mount option to UCS-2 (note that the application does not know this and 
should not know since the kernel will convert the string back upon reading).

In short, the kernel in both cases guarantees that the same sequence of 
8-bit bytes can be read back.
<end of material to skip during the first reading>

Now let's type another bash command line:

ls -d Foto*

The state of affairs is the same: the kernel returns 15 bytes, and they 
are printed on the screen without any conversions. The terminal 
interprets the string according to ISO-8859-15, and you can read the result.

So the conclusion from the above is: With most programs, filenames are 
just sequences of 8-bit bytes that should be displayed without 
modifications. There is nothing special about them. In other words, it 
is sufficient for the programs to be just 8-bit-clean (and not 
necessarily locale-aware) in order to display readable filenames. The 
terminal will then interpret them according to the configured font.

Please note that the string "Fotos für Laura" is 15 bytes long, and 
these 15 bytes are passed to the "mkdir" system call without any 
modification. In other words, one can say that ISO-8859-15 is the only 
encoding visible in userspace here.

> Or is UTF-8 used in the filesystem?

Glib2 developers think that the above approach is wrong (i.e. that the 
"ls" command does the wrong thing). They think that the approach above 
worked only when users don't exchange files (the "problem" they see is 
that the interpretation of a filename depends on the locale settings). 
Therefore they say: the programs should convert the filenames to UTF-8 
before passing to the kernel. That will of course work if people use 
only Glib-based programs. Coreutils behave incorrectly according to that 
viewpoint.

Let's see the breakage.

First, create another user with an almost empty home directory (i.e. 
only the standard bash startup files there) and log in as that user. Run 
startx. Verify in xterm that the locale settings are correct.

Then, from xterm, unset G_FILENAME_ENCODING and run any gtk2 application 
that uses the native "Save" dialog, e.g. gvim or gedit. Edit the text 
and press the "save" button. Create the folder "Fotos für Laura" from 
within the "Save" dialog (there is a button just for that). Save the 
file there. Close the editor.

Now in bash, type the following command:

ls -d "Fotos für Laura"

You will get the following error:

ls: Fotos für Laura: Datei oder Verzeichnis nicht gefunden

(i.e. no such file or directory). Instead, the following directory 
exists: "Fotos fÃŒr Laura" and it is difficult to delete from the 
command line. For you, only one character changed. For Russians, the 
string usually becomes completely unreadable.

So, until coreutils and all other programs support the notion of 
filename encoding being different from the default locale encoding, I 
will say that the default setting of glib2 introduces too much breakage. 
Fortunately, Glib2 allows us to make it compatible with the rest of the 
world, by setting G_FILENAME_ENCODING (please repeat the test with this 
variable set to "@locale").

> So the question is whether setting a certain locale (and charset) only
> effects what is displayed on my screen or also effects how programs
> work internally.

This effects e.g. the sorting order, e.g. compare:

echo -e 'Foo\nbar' | LC_ALL=C sort
echo -e 'Foo\nbar' | LC_ALL=en_US sort

But, unless the program uses glib2, this has no effect on the 
interpretation of filenames.

> And if it's the former, how can I force e.g. the Linux kernel to write
> the filenames in UTF-8 to disk? Because I'd like to take the disk (in
> this case an USB stick) with me to a friend and use it at her computer
> - - which uses different locale and charset settings than the boxes I
> mostly work at.

If the USB stick uses the vfat filesystem, mount it with the correct 
"iocharset=iso8859-15,codepage=437" options both here and there. This 
stick will also be Windows-compatible.

If the stick uses one of the native Linux filesystems, you can use the 
"convmv" perl script (search Freshmeat for it). But the filenames after 
conversion will be unreadable.

> Would be very nice if somebody could enlighten me a bit about that (to
> me) very confusing i18n, locale and charset stuff - or, also
> appreciated, if somebody could give me a pointer to a man page or
> source file with some explanations. My Googling skills brought up some
> comments and articles, but it didn't help me a lot this time... :-/

Feel free to ask me specific questions. Unfortunately, since English 
speakers don't have problems, there is really few information on that 
subject in English.

-- 
Alexander E. Patrakov



More information about the lfs-chat mailing list