Nano locale issues

Alexander E. Patrakov patrakov at ums.usu.ru
Thu Apr 13 07:03:53 PDT 2006


I wrote:
> DJ Lucas wrote:
>> Lastly, while still on the locale issues page, I'm not qualified to 
>> write the text, but I'd like to see some specific examples of the 
>> types of problems that may be seen, along with a short paragraph or 
>> two about the adoption of UTF-8 and it's use in other distro's as an 
>> example to prove (or disprove) it's future.  I can see that this page 
>> would be very valuable WRT education.
> 
> Thanks for the suggestion. I will write some text for it later today.
> 

Below is an attempt to classify the problems. Locale-related issues stem from 
the following sources:

I) incorrect assumption about the encoding of some input or output.

The result is either wrong characters (that don't form words understandable by 
humans), or blanks or "can't render this" boxes. A simple real-world example is 
at http://wiki.linuxfromscratch.org/blfs/wiki/ID3v1Coding.

This assumption is sometimes accompanied by impossibility to tell the correct 
encoding to the program (see Cdrtools, A2PS and Enscript wiki pages).

II) assumption that "characters" and "bytes" are the same things, and each 
character occupies one character cell.

The results vary. Possible outcomes:

1) User-visible strings are cut too short because they actually occupy less 
character cells than the program thinks by counting bytes.

2) Misalignment and other misplacement of text strings, for the same reason.

This usualy accumulates over time until the screen becomes completely messed up. 
MC screenshot on the Locale Related Issues page is a good example.

3) non-ASCII characters become pairs of "can't render this" boxes because it is 
impossible to draw a byte (a program says "draw this byte, draw the next byte", 
without understanding that these bytes form a character only if used together).

III) Use of obsolete isprint(), isupper(), islower(), ... C functions that 
accept the argument of the type "char" (really, a byte) instead of iswprint(), 
iswupper(), iswlower() family.

Since a byte that is only a part of the character is neither letter nor a digit, 
and is not even treated as a printable character, the program thinks of 
non-ASCII characters as of something that must be hidden from the user.

Similarly, tolower()/toupper() functions don't work with multibyte characters. 
Testcase (try it in ru_RU.UTF-8 and ru_RU.KOI8-R locales, in xterms started as 
"LANG=..... xterm -lc"): echo АБВГ | tr [:upper:] [:lower:]

IV) Use of characters outside of Latin-1 set in the PostScript output without 
embedding the corresponding fonts (was in Firefox < 1.5). Real PostScript 
printers don't understand this and print blanks.

V) Misc. bugs, e.g. in es_ES.UTF-8 locale the "cfdisk" program, when prompting 
for "yes or no" (translated), doesn't display what user typed.

As for the minimal amout of job required:

Unfixable packages (and packages with horrible workarounds, which is the same):

           o Ed-0.2
           o ASH-0.4.0
           o Tcsh-6.14.00
           o ZSH-4.2.5 (devel version = partial fix)
           o Ispell-3.2.06.epa7
           o UnZip-5.52 (already covered)
           o Zip-2.31 (similar to UnZip, can't create Windows-readable Zip files)
           o MC-4.6.1 (already covered, the wiki is too optimistic)
           o Links-2.1pre17
           o Pine-4.63 (?)
           o Slrn-0.9.8.1 (?)
           o All FTP servers except wuftpd (which is insecure)
           o Dillo-0.8.5
           o Almost all MP3 players
           o a2ps-4.13b
           o Enscript-1.6.4

Fixable packages (even if the fix is to use a development version):
           o Nano-1.2.5
           o ZSH-4.2.5 (partially)
           o Lynx-2.8.5
           o Xorg-6.9.0
           o Xorg-7.0.0 (https://bugs.freedesktop.org/show_bug.cgi?id=2676 and 
its duplicates)
           o XFree86-4.5.0
           o Cdrtools-2.01

Victims of bugs in other packages:
           o GTK+-1.2.10 (Xorg bug)
           o LessTif-0.94.4 (Xorg bug)
           o Emacs-21.4a (Xorg bug)
           o Tk-8.4.12 (Xorg bug)
           o Nautilus CD Burner-2.12.2 (cdrtools bug)

Packages that work but are often misunderstood and thus require textual changes:
           o OpenSSH-4.3p2 Client
           o teTeX-3.0 (the package is misnamed as "TeX-3.0")

Packages that need to be (re-)tested:
           o All Gnome-1.4 apps
           o HAL
           o Pine-4.63 with patch
           o Slrn-0.9.8.1 with patch
           o Heirloom mailx-12.0
           o FOP-0.20.5
           o Yelp-2.12.2

Minor failures (like MPlayer not displaying non_ASCII characters in its main 
window) are not accounted above.

Of course a whitelist would be a more sensible approach, but I am not ready to 
prepare a long whitelist now.

-- 
Alexander E. Patrakov




More information about the blfs-dev mailing list