HOWTO Make your system use unicode/utf-8

From Gentoo Linux Wiki

Jump to: navigation, search
This article is part of the HOWTO series.
Installation Kernel & Hardware Networks Portage Software System X Server Gaming Non-x86 Emulators Misc

Contents

Because computers store information only in bits of zeros and ones, characters have to be represented by a string of bits and translated back and forth using "character tables". To conserve memory, each character should be made up of as few bits as possible.

The drawback is that this limits the number of characters that can be represented by the table. As long as the table contains all the characters you need, there are no problems. The moment one shares a file with someone who uses a different character table, things start going wrong.

Some tables (such as the ISO-8859-* tables) overlap with the same string representing the same characters. Other characters may exist in only one of the tables. These, naturally, are the main point of contention.

There are two solutions to this problem. Either one must have information about the character table used in each file that contains text, or have a table that incorporates each and every character in the world.

Unicode is an implementation of the latter. It allows users to write and exchange information without compatibility worries and with falling prices for storage, it has become very popular. Users only have to make sure that their software supports Unicode and they have fonts installed that can display all the characters they wish to use (as no single font implements all the characters in Unicode).

[edit] Kernel Stuff

To activate unicode in the kernel set the following in:

Linux Kernel Configuration: Unicode support
File systems --->
 Native Language Support --->
   (utf8) Default NLS Option
   <*>   NLS UTF8

Now your filenames will be encoded in utf8 per default, after you re-compile your kernel.

If you compiled it as a module, be sure to load it:

modprobe nls_utf8

To avoid doing this every time you boot, add "nls_utf8" to your /etc/modules.autoload.d/kernel-2.6 or -2.4 file.

[edit] Kernel Bugs

Please note that there exists a bug in some Linux kernel versions which affects UTF-8 locales using dead keys. The issue has reportedly been solved since kernel version 2.6.11.

[edit] Installing locales

The system locales come with the glibc package. By default almost all possible locales are installed, though you can choose to install only the locales you need.

See TIP Specifying only needed locales for instructions.

[edit] Console setup

In ~root/.bashrc add

File: ~/.bashrc
if [[ $TERM = "linux" ]]; then
  unicode_start
fi

to set the console into unicode mode on root's login (use "unicode_start foo_font" to set your custom font).


But, since "unicode_start" requires root privileges, you can instead configure your Gentoo system to default to unicode consoles for all logins. For this to work, you must have a recent version of sys-apps/baselayout installed (>=sys-apps/baselayout-1.11.9).

First, change the unicode setting in /etc/rc.conf

File: /etc/rc.conf
UNICODE="yes"

Mind the case. UNICODE="YES" will NOT work.

Then, to install a good font for UTF-8 consoles called terminus

Code: emerge terminus
emerge -av media-fonts/terminus-font


Also edit the following files, according to their comments:

/etc/conf.d/consolefont
/etc/conf.d/keymaps


One example for setting the console font is

File: /etc/conf.d/consolefont
CONSOLEFONT="ter-v16b"
#CONSOLETRANSLATION=""


Now, reboot the system, and the system INIT will automatically enable UTF-8 capability on all console logins. However, a particular console login won't actually display in UTF-8 until receiving a switch-to-unicode escape sequence.

The last step is to make the following change so that the switch-to-unicode escape sequence executes at each login

File: ~/.bash_profile
if test -t 1 -a -t 2 ; then
        echo -n -e '\033%G'
fi

This code instructs the console to switch to unicode if running from a console TTY (and not a terminal emulator or remote shell). In fact, this code block is directly from the internals of the "unicode_start" command.

Or, to make the switch to UTF-8 global for all users (could be problematic)

File: /etc/profile
if test -t 1 -a -t 2 ; then
        echo -n -e '\033%G'
fi


As a final, last-ditch alternative, you can use this init.d script to set all consoles into unicode mode on bootup:

File: /etc/init.d/unicode
#!/sbin/runscript
conf=/etc/env.d/02locale

# Using devfs?
if [ -e /dev/.devfsd ] || [ -e /dev/.udev -a -d /dev/vc ]; then
  device=/dev/vc/
else
  device=/dev/tty
fi

depend() {
        need localmount
        after keymaps
        before consolefont
}

checkconfig() {

  if [ -r ${conf} ]; then
          . ${conf}
          encoding=
          [ -n "${LC_ALL}" ]      && encoding=${LC_ALL#*.}   && return 0
          [ -n "${LC_MESSAGES}" ] && encoding=${LC_MESSAGES#*. } && return 0
          [ -n "${LANG}" ]        && encoding=${LANG#*.}   && return 0
  fi
  eend 1 "Locale is not configured, Please fix ${conf}"
  return 1
}

start() {
        ebegin "setting consoles to UTF-8"
        checkconfig
        if [[ "${encoding}" =~ [uU][tT][fF]-?8 ]]; then
                dumpkeys | loadkeys --unicode
                for ((i=1; i <= "${RC_TTY_NUMBER}"; i++)); do
                        echo -ne "\033%G" > ${device}${i}
                done
                eend 0
        else
                eend 1 "UTF-8 is not required"
        fi
}
Code: to make script executable
chmod +x /etc/init.d/unicode 

and then

Code: add the script
rc-update add unicode default


Sometimes it might be needed to set LC_ALL and LANG environmental options as well, it's easy to set them following the instruction on the page of Gentoo Linux Localization Guide.

[edit] Converting old files

Once Unicode support has been added, old files may need to be re-encoded to display properly.

To re-encode the contents of plain text files you have the choice of and iconv, recode and enconv which is in app-i18n/enca).

app-text/convmv is a perl script utility that re-encodes filenames, directory names, and entire subtrees. Emerge it with

Code:
emerge -av app-text/convmv

To test re-encoding a filename from ISO-8859-15 to UTF-8, try

Code:
convmv -f iso-8859-15 -t utf8 file-name-with-รค

and if the produced command seems sane, add --notest to actually re-encode the name.

[edit] Apps

[edit] Terminal emulators

[edit] xterm

xterm is running in unicode mode when started with one of:

Code:

xterm -u8

uxterm

[edit] urxvt

Urxvt from x11-terms/rxvt-unicode is always running in unicode mode. If you want it to use UTF-8, you have to set your LANG accordingly (eg LANG="en_US.UTF-8")

[edit] GNU Screen

GNU Screen must be invoked with the -U command line option.

If you are using it as a login shell you will have to write a wrapper that calls screen with the -U option and the options that are called when screen is used as a login shell:

Code: GNU Screen wrapper
#!/bin/sh
exec /usr/bin/screen -xRR -U

For people using it for irssi and so on, making an alias is enough.

File: ~/.bashrc
alias screen="screen -U"

However, if you are running screen from an SSH or RSH session, then editing the screen configuration should be enough.

Add the following to ~/.screenrc

File: ~/.screenrc
defutf8 on

[edit] Editors

Vim should work out of the box, since version 6.3 or so.

Nano versions prior to 1.3.6 can't handle utf8 properly. At the time of writing, this is only needed for the alpha and ppc-macos platforms.

Code:
echo "=app-editors/nano-1.3.6 ~alpha" >> /etc/portage/package.keywords
emerge -uDav nano

Emacs, when run in console mode, can be configured to handle unicode by adding the following LISP instructions to its configuration file:

File: ~/.emacs
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(prefer-coding-system 'utf-8)

Notice, however, that the console must handle unicode too.

[edit] LaTeX

Merge unicode support for LaTeX with

Code:
emerge dev-tex/latex-unicode

[edit] Mutt printing

Mutt should work without a flaw on a unicode console. But if you want to use pretty-printing you need a few tricks as a2ps does not support utf-8. Your best bet may be using ebuild:app-misc/muttprint as it seems to work perfect both in unicode and single-byte environments and produces very elegant output. However it requires latex to be installed on your system.

Emerge the package and put this in your ~/.muttrc

File: ~/.muttrc
set print_command=muttprint

Otherwise you may emerge recode and a2ps:

emerge recode a2ps

and use this in

File: ~/.muttrc
set print_command="recode UTF-8..Latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy -P printername"

You may also use u2ps from the gnome-u2ps package (Debian gnome-u2ps package - don't know if it's also available in Gentoo). It has native Unicode support.

[edit] Shells

[edit] bash

Bash is unicode-aware since version 3 and when using readline version 5. Both are in portage.

emerge bash sys-libs/readline
revdep-rebuild --soname libreadline.so.4
rm /lib/libreadline.so.4*

be sure you know what you do when you perform the last step (see the info from the readline ebuild).

You will also need to have the package gentoolkit installed as it contains the revdep-rebuild tool.

The above recommended manual deletion of libreadline.so.4 needs to be double checked!

When I do:

# qfile /lib/libreadline.so.4
sys-libs/readline (/lib/libreadline.so.4)
# eix -s readline
sys-libs/readline-5.2_p12-r1

Apparently, libreadline.so.4 belongs to readline-5*! This is further verified with:

# qlist readline

I propose a "clean-up" on this article as further configuration files are recommended to be modified when further configuration might not be needed. See Talk/Discussion link at the top of this page for further info on these issues. I too believe a lot of this stuff should already be implemented within /etc/rc.conf and the unicode USE Flag.

[edit] zsh

Zsh handles UTF-8 perfectly since version 4.3.1. Older versions are not yet unicode aware. It still works as long as you dont use Backspace on unicode characters. (This deletes parts of the utf-8 character bytewise and confuses zle assumptions about the cursor position.)

[edit] mc

Mc must be compiled with the sys-libs/slang library for full unicode support.

 emerge gentoolkit
 euse -E slang
 emerge -avDN mc

[edit] X

X usually obeys the LC_* environment variables; however, X is picky about how you spell your locale settings. What works in the console may not work in X. You can find a list of all acceptable locale aliases in /usr/lib/X11/locale/locale.alias. As always, CaSe matters. You should make sure that the locale you choose corresponds to one of the glibc locales "locale -a".

If you're doing advanced troubleshooting you may also be interested in the locale.dir file, in the same directory. It maps locale names to files. Make sure it maps your locale correctly (it usually does).

So to sum it up, the chain goes like this, and all of its links must be intact: LC_* -> locale.alias -> locale.dir -> [X locale definition file]

[edit] Fluxbox

BUG 1 Fluxbox doesn't fully support unicode yet. Some of its styles are selecting fonts that are not suitable for unicode. To fix this you will have to edit the Fluxbox's stylefile(s) in /usr/share/fluxbox/styles and add something like:

File: /usr/share/fluxbox/styles/$YourStyle
window.font:                         -*-*-*-*-*-*-*-*-*-*-*-*-*-u

to at least fix the window title bug.

Solution by user Holms:

Another solution is to set locale in ~/.xinitrc For example I'm using Cyrillic most of a time. If you will write this in your ~/.xinitrc

File: ~/.xinitrc
export LANG="ru_RU.UTF-8"
export LC_ALL="ru_RU.UTF-8"

then all windows title will be in unicode and your locale will be Russian, set this to you country. Maybe it will be clever to put en_EN.UTF-8 instead of that, because all programs will start display everything in your language instead of english. UTF-8 shows to the system which encoding you'll be using by default so you want Unicode you get Unicode. By the way add same two line to the ~/.bashrc (at least some people prefer to do this, but didn't helped to me) and do not forget to configure your locales in /etc/locale.gen. If you haven't configured it yet, go to Gentoo handbook and read about locales. If this doesn't help try to read HOWTO_Xorg_and_Fonts. Do everything that written in "Emerging the necessary packages" section, at least that helped to me.


BUG 2 Fluxbox takes very long to load on a utf-8 locale http://bugs.gentoo.org/show_bug.cgi?id=71747

patch for fluxbox-0.9.11 here: http://www.fluxmod.org.ua/

(patch has been merged with mainline as of 0.9.14)

[edit] OpenOffice.org

To force OpenOffice.org to use UTF-8 (you'll have problems when entering unicode characters) you have to set the LANGUAGE variable to an appropriate value:

File: /etc/env.d/02locale
LANG="de_DE.UTF-8"
...a lot of LC-Variables...
# For OpenOffice.org
LANGUAGE="en_GB:en"

Don't forget to run env-update && source /etc/profile after changing files in /etc/env.d/. Maybe you'll need to login again to apply the changes to your current environment.

[edit] Links