Kiyokazu in (hopefully) hacker mode (multibyte character handling library)

Functions to handle multiple octets character strings

libiso2mb-0.8.5.tar.gz (1236KB, 2000-12-07 02:11:05)

is gzipped tarball of a collection of functions to handle sequences of characters consisting of multiple octets. It includes a character encoding conversion tool which is initially writte for debugging purpose of this library. In spite of my initial intention, I believe that it is very useful tool.

The gzipped unified diff's, the latest developent version, and log of changes

libiso2mb-0.8.5-ChangeLog.txt (13KB, 2000-12-07 02:15:34)

are also available.

The main functionalities are to convert the encoding of a character from ISO 2022 to "fake" UTF-8, and vice versa.

Requirement

To build and install this library, you need C compiler and libraries conforming to ANSI standard. Further

the "int" of your cc must have 32-bit length at least,
your stdio library must have functions "fileno()" and "fdopen()",
if you are going to use the included Makefile, you need GNU Make, GNU fileutils, GNU binutils and GNU C supporting shared objects.

I strongly recommend to use GNU C and GNU Make.

If you build with the included Makefile, you need to tell to your dynamic linker, the directory (/usr/local/lib) in which the shared library is installed.

If you are installing on a Linux box for example, add the line

/usr/local/lib

to the file /etc/ld.so.conf unless it already contains such line, and then issue the command

/sbin/ldconfig

Acceptable ISO 2022 encoding

This library can handle the following subset of the ISO 2022 escape sequences:

designating an ISO 2022 registered character set on a intermediate buffer,
designating UTF-8,
return from UTF-8,
locking shift,
7bit single shift by "\x1B\x4E" or "\x1B\x4F",
8bit single shift by "\x8E" or "\x8F".

Further it can handle the following non-ISO 2022 encodings:

Shift_JIS,
UTF-8,
Big Five,
EUC-tw,
Johab,
Unified Hangul,
KOI8-R,
KOI8-U,
Microsoft Windows Codepages 1250 -- 1258.

Fake UTF-8 encoding

The following bit patterns in UTF-8 endoding are strongly recommended not to use:

1100000x 10xxxxxx
11100000 100xxxxx 10xxxxxx
11110000 1000xxxx 10xxxxxx 10xxxxxx
11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx
11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The ISO-2022 encoded characters in input streams are embedded into these patterns.

Mapping between ISO 2022 encoding and "fake" UTF-8

There are 2^26 + 2^21 + 2^16 + 2^11 + 2^7 patterns in the "fake" UTF-8. For each combination of a character and the designated intermediate buffer, we first calculate an integer as described below, and then assign the integer with a bit pattern in the "fake" UTF-8. A smaller integer is assigned with a shorter bit pattern.

We adopt the following notations.

gggg

is a sequence of 4 bit representing the intermediate buffer and the shift state.

fffff

is lowest significant 5 bit of the final octet F of the designating escape sequence. For encodings which are neighther ISO 2022 nor UTF-8, we use the following "pseudo" final octet.

Shift_JIS (JIS X 0201 and 0208 corresponding part): 0x40.
Shift_JIS (JIS X 0213 corresponding part): 0x41.
Big Five (with 2nd octets smaller than 0x7F): 0x42.
Big Five (with 2nd octets larger than 0xA0): 0x43.
Johab: 0x44 -- 0x46.
KOI8-R, KIO8-U, Microsoft Windows Codepages 1250 -- 1258: shares 0x5E.

c[N]

is a sequence of N bit representing an integer C calculated from each character code.

Under these notations, the integer is composed as follows.

94 * 94 set

Let C1 = ((<1st octet> & 0x7F) - 0x20) * 0x60 + (<2nd octet> & 0x7F) - 0x20. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.

In case C1 >= 0x400: the integer is ggggfffffc[13]1.
In case C1 < 0x400: the integer is ggggfffffc[10]10.

Shift_JIS

If the character corresponds to a JIS X 0201 katakana, let C = (<character code> & 0x7F) - 0x20. If the character corresponds to a JIS X 0208 or 0213 kanji, let c1 and c2 be the 1st and 2nd octets of the JIS X 0208 or 0213 kanji, then put C1 = ((c1 & 0x7F) - 0x20) * 0x60 + (c2 & 0x7F) - 0x20. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.

In case C1 >= 0x400: the integer is fffffc[13]100.
In case C1 < 0x400: the integer is fffffc[10]1000.

Big Five

Let c1 and c2 be the 1st and 2nd octets, and put C1 = ((c1 & 0x7F) - 0x20) * 0x60 + (c2 & 0x7F) - 0x20. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.

In case C1 >= 0x400: the integer is fffffc[13]100.
In case C1 < 0x400: the integer is fffffc[10]1000.

Johab

Let c1 and c2 be the 1st and 2nd octets.

In case c1 >= 0x84 and c1 <= 0xD3 and c2 <= 0x7E,: C1 = (c1 - 0x84) * 0x60 + c2 - 0x41 and FC = 0x44,
in case c1 >= 0x84 and c1 <= 0xD3 and c2 >= 0x81 and c2 <= 0xA0,: C1 = (c1 - 0x84) * 0x60 + c2 - 0x43 and FC = 0x44,
in case c1 >= 0x84 and c1 <= 0xD3 and c2 >= 0xA1 and c2 <= 0xFE,: C1 = (c1 - 0x84) * 0x60 + c2 - 0xA0 and FC = 0x45,
in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0x31 and c2 <= 0x7E,: C1 = (c1 - 0xD7) * 0x60 + c2 - 0x30 and FC = 0x46,
in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0x91 and c2 <= 0xA0,: C1 = (c1 - 0xD7) * 0x60 + c2 - 0x42 and FC = 0x46,
in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0xA1 and c2 <= 0xFE,: C1 = (c1 - 0xB6) * 0x60 + c2 - 0xA0 and FC = 0x46,
in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0x31 and c2 <= 0x7E,: C1 = (c1 - 0xD8) * 0x60 + c2 - 0x30 and FC = 0x46,
in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0x91 and c2 <= 0xA0,: C1 = (c1 - 0xD8) * 0x60 + c2 - 0x42 and FC = 0x46,
in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0xA1 and c2 <= 0xFE,: C1 = (c1 - 0xB7) * 0x60 + c2 - 0xA0 and FC = 0x46.

If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.

In case C1 >= 0x400: the integer is fffffc[13]100.
In case C1 < 0x400: the integer is fffffc[10]1000.

KOI8-R, KOI8-U, Microsoft Windows Codepages 1250 -- 1258

First assign a non-negative integer c1 to each character set in order, and let c2 be lower significant 7 bit of right half of these character sets. Put C1 = c1 * 0x80 + c2. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.

In case C1 >= 0x400: the integer is fffffc[13]100.
In case C1 < 0x400: the integer is fffffc[10]1000.

96 set

Let C1 = (<character code> & 0x7F) - 0x20.

In case F >= 0x60: the integer is ggggfffffc[13]1, with C = C1 + 0x5F * 0x60.
In case F < 0x60: the integer is ggggfffffc[10]10, with C = C1.

94 set

With C = (<character code> & 0x7F) - 0x20, the integer is ggggfffffc[7]bb0000, where bb is a sequence of 2 bit defined as

in case designated by 3 octets sequence and F < 0x60, bb = 00,
in case designated by 3 octets sequence and F >= 0x60, bb = 01,
in case designated by 4 octets sequence and F < 0x60, bb = 10,
in case designated by 4 octets sequence and F >= 0x60, bb = 11.

Implimentation note

This library associates each I/O stream with an automaton.

The automaton associated with an input stream, reads an octet by octet from the stream via a function initially given to the automaton as an only access method to the stream.

After it encounters an ISO 2022 escape sequence, it records itermediate buffers invoked to GL and GR, and character sets designated on G0, G1, G2 and G3. Using these informations, it determines how many octets to compose a character, how to convert the character to "fake" UTF-8.

The automaton associated with an output stream, writes a sequence of octets to the stream via a function initially given to the automaton as an only access method to the stream.

After it decomposes "fake" UTF-8 into the original character code, the character set to which the character belongs, the intermediate buffer on which the character set is designated, and the shift state that whether the intermediate buffer is invoked to GL or to GR. It compares these informations against the internal state of it, and determines which kind of escape sequence should be written to the stream.

Any questions or comments about this page are greatly appreciated.

Almost all contents in this site are written by Kiyokazu SUTO (i.e. me) unless especially noted. I want to put all of them into the PUBLIC DOMAIN, even though some lawyers mention that it is impossible in my country.