<Top of this site> <Top of programming pages in this site>

Functions to handle multiple octets character strings


libiso2mb-0.8.5.tar.gz (1236KB, 2000-12-07 02:11:05)
is gzipped tarball of a collection of functions to handle sequences of characters consisting of multiple octets. It includes a character encoding conversion tool which is initially writte for debugging purpose of this library. In spite of my initial intention, I believe that it is very useful tool.

The gzipped unified diff's, the latest developent version, and log of changes

libiso2mb-0.8.5-ChangeLog.txt (13KB, 2000-12-07 02:15:34)
are also available.

The main functionalities are to convert the encoding of a character from ISO 2022 to "fake" UTF-8, and vice versa.


Requirement

To build and install this library, you need C compiler and libraries conforming to ANSI standard. Further

I strongly recommend to use GNU C and GNU Make.

If you build with the included Makefile, you need to tell to your dynamic linker, the directory (/usr/local/lib) in which the shared library is installed.

If you are installing on a Linux box for example, add the line

/usr/local/lib
to the file /etc/ld.so.conf unless it already contains such line, and then issue the command
/sbin/ldconfig


Acceptable ISO 2022 encoding

This library can handle the following subset of the ISO 2022 escape sequences:

Further it can handle the following non-ISO 2022 encodings:


Fake UTF-8 encoding

The following bit patterns in UTF-8 endoding are strongly recommended not to use:

The ISO-2022 encoded characters in input streams are embedded into these patterns.


Mapping between ISO 2022 encoding and "fake" UTF-8

There are 2^26 + 2^21 + 2^16 + 2^11 + 2^7 patterns in the "fake" UTF-8. For each combination of a character and the designated intermediate buffer, we first calculate an integer as described below, and then assign the integer with a bit pattern in the "fake" UTF-8. A smaller integer is assigned with a shorter bit pattern.

We adopt the following notations.

gggg
is a sequence of 4 bit representing the intermediate buffer and the shift state.
fffff
is lowest significant 5 bit of the final octet F of the designating escape sequence. For encodings which are neighther ISO 2022 nor UTF-8, we use the following "pseudo" final octet.
Shift_JIS (JIS X 0201 and 0208 corresponding part)
0x40.
Shift_JIS (JIS X 0213 corresponding part)
0x41.
Big Five (with 2nd octets smaller than 0x7F)
0x42.
Big Five (with 2nd octets larger than 0xA0)
0x43.
Johab
0x44 -- 0x46.
KOI8-R, KIO8-U, Microsoft Windows Codepages 1250 -- 1258
shares 0x5E.
c[N]
is a sequence of N bit representing an integer C calculated from each character code.
Under these notations, the integer is composed as follows.
94 * 94 set
Let C1 = ((<1st octet> & 0x7F) - 0x20) * 0x60 + (<2nd octet> & 0x7F) - 0x20. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
In case C1 >= 0x400
the integer is ggggfffffc[13]1.
In case C1 < 0x400
the integer is ggggfffffc[10]10.
Shift_JIS
If the character corresponds to a JIS X 0201 katakana, let C = (<character code> & 0x7F) - 0x20. If the character corresponds to a JIS X 0208 or 0213 kanji, let c1 and c2 be the 1st and 2nd octets of the JIS X 0208 or 0213 kanji, then put C1 = ((c1 & 0x7F) - 0x20) * 0x60 + (c2 & 0x7F) - 0x20. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
In case C1 >= 0x400
the integer is fffffc[13]100.
In case C1 < 0x400
the integer is fffffc[10]1000.
Big Five
Let c1 and c2 be the 1st and 2nd octets, and put C1 = ((c1 & 0x7F) - 0x20) * 0x60 + (c2 & 0x7F) - 0x20. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
In case C1 >= 0x400
the integer is fffffc[13]100.
In case C1 < 0x400
the integer is fffffc[10]1000.
Johab
Let c1 and c2 be the 1st and 2nd octets.
In case c1 >= 0x84 and c1 <= 0xD3 and c2 <= 0x7E,
C1 = (c1 - 0x84) * 0x60 + c2 - 0x41 and FC = 0x44,
in case c1 >= 0x84 and c1 <= 0xD3 and c2 >= 0x81 and c2 <= 0xA0,
C1 = (c1 - 0x84) * 0x60 + c2 - 0x43 and FC = 0x44,
in case c1 >= 0x84 and c1 <= 0xD3 and c2 >= 0xA1 and c2 <= 0xFE,
C1 = (c1 - 0x84) * 0x60 + c2 - 0xA0 and FC = 0x45,
in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0x31 and c2 <= 0x7E,
C1 = (c1 - 0xD7) * 0x60 + c2 - 0x30 and FC = 0x46,
in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0x91 and c2 <= 0xA0,
C1 = (c1 - 0xD7) * 0x60 + c2 - 0x42 and FC = 0x46,
in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0xA1 and c2 <= 0xFE,
C1 = (c1 - 0xB6) * 0x60 + c2 - 0xA0 and FC = 0x46,
in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0x31 and c2 <= 0x7E,
C1 = (c1 - 0xD8) * 0x60 + c2 - 0x30 and FC = 0x46,
in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0x91 and c2 <= 0xA0,
C1 = (c1 - 0xD8) * 0x60 + c2 - 0x42 and FC = 0x46,
in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0xA1 and c2 <= 0xFE,
C1 = (c1 - 0xB7) * 0x60 + c2 - 0xA0 and FC = 0x46.
If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
In case C1 >= 0x400
the integer is fffffc[13]100.
In case C1 < 0x400
the integer is fffffc[10]1000.
KOI8-R, KOI8-U, Microsoft Windows Codepages 1250 -- 1258
First assign a non-negative integer c1 to each character set in order, and let c2 be lower significant 7 bit of right half of these character sets. Put C1 = c1 * 0x80 + c2. If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
In case C1 >= 0x400
the integer is fffffc[13]100.
In case C1 < 0x400
the integer is fffffc[10]1000.
96 set
Let C1 = (<character code> & 0x7F) - 0x20.
In case F >= 0x60
the integer is ggggfffffc[13]1, with C = C1 + 0x5F * 0x60.
In case F < 0x60
the integer is ggggfffffc[10]10, with C = C1.
94 set
With C = (<character code> & 0x7F) - 0x20, the integer is ggggfffffc[7]bb0000, where bb is a sequence of 2 bit defined as


Implimentation note

This library associates each I/O stream with an automaton.

The automaton associated with an input stream, reads an octet by octet from the stream via a function initially given to the automaton as an only access method to the stream.

After it encounters an ISO 2022 escape sequence, it records itermediate buffers invoked to GL and GR, and character sets designated on G0, G1, G2 and G3. Using these informations, it determines how many octets to compose a character, how to convert the character to "fake" UTF-8.

The automaton associated with an output stream, writes a sequence of octets to the stream via a function initially given to the automaton as an only access method to the stream.

After it decomposes "fake" UTF-8 into the original character code, the character set to which the character belongs, the intermediate buffer on which the character set is designated, and the shift state that whether the intermediate buffer is invoked to GL or to GR. It compares these informations against the internal state of it, and determines which kind of escape sequence should be written to the stream.


Any questions or comments about this page are greatly appreciated.

Almost all contents in this site are written by Kiyokazu SUTO (i.e. me) unless especially noted. I want to put all of them into the PUBLIC DOMAIN, even though some lawyers mention that it is impossible in my country.