<Top of this site>
<Top of programming pages in this site>
Functions to handle multiple octets character strings
libiso2mb-0.8.5.tar.gz (1236KB, 2000-12-07 02:11:05)
is gzipped tarball of a collection of functions
to handle sequences of characters consisting of multiple octets.
It includes
a character encoding conversion tool
which is initially writte for debugging purpose of this library.
In spite of my initial intention,
I believe that it is very useful tool.
The gzipped unified diff's, the latest developent version,
and log of changes
libiso2mb-0.8.5-ChangeLog.txt (13KB, 2000-12-07 02:15:34)
are also available.
The main functionalities are
to convert the encoding of a character
from ISO 2022 to "fake" UTF-8,
and vice versa.
To build and install this library,
you need C compiler and libraries conforming to ANSI standard.
Further
- the "int" of your cc must have 32-bit length at least,
- your stdio library must have functions "fileno()" and "fdopen()",
-
if you are going to use the included Makefile,
you need
GNU Make,
GNU fileutils,
GNU binutils and GNU C
supporting shared objects.
I strongly recommend to use GNU C and GNU Make.
If you build with the included Makefile,
you need to tell to your dynamic linker,
the directory (/usr/local/lib) in which the shared library is installed.
If you are installing on a Linux box for example,
add the line
/usr/local/lib
to the file /etc/ld.so.conf
unless it already contains such line,
and then issue the command
/sbin/ldconfig
This library can handle the following subset of the ISO 2022 escape sequences:
- designating an ISO 2022 registered character set on a intermediate buffer,
- designating UTF-8,
- return from UTF-8,
- locking shift,
- 7bit single shift by "\x1B\x4E" or "\x1B\x4F",
- 8bit single shift by "\x8E" or "\x8F".
Further it can handle the following non-ISO 2022 encodings:
- Shift_JIS,
- UTF-8,
- Big Five,
- EUC-tw,
- Johab,
- Unified Hangul,
- KOI8-R,
- KOI8-U,
- Microsoft Windows Codepages 1250 -- 1258.
The following bit patterns in UTF-8 endoding
are strongly recommended not to use:
- 1100000x 10xxxxxx
- 11100000 100xxxxx 10xxxxxx
- 11110000 1000xxxx 10xxxxxx 10xxxxxx
- 11111000 10000xxx 10xxxxxx 10xxxxxx 10xxxxxx
- 11111100 100000xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The ISO-2022 encoded characters in input streams are embedded into these patterns.
There are 2^26 + 2^21 + 2^16 + 2^11 + 2^7 patterns in the "fake" UTF-8.
For each combination of a character and the designated intermediate buffer,
we first calculate an integer as described below,
and then assign the integer with a bit pattern in the "fake" UTF-8.
A smaller integer is assigned with a shorter bit pattern.
We adopt the following notations.
- gggg
- is a sequence of 4 bit representing the intermediate buffer and the shift state.
- fffff
-
is lowest significant 5 bit of the final octet F of the designating escape sequence.
For encodings which are neighther ISO 2022 nor UTF-8,
we use the following "pseudo" final octet.
- Shift_JIS (JIS X 0201 and 0208 corresponding part)
- 0x40.
- Shift_JIS (JIS X 0213 corresponding part)
- 0x41.
- Big Five (with 2nd octets smaller than 0x7F)
- 0x42.
- Big Five (with 2nd octets larger than 0xA0)
- 0x43.
- Johab
- 0x44 -- 0x46.
- KOI8-R, KIO8-U, Microsoft Windows Codepages 1250 -- 1258
- shares 0x5E.
- c[N]
- is a sequence of N bit representing an integer C calculated from each character code.
Under these notations,
the integer is composed as follows.
- 94 * 94 set
-
Let C1 = ((<1st octet> & 0x7F) - 0x20) * 0x60 + (<2nd octet> & 0x7F) - 0x20.
If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
- In case C1 >= 0x400
- the integer is ggggfffffc[13]1.
- In case C1 < 0x400
- the integer is ggggfffffc[10]10.
- Shift_JIS
-
If the character corresponds to a JIS X 0201 katakana,
let C = (<character code> & 0x7F) - 0x20.
If the character corresponds to a JIS X 0208 or 0213 kanji,
let c1 and c2 be the 1st and 2nd octets of the JIS X 0208 or 0213 kanji,
then put C1 = ((c1 & 0x7F) - 0x20) * 0x60 + (c2 & 0x7F) - 0x20.
If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
- In case C1 >= 0x400
- the integer is fffffc[13]100.
- In case C1 < 0x400
- the integer is fffffc[10]1000.
- Big Five
-
Let c1 and c2 be the 1st and 2nd octets,
and put C1 = ((c1 & 0x7F) - 0x20) * 0x60 + (c2 & 0x7F) - 0x20.
If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
- In case C1 >= 0x400
- the integer is fffffc[13]100.
- In case C1 < 0x400
- the integer is fffffc[10]1000.
- Johab
-
Let c1 and c2 be the 1st and 2nd octets.
- In case c1 >= 0x84 and c1 <= 0xD3 and c2 <= 0x7E,
- C1 = (c1 - 0x84) * 0x60 + c2 - 0x41 and FC = 0x44,
- in case c1 >= 0x84 and c1 <= 0xD3 and c2 >= 0x81 and c2 <= 0xA0,
- C1 = (c1 - 0x84) * 0x60 + c2 - 0x43 and FC = 0x44,
- in case c1 >= 0x84 and c1 <= 0xD3 and c2 >= 0xA1 and c2 <= 0xFE,
- C1 = (c1 - 0x84) * 0x60 + c2 - 0xA0 and FC = 0x45,
- in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0x31 and c2 <= 0x7E,
- C1 = (c1 - 0xD7) * 0x60 + c2 - 0x30 and FC = 0x46,
- in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0x91 and c2 <= 0xA0,
- C1 = (c1 - 0xD7) * 0x60 + c2 - 0x42 and FC = 0x46,
- in case c1 >= 0xD8 and c1 <= 0xDE and c2 >= 0xA1 and c2 <= 0xFE,
- C1 = (c1 - 0xB6) * 0x60 + c2 - 0xA0 and FC = 0x46,
- in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0x31 and c2 <= 0x7E,
- C1 = (c1 - 0xD8) * 0x60 + c2 - 0x30 and FC = 0x46,
- in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0x91 and c2 <= 0xA0,
- C1 = (c1 - 0xD8) * 0x60 + c2 - 0x42 and FC = 0x46,
- in case c1 >= 0xE0 and c1 <= 0xF9 and c2 >= 0xA1 and c2 <= 0xFE,
- C1 = (c1 - 0xB7) * 0x60 + c2 - 0xA0 and FC = 0x46.
If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
- In case C1 >= 0x400
- the integer is fffffc[13]100.
- In case C1 < 0x400
- the integer is fffffc[10]1000.
- KOI8-R, KOI8-U, Microsoft Windows Codepages 1250 -- 1258
-
First assign a non-negative integer c1 to each character set in order,
and let c2 be lower significant 7 bit of right half of these character sets.
Put C1 = c1 * 0x80 + c2.
If C1 is smaller than 0x400, then put C = C1, otherwise put C = C1 - 0x400.
- In case C1 >= 0x400
- the integer is fffffc[13]100.
- In case C1 < 0x400
- the integer is fffffc[10]1000.
- 96 set
-
Let C1 = (<character code> & 0x7F) - 0x20.
- In case F >= 0x60
- the integer is ggggfffffc[13]1, with C = C1 + 0x5F * 0x60.
- In case F < 0x60
- the integer is ggggfffffc[10]10, with C = C1.
- 94 set
-
With C = (<character code> & 0x7F) - 0x20,
the integer is ggggfffffc[7]bb0000,
where bb is a sequence of 2 bit defined as
- in case designated by 3 octets sequence and F < 0x60, bb = 00,
- in case designated by 3 octets sequence and F >= 0x60, bb = 01,
- in case designated by 4 octets sequence and F < 0x60, bb = 10,
- in case designated by 4 octets sequence and F >= 0x60, bb = 11.
This library associates each I/O stream with an automaton.
The automaton associated with an input stream,
reads an octet by octet from the stream
via a function initially given to the automaton
as an only access method to the stream.
After it encounters an ISO 2022 escape sequence,
it records itermediate buffers invoked to GL and GR,
and character sets designated on G0, G1, G2 and G3.
Using these informations,
it determines
how many octets to compose a character,
how to convert the character to "fake" UTF-8.
The automaton associated with an output stream,
writes a sequence of octets to the stream
via a function initially given to the automaton
as an only access method to the stream.
After it decomposes "fake" UTF-8
into
the original character code,
the character set to which the character belongs,
the intermediate buffer on which the character set is designated,
and the shift state that whether the intermediate buffer is invoked to GL or to GR.
It compares these informations against the internal state of it,
and determines which kind of escape sequence should be written to the stream.
Any questions or comments about this page
are greatly appreciated.
Almost all contents in this site are written by
Kiyokazu SUTO
(i.e. me)
unless especially noted.
I want to put all of them into the PUBLIC DOMAIN,
even though some lawyers mention that it is impossible in my country.