NAME

mbconv - Character encoding scheme converter

SYNOPSIS

  mbconv [options] <file> ...

DESCRIPTION

This is an application of a library to handle multiple octets character string:

  http://pub.ks-and-ks.ne.jp/prog/libiso2mb.shtml

mainly written for debugging of the library.

It reads octet by octet from files given on command line (or standard input if no file is specified), converts character encoding scheme as specified by command line options (described below), and output to standard output (or a file specified by -t option or -a option).

Options

-?, -h, --help

display summary of options and exits.

-a file, --append-to=file

output is appended to file.

-c converters, --convert-to=converters

specifies character encoding conversion. converters must be comma separated list of words described in Conversion specifiers.

-f flags, --flag=flags

specifies flags to change behavior of conversion. flags must be comma separated list of words describe in Flag specifiers.

-i, --input

succeeding options apply to input stream.

-m string, --mime-charset=string

mime encoding conforming to RFC2047 is performed. <string> is used as charset name.

-n, --line-number

line number (>= 1) is inserted to beginning of each line.

-o, --output

succeeding options apply to output stream.

-t file, --to=file

output to file (truncated).

-w, --width

output width of each line.

-cs <string>, --charset=string

specifies charset name. Some language specifications are also accepted as well as MIME charset names, which are used to restrict candidates of encoding scheme of input stream. Acceptable languages are listed in Acceptable languages.

--cname=canonical-name=charset-names

specifies canonical name of non-standard charset name. charset names must be comma separated list of charset names.

--format=string

specifies output format

--which

output charset name of each input stream to stderr, in the form

file name: charset name

if two or more files are specifed on the command line, or

charset name

otherwise.

Conversion specifiers

Note: for output stream, converter setup is automatically performed based on charset. So in most cases, yo need not to specify converters explicitly.

b, cn-big5: converted to Big Five,
c: converted to ISO-2022-CN,
j, a0: converted in such a way that designate to G0 and invoked to GL
k: converted to ISO-2022-KR,
s, sjis, shift_jis: converted to Shift_JIS,
b2c, big5-to-cns: Big Five converted to CNS 11643,
i2u, iso-to-ucs: converted to UTF-8,
u2b, ucs-to-big5: UTF-8 converted to Big Five or others,
u2c, ucs-to-cn: UTF-8 converted to CNS 11643 or others,
u2j, ucs-to-ja: UTF-8 converted to JIS X 0208 or others,
u2k, ucs-to-kr: UTF-8 converted to KS X 1001 or others,
u2gb, ucs-to-gb: UTF-8 converted to GB2312 or others,
misc: UTF-8 converted to one of koi8-r, koi8-u, windows-1250, ..., or windows-1258,
ascii: domestic ASCII converted to US-ASCII,
cn-gb: converted to CN-GB,
euc-jp: converted to EUC-jp,
euc-kr: converted to EUC-kr,
euc-tw: converted to EUC-tw,
charset: converted appropriately according to the charset bound to the internal automaton,
ms-latin1: Unicode characters of code point between 0x80 and 0x9F (both inclusive) are converted to other Unicode characters as if they are characters of those code point in Microsoft Windows Codepage 1252,
ucs-to-johab: converted to JOHAB.
cn-gb-isoir165: converted to CN-GB-ISOIR165.

Flag specifiers

use-0x28-for-94x94inG0, 28: use ``1/11 2/4 2/8 F'' instead of ``1/11 2/4 F'' to designate charsets with final octet 4/0, 4/1, or 4/2 to G0,
ac, ascii-at-control: escape sequence ``1/11 2/8 4/2'' is output before every control character,
uc, check-utf-8: check overlong encoding of UTF-8,
nossl, ignore-7bit-single-shift: escape sequence for 7 bit single shift is ignored.

Acceptable languages

The following words may be given instead of MIME charset name for input stream. In that case, coding scheme is automatically detected (hopefully) among succeeding ones.

c, cn, china, chinese: cn-gb, cn-big5, utf-8, or x-euc-tw.
j, ja, jp, japan, japanese: euc-jp, shift_jis, or utf-8.
k, ko, kr, korea, korean: euc-kr, x-johab, utf-8, or x-unified-hangul.
cjk: iso-8859-1, cn-gb, cn-big5, x-euc-tw, euc-jp, shift_jis, euc-kr, x-johab, x-unified-hangul, or utf-8.

AUTHOR

Kiyokazu SUTO <suto@ks-and-ks.ne.jp>

DISCLAIMER etc.

This program is distributed with absolutely no warranty.

Anyone can use, modify, and re-distibute this program without any restriction.