1@node Encoding conversions
2@chapter Encoding conversions (@file{iconv.h})
3
4This chapter describes the Newlib iconv library.
5The iconv functions declarations are in
6@file{iconv.h}.
7
8@menu
9* iconv::                           Encoding conversion routines
10* Introduction to iconv::           Introduction to iconv and encodings
11* Supported encodings::             The list of currently supported encodings
12* iconv design decisions::          General iconv library design issues
13* iconv configuration::             iconv-related configure script options
14* Encoding names::                  How encodings are named.
15* CCS tables::                      CCS tables format and 'mktbl.pl' Perl script
16* CES converters::                  CES converters description
17* The encodings description file::  The 'encoding.deps' file and 'mkdeps.pl'
18* How to add new encoding::         The steps to add new encoding support
19* The locale support interfaces::   Locale-related iconv interfaces
20* Contact::                         The author contact
21@end menu
22
23@page
24@include iconv/lib/iconv.def
25
26@page
27@node Introduction to iconv
28@section Introduction to iconv
29@findex encoding
30@findex character set
31@findex charset
32@findex CES
33@findex CCS
34@*
35The iconv library is intended to convert characters from one encoding to
36another. It implements iconv(), iconv_open() and iconv_close()
37calls, which are defined by the Single Unix Specification.
38
39@*
40In addition to these user-level interfaces, the iconv library also has
41several useful interfaces which are needed to support coding
42capabilities of the Newlib Locale infrastructure.  Since Locale
43support also needs to
44convert various character sets to and from the @emph{wide characters
45set}, the iconv library shares it's capabilities with the Newlib Locale
46subsystem. Moreover, the iconv library supports several features which are
47only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
48
49@*
50The Newlib iconv library was created using concepts from another iconv
51library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
52was rewritten from scratch and contains a lot of improvements with respect to
53the original iconv library.
54
55@*
56Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
57are often used with various meanings. The following are the definitions of terms
58which are used in this documentation as well as in the iconv library
59implementation:
60
61@itemize @bullet
62@item
63@dfn{encoding} - a machine representation of characters by means of bits;
64
65@item
66@dfn{Character Set} or @dfn{Charset} - just a collection of
67characters, i.e. the encoding is the machine representation of the character set;
68
69@item
70@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
71set of integers @dfn{character codes};
72
73@item
74@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
75codes to a sequence of bytes;
76@end itemize
77
78@*
79Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
80ASCII, etc. Encodings are formed by the following chain of steps:
81
82@enumerate
83@item
84User has a set of characters which are specific to his or her language (character set).
85
86@item
87Each character from this set is uniquely numbered, resulting in an CCS.
88
89@item
90Each number from the CCS is converted to a sequence of bits or bytes by means
91of a CES and form some encoding. Thus, CES may be considered as a
92function of CCS which produces some encoding. Note, that CES may be
93applied to more than one CCS.
94@end enumerate
95
96@*
97Thus, an encoding may be considered as one or more CCS + CES.
98
99@*
100Sometimes, there is no CES and in such cases encoding is equivalent
101to CCS, e.g. KOI8-R or ASCII.
102
103@*
104An example of a more complicated encoding is UTF-8 which is the UCS
105(or Unicode) CCS plus the UTF-8 CES.
106
107@*
108The following is a brief list of iconv library features:
109@itemize
110@item
111Generic architecture;
112@item
113Locale infrastructure support;
114@item
115Automatic generation of the program code which handles
116CES/CCS/Encoding/Names/Aliases dependencies;
117@item
118The ability to choose size- or speed-optimazed
119configuration;
120@item
121The ability to exclude a lot of unneeded code and data from the linking step.
122@end itemize
123
124
125
126
127@page
128@node Supported encodings
129@section Supported encodings
130@findex big5
131@findex cp775
132@findex cp850
133@findex cp852
134@findex cp855
135@findex cp866
136@findex euc_jp
137@findex euc_kr
138@findex euc_tw
139@findex iso_8859_1
140@findex iso_8859_10
141@findex iso_8859_11
142@findex iso_8859_13
143@findex iso_8859_14
144@findex iso_8859_15
145@findex iso_8859_2
146@findex iso_8859_3
147@findex iso_8859_4
148@findex iso_8859_5
149@findex iso_8859_6
150@findex iso_8859_7
151@findex iso_8859_8
152@findex iso_8859_9
153@findex iso_ir_111
154@findex koi8_r
155@findex koi8_ru
156@findex koi8_u
157@findex koi8_uni
158@findex ucs_2
159@findex ucs_2_internal
160@findex ucs_2be
161@findex ucs_2le
162@findex ucs_4
163@findex ucs_4_internal
164@findex ucs_4be
165@findex ucs_4le
166@findex us_ascii
167@findex utf_16
168@findex utf_16be
169@findex utf_16le
170@findex utf_8
171@findex win_1250
172@findex win_1251
173@findex win_1252
174@findex win_1253
175@findex win_1254
176@findex win_1255
177@findex win_1256
178@findex win_1257
179@findex win_1258
180@*
181The following is the list of currently supported encodings. The first column
182corresponds to the encoding name, the second column is the list of aliases,
183the third column is its CES and CCS components names, and the fourth column
184is a short description.
185
186@multitable @columnfractions .20 .26 .24 .30
187@item
188Name
189@tab
190Aliases
191@tab
192CES/CCS
193@tab
194Short description
195@item
196@tab
197@tab
198@tab
199
200
201@item
202big5
203@tab
204csbig5, big_five, bigfive, cn_big5, cp950
205@tab
206table_pcs / big5, us_ascii
207@tab
208The encoding for the Traditional Chinese.
209
210
211@item
212cp775
213@tab
214ibm775, cspc775baltic
215@tab
216table / cp775
217@tab
218The updated version of CP 437 that supports the balitic languages.
219
220
221@item
222cp850
223@tab
224ibm850, 850, cspc850multilingual
225@tab
226table / cp850
227@tab
228IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
229added instead of some less-often used characters like the line-drawing
230and the greek ones.
231
232
233@item
234cp852
235@tab
236ibm852, 852, cspcp852
237@tab
238@tab
239IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
240instead of some less-often used characters like the line-drawing and the greek ones.
241
242
243@item
244cp855
245@tab
246ibm855, 855, csibm855
247@tab
248table / cp855
249@tab
250IBM 855 - the updated version of CP 437 that supports Cyrillic.
251
252
253@item
254cp866
255@tab
256866, IBM866, CSIBM866
257@tab
258table / cp866
259@tab
260IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
261ordering of the alternative variant that is preferred by many Russian users.
262
263
264@item
265euc_jp
266@tab
267eucjp
268@tab
269euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
270@tab
271EUC-JP - The EUC for Japanese.
272
273
274@item
275euc_kr
276@tab
277euckr
278@tab
279euc / ksx1001
280@tab
281EUC-KR - The EUC for Korean.
282
283
284@item
285euc_tw
286@tab
287euctw
288@tab
289euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
290@tab
291EUC-TW - The EUC for Traditional Chinese.
292
293
294@item
295iso_8859_1
296@tab
297iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
298@tab
299table / iso_8859_1
300@tab
301ISO 8859-1:1987 - Latin 1, West European.
302
303
304@item
305iso_8859_10
306@tab
307iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
308@tab
309table / iso_8859_10
310@tab
311ISO 8859-10:1992 - Latin 6, Nordic.
312
313
314@item
315iso_8859_11
316@tab
317iso8859_11, iso885911
318@tab
319table / iso_8859_11
320@tab
321ISO 8859-11 - Thai.
322
323
324@item
325iso_8859_13
326@tab
327iso_8859_13:1998, iso8859_13, iso885913
328@tab
329table / iso_8859_13
330@tab
331ISO 8859-13:1998 - Latin 7, Baltic Rim.
332
333
334@item
335iso_8859_14
336@tab
337iso_8859_14:1998, iso885914, iso8859_14
338@tab
339table / iso_8859_14
340@tab
341ISO 8859-14:1998 - Latin 8, Celtic.
342
343
344@item
345iso_8859_15
346@tab
347iso885915, iso_8859_15:1998, iso8859_15,
348@tab
349table / iso_8859_15
350@tab
351ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
352
353
354@item
355iso_8859_2
356@tab
357iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
358@tab
359table / iso_8859_2
360@tab
361ISO 8859-2:1987 - Latin 2, East European.
362
363
364@item
365iso_8859_3
366@tab
367iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
368@tab
369table / iso_8859_3
370@tab
371ISO 8859-3:1988 - Latin 3, South European.
372
373
374@item
375iso_8859_4
376@tab
377iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
378@tab
379table / iso_8859_4
380@tab
381ISO 8859-4:1988 - Latin 4, North European.
382
383
384@item
385iso_8859_5
386@tab
387iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
388@tab
389table / iso_8859_5
390@tab
391ISO 8859-5:1988 - Cyrillic.
392
393
394@item
395iso_8859_6
396@tab
397iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
398@tab
399table / iso_8859_6
400@tab
401ISO i8859-6:1987 - Arabic.
402
403
404@item
405iso_8859_7
406@tab
407iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
408@tab
409table / iso_8859_7
410@tab
411ISO 8859-7:1987 - Greek.
412
413
414@item
415iso_8859_8
416@tab
417iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
418@tab
419table / iso_8859_8
420@tab
421ISO 8859-8:1988 - Hebrew.
422
423
424@item
425iso_8859_9
426@tab
427iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
428@tab
429table / iso_8859_9
430@tab
431ISO 8859-9:1989 - Latin 5, Turkish.
432
433
434@item
435iso_ir_111
436@tab
437ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
438@tab
439table / iso_ir_111
440@tab
441ISO IR 111/ECMA Cyrillic.
442
443
444@item
445koi8_r
446@tab
447cskoi8r, koi8r, koi8
448@tab
449table / koi8_r
450@tab
451RFC 1489 Cyrillic.
452
453
454@item
455koi8_ru
456@tab
457koi8ru
458@tab
459table / koi8_ru
460@tab
461The obsolete Ukrainian.
462
463
464@item
465koi8_u
466@tab
467koi8u
468@tab
469table / koi8_u
470@tab
471RFC 2319 Ukrainian.
472
473
474@item
475koi8_uni
476@tab
477koi8uni
478@tab
479table / koi8_uni
480@tab
481KOI8 Unified.
482
483
484@item
485ucs_2
486@tab
487ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
488@tab
489ucs_2 / (UCS)
490@tab
491ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
492
493
494@item
495ucs_2_internal
496@tab
497ucs2_internal, ucs_2internal, ucs2internal
498@tab
499ucs_2_internal / (UCS)
500@tab
501ISO-10646-UCS-2 in system byte order.
502NBSP is always interpreted as NBSP (BOM isn't supported).
503
504
505@item
506ucs_2be
507@tab
508ucs2be
509@tab
510ucs_2 / (UCS)
511@tab
512Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
513Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
514
515
516@item
517ucs_2le
518@tab
519ucs2le
520@tab
521ucs_2 / (UCS)
522@tab
523Little Endian version of ISO-10646-UCS-2.
524Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
525
526
527@item
528ucs_4
529@tab
530ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
531@tab
532ucs_4 / (UCS)
533@tab
534ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
535
536
537@item
538ucs_4_internal
539@tab
540ucs4_internal, ucs_4internal, ucs4internal
541@tab
542ucs_4_internal / (UCS)
543@tab
544ISO-10646-UCS-4 in system byte order.
545NBSP is always interpreted as NBSP (BOM isn't supported).
546
547
548@item
549ucs_4be
550@tab
551ucs4be
552@tab
553ucs_4 / (UCS)
554@tab
555Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
556Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
557
558
559@item
560ucs_4le
561@tab
562ucs4le
563@tab
564ucs_4 / (UCS)
565@tab
566Little Endian version of ISO-10646-UCS-4.
567Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
568
569
570@item
571us_ascii
572@tab
573ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
574@tab
575us_ascii / (ASCII)
576@tab
5777-bit ASCII.
578
579
580@item
581utf_16
582@tab
583utf16
584@tab
585utf_16 / (UCS)
586@tab
587RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
588
589
590@item
591utf_16be
592@tab
593utf16be
594@tab
595utf_16 / (UCS)
596@tab
597Big Endian version of RFC 2781 UTF-16.
598NBSP is always interpreted as NBSP (BOM isn't supported).
599
600
601@item
602utf_16le
603@tab
604utf16le
605@tab
606utf_16 / (UCS)
607@tab
608Little Endian version of RFC 2781 UTF-16.
609NBSP is always interpreted as NBSP (BOM isn't supported).
610
611
612@item
613utf_8
614@tab
615utf8
616@tab
617utf_8 / (UCS)
618@tab
619RFC 3629 UTF-8.
620
621
622@item
623win_1250
624@tab
625cp1250
626@tab
627@tab
628Win-1250 Croatian.
629
630
631@item
632win_1251
633@tab
634cp1251
635@tab
636table / win_1251
637@tab
638Win-1251 - Cyrillic.
639
640
641@item
642win_1252
643@tab
644cp1252
645@tab
646table / win_1252
647@tab
648Win-1252 - Latin 1.
649
650
651@item
652win_1253
653@tab
654cp1253
655@tab
656table / win_1253
657@tab
658Win-1253 - Greek.
659
660
661@item
662win_1254
663@tab
664cp1254
665@tab
666table / win_1254
667@tab
668Win-1254 - Turkish.
669
670
671@item
672win_1255
673@tab
674cp1255
675@tab
676table / win_1255
677@tab
678Win-1255 - Hebrew.
679
680
681@item
682win_1256
683@tab
684cp1256
685@tab
686table / win_1256
687@tab
688Win-1256 - Arabic.
689
690
691@item
692win_1257
693@tab
694cp1257
695@tab
696table / win_1257
697@tab
698Win-1257 - Baltic.
699
700
701@item
702win_1258
703@tab
704cp1258
705@tab
706table / win_1258
707@tab
708Win-1258 - Vietnamese7 that supports Cyrillic.
709@end multitable
710
711
712
713
714
715@page
716@node iconv design decisions
717@section iconv design decisions
718@findex CCS table
719@findex CES converter
720@findex Speed-optimized tables
721@findex Size-optimized tables
722@*
723The first iconv library design issue arises when considering the
724following two design approaches:
725
726@enumerate
727@item
728Have modules which implement conversion from the encoding A to the encoding B
729and vice versa i.e., one conversion module relates to any two encodings.
730@item
731Have modules which implement conversion from the encoding A to the fixed
732encoding C and vice versa i.e., one conversion module relates to any
733one encoding A and one fixed encoding C. In this case, to convert from
734the encoding A to the encoding B, two modules are needed (in order to convert
735from A to C and then from C to B).
736@end enumerate
737
738@*
739It's obvious, that we have tradeoff between commonality/flexibility and
740efficiency: the first method is more efficient since it converts
741directly; however, it isn't so flexible since for each
742encoding pair a distinct module is needed.
743
744@*
745The Newlib iconv model uses the second method and always converts through the 32-bit
746UCS but its design also allows one to write specialized conversion
747modules if the conversion speed is critical.
748
749@*
750The second design issue is how to break down (decompose) encodings.
751The Newlib iconv library uses the fact that any encoding may be
752considered as one or more CCS plus a CES. It also decomposes its
753conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
754tables}. CCS tables map CCS to UCS and vice versa; the CES converters
755map CCS to the encoding and vice versa.
756
757@*
758As the example, let's consider the conversion from the big5 encoding to
759the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
760CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
761and CNS11643_PLANE14 CCS-es plus the EUC CES.
762
763@*
764The euc_jp -> big5 conversion is performed as follows:
765
766@enumerate
767@item
768The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
769transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
770CCS-es);
771@item
772The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
773CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
774@item
775The resulting UCS codes are transformed to the ASCII and BIG5 codes using
776the corresponding CCS tables;
777@item
778The obtained CCS codes are transformed to the big5 encoding using the corresponding
779CES converter.
780@end enumerate
781
782@*
783Analogously, the backward conversion is performed as follows:
784
785@enumerate
786@item
787The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
788(the ASCII and BIG5 CCS-es);
789@item
790The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
791@item
792The resulting UCS codes are transformed to the ASCII and BIG5 codes using
793the corresponding CCS tables;
794@item
795The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
796CES converter.
797@end enumerate
798
799@*
800Note, the above is just an example and real names (which are implemented
801in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
802
803@*
804The third design issue also relates to flexibility. Obviously, it isn't
805desirable to always link all the CES converters and the CCS tables to the library
806but instead, we want to be able to load the needed converters and tables
807dynamically on demand. This isn't a problem on "big" machines such as
808a PC, but it may be very problematical within "small" embedded systems.
809
810@*
811Since the CCS tables are just data, it is possible to load them
812dynamically from external files.  The CES converters, on the other hand
813are algorithms with some code so a dynamic library loading
814capability is required.
815
816@*
817Apart from possible restrictions applied by embedded systems (small
818RAM for example), Newlib itself has no dynamic library support and
819therefore, all the CES converters which will ever be used must be linked into
820the library.   However, loading of the dynamic CCS tables is possible and is
821implemented in the Newlib iconv library.  It may be enabled via the Newlib
822configure script options.
823
824@*
825The next design issue is fine-tuning the iconv library
826configuration.  One important ability is for iconv to not link all it's
827converters and tables (if dynamic loading is not enabled) but instead,
828enable only those encodings which are specified at configuration
829time (see the section about the configure script options).
830
831@*
832In addition, the Newlib iconv library configure options distinguish between
833conversion directions. This means that not only are supported encodings
834selectable, the conversion direction is as well. For example, if user wants
835the configuration which allows conversions from UTF-8 to UTF-16 and
836doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
837enable only
838this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
839be included) thus, saving some memory (note, that such technique allows to
840exclude one half of a CCS table from linking which may be big enough).
841
842@*
843One more design aspect are the speed- and size- optimized tables. Users can
844select between them using configure script options. The
845speed-optimized CCS tables are the same as the size-optimized ones in
846case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
847CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
848other hand, conversion with speed tables is several times faster.
849
850@*
851Its worth to stress that the new encoding support can't be
852dynamically added into an already compiled Newlib library, even if it
853needs only an additional CCS table and iconv is configured to use
854the external files with CCS tables (this isn't the fundamental restriction
855and the possibility to add new Table-based encoding support dynamically, by
856means of just adding new .cct file, may be easily added).
857
858@*
859Theoretically, the compiled-in CCS tables should be more appropriate for
860embedded systems than dynamically loaded CCS tables.  This is because the compiled-in tables are read-only and can be placed in ROM
861whereas dynamic loading requires RAM.  Moreover, in the current iconv
862implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
863This means, for example, that if two iconv descriptors for
864"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
865koi8-r .cct file will be loaded (actually, iconv loads only the needed part
866of these files).  On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
867
868@page
869@node iconv configuration
870@section iconv configuration
871@findex iconv configuration
872@findex --enable-newlib-iconv-encodings
873@findex --enable-newlib-iconv-from-encodings
874@findex --enable-newlib-iconv-to-encodings
875@findex --enable-newlib-iconv-external-ccs
876@findex NLSPATH
877@*
878To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
879script option should be used. This option accepts a comma-separated list
880of @emph{encodings} that should be enabled. The option enables each encoding in both
881("to" and "from") directions.
882
883@*
884The @option{--enable-newlib-iconv-from-encodings} configure script option enables
885"from" support for each encoding that was passed to it.
886
887@*
888The @option{--enable-newlib-iconv-to-encodings} configure script option enables
889"to" support for each encoding that was passed to it.
890
891@*
892Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
893"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
894code and data will be linked) is to configure Newlib with the following
895options:
896@*
897@code{--enable-newlib-iconv-encodings=UTF-8
898--enable-newlib-iconv-from-encodings=KOI8-R
899--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
900@*
901which is the same as
902@*
903@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
904--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
905@*
906User may also just use the
907@*
908@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
909@*
910configure script option, but it isn't so optimal since there will be
911some unneeded data and code.
912
913@*
914The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
915capabilities to work with the external CCS files.
916
917@*
918The @option{--enable-target-optspace} Newlib configure script option also affects
919the iconv library. If this option is present, the library uses the size
920optimized CCS tables. This means, that only the size-optimized CCS
921tables will be linked or, if the
922@option{--enable-newlib-iconv-external-ccs} configure script option was used,
923the iconv library will load the size-optimized tables. If the
924@option{--enable-target-optspace}configure script option is disabled,
925the speed-optimized CCS tables are used.
926
927@*
928Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
929Thus, the NLSPATH environment variable should be set.
930
931
932
933
934
935@page
936@node Encoding names
937@section Encoding names
938@findex encoding name
939@findex encoding alias
940@findex normalized name
941@*
942Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
943user works with the iconv library (i.e., when the @code{iconv_open} call
944is used) both name or aliases may be used. The same is when encoding
945names are used in configure script options.
946
947@*
948Names and aliases may be specified in any case (small or capital
949letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
950
951@*
952Internally the Newlib iconv library always converts aliases to names. It
953also converts names and aliases in the @dfn{normalized} form which means
954that all capital letters are converted to small letters and the @kbd{-}
955symbols are converted to @kbd{_} symbols.
956
957
958
959
960@page
961@node CCS tables
962@section CCS tables
963@findex Size-optimized CCS table
964@findex Speed-optimized CCS table
965@findex mktbl.pl Perl script
966@findex .cct files
967@findex The CCT tables source files
968@findex CCS source files
969@*
970The iconv library stores files with CCS tables in the the @emph{ccs/}
971subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
972(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
973of compilable .c source files. The .cct files are only used when the
974@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
975The .c files are linked to the Newlib library if the corresponding
976encoding is enabled.
977
978@*
979As stated earlier, the Newlib iconv library performs all
980conversions through the 32-bit UCS, but the codes which are used
981in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
982Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
983used instead of the 32-bit UCS-4.
984
985@*
986CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
98716-bit UCS-2 and vice versa while 16-bit CCS tables map
98816-bit CCS to 16-bit UCS-2 and vice versa.
9898-bit tables are small (in size) while 16-bit tables may be big enough.
990Because of this, 16-bit CCS tables may be
991either speed- or size-optimized. Size-optimized CCS tables are
992smaller then speed-optimized ones, but the conversion process is
993slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
994size-optimized variant.
995
996Each CCS table (both speed- and size-optimized) consists of
997@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
998UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
999UCS-2 codes.
1000
1001@*
1002Almost all 16-bit CCS tables contain less then 0xFFFF codes and
1003a lot of gaps exist.
1004
1005@subsection Speed-optimized tables format
1006@*
1007In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
1008trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
1009UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
1010as @emph{Y = to_ucs[X]}.
1011
1012@*
1013Obviously, the simplest way to create the "from_ucs" table or the
101416-bit "to_ucs" table is to use the huge 16-bit array like in case
1015of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
1016less then 0xFFFF code maps and this fact may be exploited to reduce
1017the size of the CCS tables.
1018
1019@*
1020In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
102116-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
1022direction and the CCS bits number.
1023
1024@*
1025In case of the 8-bit speed-optimized table the "from_ucs" subtable
1026corresponds the "from_ucs" array and has the following layout:
1027
1028@*
1029from_ucs array:
1030@*
1031-------------------------------------
1032@*
10330xFF mapping (2 bytes) (only for
10348-bit table).
1035@*
1036-------------------------------------
1037@*
1038Heading block
1039@*
1040-------------------------------------
1041@*
1042Block 1
1043@*
1044-------------------------------------
1045@*
1046Block 2
1047@*
1048-------------------------------------
1049@*
1050  ...
1051@*
1052-------------------------------------
1053@*
1054Block N
1055@*
1056-------------------------------------
1057
1058@*
1059The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
1060subrange is represented by an 256-element @dfn{block} (256 1-byte
1061elements or 256 2-byte element in case of 16-bit CCS table) with
1062elements which are equivalent to the CCS codes of this subrange.
1063If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
1064absent and there will be less then 256 blocks.
1065
1066@*
1067Any element number @emph{m} of @dfn{the heading block} (which contains
1068256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
1069If the subrange contains some codes, the value of the @emph{m}-th element of
1070the heading block contains the offset of the corresponding block in the
1071"from_ucs" array. If there is no codes in the subrange, the heading
1072block element contains 0xFFFF.
1073
1074@*
1075If there are some gaps in a block, the corresponding block elements have
1076the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
1077is defined in the first 2-byte element of the "from_ucs" array.
1078
1079@*
1080Having such a table format, the algorithm of searching the CCS code
1081@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
1082
1083@*
1084@enumerate
1085@item If @emph{Y} is equivalent to the value of the first 2-byte element
1086of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
1087
1088@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
1089
1090@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
1091is no corresponding CCS code (error, wrong input data). Else, fetch the
1092"flom_ucs" array index of the @emph{BlkN}-th block.
1093
1094@item Calculate the offset of the @emph{X} code in its block:
1095@emph{Xindex = Y & 0xFF}
1096
1097@item If the @emph{Xindex}-th element of the block (which is equivalent to
1098@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
1099CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
1100@end enumerate
1101
1102@subsection Size-optimized tables format
1103@*
1104As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
1105This is because there is too small difference between the speed-optimized
1106and the size-optimized table sizes in case of 8-bit CCS-es.
1107
1108@*
1109Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
1110size-optimized tables.
1111
1112This sections describes the format of the "UCS-2 -> CCS" size-optimized
1113CCS table. The format of "CCS -> UCS-2" table is the same.
1114
1115The idea of the size-optimized tables is to split the UCS-2 codes
1116("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
1117Then CCS codes ("to" codes) are stored only for the codes from these
1118ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
1119together with the corresponding "to" codes.
1120
1121@*
1122The following is the layout of the size-optimized table array:
1123
1124@*
1125size_arr array:
1126@*
1127-------------------------------------
1128@*
1129Ranges number (2 bytes)
1130@*
1131-------------------------------------
1132@*
1133Unranged codes number (2 bytes)
1134@*
1135-------------------------------------
1136@*
1137Unranged codes array index (2 bytes)
1138@*
1139-------------------------------------
1140@*
1141Ranges indexes (triads)
1142@*
1143-------------------------------------
1144@*
1145Ranges
1146@*
1147-------------------------------------
1148@*
1149Unranged codes array
1150@*
1151-------------------------------------
1152
1153@*
1154The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
1155the offset of the needed range in the @emph{size_arr} and has
1156the following format (triads):
1157@*
1158the first code in range, the last code in range, range offset.
1159
1160@*
1161The array of these triads is sorted by the firs element, therefore it is
1162possible to quickly find the needed range index.
1163
1164@*
1165Each range has the corresponding sub-array containing the "to" codes. These
1166sub-arrays are stored in the place marked as "Ranges" in the layout
1167diagram.
1168
1169@*
1170The "Unranged codes array" contains pairs ("from" code, "to" code") for
1171each unranged code. The array of these pairs is sorted by "from" code
1172values, therefore it is possible to find the needed pair quickly.
1173
1174@*
1175Note, that each range requires 6 bytes to form its index. If, for
1176example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
1177(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
1178code (total 16). But it is better to join both ranges as 1 - 10 and
1179mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
1180range index and 4 bytes to mark codes 6 and 8 as absent are needed
1181(total 10 bytes). This optimization is done in the size-optimized tables.
1182Thus, ranges may contain small gaps. The absent codes in ranges are marked
1183as 0xFFFF.
1184
1185@*
1186Note, a pair of "from" codes is stored by means of unranged codes since
1187the number of bytes which are needed to form the range is greater than
1188the number of bytes to store two unranged codes (5 against 4).
1189
1190@*
1191The algorithm of searching of the CCS code
1192@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
1193CCS" size-optimized table is as follows.
1194
1195@*
1196@enumerate
1197@item Try to find the corresponding triad in the "Unranged codes array
1198index". Since we are searching in the sorted array, we can do it quickly
1199(divide by 2, compare, etc).
1200
1201@item If the triad is found, fetch the @emph{X} code from the corresponding
1202range array. If it is 0xFFFF, return an error.
1203
1204@item If there is no corresponding triad, search the @emph{X} code among the
1205sorted unranged codes. Return error, if noting was found.
1206@end enumerate
1207
1208@subsection .cct ant .c CCS Table files
1209@*
1210The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
1211speed-optimized tables. The .c source files for 16-bit CCS tables have
1212"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
1213tables.
1214
1215@*
1216When .c files are compiled and used, all the 16-bit and 32-bit values
1217have the native endian format (Big Endian for the BE systems and Little
1218Endian for the LE systems) since they are compile for the system before
1219they are used.
1220
1221@*
1222In case of .cct files, which are intended for dynamic CCS tables
1223loading, the CCS tables are stored either in LE or BE format. Since the
1224.cct files are generated by the 'mktbl.pl' Perl script, it is possible
1225to choose the endianess of the tables. It is also possible to store two
1226copies (both LE and BE) of the CCS tables in one .cct file. The default
1227.cct files (which come with the Newlib sources) have both LE and BE CCS
1228tables. The Newlib iconv library automatically chooses the needed CCS tables
1229(with appropriate endianess).
1230
1231@*
1232Note, the .cct files are only used when the
1233@option{--enable-newlib-iconv-external-ccs} is used.
1234
1235@subsection The 'mktbl.pl' Perl script
1236@*
1237The 'mktbl.pl' script is intended to generate .cct and .c CCS table
1238files from the @dfn{CCS source files}.
1239
1240@*
1241The CCS source files are just text files which has one or more colons
1242with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
1243source files see one of them using URL-s which will be given bellow.
1244
1245@*
1246The following table describes where the source files for CCS table files
1247provided by the Newlib distribution are located.
1248
1249@multitable @columnfractions .25 .75
1250@item
1251Name
1252@tab
1253URL
1254
1255@item
1256@tab
1257
1258@item
1259big5
1260@tab
1261http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
1262
1263@item
1264cns11643_plane1
1265cns11643_plane14
1266cns11643_plane2
1267@tab
1268http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
1269
1270@item
1271cp775
1272cp850
1273cp852
1274cp855
1275cp866
1276@tab
1277http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1278
1279@item
1280iso_8859_1
1281iso_8859_2
1282iso_8859_3
1283iso_8859_4
1284iso_8859_5
1285iso_8859_6
1286iso_8859_7
1287iso_8859_8
1288iso_8859_9
1289iso_8859_10
1290iso_8859_11
1291iso_8859_13
1292iso_8859_14
1293iso_8859_15
1294@tab
1295http://www.unicode.org/Public/MAPPINGS/ISO8859/
1296
1297@item
1298iso_ir_111
1299@tab
1300http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
1301
1302@item
1303jis_x0201_1976
1304jis_x0208_1990
1305jis_x0212_1990
1306@tab
1307http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
1308
1309@item
1310koi8_r
1311@tab
1312http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
1313
1314@item
1315koi8_ru
1316@tab
1317http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
1318
1319@item
1320koi8_u
1321@tab
1322http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
1323
1324@item
1325koi8_uni
1326@tab
1327http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
1328
1329@item
1330ksx1001
1331@tab
1332http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
1333
1334@item
1335win_1250
1336win_1251
1337win_1252
1338win_1253
1339win_1254
1340win_1255
1341win_1256
1342win_1257
1343win_1258
1344@tab
1345http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1346@end multitable
1347
1348The CCS source files aren't distributed with Newlib because of License
1349restrictions in most Unicode.org's files.
1350
1351The following are 'mktbl.pl' options which were used to generate .cct
1352files. Note, to generate CCS tables source files @option{-s} option
1353should be added.
1354
1355@enumerate
1356@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
1357iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
1358iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
1359iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
1360win_1256.cct, win_1258.cct, win_1251.cct,
1361win_1253.cct, win_1255.cct, win_1257.cct,
1362koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
1363big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
1364files, only the @option{-i <SRC_FILE_NAME>} option were used.
1365
1366@item To generate the jis_x0208_1990.cct file, the
1367@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
1368
1369@item To generate the cns11643_plane1.cct file, the
1370@option{-i cns11643.txt -p1 -N cns11643_plane1  -o cns11643_plane1.cct}
1371options were used.
1372
1373@item To generate the cns11643_plane2.cct file, the
1374@option{-i cns11643.txt -p2 -N cns11643_plane2  -o cns11643_plane2.cct}
1375options were used.
1376
1377@item To generate the cns11643_plane14.cct file, the
1378@option{-i cns11643.txt -p0xE -N cns11643_plane14  -o cns11643_plane14.cct}
1379options were used.
1380@end enumerate
1381
1382@*
1383For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
1384
1385@*
1386It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
1387in the CCS source file, the bits which are higher then 16 defines plane (see the
1388cns11643.txt CCS source file).
1389
1390@*
1391Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
1392several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
1393the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
1394codes}) aren't just rejected but instead, they are mapped to the default
1395UCS-2 code (which is currently the @kbd{?} character's code).
1396
1397
1398
1399
1400
1401@page
1402@node CES converters
1403@section CES converters
1404@findex PCS
1405@*
1406Similar to the CCS tables, CES converters are also split into "from UCS"
1407and "to UCS" parts. Depending on the iconv library configuration, these
1408parts are enabled or disabled.
1409
1410@*
1411The following it the list of CES converters which are currently present
1412in the Newlib iconv library.
1413
1414@itemize @bullet
1415@item
1416@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
1417encodings. The @emph{euc} CES converter uses the @emph{table} and the
1418@emph{us_ascii} CES converters.
1419
1420@item
1421@emph{table} - this CES converter corresponds to "null" and just performs
1422tables-based conversion using 8- and 16-bit CCS tables. This converter
1423is also used by any other CES converter which needs the CCS table-based
1424conversions. The @emph{table} converter is also responsible for .cct files
1425loading.
1426
1427@item
1428@emph{table_pcs} - this is the wrapper over the @emph{table} converter
1429which is intended for 16-bit encodings which also use the @dfn{Portable
1430Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
1431This means, that if the first byte the CCS code is in range of [0x00-0x7f],
1432this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
1433the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
1434The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
1435@emph{table_pcs} CES converter depends on the @emph{table} CES converter.
1436
1437@item
1438@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
1439@emph{ucs_2le} encodings support.
1440
1441@item
1442@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
1443@emph{ucs_4le} encodings support.
1444
1445@item
1446@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
1447
1448@item
1449@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
1450
1451@item
1452@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
1453principle, the most natural way to support the @emph{us_ascii} encoding
1454is to define the @emph{us_ascii} CCS and use the @emph{table} CES
1455converter. But for the optimization purposes, the specialized
1456@emph{us_ascii} CES converter was created.
1457
1458@item
1459@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
1460@emph{utf_16le} encodings support.
1461
1462@item
1463@emph{utf_8} - intended for the @emph{utf_8} encoding support.
1464@end itemize
1465
1466
1467
1468
1469
1470@page
1471@node The encodings description file
1472@section The encodings description file
1473@findex encoding.deps description file
1474@findex mkdeps.pl Perl script
1475@*
1476To simplify the process of adding new encodings support allowing to
1477automatically generate a lot of "glue" files.
1478
1479@*
1480There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
1481is used to describe encoding's properties. The 'mkdeps.pl' Perl script
1482uses 'encoding.deps' to generates the "glue" files.
1483
1484@*
1485The 'encoding.deps' file is composed of sections, each section consists
1486of entries, each entry contains some encoding/CES/CCS description.
1487
1488@*
1489The 'encoding.deps' file's syntax is very simple. Currently only two
1490sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
1491
1492@*
1493Each @emph{ENCODINGS} section's entry describes one encoding and
1494contains the following information.
1495
1496@itemize @bullet
1497@item
1498Encoding name (the @emph{ENCODING} field). The name should
1499be unique and only one name is possible.
1500
1501@item
1502The encoding's CES converter name (the @emph{CES} field). Only one CES
1503converter is allowed.
1504
1505@item
1506The whitespace-separated list of CCS table names which are used by the
1507encoding (the @emph{CCS} field).
1508
1509@item
1510The whitespace-separated list of aliases names (the @emph{ENCODING}
1511field).
1512@end itemize
1513
1514@*
1515Note all names in the 'encoding.deps' file have to have the normalized
1516form.
1517
1518@*
1519Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
1520one CES converted. For example, the @emph{euc} CES converter depends on
1521the @emph{table} and the @emph{us_ascii} CES converter since the
1522@emph{euc} CES converter uses them. This means, that both @emph{table}
1523and @emph{us_ascii} CES converters should be linked if the @emph{euc}
1524CES converter is enabled.
1525
1526@*
1527The @emph{CES_DEPENDENCIES} section defines the following:
1528
1529@itemize @bullet
1530@item
1531the CES converter name for which the dependencies are defined in this
1532entry (the @emph{CES} field);
1533
1534@item
1535the whitespace-separated list of CES converters which are needed for
1536this CES converter (the @emph{USED_CES} field).
1537@end itemize
1538
1539@*
1540The 'mktbl.pl' Perl script automatically solves the following tasks.
1541
1542@itemize @bullet
1543@item
1544User works with the iconv library in terms of encodings and doesn't know
1545anything about CES converters and CCS tables. The script automatically
1546generates code which enables all needed CES converters and CCS tables
1547for all encodings, which were enabled by the user.
1548
1549@item
1550The CES converters may have dependencies and the script automatically
1551generates the code which handles these dependencies.
1552
1553@item
1554The list of encoding's aliases is also automatically generated.
1555
1556@item
1557The script uses a lot of macros in order to enable only the minimum set
1558of code/data which is needed to support the requested encodings in the
1559requested directions.
1560@end itemize
1561
1562@*
1563The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
1564file and generates the following files.
1565
1566@itemize @bullet
1567@item
1568@emph{lib/encnames.h} - this header files contains macro definitions for all
1569encoding names
1570
1571@item
1572@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
1573is used to find the name of requested encoding by it's alias.
1574
1575@item
1576@emph{ces/cesbi.c} - this file defines two arrays
1577(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
1578description of enabled "to UCS" and "from UCS" CES converters and the
1579names of encodings which are supported by these CES converters.
1580
1581@item
1582@emph{ces/cesbi.h} - this file contains the set of macros which defines
1583the set of CES converters which should be enabled if only the set of
1584enabled encodings is given (through macros defined in the
1585@emph{newlib.h} file). Note, that one CES converter may handle several
1586encodings.
1587
1588@item
1589@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
1590this file.
1591
1592@item
1593@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
1594here.
1595
1596@item
1597@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
1598CCS names.
1599
1600@item
1601@emph{encoding.aliases} - the list of supported encodings and their
1602aliases which is intended for the Newlib configure scripts in order to
1603handle the iconv-related configure script options.
1604@end itemize
1605
1606
1607
1608
1609
1610@page
1611@node How to add new encoding
1612@section How to add new encoding
1613@*
1614At first, the new encoding should be broken down to CCS and CES. Then,
1615the process of adding new encoding is split to the following activities.
1616
1617@enumerate
1618@item Generate the .cct CCS file and the .c source file for the new
1619encoding's CCS (if it isn't already present). To do this, the CCS source
1620file should be had and the 'mktbl.pl' script should be used.
1621
1622@item Write the corresponding CES converter (if it isn't already
1623present). Use the existing CES converters as an example.
1624
1625@item
1626Add the corresponding entries to the 'encoding.deps' file and regenerate
1627the autogenerated "glue" files using the 'mkdeps.pl' script.
1628
1629@item
1630Don't forget to add entries to the newlib/newlib.hin file.
1631
1632@item
1633Of course, the 'Makefile.am'-s should also be updated (if new files were
1634added) and the 'Makefile.in'-s should be regenerated using the correct
1635version of 'automake'.
1636
1637@item
1638Don't forget to update the documentation (the list of
1639supported encodings and CES converters).
1640@end enumerate
1641
1642In case a new encoding doesn't fit to the CES/CCS decomposition model or
1643it is desired to add the specialized (non UCS-based) conversion support,
1644the Newlib iconv library code should be upgraded.
1645
1646
1647
1648
1649
1650@page
1651@node The locale support interfaces
1652@section The locale support interfaces
1653@*
1654The newlib iconv library also has some interface functions (besides the
1655@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
1656are intended for the Locale subsystem. All the locale-related code is
1657placed in the @emph{lib/iconvnls.c} file.
1658
1659@*
1660The following is the description of the locale-related interfaces:
1661
1662@itemize @bullet
1663@item
1664@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
1665wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
1666passed in the function parameters. The @emph{wchar_t} characters encoding is
1667either ucs_2_internal or ucs_4_internal depending on size of
1668@emph{wchar_t}.
1669
1670@item
1671@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
1672functions, but if there is no character in the output encoding which
1673corresponds to the character in the input encoding, the default
1674conversion isn't performed (the @code{iconv} function sets such output
1675characters to the @kbd{?} symbol and this is the behavior, which is
1676specified in SUSv3).
1677
1678@item
1679@code{_iconv_nls_get_state} - returns the current encoding's shift state
1680(the @code{mbstate_t} object).
1681
1682@item
1683@code{_iconv_nls_set_state} sets the current encoding's shift state (the
1684@code{mbstate_t} object).
1685
1686@item
1687@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
1688or stateless.
1689
1690@item
1691@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
1692maximum bytes number) of the encoding's characters.
1693@end itemize
1694
1695
1696
1697
1698@page
1699@node Contact
1700@section Contact
1701@*
1702The author of the original BSD iconv library (Alexander Chuguev) no longer
1703supports that code.
1704
1705@*
1706Any questions regarding the iconv library may be forwarded to
1707Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
1708well as to the public Newlib mailing list.
1709
1710