1@node Encoding conversions 2@chapter Encoding conversions (@file{iconv.h}) 3 4This chapter describes the Newlib iconv library. 5The iconv functions declarations are in 6@file{iconv.h}. 7 8@menu 9* iconv:: Encoding conversion routines 10* Introduction to iconv:: Introduction to iconv and encodings 11* Supported encodings:: The list of currently supported encodings 12* iconv design decisions:: General iconv library design issues 13* iconv configuration:: iconv-related configure script options 14* Encoding names:: How encodings are named. 15* CCS tables:: CCS tables format and 'mktbl.pl' Perl script 16* CES converters:: CES converters description 17* The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl' 18* How to add new encoding:: The steps to add new encoding support 19* The locale support interfaces:: Locale-related iconv interfaces 20* Contact:: The author contact 21@end menu 22 23@page 24@include iconv/lib/iconv.def 25 26@page 27@node Introduction to iconv 28@section Introduction to iconv 29@findex encoding 30@findex character set 31@findex charset 32@findex CES 33@findex CCS 34@* 35The iconv library is intended to convert characters from one encoding to 36another. It implements iconv(), iconv_open() and iconv_close() 37calls, which are defined by the Single Unix Specification. 38 39@* 40In addition to these user-level interfaces, the iconv library also has 41several useful interfaces which are needed to support coding 42capabilities of the Newlib Locale infrastructure. Since Locale 43support also needs to 44convert various character sets to and from the @emph{wide characters 45set}, the iconv library shares it's capabilities with the Newlib Locale 46subsystem. Moreover, the iconv library supports several features which are 47only needed for the Locale infrastructure (for example, the MB_CUR_MAX value). 48 49@* 50The Newlib iconv library was created using concepts from another iconv 51library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library 52was rewritten from scratch and contains a lot of improvements with respect to 53the original iconv library. 54 55@* 56Terms like @dfn{encoding} or @dfn{character set} aren't well defined and 57are often used with various meanings. The following are the definitions of terms 58which are used in this documentation as well as in the iconv library 59implementation: 60 61@itemize @bullet 62@item 63@dfn{encoding} - a machine representation of characters by means of bits; 64 65@item 66@dfn{Character Set} or @dfn{Charset} - just a collection of 67characters, i.e. the encoding is the machine representation of the character set; 68 69@item 70@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a 71set of integers @dfn{character codes}; 72 73@item 74@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character 75codes to a sequence of bytes; 76@end itemize 77 78@* 79Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8, 80ASCII, etc. Encodings are formed by the following chain of steps: 81 82@enumerate 83@item 84User has a set of characters which are specific to his or her language (character set). 85 86@item 87Each character from this set is uniquely numbered, resulting in an CCS. 88 89@item 90Each number from the CCS is converted to a sequence of bits or bytes by means 91of a CES and form some encoding. Thus, CES may be considered as a 92function of CCS which produces some encoding. Note, that CES may be 93applied to more than one CCS. 94@end enumerate 95 96@* 97Thus, an encoding may be considered as one or more CCS + CES. 98 99@* 100Sometimes, there is no CES and in such cases encoding is equivalent 101to CCS, e.g. KOI8-R or ASCII. 102 103@* 104An example of a more complicated encoding is UTF-8 which is the UCS 105(or Unicode) CCS plus the UTF-8 CES. 106 107@* 108The following is a brief list of iconv library features: 109@itemize 110@item 111Generic architecture; 112@item 113Locale infrastructure support; 114@item 115Automatic generation of the program code which handles 116CES/CCS/Encoding/Names/Aliases dependencies; 117@item 118The ability to choose size- or speed-optimazed 119configuration; 120@item 121The ability to exclude a lot of unneeded code and data from the linking step. 122@end itemize 123 124 125 126 127@page 128@node Supported encodings 129@section Supported encodings 130@findex big5 131@findex cp775 132@findex cp850 133@findex cp852 134@findex cp855 135@findex cp866 136@findex euc_jp 137@findex euc_kr 138@findex euc_tw 139@findex iso_8859_1 140@findex iso_8859_10 141@findex iso_8859_11 142@findex iso_8859_13 143@findex iso_8859_14 144@findex iso_8859_15 145@findex iso_8859_2 146@findex iso_8859_3 147@findex iso_8859_4 148@findex iso_8859_5 149@findex iso_8859_6 150@findex iso_8859_7 151@findex iso_8859_8 152@findex iso_8859_9 153@findex iso_ir_111 154@findex koi8_r 155@findex koi8_ru 156@findex koi8_u 157@findex koi8_uni 158@findex ucs_2 159@findex ucs_2_internal 160@findex ucs_2be 161@findex ucs_2le 162@findex ucs_4 163@findex ucs_4_internal 164@findex ucs_4be 165@findex ucs_4le 166@findex us_ascii 167@findex utf_16 168@findex utf_16be 169@findex utf_16le 170@findex utf_8 171@findex win_1250 172@findex win_1251 173@findex win_1252 174@findex win_1253 175@findex win_1254 176@findex win_1255 177@findex win_1256 178@findex win_1257 179@findex win_1258 180@* 181The following is the list of currently supported encodings. The first column 182corresponds to the encoding name, the second column is the list of aliases, 183the third column is its CES and CCS components names, and the fourth column 184is a short description. 185 186@multitable @columnfractions .20 .26 .24 .30 187@item 188Name 189@tab 190Aliases 191@tab 192CES/CCS 193@tab 194Short description 195@item 196@tab 197@tab 198@tab 199 200 201@item 202big5 203@tab 204csbig5, big_five, bigfive, cn_big5, cp950 205@tab 206table_pcs / big5, us_ascii 207@tab 208The encoding for the Traditional Chinese. 209 210 211@item 212cp775 213@tab 214ibm775, cspc775baltic 215@tab 216table / cp775 217@tab 218The updated version of CP 437 that supports the balitic languages. 219 220 221@item 222cp850 223@tab 224ibm850, 850, cspc850multilingual 225@tab 226table / cp850 227@tab 228IBM 850 - the updated version of CP 437 where several Latin 1 characters have been 229added instead of some less-often used characters like the line-drawing 230and the greek ones. 231 232 233@item 234cp852 235@tab 236ibm852, 852, cspcp852 237@tab 238@tab 239IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added 240instead of some less-often used characters like the line-drawing and the greek ones. 241 242 243@item 244cp855 245@tab 246ibm855, 855, csibm855 247@tab 248table / cp855 249@tab 250IBM 855 - the updated version of CP 437 that supports Cyrillic. 251 252 253@item 254cp866 255@tab 256866, IBM866, CSIBM866 257@tab 258table / cp866 259@tab 260IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet 261ordering of the alternative variant that is preferred by many Russian users. 262 263 264@item 265euc_jp 266@tab 267eucjp 268@tab 269euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990 270@tab 271EUC-JP - The EUC for Japanese. 272 273 274@item 275euc_kr 276@tab 277euckr 278@tab 279euc / ksx1001 280@tab 281EUC-KR - The EUC for Korean. 282 283 284@item 285euc_tw 286@tab 287euctw 288@tab 289euc / cns11643_plane1, cns11643_plane2, cns11643_plane14 290@tab 291EUC-TW - The EUC for Traditional Chinese. 292 293 294@item 295iso_8859_1 296@tab 297iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1 298@tab 299table / iso_8859_1 300@tab 301ISO 8859-1:1987 - Latin 1, West European. 302 303 304@item 305iso_8859_10 306@tab 307iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10 308@tab 309table / iso_8859_10 310@tab 311ISO 8859-10:1992 - Latin 6, Nordic. 312 313 314@item 315iso_8859_11 316@tab 317iso8859_11, iso885911 318@tab 319table / iso_8859_11 320@tab 321ISO 8859-11 - Thai. 322 323 324@item 325iso_8859_13 326@tab 327iso_8859_13:1998, iso8859_13, iso885913 328@tab 329table / iso_8859_13 330@tab 331ISO 8859-13:1998 - Latin 7, Baltic Rim. 332 333 334@item 335iso_8859_14 336@tab 337iso_8859_14:1998, iso885914, iso8859_14 338@tab 339table / iso_8859_14 340@tab 341ISO 8859-14:1998 - Latin 8, Celtic. 342 343 344@item 345iso_8859_15 346@tab 347iso885915, iso_8859_15:1998, iso8859_15, 348@tab 349table / iso_8859_15 350@tab 351ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1. 352 353 354@item 355iso_8859_2 356@tab 357iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2 358@tab 359table / iso_8859_2 360@tab 361ISO 8859-2:1987 - Latin 2, East European. 362 363 364@item 365iso_8859_3 366@tab 367iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593 368@tab 369table / iso_8859_3 370@tab 371ISO 8859-3:1988 - Latin 3, South European. 372 373 374@item 375iso_8859_4 376@tab 377iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4 378@tab 379table / iso_8859_4 380@tab 381ISO 8859-4:1988 - Latin 4, North European. 382 383 384@item 385iso_8859_5 386@tab 387iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic 388@tab 389table / iso_8859_5 390@tab 391ISO 8859-5:1988 - Cyrillic. 392 393 394@item 395iso_8859_6 396@tab 397iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596 398@tab 399table / iso_8859_6 400@tab 401ISO i8859-6:1987 - Arabic. 402 403 404@item 405iso_8859_7 406@tab 407iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597 408@tab 409table / iso_8859_7 410@tab 411ISO 8859-7:1987 - Greek. 412 413 414@item 415iso_8859_8 416@tab 417iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598 418@tab 419table / iso_8859_8 420@tab 421ISO 8859-8:1988 - Hebrew. 422 423 424@item 425iso_8859_9 426@tab 427iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599 428@tab 429table / iso_8859_9 430@tab 431ISO 8859-9:1989 - Latin 5, Turkish. 432 433 434@item 435iso_ir_111 436@tab 437ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic 438@tab 439table / iso_ir_111 440@tab 441ISO IR 111/ECMA Cyrillic. 442 443 444@item 445koi8_r 446@tab 447cskoi8r, koi8r, koi8 448@tab 449table / koi8_r 450@tab 451RFC 1489 Cyrillic. 452 453 454@item 455koi8_ru 456@tab 457koi8ru 458@tab 459table / koi8_ru 460@tab 461The obsolete Ukrainian. 462 463 464@item 465koi8_u 466@tab 467koi8u 468@tab 469table / koi8_u 470@tab 471RFC 2319 Ukrainian. 472 473 474@item 475koi8_uni 476@tab 477koi8uni 478@tab 479table / koi8_uni 480@tab 481KOI8 Unified. 482 483 484@item 485ucs_2 486@tab 487ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode 488@tab 489ucs_2 / (UCS) 490@tab 491ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). 492 493 494@item 495ucs_2_internal 496@tab 497ucs2_internal, ucs_2internal, ucs2internal 498@tab 499ucs_2_internal / (UCS) 500@tab 501ISO-10646-UCS-2 in system byte order. 502NBSP is always interpreted as NBSP (BOM isn't supported). 503 504 505@item 506ucs_2be 507@tab 508ucs2be 509@tab 510ucs_2 / (UCS) 511@tab 512Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2). 513Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). 514 515 516@item 517ucs_2le 518@tab 519ucs2le 520@tab 521ucs_2 / (UCS) 522@tab 523Little Endian version of ISO-10646-UCS-2. 524Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). 525 526 527@item 528ucs_4 529@tab 530ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4 531@tab 532ucs_4 / (UCS) 533@tab 534ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). 535 536 537@item 538ucs_4_internal 539@tab 540ucs4_internal, ucs_4internal, ucs4internal 541@tab 542ucs_4_internal / (UCS) 543@tab 544ISO-10646-UCS-4 in system byte order. 545NBSP is always interpreted as NBSP (BOM isn't supported). 546 547 548@item 549ucs_4be 550@tab 551ucs4be 552@tab 553ucs_4 / (UCS) 554@tab 555Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4). 556Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). 557 558 559@item 560ucs_4le 561@tab 562ucs4le 563@tab 564ucs_4 / (UCS) 565@tab 566Little Endian version of ISO-10646-UCS-4. 567Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). 568 569 570@item 571us_ascii 572@tab 573ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii 574@tab 575us_ascii / (ASCII) 576@tab 5777-bit ASCII. 578 579 580@item 581utf_16 582@tab 583utf16 584@tab 585utf_16 / (UCS) 586@tab 587RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM. 588 589 590@item 591utf_16be 592@tab 593utf16be 594@tab 595utf_16 / (UCS) 596@tab 597Big Endian version of RFC 2781 UTF-16. 598NBSP is always interpreted as NBSP (BOM isn't supported). 599 600 601@item 602utf_16le 603@tab 604utf16le 605@tab 606utf_16 / (UCS) 607@tab 608Little Endian version of RFC 2781 UTF-16. 609NBSP is always interpreted as NBSP (BOM isn't supported). 610 611 612@item 613utf_8 614@tab 615utf8 616@tab 617utf_8 / (UCS) 618@tab 619RFC 3629 UTF-8. 620 621 622@item 623win_1250 624@tab 625cp1250 626@tab 627@tab 628Win-1250 Croatian. 629 630 631@item 632win_1251 633@tab 634cp1251 635@tab 636table / win_1251 637@tab 638Win-1251 - Cyrillic. 639 640 641@item 642win_1252 643@tab 644cp1252 645@tab 646table / win_1252 647@tab 648Win-1252 - Latin 1. 649 650 651@item 652win_1253 653@tab 654cp1253 655@tab 656table / win_1253 657@tab 658Win-1253 - Greek. 659 660 661@item 662win_1254 663@tab 664cp1254 665@tab 666table / win_1254 667@tab 668Win-1254 - Turkish. 669 670 671@item 672win_1255 673@tab 674cp1255 675@tab 676table / win_1255 677@tab 678Win-1255 - Hebrew. 679 680 681@item 682win_1256 683@tab 684cp1256 685@tab 686table / win_1256 687@tab 688Win-1256 - Arabic. 689 690 691@item 692win_1257 693@tab 694cp1257 695@tab 696table / win_1257 697@tab 698Win-1257 - Baltic. 699 700 701@item 702win_1258 703@tab 704cp1258 705@tab 706table / win_1258 707@tab 708Win-1258 - Vietnamese7 that supports Cyrillic. 709@end multitable 710 711 712 713 714 715@page 716@node iconv design decisions 717@section iconv design decisions 718@findex CCS table 719@findex CES converter 720@findex Speed-optimized tables 721@findex Size-optimized tables 722@* 723The first iconv library design issue arises when considering the 724following two design approaches: 725 726@enumerate 727@item 728Have modules which implement conversion from the encoding A to the encoding B 729and vice versa i.e., one conversion module relates to any two encodings. 730@item 731Have modules which implement conversion from the encoding A to the fixed 732encoding C and vice versa i.e., one conversion module relates to any 733one encoding A and one fixed encoding C. In this case, to convert from 734the encoding A to the encoding B, two modules are needed (in order to convert 735from A to C and then from C to B). 736@end enumerate 737 738@* 739It's obvious, that we have tradeoff between commonality/flexibility and 740efficiency: the first method is more efficient since it converts 741directly; however, it isn't so flexible since for each 742encoding pair a distinct module is needed. 743 744@* 745The Newlib iconv model uses the second method and always converts through the 32-bit 746UCS but its design also allows one to write specialized conversion 747modules if the conversion speed is critical. 748 749@* 750The second design issue is how to break down (decompose) encodings. 751The Newlib iconv library uses the fact that any encoding may be 752considered as one or more CCS plus a CES. It also decomposes its 753conversion modules on @dfn{CES converter} plus one or more @dfn{CCS 754tables}. CCS tables map CCS to UCS and vice versa; the CES converters 755map CCS to the encoding and vice versa. 756 757@* 758As the example, let's consider the conversion from the big5 encoding to 759the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5 760CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2, 761and CNS11643_PLANE14 CCS-es plus the EUC CES. 762 763@* 764The euc_jp -> big5 conversion is performed as follows: 765 766@enumerate 767@item 768The EUC converter performs the EUC-TW encoding to the corresponding CCS-es 769transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14 770CCS-es); 771@item 772The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1, 773CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables; 774@item 775The resulting UCS codes are transformed to the ASCII and BIG5 codes using 776the corresponding CCS tables; 777@item 778The obtained CCS codes are transformed to the big5 encoding using the corresponding 779CES converter. 780@end enumerate 781 782@* 783Analogously, the backward conversion is performed as follows: 784 785@enumerate 786@item 787The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation 788(the ASCII and BIG5 CCS-es); 789@item 790The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables; 791@item 792The resulting UCS codes are transformed to the ASCII and BIG5 codes using 793the corresponding CCS tables; 794@item 795The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding 796CES converter. 797@end enumerate 798 799@* 800Note, the above is just an example and real names (which are implemented 801in the Newlib iconv) of the CES converters and the CCS tables are slightly different. 802 803@* 804The third design issue also relates to flexibility. Obviously, it isn't 805desirable to always link all the CES converters and the CCS tables to the library 806but instead, we want to be able to load the needed converters and tables 807dynamically on demand. This isn't a problem on "big" machines such as 808a PC, but it may be very problematical within "small" embedded systems. 809 810@* 811Since the CCS tables are just data, it is possible to load them 812dynamically from external files. The CES converters, on the other hand 813are algorithms with some code so a dynamic library loading 814capability is required. 815 816@* 817Apart from possible restrictions applied by embedded systems (small 818RAM for example), Newlib itself has no dynamic library support and 819therefore, all the CES converters which will ever be used must be linked into 820the library. However, loading of the dynamic CCS tables is possible and is 821implemented in the Newlib iconv library. It may be enabled via the Newlib 822configure script options. 823 824@* 825The next design issue is fine-tuning the iconv library 826configuration. One important ability is for iconv to not link all it's 827converters and tables (if dynamic loading is not enabled) but instead, 828enable only those encodings which are specified at configuration 829time (see the section about the configure script options). 830 831@* 832In addition, the Newlib iconv library configure options distinguish between 833conversion directions. This means that not only are supported encodings 834selectable, the conversion direction is as well. For example, if user wants 835the configuration which allows conversions from UTF-8 to UTF-16 and 836doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can 837enable only 838this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will 839be included) thus, saving some memory (note, that such technique allows to 840exclude one half of a CCS table from linking which may be big enough). 841 842@* 843One more design aspect are the speed- and size- optimized tables. Users can 844select between them using configure script options. The 845speed-optimized CCS tables are the same as the size-optimized ones in 846case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized 847CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the 848other hand, conversion with speed tables is several times faster. 849 850@* 851Its worth to stress that the new encoding support can't be 852dynamically added into an already compiled Newlib library, even if it 853needs only an additional CCS table and iconv is configured to use 854the external files with CCS tables (this isn't the fundamental restriction 855and the possibility to add new Table-based encoding support dynamically, by 856means of just adding new .cct file, may be easily added). 857 858@* 859Theoretically, the compiled-in CCS tables should be more appropriate for 860embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM 861whereas dynamic loading requires RAM. Moreover, in the current iconv 862implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding. 863This means, for example, that if two iconv descriptors for 864"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of 865koi8-r .cct file will be loaded (actually, iconv loads only the needed part 866of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy. 867 868@page 869@node iconv configuration 870@section iconv configuration 871@findex iconv configuration 872@findex --enable-newlib-iconv-encodings 873@findex --enable-newlib-iconv-from-encodings 874@findex --enable-newlib-iconv-to-encodings 875@findex --enable-newlib-iconv-external-ccs 876@findex NLSPATH 877@* 878To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure 879script option should be used. This option accepts a comma-separated list 880of @emph{encodings} that should be enabled. The option enables each encoding in both 881("to" and "from") directions. 882 883@* 884The @option{--enable-newlib-iconv-from-encodings} configure script option enables 885"from" support for each encoding that was passed to it. 886 887@* 888The @option{--enable-newlib-iconv-to-encodings} configure script option enables 889"to" support for each encoding that was passed to it. 890 891@* 892Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and 893"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv 894code and data will be linked) is to configure Newlib with the following 895options: 896@* 897@code{--enable-newlib-iconv-encodings=UTF-8 898--enable-newlib-iconv-from-encodings=KOI8-R 899--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5} 900@* 901which is the same as 902@* 903@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8 904--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8} 905@* 906User may also just use the 907@* 908@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2} 909@* 910configure script option, but it isn't so optimal since there will be 911some unneeded data and code. 912 913@* 914The @option{--enable-newlib-iconv-external-ccs} option enables iconv's 915capabilities to work with the external CCS files. 916 917@* 918The @option{--enable-target-optspace} Newlib configure script option also affects 919the iconv library. If this option is present, the library uses the size 920optimized CCS tables. This means, that only the size-optimized CCS 921tables will be linked or, if the 922@option{--enable-newlib-iconv-external-ccs} configure script option was used, 923the iconv library will load the size-optimized tables. If the 924@option{--enable-target-optspace}configure script option is disabled, 925the speed-optimized CCS tables are used. 926 927@* 928Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory. 929Thus, the NLSPATH environment variable should be set. 930 931 932 933 934 935@page 936@node Encoding names 937@section Encoding names 938@findex encoding name 939@findex encoding alias 940@findex normalized name 941@* 942Each encoding has one @dfn{name} and a number of @dfn{aliases}. When 943user works with the iconv library (i.e., when the @code{iconv_open} call 944is used) both name or aliases may be used. The same is when encoding 945names are used in configure script options. 946 947@* 948Names and aliases may be specified in any case (small or capital 949letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol. 950 951@* 952Internally the Newlib iconv library always converts aliases to names. It 953also converts names and aliases in the @dfn{normalized} form which means 954that all capital letters are converted to small letters and the @kbd{-} 955symbols are converted to @kbd{_} symbols. 956 957 958 959 960@page 961@node CCS tables 962@section CCS tables 963@findex Size-optimized CCS table 964@findex Speed-optimized CCS table 965@findex mktbl.pl Perl script 966@findex .cct files 967@findex The CCT tables source files 968@findex CCS source files 969@* 970The iconv library stores files with CCS tables in the the @emph{ccs/} 971subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form 972(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form 973of compilable .c source files. The .cct files are only used when the 974@option{--enable-newlib-iconv-external-ccs} configure script option is enabled. 975The .c files are linked to the Newlib library if the corresponding 976encoding is enabled. 977 978@* 979As stated earlier, the Newlib iconv library performs all 980conversions through the 32-bit UCS, but the codes which are used 981in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set. 982Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is 983used instead of the 32-bit UCS-4. 984 985@* 986CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to 98716-bit UCS-2 and vice versa while 16-bit CCS tables map 98816-bit CCS to 16-bit UCS-2 and vice versa. 9898-bit tables are small (in size) while 16-bit tables may be big enough. 990Because of this, 16-bit CCS tables may be 991either speed- or size-optimized. Size-optimized CCS tables are 992smaller then speed-optimized ones, but the conversion process is 993slower if the size-optimized CCS tables are used. 8-bit CCS tables have only 994size-optimized variant. 995 996Each CCS table (both speed- and size-optimized) consists of 997@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps 998UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to 999UCS-2 codes. 1000 1001@* 1002Almost all 16-bit CCS tables contain less then 0xFFFF codes and 1003a lot of gaps exist. 1004 1005@subsection Speed-optimized tables format 1006@* 1007In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is 1008trivial - it is just the array of 256 16-bit UCS codes. Therefore, an 1009UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates 1010as @emph{Y = to_ucs[X]}. 1011 1012@* 1013Obviously, the simplest way to create the "from_ucs" table or the 101416-bit "to_ucs" table is to use the huge 16-bit array like in case 1015of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain 1016less then 0xFFFF code maps and this fact may be exploited to reduce 1017the size of the CCS tables. 1018 1019@* 1020In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The 102116-bit "CCS -> UCS-2" CCS table format is the same, except the mapping 1022direction and the CCS bits number. 1023 1024@* 1025In case of the 8-bit speed-optimized table the "from_ucs" subtable 1026corresponds the "from_ucs" array and has the following layout: 1027 1028@* 1029from_ucs array: 1030@* 1031------------------------------------- 1032@* 10330xFF mapping (2 bytes) (only for 10348-bit table). 1035@* 1036------------------------------------- 1037@* 1038Heading block 1039@* 1040------------------------------------- 1041@* 1042Block 1 1043@* 1044------------------------------------- 1045@* 1046Block 2 1047@* 1048------------------------------------- 1049@* 1050 ... 1051@* 1052------------------------------------- 1053@* 1054Block N 1055@* 1056------------------------------------- 1057 1058@* 1059The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each 1060subrange is represented by an 256-element @dfn{block} (256 1-byte 1061elements or 256 2-byte element in case of 16-bit CCS table) with 1062elements which are equivalent to the CCS codes of this subrange. 1063If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be 1064absent and there will be less then 256 blocks. 1065 1066@* 1067Any element number @emph{m} of @dfn{the heading block} (which contains 1068256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange. 1069If the subrange contains some codes, the value of the @emph{m}-th element of 1070the heading block contains the offset of the corresponding block in the 1071"from_ucs" array. If there is no codes in the subrange, the heading 1072block element contains 0xFFFF. 1073 1074@* 1075If there are some gaps in a block, the corresponding block elements have 1076the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping 1077is defined in the first 2-byte element of the "from_ucs" array. 1078 1079@* 1080Having such a table format, the algorithm of searching the CCS code 1081@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows. 1082 1083@* 1084@enumerate 1085@item If @emph{Y} is equivalent to the value of the first 2-byte element 1086of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search. 1087 1088@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}. 1089 1090@item If the heading block element with number @emph{BlkN} is 0xFFFF, there 1091is no corresponding CCS code (error, wrong input data). Else, fetch the 1092"flom_ucs" array index of the @emph{BlkN}-th block. 1093 1094@item Calculate the offset of the @emph{X} code in its block: 1095@emph{Xindex = Y & 0xFF} 1096 1097@item If the @emph{Xindex}-th element of the block (which is equivalent to 1098@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding 1099CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}. 1100@end enumerate 1101 1102@subsection Size-optimized tables format 1103@* 1104As it is stated above, size-optimized tables exist only for 16-bit CCS-es. 1105This is because there is too small difference between the speed-optimized 1106and the size-optimized table sizes in case of 8-bit CCS-es. 1107 1108@* 1109Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of 1110size-optimized tables. 1111 1112This sections describes the format of the "UCS-2 -> CCS" size-optimized 1113CCS table. The format of "CCS -> UCS-2" table is the same. 1114 1115The idea of the size-optimized tables is to split the UCS-2 codes 1116("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes). 1117Then CCS codes ("to" codes) are stored only for the codes from these 1118ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored 1119together with the corresponding "to" codes. 1120 1121@* 1122The following is the layout of the size-optimized table array: 1123 1124@* 1125size_arr array: 1126@* 1127------------------------------------- 1128@* 1129Ranges number (2 bytes) 1130@* 1131------------------------------------- 1132@* 1133Unranged codes number (2 bytes) 1134@* 1135------------------------------------- 1136@* 1137Unranged codes array index (2 bytes) 1138@* 1139------------------------------------- 1140@* 1141Ranges indexes (triads) 1142@* 1143------------------------------------- 1144@* 1145Ranges 1146@* 1147------------------------------------- 1148@* 1149Unranged codes array 1150@* 1151------------------------------------- 1152 1153@* 1154The @dfn{Unranged codes array index} @emph{size_arr} section helps to find 1155the offset of the needed range in the @emph{size_arr} and has 1156the following format (triads): 1157@* 1158the first code in range, the last code in range, range offset. 1159 1160@* 1161The array of these triads is sorted by the firs element, therefore it is 1162possible to quickly find the needed range index. 1163 1164@* 1165Each range has the corresponding sub-array containing the "to" codes. These 1166sub-arrays are stored in the place marked as "Ranges" in the layout 1167diagram. 1168 1169@* 1170The "Unranged codes array" contains pairs ("from" code, "to" code") for 1171each unranged code. The array of these pairs is sorted by "from" code 1172values, therefore it is possible to find the needed pair quickly. 1173 1174@* 1175Note, that each range requires 6 bytes to form its index. If, for 1176example, there are two ranges (1 - 5 and 9 - 10), and one unranged code 1177(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged 1178code (total 16). But it is better to join both ranges as 1 - 10 and 1179mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the 1180range index and 4 bytes to mark codes 6 and 8 as absent are needed 1181(total 10 bytes). This optimization is done in the size-optimized tables. 1182Thus, ranges may contain small gaps. The absent codes in ranges are marked 1183as 0xFFFF. 1184 1185@* 1186Note, a pair of "from" codes is stored by means of unranged codes since 1187the number of bytes which are needed to form the range is greater than 1188the number of bytes to store two unranged codes (5 against 4). 1189 1190@* 1191The algorithm of searching of the CCS code 1192@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 -> 1193CCS" size-optimized table is as follows. 1194 1195@* 1196@enumerate 1197@item Try to find the corresponding triad in the "Unranged codes array 1198index". Since we are searching in the sorted array, we can do it quickly 1199(divide by 2, compare, etc). 1200 1201@item If the triad is found, fetch the @emph{X} code from the corresponding 1202range array. If it is 0xFFFF, return an error. 1203 1204@item If there is no corresponding triad, search the @emph{X} code among the 1205sorted unranged codes. Return error, if noting was found. 1206@end enumerate 1207 1208@subsection .cct ant .c CCS Table files 1209@* 1210The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs" 1211speed-optimized tables. The .c source files for 16-bit CCS tables have 1212"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size" 1213tables. 1214 1215@* 1216When .c files are compiled and used, all the 16-bit and 32-bit values 1217have the native endian format (Big Endian for the BE systems and Little 1218Endian for the LE systems) since they are compile for the system before 1219they are used. 1220 1221@* 1222In case of .cct files, which are intended for dynamic CCS tables 1223loading, the CCS tables are stored either in LE or BE format. Since the 1224.cct files are generated by the 'mktbl.pl' Perl script, it is possible 1225to choose the endianess of the tables. It is also possible to store two 1226copies (both LE and BE) of the CCS tables in one .cct file. The default 1227.cct files (which come with the Newlib sources) have both LE and BE CCS 1228tables. The Newlib iconv library automatically chooses the needed CCS tables 1229(with appropriate endianess). 1230 1231@* 1232Note, the .cct files are only used when the 1233@option{--enable-newlib-iconv-external-ccs} is used. 1234 1235@subsection The 'mktbl.pl' Perl script 1236@* 1237The 'mktbl.pl' script is intended to generate .cct and .c CCS table 1238files from the @dfn{CCS source files}. 1239 1240@* 1241The CCS source files are just text files which has one or more colons 1242with CCS <-> UCS-2 codes mapping. To see an example of the CCS table 1243source files see one of them using URL-s which will be given bellow. 1244 1245@* 1246The following table describes where the source files for CCS table files 1247provided by the Newlib distribution are located. 1248 1249@multitable @columnfractions .25 .75 1250@item 1251Name 1252@tab 1253URL 1254 1255@item 1256@tab 1257 1258@item 1259big5 1260@tab 1261http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT 1262 1263@item 1264cns11643_plane1 1265cns11643_plane14 1266cns11643_plane2 1267@tab 1268http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT 1269 1270@item 1271cp775 1272cp850 1273cp852 1274cp855 1275cp866 1276@tab 1277http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ 1278 1279@item 1280iso_8859_1 1281iso_8859_2 1282iso_8859_3 1283iso_8859_4 1284iso_8859_5 1285iso_8859_6 1286iso_8859_7 1287iso_8859_8 1288iso_8859_9 1289iso_8859_10 1290iso_8859_11 1291iso_8859_13 1292iso_8859_14 1293iso_8859_15 1294@tab 1295http://www.unicode.org/Public/MAPPINGS/ISO8859/ 1296 1297@item 1298iso_ir_111 1299@tab 1300http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT 1301 1302@item 1303jis_x0201_1976 1304jis_x0208_1990 1305jis_x0212_1990 1306@tab 1307http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT 1308 1309@item 1310koi8_r 1311@tab 1312http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT 1313 1314@item 1315koi8_ru 1316@tab 1317http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT 1318 1319@item 1320koi8_u 1321@tab 1322http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT 1323 1324@item 1325koi8_uni 1326@tab 1327http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT 1328 1329@item 1330ksx1001 1331@tab 1332http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT 1333 1334@item 1335win_1250 1336win_1251 1337win_1252 1338win_1253 1339win_1254 1340win_1255 1341win_1256 1342win_1257 1343win_1258 1344@tab 1345http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ 1346@end multitable 1347 1348The CCS source files aren't distributed with Newlib because of License 1349restrictions in most Unicode.org's files. 1350 1351The following are 'mktbl.pl' options which were used to generate .cct 1352files. Note, to generate CCS tables source files @option{-s} option 1353should be added. 1354 1355@enumerate 1356@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct, 1357iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct, 1358iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct, 1359iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct 1360win_1256.cct, win_1258.cct, win_1251.cct, 1361win_1253.cct, win_1255.cct, win_1257.cct, 1362koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct, 1363big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct 1364files, only the @option{-i <SRC_FILE_NAME>} option were used. 1365 1366@item To generate the jis_x0208_1990.cct file, the 1367@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used. 1368 1369@item To generate the cns11643_plane1.cct file, the 1370@option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct} 1371options were used. 1372 1373@item To generate the cns11643_plane2.cct file, the 1374@option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct} 1375options were used. 1376 1377@item To generate the cns11643_plane14.cct file, the 1378@option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct} 1379options were used. 1380@end enumerate 1381 1382@* 1383For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output. 1384 1385@* 1386It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes 1387in the CCS source file, the bits which are higher then 16 defines plane (see the 1388cns11643.txt CCS source file). 1389 1390@* 1391Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example, 1392several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to 1393the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost 1394codes}) aren't just rejected but instead, they are mapped to the default 1395UCS-2 code (which is currently the @kbd{?} character's code). 1396 1397 1398 1399 1400 1401@page 1402@node CES converters 1403@section CES converters 1404@findex PCS 1405@* 1406Similar to the CCS tables, CES converters are also split into "from UCS" 1407and "to UCS" parts. Depending on the iconv library configuration, these 1408parts are enabled or disabled. 1409 1410@* 1411The following it the list of CES converters which are currently present 1412in the Newlib iconv library. 1413 1414@itemize @bullet 1415@item 1416@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw} 1417encodings. The @emph{euc} CES converter uses the @emph{table} and the 1418@emph{us_ascii} CES converters. 1419 1420@item 1421@emph{table} - this CES converter corresponds to "null" and just performs 1422tables-based conversion using 8- and 16-bit CCS tables. This converter 1423is also used by any other CES converter which needs the CCS table-based 1424conversions. The @emph{table} converter is also responsible for .cct files 1425loading. 1426 1427@item 1428@emph{table_pcs} - this is the wrapper over the @emph{table} converter 1429which is intended for 16-bit encodings which also use the @dfn{Portable 1430Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}. 1431This means, that if the first byte the CCS code is in range of [0x00-0x7f], 1432this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course, 1433the 16-bit codes must not contain bytes in the range of [0x00-0x7f]. 1434The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the 1435@emph{table_pcs} CES converter depends on the @emph{table} CES converter. 1436 1437@item 1438@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and 1439@emph{ucs_2le} encodings support. 1440 1441@item 1442@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and 1443@emph{ucs_4le} encodings support. 1444 1445@item 1446@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support. 1447 1448@item 1449@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support. 1450 1451@item 1452@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In 1453principle, the most natural way to support the @emph{us_ascii} encoding 1454is to define the @emph{us_ascii} CCS and use the @emph{table} CES 1455converter. But for the optimization purposes, the specialized 1456@emph{us_ascii} CES converter was created. 1457 1458@item 1459@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and 1460@emph{utf_16le} encodings support. 1461 1462@item 1463@emph{utf_8} - intended for the @emph{utf_8} encoding support. 1464@end itemize 1465 1466 1467 1468 1469 1470@page 1471@node The encodings description file 1472@section The encodings description file 1473@findex encoding.deps description file 1474@findex mkdeps.pl Perl script 1475@* 1476To simplify the process of adding new encodings support allowing to 1477automatically generate a lot of "glue" files. 1478 1479@* 1480There is the 'encoding.deps' file in the @emph{lib/} subdirectory which 1481is used to describe encoding's properties. The 'mkdeps.pl' Perl script 1482uses 'encoding.deps' to generates the "glue" files. 1483 1484@* 1485The 'encoding.deps' file is composed of sections, each section consists 1486of entries, each entry contains some encoding/CES/CCS description. 1487 1488@* 1489The 'encoding.deps' file's syntax is very simple. Currently only two 1490sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}. 1491 1492@* 1493Each @emph{ENCODINGS} section's entry describes one encoding and 1494contains the following information. 1495 1496@itemize @bullet 1497@item 1498Encoding name (the @emph{ENCODING} field). The name should 1499be unique and only one name is possible. 1500 1501@item 1502The encoding's CES converter name (the @emph{CES} field). Only one CES 1503converter is allowed. 1504 1505@item 1506The whitespace-separated list of CCS table names which are used by the 1507encoding (the @emph{CCS} field). 1508 1509@item 1510The whitespace-separated list of aliases names (the @emph{ENCODING} 1511field). 1512@end itemize 1513 1514@* 1515Note all names in the 'encoding.deps' file have to have the normalized 1516form. 1517 1518@* 1519Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of 1520one CES converted. For example, the @emph{euc} CES converter depends on 1521the @emph{table} and the @emph{us_ascii} CES converter since the 1522@emph{euc} CES converter uses them. This means, that both @emph{table} 1523and @emph{us_ascii} CES converters should be linked if the @emph{euc} 1524CES converter is enabled. 1525 1526@* 1527The @emph{CES_DEPENDENCIES} section defines the following: 1528 1529@itemize @bullet 1530@item 1531the CES converter name for which the dependencies are defined in this 1532entry (the @emph{CES} field); 1533 1534@item 1535the whitespace-separated list of CES converters which are needed for 1536this CES converter (the @emph{USED_CES} field). 1537@end itemize 1538 1539@* 1540The 'mktbl.pl' Perl script automatically solves the following tasks. 1541 1542@itemize @bullet 1543@item 1544User works with the iconv library in terms of encodings and doesn't know 1545anything about CES converters and CCS tables. The script automatically 1546generates code which enables all needed CES converters and CCS tables 1547for all encodings, which were enabled by the user. 1548 1549@item 1550The CES converters may have dependencies and the script automatically 1551generates the code which handles these dependencies. 1552 1553@item 1554The list of encoding's aliases is also automatically generated. 1555 1556@item 1557The script uses a lot of macros in order to enable only the minimum set 1558of code/data which is needed to support the requested encodings in the 1559requested directions. 1560@end itemize 1561 1562@* 1563The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps' 1564file and generates the following files. 1565 1566@itemize @bullet 1567@item 1568@emph{lib/encnames.h} - this header files contains macro definitions for all 1569encoding names 1570 1571@item 1572@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array 1573is used to find the name of requested encoding by it's alias. 1574 1575@item 1576@emph{ces/cesbi.c} - this file defines two arrays 1577(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain 1578description of enabled "to UCS" and "from UCS" CES converters and the 1579names of encodings which are supported by these CES converters. 1580 1581@item 1582@emph{ces/cesbi.h} - this file contains the set of macros which defines 1583the set of CES converters which should be enabled if only the set of 1584enabled encodings is given (through macros defined in the 1585@emph{newlib.h} file). Note, that one CES converter may handle several 1586encodings. 1587 1588@item 1589@emph{ces/cesdeps.h} - the CES converters dependencies are handled in 1590this file. 1591 1592@item 1593@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined 1594here. 1595 1596@item 1597@emph{ccs/ccsnames.h} - this header files contains macro definitions for all 1598CCS names. 1599 1600@item 1601@emph{encoding.aliases} - the list of supported encodings and their 1602aliases which is intended for the Newlib configure scripts in order to 1603handle the iconv-related configure script options. 1604@end itemize 1605 1606 1607 1608 1609 1610@page 1611@node How to add new encoding 1612@section How to add new encoding 1613@* 1614At first, the new encoding should be broken down to CCS and CES. Then, 1615the process of adding new encoding is split to the following activities. 1616 1617@enumerate 1618@item Generate the .cct CCS file and the .c source file for the new 1619encoding's CCS (if it isn't already present). To do this, the CCS source 1620file should be had and the 'mktbl.pl' script should be used. 1621 1622@item Write the corresponding CES converter (if it isn't already 1623present). Use the existing CES converters as an example. 1624 1625@item 1626Add the corresponding entries to the 'encoding.deps' file and regenerate 1627the autogenerated "glue" files using the 'mkdeps.pl' script. 1628 1629@item 1630Don't forget to add entries to the newlib/newlib.hin file. 1631 1632@item 1633Of course, the 'Makefile.am'-s should also be updated (if new files were 1634added) and the 'Makefile.in'-s should be regenerated using the correct 1635version of 'automake'. 1636 1637@item 1638Don't forget to update the documentation (the list of 1639supported encodings and CES converters). 1640@end enumerate 1641 1642In case a new encoding doesn't fit to the CES/CCS decomposition model or 1643it is desired to add the specialized (non UCS-based) conversion support, 1644the Newlib iconv library code should be upgraded. 1645 1646 1647 1648 1649 1650@page 1651@node The locale support interfaces 1652@section The locale support interfaces 1653@* 1654The newlib iconv library also has some interface functions (besides the 1655@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which 1656are intended for the Locale subsystem. All the locale-related code is 1657placed in the @emph{lib/iconvnls.c} file. 1658 1659@* 1660The following is the description of the locale-related interfaces: 1661 1662@itemize @bullet 1663@item 1664@code{_iconv_nls_open} - opens two iconv descriptors for "CCS -> 1665wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is 1666passed in the function parameters. The @emph{wchar_t} characters encoding is 1667either ucs_2_internal or ucs_4_internal depending on size of 1668@emph{wchar_t}. 1669 1670@item 1671@code{_iconv_nls_conv} - the function is similar to the @code{iconv} 1672functions, but if there is no character in the output encoding which 1673corresponds to the character in the input encoding, the default 1674conversion isn't performed (the @code{iconv} function sets such output 1675characters to the @kbd{?} symbol and this is the behavior, which is 1676specified in SUSv3). 1677 1678@item 1679@code{_iconv_nls_get_state} - returns the current encoding's shift state 1680(the @code{mbstate_t} object). 1681 1682@item 1683@code{_iconv_nls_set_state} sets the current encoding's shift state (the 1684@code{mbstate_t} object). 1685 1686@item 1687@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful 1688or stateless. 1689 1690@item 1691@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the 1692maximum bytes number) of the encoding's characters. 1693@end itemize 1694 1695 1696 1697 1698@page 1699@node Contact 1700@section Contact 1701@* 1702The author of the original BSD iconv library (Alexander Chuguev) no longer 1703supports that code. 1704 1705@* 1706Any questions regarding the iconv library may be forwarded to 1707Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as 1708well as to the public Newlib mailing list. 1709 1710