prepared by
Internet Mail Consortium
Internet Mail Consortium Report: MAIL-I18N
IMCR-010, August 1, 1998
It is important to note that this report does not create any new standards; instead, it shows how current and proposed standards should be used. Most of the issues in internationalization already have standards, and those issues that are not are already being addressed in standards bodies such as the IETF and the Unicode Consortium. However, in many cases, there are too many standards, and developers are waiting to see which standard becomes dominant. One of the main purposes of this report is to help break this logjam.
Some of the most important problems of internationalization in Internet mail are covered in this report. They include:
Other important topics that are not covered in this report include using international characters in client commands and human-readable responses in the SMTP, POP, and IMAP protocols, internationalization in mailbox names and sorting in IMAP, and international characters in digital certificates used with Internet mail.
Finally, this report does not cover what current implementations do for internationalization. Instead, a separate report may be made on this topic.
This report describes the internationalization issues when creating and displaying Internet mail messages. For some issues, there are specific recommendations for what a program should do to facilitate the best possible results; these are marked as Recommendation. For other issues, only general suggestions are made.
This section covers only a few key words and phrases. All readers of this report would do well to read the first five chapters of the Unicode Standard if for no other reason than to get a handle on the terminology used in the discussion of internationalization. Most people, on reading the Unicode Standard for the first time, remark that they had no idea that there were so many variations on how characters are formed, how they are displayed, and so on. Many of the definitions used in this report come straight from the Unicode Standard.
In some cases, a charset name matches a CCS or CES name. This is often the case when a particular CCS implies a single specific CES or vice versa. For example the UTF-8 CES is only used with the ISO 10646 CCS, and consequently the registered charset name for CCS="ISO 10646" CES="UTF-8" is "UTF-8".
Some charsets are complex. Typical of this is the ISO-2022-JP charset, which is CCS="ASCII","JIS X 0201 left half","JIS X 0208" and CES="ISO 2022". Some charsets based on ISO 2022 are even more complicated than this.
Most MUAs are used by people, although there are certainly many MUAs that are automatic processes such as mailing list systems. For the purposes of this report, an MUA has four basic functions:
MTAs on the Internet run the SMTP protocol. MTAs mostly just move messages. The MTA that is responsible for receiving messages for the recipient accepts messages and saves them in a message store that the recipient has access to. This report only covers MUAs, not MTAs.
RFC 2277 specifies that protocols must be able to use the UTF-8 charset for all text. The UTF-8 charset is defined in the Unicode Standard as well as in RFC 2279. RFC 2277 also specifies that protocols that transfer text must provide for carrying information about the language of that text; protocols should also provide for carrying information about the language of names, where appropriate. It recommends the use of language tagging as defined in "Tags for Language Names", RFC 1766.
Internationalized text and names appear in both parts of Internet mail messages: in the headers, and in the body of the message. RFC 2047 covers how to use international characters in some parts of non-MIME headers, such as addresses and subject headers. RFC 2231 extends the concepts in RFC 2047 to cover MIME parameter values, and also specifies a method for using language tagging for international characters in these headers. RFC 2046 describes how to specify the charset for each part of the body of the message.
Because charsets are so important to Internet standards, new charsets can be registered so that applications can refer to them. RFC 2278 defines what charsets are and how they can be registered with IANA.
It should be noted that the Unicode Standard also defines the UTF-7 charset, which was intended for Internet mail. However, MIME is quite capable of carrying UTF-8, and UTF-8 is expected to be used in many protocols, not just Internet mail. Fortunately, very few vendors implemented UTF-7, and its use is strongly discouraged in Internet mail.
More recently, ISO and the Unicode Consortium have created a single large character set that encompasses essentially all of the characters from all living languages (and many defunct languages as well). This character set is specified in ISO/IEC 10646 and in the Unicode Standard. However, the Unicode Standard goes much further than ISO/IEC 10646, and gives semantics to the characters, categorizes them, has many useful rules for handling them, and imposes tighter compliance requirements to guarantee the same behavior on different platforms.
As described earlier, current IETF practice for protocols is to use the UTF-8 charset, which maps to the characters in the Unicode Standard and ISO 10646. UTF-8 comes from the Unicode Standard and ISO/IEC 10646, although the definition of UTF-8 that is used in Internet protocols comes from RFC 2279.
The Unicode Consortium has defined a way to label the language of text that is encoded in the Unicode Standard. The document that defines this tagging is Unicode Technical Report #7: Plane 14 Characters for Language Tags. These tags can be used to switch languages within a single block of text; this differs from the MIME tagging defined in RFC 1766, which defines a single language for an entire body part. This kind of embedded tagging is most useful for multi-language text.
For example, a Content-type header might be:
Content-type: text/plain; charset="utf-8"
Recommendation: All body parts that include human-readable text and are created with a Content-type header should include an explicit charset parameter, even if the charset is US-ASCII.
The IANA maintains the list of charsets.
There is a strong tendency in the IETF to start using the UTF-8 charset as soon as possible. Of course, there is always a "who starts first" problem with adopting any new technology, but there is general agreement that this is quite important.
Recommendation: All mail-creating programs created or revised after January 1, 1999, must be able to create mail using the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot create mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating programs should try to meet this requirement as early as possible.
At the time this report is being written, few mail-displaying programs support the UTF-8 charset. Thus, it is not recommended that mail-creating programs should immediately start sending only UTF-8. There are dozens of charsets in wide use throughout the world in currently-deployed MUAs. However, the use of the UTF-7 charset is strongly discouraged.
Mail recipients expect to receive mail that they can view (although most people have gotten used to getting some messages that they cannot view due to charsets that their viewing program does not handle). Thus, senders still need to be able to control the charset that is used when creating messages.
Recommendation: All mail-creating programs that are controlled by humans should allow the sender to choose the charset used to create a message. These programs should also give advice to the user about the different charsets, such as about the likelihood that the recipient will be able to display a particular charset.
Of course, guessing what charset to use for a recipient with unknown capabilities is quite difficult. Even if the recipient has sent a message, the sender cannot assume that the charset used in that message is the best charset they can use. This is due to the problem of neither side knowing whether or not they can escalate to a more capable charset first.
Most MUAs have a "default" charset they use for messages. This default might be set based on a number of factors, including the country of origin of the software, the location of the user, settings from the operating system on which the software is being run, and so on. Because the user often knows the capabilities of most of the recipients of mail they send, the user should be able to set the default charset used in new messages.
RFC 1766 defines how to create language tags for message body parts, and Unicode Language Tags describes how to mark the language of text that uses the Unicode Standard. Many programs process MIME language information, but Unicode language tags are very new and are handled by very few agents so far. Thus, every mail body part should have a Content-language header if possible, and parts that have more than one language should use UTF-8 and Unicode language tags.
Recommendation: All body parts that are created with a Content-type header that include human-readable text should also include a Content-language header. This practice makes it more likely that programs that process messages where different languages would process differently will process them correctly. Note that the MIME media type does not define whether or not the content is human-readable, and the Content-language header should be used with all types of human-readable content, not just plain text.
Recommendation: All plain text body parts that use UTF-8 and have more than one language should use Unicode Language Tags in addition to a Content-language header. However, Unicode Language Tags should only be used with plain text body parts that have more than one language; they should not be used with body parts that have a single language, nor should they be used with structured text body parts such as those coded with HTML.
Recommendation: All mail-creating programs should allow users to use non-ASCII characters in message headers, as described in RFC 2047 and RFC 2231. Headers that conform to these two RFCs are not known to harm any mail-displaying process that does not conform to the RFCs. The charsets used in these headers should be chosen using the similar methods to choosing charsets for the bodies of the messages to which they are attached.
There is a strong tendency in the IETF to start using the UTF-8 charset as soon as possible. Of course, there is always a "who starts first" problem with adopting any new technology, but there is general agreement that this is quite important.
Recommendation: All mail-displaying programs created or revised after January 1, 1999, must be able to display mail that uses the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot display mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-displaying programs should try to meet this requirement as early as possible. As noted above, programs that display UTF-8 do not have to display all possible UTF-8 characters. In fact, it is likely that only a few such programs will exists, mostly due to display restrictions of the operating systems on which the mail programs run. Therefore, mail-displaying programs should have a method for displaying characters in messages that they can't represent by the correct glyph.
This report does not make any recommendations on how to represent undisplayable characters. However, any mail-displaying program that can understand a charset that it cannot fully display should have some reasonable method for showing undisplayable characters. This might be to use a single glyph that represents every undisplayable character, or it might be to show the underlying encoding for each undisplayable character, or some other method. The Unicode Standard contains a great deal of information on undisplayable characters, and additional suggestions on handling undisplayable characters can be found in section 5.4 of the HTML 4.0 specification from the W3C.
Recommendation: All SMTP servers should support the 8BITMIME extension, as described in RFC 1652.
One concern that many people outside the US have is that MUAs will send and receive UTF-8, but only handle the portion of UTF-8 that overlaps with US-ASCII. Others worry that some MUAs will support only the characters in the iso-8859-1 charset and claim that they handle international characters. Clearly, it is difficult for an MUA to handle characters beyond what the operating system it is running under can show. As more and more MUAs start sending UTF-8, however, there will be a world-wide expectation that recipients will be able to view messages. Thus, every MUA maker should not only try to handle UTF-8, but should also work hard to display as many characters within that charset as possible.
Because the Unicode Standard covers almost every known character used anywhere in the world today, it makes a good "central platform" for internationalization. The Unicode Consortium has freely-available transcoding tables for all common charsets to and from the Unicode Standard. This means that software that uses the Unicode Standard as its core character set can transcode to and from any common charset easily using the transcoding tables. It also means that all software that uses these mapping tables will convert from one charset to another in an identical fashion.
Recommendation: All mail-creating and mail-displaying programs created or revised after January 1, 1999, should be able to handle many common charsets in addition to UTF-8. Another way to say this is that any mail-creating and mail-displaying program created or revised after January 1, 1999, that cannot handle a wide variety of common charsets should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating and mail-displaying programs should try to meet this requirement as early as possible.
Explicit charset parameter | All body parts that include human-readable text and are created with a Content-type header should include an explicit charset parameter, even if the charset is US-ASCII. |
Sending UTF-8 | All mail-creating programs created or revised after January 1, 1999, must be able to create mail using the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot create mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating programs should try to meet this requirement as early as possible. |
Choosing charsets on creation | All mail-creating programs that are controlled by humans should allow the sender to choose the charset used to create a message. These programs should also give advice to the user about the different charsets, such as about the likelihood that the recipient will be able to display a particular charset. |
Specifying languages | All body parts that are created with a Content-type header that include human-readable text should also include a Content-language header. This practice makes it more likely that programs that process messages where different languages would process differently will process them correctly. Note that the MIME media type does not define whether or not the content is human-readable, and the Content-language header should be used with all types of human-readable content, not just plain text. |
Multi-language text | All plain text body parts that use UTF-8 and have more than one language should use Unicode Language Tags in addition to a Content-language header. However, Unicode Language Tags should only be used with plain text body parts that have more than one language; they should not be used with body parts that have a single language, nor should they be used with structured text body parts such as those coded with HTML. |
Non-ASCII headers | All mail-creating programs should allow users to use non-ASCII characters in message headers, as described in RFC 2047 and RFC 2231. Headers that conform to these two RFCs are not known to harm any mail-displaying process that does not conform to the RFCs. The charsets used in these headers should be chosen using the similar methods to choosing charsets for the bodies of the messages to which they are attached. |
Displaying UTF-8 | All mail-displaying programs created or revised after January 1, 1999, must be able to display mail that uses the UTF-8 charset. Another way to say this is that any program created or revised after January 1, 1999, that cannot display mail using the UTF-8 charset should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-displaying programs should try to meet this requirement as early as possible. |
MTAs and 8-bit content | All SMTP servers should support the 8BITMIME extension, as described in RFC 1652. |
Handling all common charsets | All mail-creating and mail-displaying programs created or revised after January 1, 1999, should be able to handle many common charsets in addition to UTF-8. Another way to say this is that any mail-creating and mail-displaying program created or revised after January 1, 1999, that cannot handle a wide variety of common charsets should be considered deficient and lacking in standard internationalization capabilities. Of course, all mail-creating and mail-displaying programs should try to meet this requirement as early as possible. |
Name | More Information |
---|---|
Internet Assigned Number Authority (IANA) | <http://www.isi.edu/iana/> |
Internet Engineering Task Force (IETF) | <http://www.ietf.org/> |
Internet Mail Consortium (IMC) | <http://www.imc.org/> |
International Standards Organization (ISO) | <http://www.iso.ch/> |
Unicode Consortium | <http://www.unicode.org/> |
World Wide Web Consortium (W3C) | <http://www.w3.org/> |
The following is a list of the main standards discussed in this report. A more complete list that includes historical and other documents can be found on the IMC web site at <http://www.imc.org/imc-intl/>.
Reference | Title | Where Available |
---|---|---|
Charset list | Official names for character sets | <http://www.isi.edu/in-notes/iana/assignments/character-sets> |
HTML 4 | HTML 4.0 Specification | <http://www.w3.org/TR/REC-html40/> |
ISO 639:1988 | Code for the Representation of Names of Languages | National ISO member bodies. An unofficial but useful version is at <http://www.unicode.org/unicode/onlinedat/languages.html> |
ISO/IEC 2022:1994 | Information Technology -- Character Code Structure and Extension Techniques | National ISO member bodies. The registry of the character sets that are referred to in this document can be found at <http://www.itscj.ipsj.or.jp/ISO-IR/> |
ISO 3166-1:1997 | Codes for the Representation of Names of Countries | National ISO member bodies. An unofficial but useful version is at <http://www.unicode.org/unicode/onlinedat/countries.html> |
ISO/IEC 8859 | Information Processing -- 8-bit Single-Byte Coded Graphic Character Sets (many parts for different alphabets) | National ISO member bodies. The official Unicode mappings for ISO 8859 are at <ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/> |
ISO 10646-1:1993 | Information Technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane and its amendments | National ISO member bodies |
MIME | Multipurpose Internet Mail Extensions (RFCs 2045-2049) | <http://www.imc.org/rfc2045>, <http://www.imc.org/rfc2046>, <http://www.imc.org/rfc2047>, <http://www.imc.org/rfc2048>, <http://www.imc.org/rfc2049> |
RFC 1652 | SMTP 8bit-MIMEtransport | <http://www.imc.org/rfc1652> |
RFC 1766 | Tags for Language Names | <http://www.imc.org/rfc1766> |
RFC 2231 | MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations | <http://www.imc.org/rfc2231> |
RFC 2277 | IETF Policy on Character Sets and Languages | <http://www.imc.org/rfc2277> |
RFC 2278 | IANA Charset Registration Procedures | <http://www.imc.org/rfc2278> |
RFC 2279 | UTF-8, a Transformation Format of Unicode and ISO 10646 | <http://www.imc.org/rfc2279> |
Unicode Standard | Unicode Standard, version 2.1 | Some parts of the version 2.0 standard are available at <http://www.unicode.org/unicode/standard/standard.html>. The paper edition (with CD-ROM) of version 2.0 from Addison Wesley is ISBN 0-201-48345-9, and is available from most online bookstores. The differences between version 2.0 and version 2.1 are detailed at <http://www.unicode.org/unicode/reports/tr8.html>. Other Unicode technical reports which may be of interest can be found at <http://www.unicode.org/unicode/reports/techreports.html>. |
Unicode Language Tags | Unicode Technical Report #7: Plane 14 Characters for Language Tags | <http://www.unicode.org/unicode/reports/tr7.html> |
The Internet Mail Consortium is an industry trade association for companies participating in the Internet mail market. To give feedback or to get more information on IMC reports, send mail to <mailto:reports@imc.org>. For information on the Internet Mail Consortium, please visit our web site at <http://www.imc.org/>, or call us at +1 (831) 426-9827.