RFC 7997 - The Use of Non-ASCII Characters in RFCs

In order to support the internationalization of protocols and a more diverse Internet community, the RFC Series must evolve to allow for the use of non-ASCII characters in RFCs. While English remains the required language of the Series, the encoding of future RFCs will be in UTF-8, allowing for a broader range of characters than typically used in the English language. This document describes the RFC Editor requirements and gives guidance regarding the use of non-ASCII characters in RFCs.¶

3. Rules for the Use of Non-ASCII Characters

This section describes the guidelines for the use of non-ASCII characters in an RFC. If the RFC Editor identifies areas where the use of non-ASCII characters negatively impacts the readability of the text, they will request alternate text.¶

The RFC Editor may, in cases of entire words represented in non-ASCII characters, ask for a set of reviewers to verify the meaning, spelling, characters, and grammar of the text.¶

3.1. General Usage throughout a Document

Where the use of non-ASCII characters is purely part of an example and not otherwise required for correct protocol operation, escaping the non-ASCII character is not required. Note, however, that as the language of the RFC Series is English, the use of non-ASCII characters is based on the spelling of words commonly used in the English language following the guidance in the Merriam-Webster dictionary [MerrWeb].¶

The RFC Editor will use the primary spelling listed in that dictionary by default.¶

Example of non-ASCII characters that do not require escaping (example from Section 3.1.1.12 of RFC 4475 [RFC4475], with a hex dump replaced by the actual character glyphs): ¶

This particular response contains unreserved and non-ASCII
UTF-8 characters.
This response is well formed.  A parser must accept this message.

Message Details : unreason

SIP/2.0 200 = 2**3 * 5**2 но сто девяносто девять - простое
Via: SIP/2.0/UDP 192.0.2.198;branch=z9hG4bK1324923
Call-ID: unreason.1234ksdfak3j2erwedfsASdf
CSeq: 35 INVITE
From: sip:user@example.com;tag=11141343
To: sip:user@example.edu;tag=2229 Content-Length: 154
Content-Type: application/sdp

¶⧉

3.2. Person Names

Person names may appear in several places within an RFC (e.g., the header, Acknowledgements, and References). When a script outside the Unicode Latin blocks [UNICODE-CHART] is used for an individual name, an author-provided, ASCII-only identifier will appear immediately after the non-Latin characters, surrounded by parentheses. This will improve general readability of the text.¶

Example header:¶

OLD:¶


Internet Engineering Task Force (IETF)                       J. Tong
Request for Comments: 7380                                C. Bi, Ed.
Category: Standards Track                              China Telecom
ISSN: 2070-1721                                              R. Even
                                                          Q. Wu, Ed.
                                                            R. Huang
                                                              Huawei
                                                       November 2014

¶⧉

PROPOSED/NEW:¶


Internet Engineering Task Force (IETF)                       J. Tong
Request for Comments: 7380                                C. Bi, Ed.
Category: Standards Track                              China Telecom
ISSN: 2070-1721                                   רנוי אןב (R. Even)
                                                    吴钦 (Q. Wu), Ed.
                                                            R. Huang
                                                              Huawei
                                                       November 2014

¶⧉

Example Acknowledgements section:¶

OLD:¶

The following people contributed significant text to early versions of this draft: Patrik Faltstrom, William Chan, and Fred Baker.¶

PROPOSED/NEW:¶

The following people contributed significant text to early versions of this draft: Patrik Fältström (Faltstrom), 陈智昌 (William Chan), and Fred Baker.¶

Example reference entry:¶

OLD:¶

   [RFC6630]  Cao, Z., Deng, H., Wu, Q., and G. Zorn, Ed., "EAP
              Re-authentication Protocol Extensions for Authenticated
              Anticipatory Keying (ERP/AAK)", RFC 6630, June 2012.

¶⧉

NEW¶

   [RFC6630]  Cao, Z., Deng, H., 吴钦 (Wu, Q.), and G. Zorn, Ed., "EAP
              Re-authentication Protocol Extensions for Authenticated
              Anticipatory Keying (ERP/AAK)", RFC 6630, June 2012.

¶⧉

3.3. Company Names

Company names may appear in several places within an RFC. In all cases, valid Unicode is required. For names that include characters outside of the Unicode Latin and Latin Extended scripts, an author-provided, ASCII-only identifier is required to assist in searching and indexing of the document.¶

3.4. Body of the Document

When the mention of non-ASCII characters is required for correct protocol operation and understanding, the characters' Unicode code points must be used in the text. The addition of each character name is encouraged.¶

Non-ASCII characters will require identifying the Unicode code point.
Use of the actual UTF-8 character (e.g., Δ) is encouraged so that a reader can more easily see what the character is, if their device can render the text.
The use of the Unicode character names like "INCREMENT" in addition to the use of Unicode code points is also encouraged. When used, Unicode character names should be in all capital letters.

Examples:¶

OLD [RFC7564]:¶

However, the problem is made more serious by introducing the full
range of Unicode code points into protocol strings.  For example,
the characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 from
the Cherokee block look similar to the ASCII characters
"STPETER" as they might appear when presented using a "creative"
font family.

¶⧉

NEW/ALLOWED:¶

However, the problem is made more serious by introducing the full
range of Unicode code points into protocol strings. For example,
the characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2
(ᏚᎢᎵᎬᎢᎬᏒ) from the Cherokee block look similar to the ASCII
characters "STPETER" as they might appear when presented using a
"creative" font family.

¶⧉

ALSO ACCEPTABLE:¶

However, the problem is made more serious by introducing the full
range of Unicode code points into protocol strings. For example,
the characters "ᏚᎢᎵᎬᎢᎬᏒ" (U+13DA U+13A2 U+13B5 U+13AC U+13A2
U+13AC U+13D2) from the Cherokee block look similar to the ASCII
characters "STPETER" as they might appear when presented using a
"creative" font family.

¶⧉

Example of proper identification of Unicode characters in an RFC:¶

Acceptable:¶

Temperature changes in the Temperature Control Protocol are indicated by the U+2206 character.

Preferred:¶

Temperature changes in the Temperature Control Protocol are indicated by the U+2206 character ("Δ").
Temperature changes in the Temperature Control Protocol are indicated by the U+2206 character (INCREMENT).
Temperature changes in the Temperature Control Protocol are indicated by the U+2206 character ("Δ", INCREMENT).
Temperature changes in the Temperature Control Protocol are indicated by the U+2206 character (INCREMENT, "Δ").
Temperature changes in the Temperature Control Protocol are indicated by the "Delta" character "Δ" (U+2206).
Temperature changes in the Temperature Control Protocol are indicated by the character "Δ" (INCREMENT, U+2206).

Which option of (1), (2), (3), (4), (5), or (6) is preferred may depend on context and the specific character(s) in question. All are acceptable within an RFC. "US-ASCII Escaping of Unicode Character" [BCP137] describes the pros and cons of different options for identifying Unicode characters and may help authors decide how to represent the non-ASCII characters in their documents.¶

3.5. Tables

Tables follow the same rules for identifiers and characters as in "Body of the Document" (Section 3.4). If it is sensible (i.e., more understandable for a reader) for a given document to have two tables, -- one including the identifiers and non-ASCII characters and a second with just the non-ASCII characters -- then that will be allowed at the discretion of the authors.¶

Original text from "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" [RFC7613].¶

Table 3: A sample of legal passwords

+------------------------------------+------------------------------+
| # | Password                       | Notes                        |
+------------------------------------+------------------------------+
| 12| <correct horse battery staple> | ASCII space is allowed       |
+------------------------------------+------------------------------+
| 13| <Correct Horse Battery Staple> | Different from example 12    |
+------------------------------------+------------------------------+
| 14| <&#x3C0;&#xDF;&#xE5;>          | Non-ASCII letters are OK     |
|   |                                | (e.g., GREEK SMALL LETTER    |
|   |                                | PI, U+03C0)                  |
+------------------------------------+------------------------------+
| 15| <Jack of &#x2666;s>            | Symbols are OK (e.g., BLACK  |
|   |                                | DIAMOND SUIT, U+2666)        |
+------------------------------------+------------------------------+
| 16| <foo&#x1680;bar>               | OGHAM SPACE MARK, U+1680, is |
|   |                                | mapped to U+0020 and thus    |
|   |                                | the full string is mapped to |
|   |                                | <foo bar>                    |
+------------------------------------+------------------------------+

¶⧉

Preferred text:¶

Table 3: A sample of legal passwords

+------------------------------------+------------------------------+
| # | Password                       | Notes                        |
+------------------------------------+------------------------------+
| 12| <correct horse battery staple> | ASCII space is allowed       |
+------------------------------------+------------------------------+
| 13| <Correct Horse Battery Staple> | Different from example 12    |
+------------------------------------+------------------------------+
| 14| <πß๗>                          | Non-ASCII letters are OK     |
|   |                                | (e.g., GREEK SMALL LETTER    |
|   |                                | PI, U+03C0; LATIN SMALL      |
|   |                                | LETTER SHARP S, U+00DF; THAI |
|   |                                | DIGIT SEVEN, U+0E57)         |
+------------------------------------+------------------------------+
| 15| <Jack of ♦s>                   | Symbols are OK (e.g., BLACK  |
|   |                                | DIAMOND SUIT, U+2666)        |
+------------------------------------+------------------------------+
| 16| <foo bar>                      | OGHAM SPACE MARK, U+1680, is |
|   |                                | mapped to U+0020 and thus    |
|   |                                | the full string is mapped to |
|   |                                | <foo bar>                    |
+------------------------------------+------------------------------+

¶⧉

3.6. Code Components

The RFC Editor encourages the use of the U+ notation except within a code component where one must follow the rules of the programming language in which the code is being written.¶

Code components are generally expected to use fixed-width fonts. Where such fonts are not available for a particular script, the best script-appropriate font will be used for that part of the code component.¶

3.7. Bibliographic Text

The reference entry must be in English; whatever subfields are present must be available in ASCII-encoded characters. For references to RFCs and Internet-Drafts, the author's name will be formatted in the reference as per current RFC Style Guide recommendations. As long as good sense is used, the reference entry may also include non-ASCII characters at the author's discretion and as provided by the author. The RFC Editor may request that a third party, such as a language specialist or subject matter expert, review of any non-ASCII reference. This applies to both normative and informative references.¶

Example:¶

[GOST3410] "Information technology. Cryptographic data security.
           Signature and verification processes of [electronic]
           digital signature.", GOST R 34.10-2001, Gosudarstvennyi
           Standard of Russian Federation, Government Committee of
           Russia for Standards, 2001. (In Russian)

¶⧉

Allowable addition to the above citation:¶

           "Информационная технология. Криптографическая защита
           информации. Процессы формирования и проверки
           электронной цифровой подписи", GOST R 34.10-2001,
           Государственный стандарт Российской Федерации, 2001.

¶⧉

Alternatively:¶

[GOST3410] "Information technology. Cryptographic data security.
           Signature and verification processes of [electronic]
           digital signature.", GOST R 34.10-2001, Gosudarstvennyi
           Standard of Russian Federation, Правительственная
           комиссия России по стандартам (Government Committee of
           Russia for Standards), 2001. (In Russian)

¶⧉

3.8. Keywords and Citation Tags

Keywords (as tagged with the <keyword> element in XML) and citation tags (as defined in the anchor attributes of <reference> elements) must contain only ASCII characters.¶

3.9. Address Information

The purpose of providing address information, either postal or email, is to assist readers of an RFC in contacting the author or authors. Authors may include the official postal address as recognized by their company or local postal service without additional non-ASCII character escapes. If the email address includes non-ASCII characters and is a valid email address at the time of publication, non-ASCII character escapes are not required.¶

Example:¶

  Qin Wu (editor)
  Huawei
  101 Software Avenue, Yuhua District
  Nanjing, Jiangsu  210012
  China

Additional contact information:
  吴钦 (editor)
  华为技术有限公司
  ⾬雨花区软件⼤大道101号
  江苏南京 210012
  中国

------

  Roni Even
  Huawei
  14 David Hamelech
  Tel Aviv  64953
  Israel

Additional contact information:
  רוני אבן    
  וואווי
  דוד המלך 14   
  תל אביב 64953   
  ישראל

¶⧉

The Use of Non-ASCII Characters in RFCs

Abstract

Status of this Memo

Copyright Notice

1. Introduction

2. Basic Requirements

3. Rules for the Use of Non-ASCII Characters

3.1. General Usage throughout a Document

3.2. Person Names

3.3. Company Names

3.4. Body of the Document

3.5. Tables

3.6. Code Components

3.7. Bibliographic Text

3.8. Keywords and Citation Tags

3.9. Address Information

4. Normalization Forms

5. XML Markup

6. Internationalization Considerations

7. Security Considerations

8. Informative References

IAB Members at the Time of Approval

Acknowledgements

Author's Address