RFC 9233: Internationalized Domain Names for Applications 2008 (IDNA2008) and Unicode 12.0.0
- P. Fältström
Abstract
This document describes the changes between Unicode 6.0.0 and
Unicode 12.0.0 in the context of the current version of
Internationaliz
To improve understanding, this document describes systems that are being used as alternatives to those that conform to IDNA2008.¶
Status of This Memo
This is an Internet Standards Track document.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.¶
Information about the current status of this document, any
errata, and how to provide feedback on it may be obtained at
https://
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://
1. Introduction
The current version of Internationaliz
The derived property values that can be calculated are defined in RFC 5892 [RFC5892]. Below is a summary to aid in the reading of this document. For definition of the terms, please see RFC 5892 [RFC5892].¶
- PROTOCOL VALID:
- Those that are allowed to be used in IDNs. Code points with this property value are permitted for general use in IDNs. However, the fact that a label consists only of code points with this property value does not imply that the label can be used in DNS. The abbreviated term PVALID is used to refer to this value.¶
- CONTEXTUAL RULE REQUIRED:
- Some characteristics of the character, such as it being invisible in certain contexts or problematic in others, require that it not be used in labels unless specific other characters or properties are present. The abbreviated term CONTEXT is used to refer to this value. As explained in RFC 5892 [RFC5892], CONTEXT is in turn divided into CONTEXTJ and CONTEXTO.¶
- DISALLOWED:
- Those that should clearly not be included in IDNs. Code points with this property value are not permitted in IDNs.¶
- UNASSIGNED:
- Those code points that are not designated (i.e., are unassigned) in the Unicode Standard.¶
When the Unicode Standard is updated, new code points are assigned and already assigned code points can have their property values changed.¶
There were three incompatible changes in the Unicode Standard between Unicode 5.2.0 [Unicode-5.2.0] and Unicode 6.0.0 [Unicode-6.0.0]; they are described in RFC 6452 [RFC6452]. The code points U+0CF1 and U+0CF2 had a derived property value change from DISALLOWED to PVALID, and the code point U+19DA had a change in derived property value from PVALID to DISALLOWED. These changes where examined in great detail, but the IETF concluded that these changes to the Unicode Standard did not warrant an update to RFC 5892 [RFC5892].¶
As described in Section 3, more incompatible changes have been made to code points between Unicode 6.0.0 and Unicode 12.0.0 [Unicode-12.0.0]; however, the changes in the derived property values do not result in exceptions (as defined in Section 2.6 of RFC 5892 [RFC5892]) that would require an update to the "IDNA Contextual Rules" registry (which would also be considered an update to RFC 5892 [RFC5892]).¶
Further, in 2015, the Internet Architecture Board (IAB) issued a statement [IAB2005-1] that advised the community to avoid using any of the potentially problematic code points and asked the IETF to resolve the issues related to the code point ARABIC LETTER BEH WITH HAMZA ABOVE (U+08A1) that was introduced in Unicode 7.0.0 [Unicode-7.0.0]. In February of that year, the statement was revised [IAB2005-2] to focus on the latter request. More details about the problem of code point sequences not normalizing as one might expect appear in a draft that was part of the discussion [IDNA7].¶
The result of the work in the IETF was that no exception was added to RFC 5892 [RFC5892]; however, it should be noted that the review of the issues around U+08A1 indicated that this code point is not an isolated case and that a number of long-standing PVALID code points may have similar issues. While the affected code points remain PVALID in this document, identification of the problem resulted in a clarification of the review process for new Unicode versions. That clarification, which reinforces the original review plan to capture issues like these, was published as RFC 8753 [RFC8753]. Any review of Unicode versions after 12.0.0 should be made according to RFC 8753 [RFC8753]; an objective of this document is to ensure that a proper review of such versions after version 12.0.0 can be made.¶
2. Background
2.1. IDNA2008 Documents
IDNA2008 consists of the following documents. The documents in the set have informal names.¶
2.2. Additional Important IDNA2008-Related Documents
There are other documents important for the understanding and functioning of IDNA2008, for example this.¶
2.3. Deployment
There are many variations on the general IDNA model in use in the various parts of the community. The following lists some of the strategies that implementations that claim to be IDNA compliant are known to use, but it should be noted the list is not complete:¶
In practice, the Unicode Consortium creates a maximum set of code points by assigning code points in the Unicode Standard. The IDNA2008 rules use the Unicode Standard to create a further subset of code points and context that are permitted in DNS labels associated with its PVALID and CONTEXT (CONTEXTJ or CONTEXTO) derived property values. DNS registries and other organizations that deal with IDNs are supposed to create their own subsets from IDNA2008 for use by those registries and organizations.¶
This progressive subsetting and narrowing of the repertoire of code points that can be used in labels is an implementation of the principles of being conservative when deciding what code points to include in such a subset. SAC-084 [SAC-084] and RFC 6912 [RFC6912] recommend to DNS registries and other organizations to be conservative when creating their subsets and to use the principle of creating subsets by inclusion.¶
See also Security Considerations (Section 7) in this document.¶
3. Notable Changes between Unicode 6.0.0 and 12.0.0
Among the changes between the Unicode versions, most code
points that change derived property value change from
UNASSIGNED to PVALID or from UNASSIGNED to DISALLOWED. The
interesting changes in derived property values include other
changes. All changes between the major versions of Unicode can be
found in Appendix A (6.0.0-7.0.0), Appendix B (7.0.0-8.0.0), Appendix C (8.0.0-9.0.0),
Appendix D (9.0.0-10.0.0), Appendix E
3.1. Changes between Unicode 6.0.0 and 7.0.0
Change in number of characters in each category:¶
There are no changes made to Unicode between version 6.0.0 and 7.0.0 that impact IDNA2008 calculation of the derived property values.¶
The code points U+17B4 KHMER VOWEL INHERENT AQ and U+17B5
KHMER VOWEL INHERENT AA both changed the General Category
from Cf (Format) to Mn
The character ARABIC LETTER BEH WITH HAMZA ABOVE (U+08A1) was introduced in Unicode 7.0.0. This was discussed extensively in the IETF and also by the IAB in their statement [IAB2005-1] requesting the IETF to investigate the issue. Specifically, the IAB stated:¶
On the same precautionary principle, the IAB recommends that the Internationalized Domain Names for Applications (IDNA) Parameters registry <https:// www > not be updated to Unicode 7.0.0 until the IETF has consensus on a solution to this problem.¶.iana .org /assignments /idna -tables /
The discussion in the IETF concluded that although it is possible to create "the same" character in multiple ways, the issue with U+08A1 is not unique. The character U+08A1 (ARABIC LETTER BEH WITH HAMZA ABOVE) can be represented with the sequence ARABIC LETTER BEH (U+0628) and ARABIC HAMZA ABOVE (U+0654). This is identical to LATIN SMALL LETTER O WITH STROKE (U+00F8), which can be represented with the sequence LATIN SMALL LETTER O (U+006F) followed by COMBINING SHORT SOLIDUS OVERLAY (U+0337).¶
Although the discussion about this specific code point resulted in acceptance of the derived property value of PVALID, the underlying problem with combining sequences is not understood fully. Therefore, it cannot be claimed that this case can be extrapolated to other situations and other code points.¶
3.2. Changes between Unicode 7.0.0 and 10.0.0
Change in number of characters in each category:¶
There are no changes made to Unicode between version 7.0.0 and 10.0.0 that impact IDNA2008 calculation of the derived property values.¶
3.3. Changes between Unicode 10.0.0 and 11.0.0
Change in number of characters in each category:¶
These changes to the Unicode Standard have the following implications for these code points:¶
3.4. Changes between Unicode 11.0.0 and 12.0.0
Change in number of characters in each category:¶
4. U+111C9 SHARADA SANDHI MARK
As one can see in Section 3, an incompatible
property change was made between Unicode 6.0.0 and 12.0.0,
affecting the code point U+111C9. Its derived property value
thus changed from DISALLOWED to PVALID. In situations like
these, IDNA2008 allows for addition of rules to RFC 5892 [RFC5892], Section 2.7. If the code
point is accepted, it might still be rejected if validated by
software based on versions of Unicode older than 12.0.0. As
the character is rarely used outside the group of Sharada
specialists but is used in some records for indicating sandhi
breaks, the conclusion was that it could either be added as an
exception or allowed to change its property value. As
including an exception would require implementation changes to
deployments of IDNA20008, the IETF has decided not to
add a Backward
5. Conclusion
As described in Sections 3 and 4, changes have been made to Unicode between version 6.0.0 and 12.0.0. Some changes to specific characters changed their derived property value, whereas other changes did not. Given the deployment considerations described in Section 2.3 and changes in the Unicode Standard described in Sections 3 and 4, including implications to normalization, the conclusion is not to add any exception rules to IDNA2008.¶
This document addresses only changes to Unicode between version 6.0.0 and version 12.0.0. Changes in future Unicode versions might result in the conclusion that exception rules need to be added to IDNA2008 after the review process explained in RFC 8753 [RFC8753]. Separately from any changes in Unicode, the IETF might conclude that updates to RFC 5892 [RFC5892] or other IDNA2008 documents might become necessary; such updates might include changes to the algorithm specified in IDNA2008 as well as additional rules, categories, or other forms of tuning, like the clarifications in RFC 8753 [RFC8753].¶
6. IANA Considerations
IANA updated the "IDNA Rules and Derived Property Values" [IANA-IDNA] registry after the expert reviewer validated that the derived property values were calculated correctly.¶
7. Security Considerations
This document makes recommendations regarding the use of the IDNA2008 algorithm for calculation of derived property values, based on Unicode version 12.0.0. This recommendation does not say anything about what recommendations to make for future versions of the Unicode Standard.¶
Not following these recommendations can lead to various security issues. Specifically, allowing confusable characters may lead to various phishing attacks, as described in the Security Consideration Sections in the documents listed in Section 2.1.¶
8. References
8.1. Normative References
- [RFC3491]
-
Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationaliz
ed , RFC 3491, DOI 10Domain Names (IDN)" .17487 , , <https:///RFC3491 www >..rfc -editor .org /info /rfc3491 - [RFC5890]
-
Klensin, J., "Internationaliz
ed , RFC 5890, DOI 10Domain Names for Applications (IDNA): Definitions and Document Framework" .17487 , , <https:///RFC5890 www >..rfc -editor .org /info /rfc5890 - [RFC5891]
-
Klensin, J., "Internationaliz
ed , RFC 5891, DOI 10Domain Names in Applications (IDNA): Protocol" .17487 , , <https:///RFC5891 www >..rfc -editor .org /info /rfc5891 - [RFC5892]
-
Faltstrom, P., Ed., "The Unicode Code Points and Internationaliz
ed , RFC 5892, DOI 10Domain Names for Applications (IDNA)" .17487 , , <https:///RFC5892 www >..rfc -editor .org /info /rfc5892 - [RFC5893]
-
Alvestrand, H., Ed. and C. Karp, "Right-to-Left Scripts for Internationaliz
ed , RFC 5893, DOI 10Domain Names for Applications (IDNA)" .17487 , , <https:///RFC5893 www >..rfc -editor .org /info /rfc5893 - [RFC6452]
-
Faltstrom, P., Ed. and P. Hoffman, Ed., "The Unicode Code Points and Internationaliz
ed , RFC 6452, DOI 10Domain Names for Applications (IDNA) - Unicode 6.0" .17487 , , <https:///RFC6452 www >..rfc -editor .org /info /rfc6452
8.2. Informative References
- [IAB2005-1]
-
Internet Architecture Board, "IAB Statement on Identifiers and Unicode 7.0.0", , <https://
www >..iab .org /documents /correspondence -reports -documents /2015 -2 /iab -statement -on -identifiers -and -unicode -7 -0 -0 /archive / - [IAB2005-2]
-
Internet Architecture Board, "IAB Statement on Identifiers and Unicode 7.0.0", , <https://
www >..iab .org /documents /correspondence -reports -documents /2015 -2 /iab -statement -on -identifiers -and -unicode -7 -0 -0 / - [IANA-IDNA]
-
IANA, "IDNA Rules and Derived Property Values", , <https://
www >..iana .org /assignments /idna -tables -12 .0 .0 / - [IDNA7]
-
Klensin, J. C. and P. Faltstrom, "IDNA Update for Unicode 7.0 and Later Versions", Work in Progress, Internet-Draft, draft
-klensin , , <https://-idna -5892upd -unicode70 -05 datatracker >..ietf .org /doc /html /draft -klensin -idna -5892upd -unicode70 -05 - [RFC3454]
-
Hoffman, P. and M. Blanchet, "Preparation of Internationaliz
ed , RFC 3454, DOI 10Strings ("stringprep")" .17487 , , <https:///RFC3454 www >..rfc -editor .org /info /rfc3454 - [RFC3490]
-
Faltstrom, P., Hoffman, P., and A. Costello, "Internationaliz
ing , RFC 3490, DOI 10Domain Names in Applications (IDNA)" .17487 , , <https:///RFC3490 www >..rfc -editor .org /info /rfc3490 - [RFC5894]
-
Klensin, J., "Internationaliz
ed , RFC 5894, DOI 10Domain Names for Applications (IDNA): Background, Explanation, and Rationale" .17487 , , <https:///RFC5894 www >..rfc -editor .org /info /rfc5894 - [RFC5895]
-
Resnick, P. and P. Hoffman, "Mapping Characters for Internationaliz
ed , RFC 5895, DOI 10Domain Names in Applications (IDNA) 2008" .17487 , , <https:///RFC5895 www >..rfc -editor .org /info /rfc5895 - [RFC6912]
-
Sullivan, A., Thaler, D., Klensin, J., and O. Kolkman, "Principles for Unicode Code Point Inclusion in Labels in the DNS", RFC 6912, DOI 10
.17487 , , <https:///RFC6912 www >..rfc -editor .org /info /rfc6912 - [RFC8753]
-
Klensin, J. and P. Fältström, "Internationaliz
ed , RFC 8753, DOI 10Domain Names for Applications (IDNA) Review for New Unicode Versions" .17487 , , <https:///RFC8753 www >..rfc -editor .org /info /rfc8753 - [SAC-084]
-
The Security and Stability Advisory Committee, "SAC084", SSAC Comments on Guidelines for the Extended Process Similarity Review Panel for the IDN ccTLD Fast Track Process, , <https://
www >..icann .org /en /system /files /files /sac -084 -en .pdf - [Unicode-3.2.0]
-
The Unicode Consortium, "The Unicode Standard, Version 3.2.0", Mountain View: The Unicode Consortium, ISBN 0-201-61633-5, , <https://
www >..unicode .org /versions /Unicode3 .2 .0 / - [Unicode-5.2.0]
-
The Unicode Consortium, "The Unicode Standard, Version 5.2.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -00 -9 www >..unicode .org /versions /Unicode5 .2 .0 / - [Unicode-6.0.0]
-
The Unicode Consortium, "The Unicode Standard, Version 6.0.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -01 -6 www >..unicode .org /versions /Unicode6 .0 .0 / - [Unicode-7.0.0]
-
The Unicode Consortium, "The Unicode Standard, Version 7.0.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -09 -2 www >..unicode .org /versions /Unicode7 .0 .0 / - [Unicode-8.0.0]
-
The Unicode Consortium, "The Unicode Standard, Version 8.0.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -10 -8 www >..unicode .org /versions /Unicode8 .0 .0 / - [Unicode-10.0.0]
-
The Unicode Consortium, "The Unicode Standard, Version 10.0.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -16 -0 www >..unicode .org /versions /Unicode10 .0 .0 / - [Unicode-11.0.0]
-
The Unicode Consortium, "The Unicode Standard, Version 11.0.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -19 -1 www >..unicode .org /versions /Unicode11 .0 .0 / - [Unicode-12.0.0]
-
The Unicode Consortium, "The Unicode Standard, Version 12.0.0", Mountain View: The Unicode Consortium, ISBN 978
-1 , , <https://-936213 -22 -1 www >..unicode .org /versions /Unicode12 .0 .0 / - [UTS-46]
-
The Unicode Consortium, "Unicode Technical Standard #46, Version 12.0.0", UNICODE IDNA COMPATIBILITY PROCESSING, , <https://
www >..unicode .org /reports /tr46 /tr46 -23 .html
Appendix A. Changes from Unicode 6.0.0 to Unicode 7.0.0
Changes from derived property value UNASSIGNED to either PVALID or DISALLOWED.¶
Appendix B. Changes from Unicode 7.0.0 to Unicode 8.0.0
Changes from derived property value UNASSIGNED to either PVALID or DISALLOWED.¶
Appendix C. Changes from Unicode 8.0.0 to Unicode 9.0.0
Changes from derived property value UNASSIGNED to either PVALID or DISALLOWED.¶
Appendix D. Changes from Unicode 9.0.0 to Unicode 10.0.0
Changes from derived property value UNASSIGNED to either PVALID or DISALLOWED.¶
Appendix E. Changes from Unicode 10.0.0 to Unicode 11.0.0
Changes from derived property value DISALLOWED to PVALID.¶
Changes from derived property value UNASSIGNED to either PVALID or DISALLOWED.¶
Appendix F. Changes from Unicode 11.0.0 to Unicode 12.0.0
Changes from derived property value UNASSIGNED to either PVALID or DISALLOWED.¶
Acknowledgments
Thanks to Harald Alvestrand, Marc Blanchet, Martin Dürst, Asmus Freytag, Ted Hardie, John Klensin, Erik Nordmark, Pete Resnick, Peter Saint-Andre, Michel Suignard, Andrew Sullivan, and Suzanne Woolf for input to this document.¶