Review of Unicode TR#46

I have reviewed Draft Unicode Technical Standard #46 - UNICODE IDNA COMPATIBILITY PROCESSING as could be found at http://www.unicode.org/reports/tr46/ on March 15, 2010 (below called TR#46).

The review is based on my personal experience from many years of activity in the IETF, and of course my work with both IDNA2003 (RFC3490 etc) and IDNA2008 (working group charter).

The document from my perspective need some major changes. The reasons are both structural, so that it end up being an addition to IDNA2008, and technical/specific so that it is more clear on what a reader is to take home as a message when reading it.

1. If I start with the structural issues, the most problematic issue I think is that the specification of the algorithm to use (section 4), which include using the table (section 5) created according to what is specified in (section 7), duplicates the rules used in IDNA2008, and merges them with what is was created to use the table for IDNA2003. I.e. the rules are completely new instead of referencing the mapping table from IDNA2003 and the algorithm in IDNA2008 (plus potentially having an addition to those). This is problematic for various reasons, but let me mention two:

1.1. By creating a completely new table, instead of just stating additions to either (or both) of IDNA2003 and the derived values from IDNA2008, it will be hard to know that it is completely in sync with the IETF standards. Specifically as IDNA2008 is not defined by a table, but a definition of an algorithm.

1.2. IDNA2008 is according to the consensus in IETF independent of Unicode version. This was something that was needed due to the problems applications have ended up with in various operating systems when the operating system, the libraries used, and the application, might be optimised / developed for different versions of Unicode. And, in very few cases is it possible for the applications to detect what version of Unicode is in use. If TR#46 is as it looks today, dependent on Unicode version, that would turn things back to the situation with IDNA2003. What would be much better would be to (once again) have TR#46 be an addition to IDNA2008, that specified what to do before (in some cases) IDNA2008 was applied.

2. The second issue have to do more specifically with how the algorithm is specified, and that is divided in three issues:

2.1. The document uses statements like “should”, “should always”, “should” (in italics), “must”, “must only”, “must be”, “must not” etc and those terms are not defined. This makes it hard to understand what the implications are for those statements.

2.2. TR#46 do not specify clearly what parts of the document is normative, and what is not. I do understand some background information is needed, but the pieces that is normative must stand out, and be possible to use by the implementor by themselves. Further, those normative sessions that for example explain what to do when being a registry, what to do when being a registrar, an application (in various situations) etc.

2.3. The taxonomy used in TR#46 is not clear, and specifically it is for me not clear whether it really follows the taxonomy the IETF consensus have approved as specified in draft-ietf-idnabis-defs-13.txt. If UTC is to reference IETF standards, it of course is better if those references uses IETF taxonomy, and if the IETF taxonomy is different from what for example UTC is using, that should be made extremely clear, and the reader should not have to guess which one of the conflicting taxonomies are in use.

3. The third category have to do with the actual content itself, and is divided in several parts as well.

3.1. IDNA2008 is specifying how to do the calculation whether what is a tentative domain name is a valid A-label/U-label or not. It is for the reader of TR#46 clear where this calculation fits in. Specifically if one look at section 4 (Processing), there is no reference to IDNA2008 at all. For TR#46 to be effective, it must concentrate on pre-processing of strings that is recommended in certain contexts before IDNA2008 is applied, and be written that way.

3.2. TR#46 is confusing regarding when it talks about URIs, IRIs, domain names etc. This is a similar comment as is already made in 2.3, but this is a more specific comment regarding specifically section 1 that talk about domain names, but URLs (and IRIs) are given as examples. The document also talks in some places about “what happens when the user types” (something) and sometimes it talks about “comparison” (that is made on the server side, and not client side). Basically (and specifically) the introduction section 1 is to be split into one non-normative portion that is extremely clear on the various items, and then what is normative should be moved to the actual (normative) processing. Specifically, it is important the document keep discussions about domain names separated from discussions about IRIs.

3.3. Relationship between section 4 and section 7 is to me not clear. I guess section 4 is about intra label processing, while section 7 talks about whole labels. This must be clarified, and I think it would be more clear if the actual processing (section 4) was more clear regarding relationship with IDNA2008.

3.4. Regarding section 1, I must specifically talk about confusion in section 1.3 Security Considerations. We in the IETF also have a mandatory section with the same name, and as an IETF person I expected a bit different content in section 1.3. That is of course not wrong, but, the section mixes issues I think are background information, with things that really are security issues.

3.5. In section 3.1 there is a statement that is:

That is, the sequence "a<ZWJ>b" looks just like "ab"

This is something that points at a problem with the document that is hard to explain more than to pick this specific example. What problem does it try to solve. Are we talking about make it easier for someone that reads a domain name on a billboard to type it in, or minimize the risk someone click on a link on a webpage that goes to some phishing site, or? The “looks just like” make this confusing. If the ZWJ is on a billboard, the user will not type it in, so there is no danger, so the document must talk about the issue that some IRI include a ZWJ, and the user “uses that IRI”. If we look at that problem, and will solve it, then we have to do many many more things than “just” taking care of ZWJ. For example handle the case a webpage can include:

http://b.example.com

3.6. The document should be much more clear on when it talks about Unicode characters (regardless of encoding) and when it talks about the actual various encoded strings. Regarding the former, we talk about issues that have to do with Unicode Codepoints, and the use of them, while in the latter case, we talk about problems that are because of the encodings (where for example the different encodings could create different problems). As one example, in section 4 there is a list that talk about what steps to go through with a domain name in Unicode. The section start with a for me confusing text about how the Unicode codepoints are represented. It talks about “escaping”, but at the same time it does not talk about how the Unicode codepoints are to be represented (UTF-8 or UTF-16 or…). I.e. it talks about some representations be “weird”, and some be “not weird” (as they are not mentioned). This confusion continues, for example in step 4 in the processing, it talks about “If the label start with xn–…”, which implies the string might be in Punycode already from the beginning. I.e. using one of the possible representations of a Unicode string.

3.7. When it is explained how Table 3 is created, rules used in IDNA2008 are duplicated. I have not in detail looked through the rules to see whether they are the same, or different, but it is clear the document is not referencing IDNA2008 regarding the derived property value. Further, in Step 3, there is a “handpicking” of characters, which with the IETF experience from IDNA2003 and work on IDNA2008 create a high risk that some characters are missed and forgotten. In step 4, there is a specific list of characters that is supposed to be a list of differences between IDNA2008 and this created deviation set. The problem with this is that IDNA2008 is independent of Unicode version, while this deviation set is not. So the correct statement could have been “IDNA2008 applied to Unicode 5.2 compared with the deviation set”, and not what is said. See section 4 for concrete suggestion on how to solve this problem.

3.8. In Table 4 it really looks like if IDNA2008 is one table, and UTS46 is another, so the developer have to choose between UTS46 and IDNA2008. This will, according to this document, “only create different results during the time deviations are taken into account”, but the document do not talk about ever having the developer using IDNA2008. Instead, they promise to be “compatible with IDNA2008” (not a direct quote).

4. Last, I would like to give some suggestions on how to make this document one that can be used, that would not overlap with the IETF document(s). It should explain the following:

4.1. Split the document more clearly in a non-normative way, and a very strict normative portion so that the normative part can be stand alone (without the non-normative part).

4.2. Explain, in different processes in the form of chain of events, in specific scenarios, algorithm(s) are described that handle domain names before IDNA2008 is applied. As an add-on to IDNA2008, and not a replacement like in the case of this document. This could for example point out how to handle the following issues for registries, registrars, the zone file owner/editor, domain name holder use of the name, use in publications etc:

4.2.1. Mappings that are part of IDNA2003 that maps to characters that are PVALID (or CONTEXT) in IDNA2008.

4.2.2. Codepoints that are valid in IDNA2003, but not in IDNA2008 (including codepoints that in IDNA2003 maps to codepoints that are not PVALID in IDNA2008).

4.2.3. The four codepoints that are the real deviations between IDNA2003 and IDNA2008.

4.3. Specifically how to handle Unicode Strings as part of IRIs, and how to handle IRIs in the browser environment, as that seems to be what is a large community that wait for a document like this. How to handle cases where the user types, pastes, clicks etc on various strings.

I hope this description of the issues I personally see helps regarding creation of a version of TR46 that I can support. Because as an individual I do not see I can support the current version – specifically as it is a complete overlap with IDNA2008. Or, I have completely misunderstood the document (which is a sign by itself that some changes might be needed).