Python and Unicode

As some of you saw, I did write earlier about problems I had with wide Unicode codepoints. That was because the Python I use is not compiled for support for it. Naive as I was, I thought that this was one of the things that was fixed for Python 3.x, but it was not. Because of this, codepoints with more than 16 bit values are stored as surrogate pairs, so it was possible to do a workaround. Not pretty, and possibly not the most efficient python code out there (note that it uses some data structures that I have already created elsewhere):


def unicodeCaseFold(c):
  result = ""
  maxlen = len(c)
  i = 0
  while(i<maxlen):
    if((maxlen-i) > 1 and ord(c[i:i+1]) >= 0xD800 and ord(c[i:i+1]) <= 0xFA0D):
      # Lets guess this is a surrogate pair
      value = ord(c[i:i+2])
      j = i + 2
    else:
      value = ord(c[i:i+1])
      j = i + 1
    if(value in unicodeSpecialCaseFoldingDict):
      result = result + unicodeSpecialCaseFoldingDict[value]
    else:
      result = result + c[i:j].lower()
  i = j
  return(result)

Updated: Some “<” and “>” where not visible earlier as I forgot to encode them properly in HTML… Now fixed.

Updated: Corrected some bugs in the code.