Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I doubt that they are encoded in UCS-2 as that character set isn't able to encode every (or even just the majority) of unicode code points.

You are right though (and this is why I upvoted you back to 1) that you shouldn't care. In fact, you not knowing the internal encoding the proof of that. In python (I'm talking python 3 here which has done this right), you don't care how a string is stored internally.

The only place where you care about this is when your strings interact with the outside world (i/o). Then your strings need to be converted into bytes and thus the internal representation must be encoded using some kind of encoding.

This is what the .decode and .encode methods are used for.

Have a look at http://diveintopython3.org/strings.html which manages to say this better (and with more words) than I ever would be able to.



In Python 2.x are encoded in UCS-2, not UTF-16, at least by default (I'm not sure about Python 3.x, I assume it's the same though). If you want to support every single possible Unicode codepoint, you can tell Python to do so at compile time (via ./configure flag).

In practice the characters that aren't in UCS-2 tend to be characters that don't exist in modern languages, e.g. the characterset for Linear B, Domino tiles, and Cuneiform, so they're not supported since they're not of practical use to most people. There's a fairly good list at http://en.wikipedia.org/wiki/Plane_(Unicode) . In this list, Python by default doesn't support things not in the BMP.


No, the Python internals support surrogates so you can support characters outside the BMP. This makes it (basically) UTF-16.


Things outside of the BMP aren't just dead languages anymore. You have to be able to support characters outside the BMP if you want to sell your software in China:

http://en.wikipedia.org/wiki/GB_18030




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: