Python's strings have full Unicode support.
In Python 3.X, the normal str string handles Unicode text.
A distinct bytes string type represents raw byte values.
S = 'sp\xc4m' # 3.X: normal str strings are Unicode text print(S)# w w w . jav a 2 s . c o m print( b'a\x01c' ) # bytes strings are byte-based data print( u'test\u00c4m' ) # The 2.X Unicode literal works in 3.3+: just str
In Python 2.X, the normal str string handles both 8-bit character strings (including ASCII text) and raw byte values.
A distinct unicode string type represents Unicode text.
3.X bytes literals are supported in 2.6 and later for 3.X compatibility and they are treated the same as normal 2.X str strings:
print(u'sp\xc4m') # 2.X: Unicode strings are a distinct type print( 'a\x01c' ) # Normal str strings contain byte-based text/data print( b'a\x01c' ) # The 3.X bytes literal works in 2.6+: just str # w ww. ja v a 2 s . co m
In both 2.X and 3.X, non-Unicode strings are sequences of 8-bit bytes that print with ASCII characters when possible.
Unicode strings are sequences of Unicode code points-identifying numbers for characters.
print( 'test' ) # Characters may be 1, 2, or 4 bytes in memory print( 'test'.encode('utf8') ) # Encoded to 4 bytes in UTF-8 in files print( 'test'.encode('utf16') ) # But encoded to 10 bytes in UTF-16 # ww w . j a va 2s. c o m
Both 3.X and 2.X also support the bytearray string type.
bytearray string type is essentially a bytes string (a str in 2.X) that supports most of the list object's in-place mutable change operations.
Both 3.X and 2.X support coding non-ASCII characters with \x hexadecimal and short \u and long \U Unicode escapes.
Python also handles file-wide encodings declared in program source files.
Here's our non-ASCII character coded three ways in 3.X (add a leading "u" and say "print" to see the same in 2.X):
print( 'test\xc4\u00c4\U000000c4m' ) print( '\u00A3', '\u00A3'.encode('latin1'), b'\xA3'.decode('latin1') )
Python 2.X allows its normal and Unicode strings to be mixed in expressions as long as the normal string is all ASCII.
Python 3.X has a tighter model that never allows its normal and byte strings to mix without explicit conversion:
u'x' + b'y' # Works in 2.X (where b is optional and ignored) u'x' + 'y' # Works in 2.X: u'xy' u'x' + b'y' # Fails in 3.3 (where u is optional and ignored) u'x' + 'y' # Works in 3.3: 'xy' 'x' + b'y'.decode() # Works in 3.X if decode bytes to str: 'xy' 'x'.encode() + b'y' # Works in 3.X if encode str to bytes: b'xy'