To access files containing non-ASCII Unicode text, pass in an encoding name.
In this mode, Python text files automatically encode on writes and decode on reads per the encoding scheme name you provide.
In Python 3.X:
S = 'test\xc4m' # Non-ASCII Unicode text print( S ) print( S[2] ) # Sequence of characters file = open('unidata.txt', 'w', encoding='utf-8') # Write/encode UTF-8 text file.write(S) # 4 characters written file.close() # w ww .ja v a2s . c om text = open('unidata.txt', encoding='utf-8').read() # Read/decode UTF-8 text print( text ) print( len(text) ) # 4 chars (code points)
You can see what's truly stored in your file by stepping into binary mode:
raw = open('unidata.txt', 'rb').read() # Read raw encoded bytes print( raw ) print( len(raw) ) # Really 5 bytes in UTF-8 # w w w .ja v a2s . co m
You can encode and decode manually if you get Unicode data from a source other than a file:
raw = open('unidata.txt', 'rb').read() # Read raw encoded bytes text = "test" print( text.encode('utf-8') ) # Manual encode to bytes print( raw.decode('utf-8') ) # Manual decode to str # from w w w. ja v a 2 s . c o m
To see how text files would automatically encode the same string under different encoding names:
text = "test" print( text.encode('latin-1') ) # Bytes differ in others print( text.encode('utf-16') ) print( len(text.encode('latin-1')), len(text.encode('utf-16')) ) print( b'\xff\xfed\x00p\x00\xc4\x00m\x00'.decode('utf-16') ) # But same string decoded # from w w w .j av a 2 s . c om