Arcpy - Problem with accents

NathalieTONSON · ‎05-21-2014

Hello,

I use arcpy to generate some maps and I have a problem with the accents witch are in the shapefile (I am French so I have a lot of accents in my data !!!)
I don't understand how it works the encoding/coding and the functions decode(), encode() with arcpy.

For example, my script without import arcpy :

Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> # -*- coding: utf-8 -*-
...
>>> a="é"
>>> print a
é
>>>

with import arcpy :

Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> # -*- coding: utf-8 -*-
... import arcpy
>>>
>>> a="é"
>>> print a
'
>>>

So I tried with every encode() and decode() with 'Utf-8', 'latin9' ... but I always have errors like UnicodeDecodeError : 'utf-8' codec can't decode byte ...

I have the same error when I print the value in the .dbf of a shapefile.

...
ROWS = arcpy.SearchCursor(LAYER,"","","","")
for row in ROWS:
 print str(row.DESCRIPT)

Thanks !!!

Nat

T__WayneWhitley · ‎05-24-2014

...a little late, but I think this will help below. Character encoding is interesting, although a full understanding of it on my part is a work in progress. So at the start I apologize for my limited explanation and limited facility in French as well.

Basically, to cut to the chase, the 'raw' entry of your unicode character is not 'understood' by python in order to encode it properly...so of course when you try to get it back (as with a print statement), you typically get nonsense. With my standard output (stout) set as cp1252, I get: �?©
If you prefix it with 'u' (to distinguish it from a possible non-unicode character), you should be in business with the declared utf-8 formatting.

Not sure if you have the unicodedata module in your python version, but I think you'll get the idea just reading the script and corresponding output (output included further below). Try this short demo (my system is different from yours, so also attached is the script for you to run on your own system):

# -*- coding: utf-8 -*-
import sys, unicodedata
print 'The default encoding is {0}'.format(sys.getdefaultencoding())
print 'The standard output encoding is {0}\n'.format(sys.stdout.encoding)

a= u"\xe9"
print 'Trial 1- variable \'a\' is:', a
print '...and this is the type: ', type(a)
print 'This character is {0}\n'.format(unicodedata.name(a))

a= u"é"
print 'Trial 2- variable \'a\' is:', a
print '...and this is the type: ', type(a)
print 'This character is {0}\n\n'.format(unicodedata.name(a))

a= u"Wow caractères Unicode peuvent être difficiles à manipuler, ce qui avec le codage et le décodage en cours."
print 'Trial 3- (a bunch of French with accented characters):\n'
print a
print '\n...and this is the type: ', type(a)

a="é"
print '\n\nTrial 4- variable \'a\' is now becomes gibberish, improperly encoded utf-8:', a
print '...and this is the type: ', type(a)

On my system, I get back this (from the print statements):

>>> 
The default encoding is ascii
The standard output encoding is cp1252

Trial 1- variable 'a' is: é
...and this is the type:  <type 'unicode'>
This character is LATIN SMALL LETTER E WITH ACUTE

Trial 2- variable 'a' is: é
...and this is the type:  <type 'unicode'>
This character is LATIN SMALL LETTER E WITH ACUTE


Trial 3- (a bunch of French with accented characters):

Wow caractères Unicode peuvent être difficiles à manipuler, ce qui avec le codage et le décodage en cours.

...and this is the type:  <type 'unicode'>


Trial 4- variable 'a' is now becomes gibberish, improperly encoded utf-8: �?©
...and this is the type:  <type 'str'>
>>>

-Wayne

PS- By the way, if I feed in the French in trial 3 above and forget the 'u' prefix, this is the resulting feedback printed, which should now be no surprise:

Wow caract�?¨res Unicode peuvent �?ªtre difficiles �?* manipuler, ce qui avec le codage et le d�?©codage en cours.