Database qa/qc would be so much easier when no data is entered....
Last week I was dealing with newline characters. (see Where clause for '\n' ). Today I'm getting the following error:
Runtime error
Traceback (most recent call last):
File "<string>", line 26, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 10: ordinal not in range(128)
Okay... Best I can tell u201c is a left double quote.(http://www.fileformat.info/info/unicode/char/201C/index.htm ). I put
# -*- coding: utf-8 -*-
as the first line of my script and it still errors out. Am I doomed, or is there a way past these special characters?
Mixing and matching is going to cause you grief and confuse us . I said it was Python 2.7 as I'd read you were running your code in in 10.6.1.
You can install Spyder for Desktop (python 2.7) and Spyder for Pro (python 3)
Agreed. It can be confusing and often times is. We have a rather large collection of 2.7 scripts that we are in the process of migrating, finding gotchas all along the way.
This particular exercise may be one of futility as it seems everytime I check for one embedded special character, I error out on another regardless of python version...
Some distractions for you Joe
http://ptgmedia.pearsoncmg.com/imprint_downloads/informit/promotions/python/python2python3.pdf
Nice. Thanks. I can use a distraction...
If you are working in Python 3, try specifying the encoding when you open the file. From Open - Built-in Functions — Python 3.7.2 documentation
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever
locale.getpreferredencoding()
returns), but any text encoding supported by Python can be used. See thecodecs
module for the list of supported encodings.
In Python 2, you can use the 15.2. io — Core tools for working with streams — Python 2.7.15 documentation module that allows specifying the text file encoding.
On occasion, I've used a conversion dictionary as a "hammer":
conversion = {
u"\u2019": "'", # apostrophe
u"\u201c": "\"", # double quote
u"\u00a0": " ", # non breaking space
"\n" : "=>" # newline
}
def convert(data):
for k, v in conversion.items():
# https://stackoverflow.com/questions/14156473
data = data.replace(k,v)
return data.encode('ascii')
print convert("hello\nworld" )
# hello=>world
print convert(u"hello \u201cworld\u201c" )
# hello "world"
Hammers... Fixing the world's problems one smack at a time! 😉
# -*- coding: utf-8 -*-
This only tells the python interpreter that string literals in your script are utf-8, it doesn't apply to string data that you use in the script.
i.e
status = "It's -10°C here in Überwald and I'm sitting outside at the Café"
print status
C:\Python27\ArcGIS10.3\python.exe test.py
File "test.py", line 1
SyntaxError: Non-ASCII character '\xc2' in file test.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
# -*- coding: utf-8 -*-
status = "It's -10°C here in Überwald and I'm sitting outside at the Café"
print status
C:\Python27\ArcGIS10.3\python.exe test.py
It's -10°C here in Überwald and I'm sitting outside at the Café
Process finished with exit code 0
I took Randy's hammer idea and got the following. I can actually get a handle as to which records are the offenders:
""" Run in an ArcMap 10.6.1 Python window. I got the same output in
a Spyder console
"""
>>> import arcpy
arcpy.env.workspace = r'J:\WaterQuality\test_tables.gdb'
#arcpy.env.workspace = r'I:\GIS\ArcSDE\SuperUser\pwengfc\SLCOen@pweng.sde'
fields = ['OBJECTID', 'SITENOTES']
table = 'MacrosSamples'
with arcpy.da.SearchCursor(table,fields)as cursor:
for row in cursor:
if row[1] == None:
pass
elif u"\u201c" in row[1]:
print(row)
(3, u'Substrate \u201co\u201d = bedrock\nUpper reaches definitely take caution, high flow at this time. Lower reaches nice and easy to sample. Slick rocks. Better with multiple samplers.')
(16, u'Park at Spruces campground lot.\nX site (A) is just downstream of bridge. Work downstream from \u201cA\u201d site.\nSome rocks are very slippery\nSeveral large logs/debris jams in reach')
(33, u'\u201cBedrock\u201d is a calcified structure.\nSafe for 1-2 samplers\nGolfers utilizing course that river runs through\nCulvert at start and middle of reach\nEasy parking in golf course lot')
(65, u'Substrate \u201cother\u201d = bedrock')
(66, u'Substrate \u201cother\u201d = bedrock')
(69, u'Substrate \u201co\u201d = bedrock\nSafe for 1 sampler')
(81, u'Lambs Canyon Creek enters Parleys at beginning of \u201cA\u201d transect. Long hike to x site. Safe for 1 person.\n')
Can you explain a bit more about what you want to do with this text data. Your errors are happening when you print it out and this is expected (in Python 2, though in Python 3 you don't need to worry as much as strings are unicode objects).
For example: I can read in some data with non-ascii characters and write it out to another table with no issues, but if I try to print it (or write it to a text file/spreadsheet) without encoding it, I get the dreaded UnicodeEncodeError. If you want to print it/output it you need to encode it.
table,field = 'c:/temp/default.gdb/test', 'testfield'
with arcpy.da.SearchCursor(table,field) as rows:
for row in rows:
data = row[0]
break
print('{} = tablename; {} = fieldvalue'.format(table,data))
# Runtime error
# Traceback (most recent call last):
# File "<string>", line 4, in <module>
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 8: ordinal not in range(128)
with open('c:/temp/test.txt', 'w') as t:
t.write(data)
# Runtime error
# Traceback (most recent call last):
# File "<string>", line 2, in <module>
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xb0' in position 8: ordinal not in range(128)
print('{} = tablename; {} = fieldvalue'.format(table,data.encode('utf-8')))
# c:/temp/default.gdb/test = tablename; It's -10°C here in Überwald and I'm sitting outside at the Café = fieldvalue
Hammers sometimes miss and hit you on the thumb. You can try and replace most of the usual non-ASCII characters, but you'll always run across another...
Your best bet is deal with it as unicode then encode it on output.
If you absolutely must force it to ascii, t ry the hammer (manual replacement), if that fails, just strip them out:
data.encode('ascii', 'ignore')
You could also look at a 3rd party library that takes Unicode data and tries to represent it in ASCII characters - Unidecode · PyPI