Select to view content in your preferred language

gdal.VectorTranslate() transforms non-ASCII characters?

294
4
3 weeks ago
AlfredBaldenweck
MVP Regular Contributor

I'm trying a workflow using gdal.VectorTranslate(), since ogr2ogr isn't there anymore.

I'm having an issue of the original data using non-ASCII characters, but they are replaced by � when translated.

AlfredBaldenweck_0-1766169224937.png

AlfredBaldenweck_1-1766169237494.png

How can I make this not happen?

I tried setting the config options, but "OGR_FORCE_ASCII" is only used by certain drivers and processes. Similarly, I cannot find a Translate option that would appear to take care of this.

This is kind of a major thing. If I absolutely have to, I suppose I can get the strings from the original data, but that will majorly slow things down, not to mention complicate things. 
Thanks

 

 

Edit it appears that this is dependent on the source; I have no problems going from fGDB to fGDB, but the real data is in an MDB.

0 Kudos
4 Replies
AlfredBaldenweck
MVP Regular Contributor

It seems that pyodbc can read it just fine, which is super frustrating, since I can't figure out how to get it into a different format without having to download a new driver, which doesn't work if trying to distribute this workflow to various users (unless ?)

 

For context, this is what gdal is showing me for the strings

bytearray(b'S\xbdNE\xbc')

 I'm not sure how some things can read these correctly as fractions but not others? 

I was able to convert that string to hex, which I fed to a converter online and got the desired output, and then the converter immediately broke when I tried doing it again. 

53bd4e45bc

putting that string with the fractions in to the same converter gives me this

53 c2 bd 4e 45 c2 bc

 As you can see, I'm missing some stuff here.

Doing a bytes.fromhex().decode() fails because of an "invalid start byte".

Kind of out of ideas here, so if anyone has any I'd really appreciate it

0 Kudos
AlfredBaldenweck
MVP Regular Contributor

Okay, with the help of this post  I'm able to read it alright.

import codecs
inmdb = ogr.Open(in_ds)
sql = "Select TextString from LDAnno"
res = inmdb.ExecuteSQL(sql)
for r in res:
    t = r.GetFieldAsBinary(0).hex()
    t = codecs.decode(t, "hex").decode('dbcs')
    print(t)

 

 S½NE¼ W½SE¼ E½SW¼ E½SW¼ W½NE¼ W½NE¼ S½NW¼ N½SW¼

The question remains of how do I force gdal to use that reading instead of doing its own thing?

0 Kudos
HaydenWelch
MVP Regular Contributor

https://github.com/OSGeo/gdal/blob/9d2c301cb3e18d2fea3af32652d0a31de0447e10/apps/ogr2ogr_lib.cpp#L90

 

https://gdal.org/en/stable/doxygen/classCPLStringList.html

 

Seems like they're using a char array for strings? Could be that the source data isn't properly encoded, or is encoded as something that isn't utf-8 (latin1? cp1252?)

 

You're really having a lot of fun issues with encoding lately huh

0 Kudos
AlfredBaldenweck
MVP Regular Contributor

It's ANSI, it appears. 

I'm looking at finding all text fields, cycling through them, and then going through with an update cursor on the final product to update them to the correct values. Not 100% on how I'm going to do all that (for reasons I don't really want to get into I had to create a new unique ID field for each table during the Translate() process), but we're going to try. It'd be fine if they just brought over the values as-is and Pro couldn't read them, but they evaluate them for utf-8, freak out, and then change the values.

0 Kudos