Extract specific strings (more than one) into new fields

AlexanderPichlmayr · ‎07-09-2013

Hello:

I want to extract strings from a field called "other_tags" using the field calculator.
"other_tags" has a lot of unused information and the sting part that I want to extract is always located at a different position. Here some examples of the field "other_tags":

"lit"=>"yes","maxspeed"=>"50","oneway"=>"yes","ref"=>"B 6","voltage"=>"110000"
"voltage"=>"110000"
"lit"=>"yes","voltage"=>"110000"
"lanes"=>"2","oneway"=>"yes","ref"=>"B 6","voltage"=>"110000"

Always when the string " "voltage"=>" " accures the subsequent number should be written in a new field. Lets call the new field "Result1". It is also possible that the stirng " "voltage"=>" " existes 2,3 or 4 times in a single string so somehow i would have to write the second/third and fourth value in other fields. Lets call them "Result2", "Result3 and "Result4".

There was a similar thread wich didnt really help me further because my data in the field is not structured that way. But maybe it representes some productive food for thought:
http://forums.arcgis.com/threads/45108-Extract-string-in-field-calculator-using-python
Also I have the same problem then the auditor of that thread that the code he was using does not work with field calculator!? Even when my source field is a sting type just as the target field.

From my point of view it seams to be a bretty complex topic for someone like me whose python experience is very small - So I am glad for any kind of help!!!

markdenil · ‎07-09-2013

Did you try something like this?

txtStr = '"lit"=>"yes","maxspeed"=>"50","voltage"=>"900","voltage"=>"666","oneway"=>"yes","ref"=>"B 6","voltage"=>"110000"'
sList = txtStr.split(',')
newList = []
for q in sList:
    if q.find("voltage") >= 0:
        newList.append(q)
for n in newList:
    print n

This yeilds a list of just the ","voltage"=>... components of the value string.
Not a complete solution, but perhaps a start.

AlexanderPichlmayr · ‎07-10-2013

Hi thanks for your reply,

well as I said my python skills are still very lame so there are some things a little unclear to me:

1) why did you add more then one voltage value in txtStr? The list I am extracting from is very long so there is no way to know what values for voltage exist. Also the shape of all the unused information is unknown - I suppose all existing shapes have to be included in txtStr in order to get it to work!?

txtStr = '"lit"=>"yes","maxspeed"=>"50","voltage"=>"900","voltage"=>"666","oneway"=>"yes","ref"=>"B 6","voltage"=>"110000"'

2) what function has

newList = []

and

for q in sList:

3) isnt there a pre-logic script code to be set up?

markdenil · ‎07-10-2013

Yes, the pre-logic code block will be based on the snippit provided.

in the code snippit:
txtStr is the string in the attibute field, and is accessed itteritivly for each record in the table

sList = txtStr.split(',') ## this splits the string on the comma delimiters, into a list
(one entry per keyword_flag=>value)

newList = [] ## this creates a new, empty list. it will hold all the voltage values, however many there may be in that line.
(My initial understanding of your remark:
" It is also possible that the stirng " "voltage"=>" " existes 2,3 or 4 times in a single string "
to mean just that: the sample txtStr I used has 3 different voltage flags and values)

The next bit:

for q in sList:
    if q.find("voltage") >= 0:
        newList.append(q)

checks each item in the list (the one of the split-up text string)
and, if it finds the string 'voltage', writes it to the new list.

Now you have a list of just the voltage flags and values:
for example: ["voltage"=>"900","voltage"=>"666","voltage"=>"110000"]

The last bit would not go in the code block:
for n in newList: print n
just prints each "voltage"=>"whatever" string to the screen.
It is included here so the snippit does something as a snippit.....

what you would have to do instead is to set each item in the newList to a variable
that is then written to the new attribute item by the Calculator

You may want to strip off the "voltage"=> part using string.replace()
and perhaps convert the remaining numeric string to actual numbers
using short(), long(), or float()

Of course, the Calculator only writes one field, so you will have to get around that.
Using an Update Cursor in a custom script is recomended,
but to use the calculator you could make the calculation in stages:
you could calculate each Result field (Result1, Result2, Resultx)
in turn by indexing the newList; as in using newList[0] to calculate Result1
and so on.

JamesCrandall · ‎07-10-2013

If it were me I'd focus on getting the data squared away first instead of attempting to cobble it all together with scripting. Use some proper database design technique and get the data of interest into appropriate fields --- you may even want to use another tool that has friendlier string manipulation functions like MS Excel, THEN get your spatial data in order.

I wish I had a direct solution for you! But in your case it appears that you will be attempting to apply database methodology (what GIS data really is) to a hodgepoge of strings contained in a single field (an improper database design approach).

markdenil · ‎07-10-2013

Excel is a good tool for cobbling...
But getting the data squared away is not so very difficult with scripting.
Mucking about with Calculate Field is not the way to do it, however.

Assuming you have already added 6 new Results (numeric) fields to your table
(Result1, Result2, and so on)
and that the data string is in a field called BigFunkyField
and the fc is c:\FileGeodatabase.gdb\inFC.....

import arcpy

inTable = r"c:\FileGeodatabase.gdb\inFC"
cur = arcpy.UpdateCursor(inTable)
for row in cur:
    txtStr = row.BigFunkyField
    sList = txtStr.split(',')
    newList = []
    for q in sList:
        if q.find("voltage") >= 0:
            volt = q.replace('"voltage"=>', '')
            newList.append(long(volt.replace('"', '')))

    length = len(newList)
    r1, r2, r3, r4, r5, r6 = 0, 0, 0, 0, 0, 0
    rList = [r1, r2, r3, r4, r5, r6]

    for n in range(0, length):
        rList = newList

    row.setValue(Result1, rList[0])
    row.setValue(Result2, rList[1])
    row.setValue(Result3, rList[2])
    row.setValue(Result4, rList[3])
    row.setValue(Result5, rList[4])
    row.setValue(Result6, rList[5])
    cur.updateRow(upRow)
del cur

print 'Done'

This is not elegent, and not tested, but....
It also turns the number strings into real numbers...

JamesCrandall · ‎07-10-2013

Excel is a good tool for cobbling...
But getting the data squared away is not so very difficult with scripting.
Mucking about with Calculate Field is not the way to do it, however.

Assuming you have already added 6 new Results (numeric) fields to your table
(Result1, Result2, and so on)
and that the data string is in a field called BigFunkyField
and the fc is c:\FileGeodatabase.gdb\inFC.....

This is not elegent, and not tested, but....
It also turns the number strings into real numbers...

No doubt. I tend to always approach my development from a database-centric position --- that is, if the database (or GIS layer in this case) is properly designed and maintained, then there is never any need to cobble things together and I set myself up to better be able to develop elegant solutions. Though I'd question if any of my python apps could be considered elegant 🙂

...I am much more of a RDBMS database/N-Tier application developer that is now working in a python scripting world but continually find similarities where I can bring those experiences into my new world.

markdenil · ‎07-11-2013

You will get no disageement from me that
"if the database (or GIS layer in this case) is properly designed and maintained, then there is never any need to cobble things together".