Split Text Base on Multiple Delimiters

JudsonCrouch1 · ‎03-03-2016

I am working on creating a script that can be used when converting KMZ files to a feature class. I have that part down and am now getting rid of extraneous columns and cleaning up the data. The issue is that when the KMZ is imported, the "PopupInfo" field contains " " and "" characters separating things like type of site, phone number, address, and lat/long.

Ex:

Education (123) 456-7890 12345 Road Name Rd., City, ST 12345 Location:DD.DDDDDN, DD.DDDDDW

I was trying to think of a way to split the text out into respective columns (type, phone, address, Lat, Long) but for the life of me cannot figure out how to split this in an efficient way. In field calculator I have tried:

!PopupInfo!.split('<br>')[0]

and successfully brought over the type but have no idea how to do it with the phone number and addresses.

Is there anyway to split text based on multiple delimiters or can anyone suggest an efficient way to do this?

Thanks in advance!

DanPatterson_Retired · ‎03-03-2016

if it is text and only contains the break designator, you could replace the before and after case, then split... at least that would give you a list from which you could do something

x = "Education (123) 456-7890 12345 Road Name Rd., City, ST 12345 Location:DD.DDDDDN, DD.DDDDDW"
y = ((x.replace(" ","|")).replace("</br","")).split("|")
y

['Education', '(123) 456-7890>', '12345 Road Name Rd., City, ST 12345', 'Location:DD.DDDDDN, DD.DDDDDW']

View solution in original post

JakeSkinner · ‎03-03-2016

Hi Judson,

You could do this with an update cursor. Ex:

arcpy.da.UpdateCursor(fc, ["originalField", "type", "phone", "address", "lat", "Long"]) as cursor:
    for row in cursor:
        row[1] = row[0].split('<br>')[0]
        row[2] = row[0].split('<br>')[1]
        row[3] = row[0].split('<br>')[2]
        row[4] = row[0].split('<br>')[3]
        row[5] = row[0].split('<br>')[4]
        cursor.updateRow(row)

You will just need to update the "originalField" with the field name that contains the text string you are breaking up.

JudsonCrouch1 · ‎03-03-2016

Jake,

I used the following code however, it return with the following error.

import arcpy

ap.env.workspace = "...\\path\\to\\gdb"

with arcpy.da.UpdateCursor('Places', ["PopupInfo", "Type", "Phone", "Address", "Lat", "Long"]) as cursor:
     for row in cursor:
        row[1] = row[0].split('<br>')[0]
        row[2] = row[0].split('<br>')[1]
        row[3] = row[0].split('<br>')[2]
        row[4] = row[0].split('<br>')[3]
        row[5] = row[0].split('<br>')[4]
        cursor.updateRow(row)

row[5] = row[0].split(' ')[4]

IndexError: list index out of range.

I think this is caused by gaps in my data. I looked to make sure it was all the same format but it looks like it is not so I think I need to do a while loop to skip over the bad records in the table but am not sure how to accomplish that. Any suggestions?

BlakeTerhune · ‎03-03-2016

That error means it's trying to get the fifth thing in the row split. Looking at your example, it probably means there were only three instances of your split text ' '

I recommend you print row[0] before you start splitting it so you know which one it chokes on. If you really just want to skip this error, you could use a try, except block.

import arcpy

ap.env.workspace = "...\\path\\to\\gdb"

update_fields = ["PopupInfo", "Type", "Phone", "Address", "Lat", "Long"]
with arcpy.da.UpdateCursor('Places', update_fields) as cursor:
    for row in cursor:
        try:
            row[1] = row[0].split('<br>')[0]
            row[2] = row[0].split('<br>')[1]
            row[3] = row[0].split('<br>')[2]
            row[4] = row[0].split('<br>')[3]
            row[5] = row[0].split('<br>')[4]
            cursor.updateRow(row)
        except IndexError as iErr:
            print "Error processing {}\n{}".format(row[0], iErr)
            pass

This will only pass on that specific exception (IndexError). If something else happens, the script will still stop.

JudsonCrouch1 · ‎03-03-2016

I tried your code block and it returned every record. There are a couple of records that have only 2 or 3 instances of ' ' in them...

I'm still a bit new to python and arcpy so can you clarify how to "print row[0]" to find out where the error is? I tried replacing the cursor.updateRow(row) with print row[0] and it returned the same record as before.

Thanks for the help!

DanPatterson_Retired · ‎03-03-2016

if it is text and only contains the break designator, you could replace the before and after case, then split... at least that would give you a list from which you could do something

x = "Education (123) 456-7890 12345 Road Name Rd., City, ST 12345 Location:DD.DDDDDN, DD.DDDDDW"
y = ((x.replace(" ","|")).replace("</br","")).split("|")
y

['Education', '(123) 456-7890>', '12345 Road Name Rd., City, ST 12345', 'Location:DD.DDDDDN, DD.DDDDDW']

JudsonCrouch1 · ‎03-03-2016

A combination of all of this above made this work. Thanks for all the help!

See code below for final code:

fields = ["Type", "Phone", "Address", "Lat", "Long"]
addfield = partial(ap.AddField_management, feature, field_type = "STRING", field_length = 150)

print("Adding Fields...")
for field in fields:
    addfield(field)
print("Fields Added.")


with ap.da.UpdateCursor(feature, ["PopupInfo", "Type", "Phone", "Address", "Lat", "Long"]) as cursor:
    for row in cursor:
        try:
            row[1] = ((row[0].replace('<br>','|')).replace('</br>', '|')).split('|')[0]
            row[2] = ((row[0].replace('<br>','|')).replace('</br>', '|')).split('|')[1]
            row[3] = ((row[0].replace('<br>','|')).replace('</br>', '|')).split('|')[2]
            row[4] = ((row[0].replace('<br>','|')).replace('</br>', '|')).split('|')[3]
            row[5] = ((row[0].replace('<br>','|')).replace('</br>', '|')).split('|')[4]
            cursor.updateRow(row)
        except IndexError as iErr:
            print"Error processing: {}\n{}.".format(row[0], iErr)
            pass

This left me with 4 of the original 55 records as null due to those 4 records having only two instances of ' ' in them.