Extracting certain string/characters and numbers from field

721
14
Jump to solution
09-27-2019 03:14 PM
2Quiker
Occasional Contributor II

I have a field with with descriptions and i need to extract the tax numbers from this field and put them into another field in the same layer. 

Some examples of what i need .

1. 

12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A

- I need to extract "TX 2" from this field

2. 

23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1

- I need to extract "TX 11271" from this row

3. 

04-5N-5W SE TAX 3 & TAX 4 IN NWSE

- I need to extract "TAX 3, TAX 4" from this row

4. 

21-4N-3W SW TX 86, 89, 90 & 93 IN S 1/2 OF SW

- I need to exact "TX 86, 89,90 & 93" from this row

I currently can extract numbers with the following and i need help to include what i mentioned above.  I guess i need to extract everything beginning at TX and before the space before the next alpha characters. 

fc = r'C:\Temp\Parcels.shp'

with arcpy.da.UpdateCursor(fc,['Legal','Legal2']) as cursor:
    for row in cursor:
        row[1] = ''.join([str(i) for i in row[0] if i.isdigit()])
        cursor.updateRow(row)
0 Kudos
1 Solution

Accepted Solutions
JoshuaBixby
MVP Esteemed Contributor

You have some edge cases that will be challenging to address, but Python regular expressions should be able to get a vast majority of your cases:

>>> import re
>>>
>>> recs = [
...     "12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A",
...     "23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1",
...     "04-5N-5W SE TAX 3 & TAX 4 IN NWSE",
...     "21-4N-3W SW TX 86, 89, 90 & 93 IN S 1/2 OF SW"
... ]
>>>
>>> pattern = re.compile(r"((?:TX|TAX) [,& 0-9]+)")
>>>
>>> for rec in recs:
...     print(pattern.findall(rec))
...
...
['TX 2 ', 'TX 2']
['TX 11271 ']
['TAX 3 & ', 'TAX 4 ']
['TX 86, 89, 90 & 93 ']
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

One edge case my re expression doesn't currently address is the "TX 2-A" at the end of example 1.  You don't mention wanting to extract it, why not?  Also, have a tax number with an alphabetic character in it does complicate the regular expression.

UPDATE:  I came up with a slightly different, I think better, approach that will return just the tax ids as a list from the description:

>>> import re
>>>
>>> recs = [
...     "12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A",
...     "23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1",
...     "04-5N-5W SE TAX 3 & TAX 4 IN NWSE",
...     "21-4N-3W SW TX 86, 89, 90 & 93 IN S 1/2 OF SW"
... ]
>>>
>>> pattern = re.compile(r"[ ,&]+")
>>>
>>> def extract_tax_ids(split_desc):
...     l = []
...     b = False
...     for i in split_desc:
...         if i in ("TX", "TAX"):
...             b = True
...             continue
...         if b and not i.isalpha():
...             l.append(i)
...         else:
...             b = False
...     return l
...
>>>
>>> for rec in recs:
...     print(extract_tax_ids(pattern.split(rec)))
...
['2', '2-A']
['11271']
['3', '4']
['86', '89', '90', '93']
>>> 

View solution in original post

14 Replies
DanPatterson_Retired
MVP Esteemed Contributor

strings are iterables and you can use an 'in' check.  since you have so few cases, you could make a list of those and compare the cases to each record in a searchcursor.  the logic is as follows

cases = ['TX 2', 'TX 11271']

records = ['12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A', 
           '23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1',
           '04-5N-5W SE TAX 3 & TAX 4 IN NWSE'
           ]

for rec in records:
    for case in cases:
        if case in rec:
            print("{} found in {}".format(case, rec))
        else:
            print("{} not found".format(case))
            
TX 2 found in 12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A
TX 11271 not found
TX 2 not found
TX 11271 found in 23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1
TX 2 not found
TX 11271 not found

Of course, if you are now going to say you have a few hundred cases... never mind

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

You have some edge cases that will be challenging to address, but Python regular expressions should be able to get a vast majority of your cases:

>>> import re
>>>
>>> recs = [
...     "12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A",
...     "23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1",
...     "04-5N-5W SE TAX 3 & TAX 4 IN NWSE",
...     "21-4N-3W SW TX 86, 89, 90 & 93 IN S 1/2 OF SW"
... ]
>>>
>>> pattern = re.compile(r"((?:TX|TAX) [,& 0-9]+)")
>>>
>>> for rec in recs:
...     print(pattern.findall(rec))
...
...
['TX 2 ', 'TX 2']
['TX 11271 ']
['TAX 3 & ', 'TAX 4 ']
['TX 86, 89, 90 & 93 ']
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

One edge case my re expression doesn't currently address is the "TX 2-A" at the end of example 1.  You don't mention wanting to extract it, why not?  Also, have a tax number with an alphabetic character in it does complicate the regular expression.

UPDATE:  I came up with a slightly different, I think better, approach that will return just the tax ids as a list from the description:

>>> import re
>>>
>>> recs = [
...     "12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A",
...     "23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1",
...     "04-5N-5W SE TAX 3 & TAX 4 IN NWSE",
...     "21-4N-3W SW TX 86, 89, 90 & 93 IN S 1/2 OF SW"
... ]
>>>
>>> pattern = re.compile(r"[ ,&]+")
>>>
>>> def extract_tax_ids(split_desc):
...     l = []
...     b = False
...     for i in split_desc:
...         if i in ("TX", "TAX"):
...             b = True
...             continue
...         if b and not i.isalpha():
...             l.append(i)
...         else:
...             b = False
...     return l
...
>>>
>>> for rec in recs:
...     print(extract_tax_ids(pattern.split(rec)))
...
['2', '2-A']
['11271']
['3', '4']
['86', '89', '90', '93']
>>> 
2Quiker
Occasional Contributor II

I was able to get the print outs. I still need to get the Legal2 field populated with the results some how.

When i add an update cursor i get the following error

line 27, in <module>
    with arcpy.da.UpdateCursor(recs,["Legal2"]) as cursor:
RuntimeError: 'in_table' is not a table or a featureclass

I am a little lost on how to populate the Legal2 field? -   with arcpy.da.UpdateCursor(recs,["Legal2"]) as cursor:

with arcpy.da.UpdateCursor(recs,["Legal2"]) as cursor:
    for row in cursor:

print results, but they looked like this.

['02362']
['03315', '05857', '05858']
['15022']
['03398']
['5858']
['05857']
['1', '2', 'T72007']
['1']
['2']
[]
[]
[]
[]
[]
['04089']
['04089']

currently have

import arcpy, re

recs = r'C:\Temp\Descriptions.shp'

field = 'Legal'
recs = (row[0] for row in arcpy.da.SearchCursor(recs, field))

pattern = re.compile(r"[ ,&]+")

def extract_tax_ids(split_desc):
    l = []
    b = False
    for i in split_desc:
         if i in ("TX", "TAX"):
             b = True
             continue
         if b and not i.isalpha():
             l.append(i)
         else:
             b = False
    return l


#for rec in recs:
   #where = extract_tax_ids(pattern.split(rec))

with arcpy.da.UpdateCursor(recs,["Legal2"]) as cursor:
    for row in cursor:‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
0 Kudos
JoshuaBixby
MVP Esteemed Contributor

What do you want the Legal2 field to be?  A comma separated string?

Looking at your code, you have a couple issues going on.  First, you set recs to be a path to a shape file, but then it gets set again as a generator expression.  That is why your cursor is giving the error it is.  Second, you won't be able to pass a list of values directly into a single field with the update cursor. 

0 Kudos
RandyBurton
MVP Regular Contributor

To clarify, you want to start with either "TX" or "TAX" and stop just before "IN"?

0 Kudos
2Quiker
Occasional Contributor II

Some times it' not before the IN, some times it's before other alpha characters. It does appear to before alpha characters though.

22-3N-2W SE NAMPA ORIGINAL TX 98728 OF LT 32 BLK 34
02-3N-2W NW TX 03475 LS RD IN S 1/2 NW
0 Kudos
RandyBurton
MVP Regular Contributor

Here's my latest test.  It looks for a pattern of TX or TAX and selects everything until a letter is found that is neither TX nor TAX.  The test code:

import re

recs = [
    "12-5N-5W SW NW SW S & W OF CANAL,, TX 2 IN SWNW LS TX 2-A",
    "23-3N-2W SW 4TH ST TOWNHOMES TX 11271 IN LT 4C BLK 1",
    "07-4N-3W NE NENE S&W OF HWY",
    "04-5N-5W SE TAX 3 & TAX 4 IN NWSE",
    "22-3N-2W SE NAMPA ORIGINAL TX 98728 OF LT 32 BLK 34",
    "02-3N-2W NW TX 03475 LS RD IN S 1/2 NW12",
    "21-4N-3W SW TX 86, 89, 90 & 93 IN S 1/2 OF SW"
    ]

pattern = re.compile(r"((?:TX|TAX) [^A-Z]+ )")

for rec in recs:
    print ''.join((pattern.findall(rec))).strip()

# outputs:
TX 2
TX 11271
(blank line)
TAX 3 & TAX 4
TX 98728
TX 03475
TX 86, 89, 90 & 93‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

So I would suggest trying something like:

import arcpy
import re

pattern = re.compile(r"((?:TX|TAX) [^A-Z]+ )")

fc = r'C:\Temp\Descriptions.shp'

with arcpy.da.UpdateCursor(fc,['Legal','Legal2']) as cursor:
    for row in cursor:
        row[1] = ''.join(pattern.findall(row[0].upper())).strip()
        cursor.updateRow(row)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Hope this helps.

2Quiker
Occasional Contributor II

The code worked but i think the data in the legal field has lots of variations and i thought that the tax number ended before an alpha number but it doesn't, some will be TAX 30-A-1-E or TX 30-A-1, etc.

27-4N-3W SW BEALS ACREAGE E 278' OF TX 30-A OF BLKS 3,4,5,6 LESS TX 30-A-1 LS ST
22-4N-3W NE DEMENT TAX 7-B BLK 52 IN DEMENT & BOONE CALD
27-4N-3W SW BEALS ACREAGE TAX 14-B IN BLK 4 LESS TAX 14-B-1
27-4N-3W SW BEALS ACREAGE TAX 20, TAX 30-A-1-E BLK 3
27-4N-3W SW BEALS ACREAGE TX 30-A-1-D-A BLK 6‍‍‍‍‍
24-4N-5W NE TAX 2&3 IN NENE
31-4N-2W SW LONGVIEW PLACE TX 1 IN BLKS 1,2,3,LS HWY,, TX 4,5 IN BLK 3 & TX 9,10 IN BLK 4 TX 8 IN BLK 6
24-3N-2W NW NESW LESS TAX 34 & 46, S 1/2 NW LS TX 67 IN SWNW,,TX 46-A IN NESW
21-4N-3W NE CALLOWAY ADD TX 4 & 5 & W 7' OF S 114' OF TX 3 IN LT 3 LS HWY BLK 2

1. should be TX 30-A, TX 30-A-1

2. should be TAX 7-B

3. Should be TAX14-B, TAX 14-B-1

4. Should be TAX 20, TAX 30-A-1-E

5. Should be TX 30-A-1-D-A

6. Should be TAX 2, TAX3 or TAX 2&3

7. TX1, TX 4&5, TX9&10

8. TAX 34 &46. TX 46-A or TAX34, TAX46, TX 46-A

9. Should be TX 4&5, TX 3 or TX 4, TX 5, TX 3

Running a print i get the following.

20', '30-A-1-E']
['43']
['33', '4', '17', '30-A-1-B']
['4', '33', '30-A-1-A']
['18', '30A1CC']
['30-A-1-F']
['30-A-1-D', '30-A-1-D-A', '30-A-1-F']
['30-A-1-C-A']
['30-A-1-C-B']‍‍‍‍‍‍‍‍‍
['7', '7A', '7B', '7C']
0 Kudos
JoshuaBixby
MVP Esteemed Contributor

 i think the data in the legal field has lots of variations

I agree, and I don't think you will be able to come up with a rule set that will address all cases; well, come up with one in a relatively reasonable amount of time.  A couple of different approaches have been suggested and demonstrated, and I think now you just need to dig in.  I suggest find a rule set that address most cases and then validating the results and addressing the edge cases manually.