Select to view content in your preferred language

Extract particular values from field

2925
20
Jump to solution
03-06-2020 01:13 PM
CCWeedcontrol
Frequent Contributor

I need help with extract a particulate value from the field "PojectIinfo" field.  I need to only extract the value/characters of the letter B and the 10 digits after the B an place it in the field "BP".

There might be a space in that value as well, for example B24501000 space 0. The total characters would be 11; B plus 10 digits. I am thinking i need to use re.compile but i am uncertain on the code.

 

Current field looks like this.

New SFR w/Attached Garage, covered parch and patio B24501000 0  >>>Extract B24501000 0 to BP field

B34567000 0 30 X 48 Pole Barn >>>Extract B34567000 0 to BP field

NEW 22,778 SQ FT B33561000 0 Addition to existing hop building >>>Extract B33561000 0 to BP field

Residential B14356000 0 Storage>>>Extract B14356000 0 to BP field

0 Kudos
20 Replies
CCWeedcontrol
Frequent Contributor

I still got the same error on the same line. I checked the field and the lowest count of words in the filed is 2 and a max of 61.

I used len(!PojectIinfo!.split(" ")) to get the count of words in that field.

0 Kudos
RandyBurton
MVP Alum

Perhaps the following.

import re

recs = [
    "New SFR w/Attached Garage, covered parch and patio B24501000 0", # B24501000 0
    "B34567000 0 30 X 48 Pole Barn", # B34567000 0
    "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building", # B33561000 0
    "Residential B143560000 Storage" # B14356000 
    ]

pattern = re.compile(r"([\d ?]{8,10})") # B followed by 8-10 digits, may include space

for rec in recs:
    print(pattern.findall(rec)[0].strip()) # strip trims a trailing space

''' Results:   
B24501000 0
B34567000 0
B33561000 0
B143560000
'''‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
CCWeedcontrol
Frequent Contributor

After looking at the field a little closer there are Alphas characters in some records instead of a space. I need to get those as well. I would also like to add a 0 (zero) to the spaces if there is no Alph character if possible.

For example:

Single Family Res B34561000A0 >>>  Extract B34561000A0 to BP field

New SFR w/Attached Garage, covered parch and patio B24501000 0  >>>Extract B2450100000 to BP field

0 Kudos
DanPatterson_Retired
MVP Emeritus

Any other corner cases?  Nulls?  Permits that will kill a simple case( B1 BBQ and Bar   B245010000   0

0 Kudos
RandyBurton
MVP Alum

Updated expression used in my previous code:

import re

recs = [
    "New SFR w/Attached Garage, covered parch and patio B24501000 0", # B24501000 0
    "B34567000 0 30 X 48 Pole Barn", # B34567000 0
    "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building", # B33561000 0
    "Residential B143560000 Storage", # B14356000
    "Single Family Res B34561000A0", # B34561000A0
    "New SFR w/Attached Garage, covered parch and patio B24501000 0" # B2450100000 
    ]

pattern = re.compile(r"([\d]+[A-Z][\d]*|[\d]+[\s][\d]*)"

for rec in recs:
    print(pattern.findall(rec)[0].strip())

'''
Results:
B24501000 0
B34567000 0
B33561000 0
B143560000
B34561000A0
B24501000 0
'''‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Then replace the space in the results with a zero or whatever.

To experiment with regular expressions, see: Regular Expressions 101. This is the explanation given by the website of the expression used in line 12.

"[\d]+[A-Z][\d]*|[\d]+[\s][\d]*"
gm

1st Alternative [\d]+[A-Z][\d]*
  Match a single character present in the list below 
    B matches the character B literally (case sensitive)
  Match a single character present in the list below [\d]+
    + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])
  Match a single character present in the list below [A-Z]
    A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
  Match a single character present in the list below [\d]*
    * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])

2nd Alternative [\d]+[\s][\d]*
  Match a single character present in the list below 
    B matches the character B literally (case sensitive)
  Match a single character present in the list below [\d]+
    + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])
  Match a single character present in the list below [\s]
    \s matches any whitespace character (equal to [\r\n\t\f\v ])
  Match a single character present in the list below [\d]*
    * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])

Global pattern flags
  g modifier: global. All matches (don't return after first match)
  m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

EDIT:

The following simpler expression also works for the examples in the code above:

pattern = re.compile(r"([\d][\d A-Z]{7,9})")‍‍

# Letter B followed by a number, followed by 7 to 9 numbers, spaces, and/or capital letters‍‍‍
CCWeedcontrol
Frequent Contributor

Getting closer, the other issue that i found was a lot records didn't have "B", people for got to put the "B" in front of the building permit. Some records have the following.

34567000 0
12345000 0
45125011A0
0 Kudos
DanPatterson_Retired
MVP Emeritus

Sounds like you aren't going to find a one-script-fits-all solution.

You might have to process and fix in batches... like finding all those records that begin with a "B", then doing those, and finding out how the rest differ.

JoshuaBixby
MVP Esteemed Contributor

The one constant appears to be an 8-digit number.  Assuming that is the case, the following regex should work to extract the 8-digit number plus 2 afterwards:

>>> l = [
... "New SFR w/Attached Garage, covered parch and patio B24501000 0",
... "B34567000 0 30 X 48 Pole Barn",
... "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building",
... "Residential B14356000 0 Storage",
... "Single Family Res B34561000A0",
... "New SFR w/Attached Garage, covered parch and patio B24501000 0 "
... ]
>>>
>>> for i in l:
...   re.findall(r"\d{8}[ \w]\d", i)
...
['24501000 0']
['34567000 0']
['33561000 0']
['14356000 0']
['34561000A0']
['24501000 0']
>>> 
CCWeedcontrol
Frequent Contributor

Yes, this got me the closest to my goal. I was trying to understand how re.findall patterns work. The [ ] is used to match certain characters, and the reason you didn't use the [ ]  was because you were trying to capture the 8 digits and the last two regardless of the alpha "B". After looking at the data i noticed that some didn't have the "B" but had 8-10 digits. I appreciate the help.

 

0 Kudos
CCWeedcontrol
Frequent Contributor

Thank you for explaining "Matching characters". Thank you for the help.

0 Kudos