Extract particular values from field

CCWeedcontrol · ‎03-06-2020

I need help with extract a particulate value from the field "PojectIinfo" field. I need to only extract the value/characters of the letter B and the 10 digits after the B an place it in the field "BP".

There might be a space in that value as well, for example B24501000 space 0. The total characters would be 11; B plus 10 digits. I am thinking i need to use re.compile but i am uncertain on the code.

Current field looks like this.

New SFR w/Attached Garage, covered parch and patio B24501000 0 >>>Extract B24501000 0 to BP field

B34567000 0 30 X 48 Pole Barn >>>Extract B34567000 0 to BP field

NEW 22,778 SQ FT B33561000 0 Addition to existing hop building >>>Extract B33561000 0 to BP field

Residential B14356000 0 Storage>>>Extract B14356000 0 to BP field

CCWeedcontrol · ‎03-10-2020

I still got the same error on the same line. I checked the field and the lowest count of words in the filed is 2 and a max of 61.

I used len(!PojectIinfo!.split(" ")) to get the count of words in that field.

RandyBurton · ‎03-06-2020

Perhaps the following.

import re

recs = [
    "New SFR w/Attached Garage, covered parch and patio B24501000 0", # B24501000 0
    "B34567000 0 30 X 48 Pole Barn", # B34567000 0
    "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building", # B33561000 0
    "Residential B143560000 Storage" # B14356000 
    ]

pattern = re.compile(r"([\d ?]{8,10})") # B followed by 8-10 digits, may include space

for rec in recs:
    print(pattern.findall(rec)[0].strip()) # strip trims a trailing space

''' Results:   
B24501000 0
B34567000 0
B33561000 0
B143560000
'''‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

CCWeedcontrol · ‎03-06-2020

After looking at the field a little closer there are Alphas characters in some records instead of a space. I need to get those as well. I would also like to add a 0 (zero) to the spaces if there is no Alph character if possible.

For example:

Single Family Res B34561000A0 >>> Extract B34561000A0 to BP field

New SFR w/Attached Garage, covered parch and patio B24501000 0 >>>Extract B2450100000 to BP field

DanPatterson_Retired · ‎03-06-2020

Any other corner cases? Nulls? Permits that will kill a simple case( B1 BBQ and Bar B245010000 0

RandyBurton · ‎03-06-2020

Updated expression used in my previous code:

import re

recs = [
    "New SFR w/Attached Garage, covered parch and patio B24501000 0", # B24501000 0
    "B34567000 0 30 X 48 Pole Barn", # B34567000 0
    "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building", # B33561000 0
    "Residential B143560000 Storage", # B14356000
    "Single Family Res B34561000A0", # B34561000A0
    "New SFR w/Attached Garage, covered parch and patio B24501000 0" # B2450100000 
    ]

pattern = re.compile(r"([\d]+[A-Z][\d]*|[\d]+[\s][\d]*)"

for rec in recs:
    print(pattern.findall(rec)[0].strip())

'''
Results:
B24501000 0
B34567000 0
B33561000 0
B143560000
B34561000A0
B24501000 0
'''‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Then replace the space in the results with a zero or whatever.

To experiment with regular expressions, see: Regular Expressions 101. This is the explanation given by the website of the expression used in line 12.

"[\d]+[A-Z][\d]*|[\d]+[\s][\d]*"
gm

1st Alternative [\d]+[A-Z][\d]*
  Match a single character present in the list below 
    B matches the character B literally (case sensitive)
  Match a single character present in the list below [\d]+
    + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])
  Match a single character present in the list below [A-Z]
    A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
  Match a single character present in the list below [\d]*
    * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])

2nd Alternative [\d]+[\s][\d]*
  Match a single character present in the list below 
    B matches the character B literally (case sensitive)
  Match a single character present in the list below [\d]+
    + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])
  Match a single character present in the list below [\s]
    \s matches any whitespace character (equal to [\r\n\t\f\v ])
  Match a single character present in the list below [\d]*
    * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    \d matches a digit (equal to [0-9])

Global pattern flags
  g modifier: global. All matches (don't return after first match)
  m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

EDIT:

The following simpler expression also works for the examples in the code above:

pattern = re.compile(r"([\d][\d A-Z]{7,9})")‍‍

# Letter B followed by a number, followed by 7 to 9 numbers, spaces, and/or capital letters‍‍‍‍‍‍

CCWeedcontrol · ‎03-10-2020

Getting closer, the other issue that i found was a lot records didn't have "B", people for got to put the "B" in front of the building permit. Some records have the following.

34567000 0
12345000 0
45125011A0

DanPatterson_Retired · ‎03-10-2020

Sounds like you aren't going to find a one-script-fits-all solution.

You might have to process and fix in batches... like finding all those records that begin with a "B", then doing those, and finding out how the rest differ.

JoshuaBixby · ‎03-10-2020

The one constant appears to be an 8-digit number. Assuming that is the case, the following regex should work to extract the 8-digit number plus 2 afterwards:

>>> l = [
... "New SFR w/Attached Garage, covered parch and patio B24501000 0",
... "B34567000 0 30 X 48 Pole Barn",
... "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building",
... "Residential B14356000 0 Storage",
... "Single Family Res B34561000A0",
... "New SFR w/Attached Garage, covered parch and patio B24501000 0 "
... ]
>>>
>>> for i in l:
...   re.findall(r"\d{8}[ \w]\d", i)
...
['24501000 0']
['34567000 0']
['33561000 0']
['14356000 0']
['34561000A0']
['24501000 0']
>>> ‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

CCWeedcontrol · ‎03-12-2020

Yes, this got me the closest to my goal. I was trying to understand how re.findall patterns work. The [ ] is used to match certain characters, and the reason you didn't use the [ ] was because you were trying to capture the 8 digits and the last two regardless of the alpha "B". After looking at the data i noticed that some didn't have the "B" but had 8-10 digits. I appreciate the help.

CCWeedcontrol · ‎03-12-2020

Thank you for explaining "Matching characters". Thank you for the help.