I need help with extract a particulate value from the field "PojectIinfo" field. I need to only extract the value/characters of the letter B and the 10 digits after the B an place it in the field "BP".
There might be a space in that value as well, for example B24501000 space 0. The total characters would be 11; B plus 10 digits. I am thinking i need to use re.compile but i am uncertain on the code.
Current field looks like this.
New SFR w/Attached Garage, covered parch and patio B24501000 0 >>>Extract B24501000 0 to BP field
B34567000 0 30 X 48 Pole Barn >>>Extract B34567000 0 to BP field
NEW 22,778 SQ FT B33561000 0 Addition to existing hop building >>>Extract B33561000 0 to BP field
Residential B14356000 0 Storage>>>Extract B14356000 0 to BP field
Solved! Go to Solution.
I still got the same error on the same line. I checked the field and the lowest count of words in the filed is 2 and a max of 61.
I used len(!PojectIinfo!
.split(" ")) to get the count of words in that field.
Perhaps the following.
import re
recs = [
"New SFR w/Attached Garage, covered parch and patio B24501000 0", # B24501000 0
"B34567000 0 30 X 48 Pole Barn", # B34567000 0
"NEW 22,778 SQ FT B33561000 0 Addition to existing hop building", # B33561000 0
"Residential B143560000 Storage" # B14356000
]
pattern = re.compile(r"([\d ?]{8,10})") # B followed by 8-10 digits, may include space
for rec in recs:
print(pattern.findall(rec)[0].strip()) # strip trims a trailing space
''' Results:
B24501000 0
B34567000 0
B33561000 0
B143560000
'''
After looking at the field a little closer there are Alphas characters in some records instead of a space. I need to get those as well. I would also like to add a 0 (zero) to the spaces if there is no Alph character if possible.
For example:
Single Family Res B34561000A0 >>> Extract B34561000A0 to BP field
New SFR w/Attached Garage, covered parch and patio B24501000 0 >>>Extract B2450100000 to BP field
Any other corner cases? Nulls? Permits that will kill a simple case( B1 BBQ and Bar B245010000 0
Updated expression used in my previous code:
import re
recs = [
"New SFR w/Attached Garage, covered parch and patio B24501000 0", # B24501000 0
"B34567000 0 30 X 48 Pole Barn", # B34567000 0
"NEW 22,778 SQ FT B33561000 0 Addition to existing hop building", # B33561000 0
"Residential B143560000 Storage", # B14356000
"Single Family Res B34561000A0", # B34561000A0
"New SFR w/Attached Garage, covered parch and patio B24501000 0" # B2450100000
]
pattern = re.compile(r"([\d]+[A-Z][\d]*|[\d]+[\s][\d]*)"
for rec in recs:
print(pattern.findall(rec)[0].strip())
'''
Results:
B24501000 0
B34567000 0
B33561000 0
B143560000
B34561000A0
B24501000 0
'''
Then replace the space in the results with a zero or whatever.
To experiment with regular expressions, see: Regular Expressions 101. This is the explanation given by the website of the expression used in line 12.
"[\d]+[A-Z][\d]*|[\d]+[\s][\d]*"
gm
1st Alternative [\d]+[A-Z][\d]*
Match a single character present in the list below
B matches the character B literally (case sensitive)
Match a single character present in the list below [\d]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
Match a single character present in the list below [A-Z]
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
Match a single character present in the list below [\d]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
2nd Alternative [\d]+[\s][\d]*
Match a single character present in the list below
B matches the character B literally (case sensitive)
Match a single character present in the list below [\d]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
Match a single character present in the list below [\s]
\s matches any whitespace character (equal to [\r\n\t\f\v ])
Match a single character present in the list below [\d]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
EDIT:
The following simpler expression also works for the examples in the code above:
pattern = re.compile(r"([\d][\d A-Z]{7,9})")
# Letter B followed by a number, followed by 7 to 9 numbers, spaces, and/or capital letters
Getting closer, the other issue that i found was a lot records didn't have "B", people for got to put the "B" in front of the building permit. Some records have the following.
34567000 0
12345000 0
45125011A0
Sounds like you aren't going to find a one-script-fits-all solution.
You might have to process and fix in batches... like finding all those records that begin with a "B", then doing those, and finding out how the rest differ.
The one constant appears to be an 8-digit number. Assuming that is the case, the following regex should work to extract the 8-digit number plus 2 afterwards:
>>> l = [
... "New SFR w/Attached Garage, covered parch and patio B24501000 0",
... "B34567000 0 30 X 48 Pole Barn",
... "NEW 22,778 SQ FT B33561000 0 Addition to existing hop building",
... "Residential B14356000 0 Storage",
... "Single Family Res B34561000A0",
... "New SFR w/Attached Garage, covered parch and patio B24501000 0 "
... ]
>>>
>>> for i in l:
... re.findall(r"\d{8}[ \w]\d", i)
...
['24501000 0']
['34567000 0']
['33561000 0']
['14356000 0']
['34561000A0']
['24501000 0']
>>>
Yes, this got me the closest to my goal. I was trying to understand how re.findall patterns work. The [ ] is used to match certain characters, and the reason you didn't use the [ ] was because you were trying to capture the 8 digits and the last two regardless of the alpha "B". After looking at the data i noticed that some didn't have the "B" but had 8-10 digits. I appreciate the help.
Thank you for explaining "Matching characters". Thank you for the help.