Extracting number w/ specialcharacter from random position in string field

960
7
Jump to solution
01-22-2021 03:43 AM
SimonLange
New Contributor II

Hi all.

 

I am struggling with extracting a number in combination with a percentage sign (e.g 40 %, 60 %; 30 %, 30 % and 40 % and so on)  from a field. The issue is that there is no regular position of said number - see example below - and that there are numbers which I dont want to extract.

 

Any ideas on how to extract the number-character combination? Thanks for any suggestions

 

Example fields:

V 40 %, B112 60 %

V 30 %, B11 30 %, K 40 %

S 132 90 %, R113-GB00BK 10 %

0 Kudos
1 Solution

Accepted Solutions
jcarlson
MVP Esteemed Contributor

If the data are consistently formatted, you might consider looking at regular expressions to accomplish this.

Here's my code:

import re

values = [
    'V 40 %, B112 60 %',
    'V 30 %, B11 30 %, K 40 %',
    'S 132 90 %, R113-GB00BK 10 %'
]

# Regex patterns
other_patt = re.compile('\S+?(?=\s[0-9]+\s%)')
percent_patt = re.compile('[0-9]+(?=\s%)')

for value in values:
    print([re.findall(other_patt, value), re.findall(percent_patt, value)])

And here's what it returns:

[['V', 'B112'], ['40', '60']]
[['V', 'B11', 'K'], ['30', '30', '40']]
[['132', 'R113-GB00BK'], ['90', '10']]

 

Breaking down the regex patterns:

  • \S+?(?=\s[0-9]+\s%)
    • \S+?
      • \S+ matches one or more non-whitespace characters.
      • The '?' makes it non-greedy, meaning it will match as little as possible, so that we don't inadvertently grab more than one value
    • (?=...) Indicates a lookahead expression, meaning it specifically looks for strings which are followed by the value in the '...', but does not include that value in the returned match
    • \s looks for a whitespace character
    • [0-9]+ Looks for one or more consecutive numeric characters.
    • % Just a literal '%' character!
  • [0-9]+(?=\s%)
    • Again, [0-9]+ is looking for one or more consecutive numeric characters.
    • (?=\s$) A simpler lookahead expression, this time looking only for those numeric characters followed by a single whitespace character and a percent sign

Regex is quite useful for extracting text, and is a module well worth digging into.

- Josh Carlson
Kendall County GIS

View solution in original post

7 Replies
DavidPike
MVP Frequent Contributor

Which number do you want to extract in those examples?  Can you give examples along with the intended result?  There may be no regular position, but is there a regular format - do you want the numbers preceding the % sign?

0 Kudos
SimonLange
New Contributor II

Hi David - thanks for your response. I specified my question to which numbers I want to extract!

The codes are always separated by a comma - the codes themselfes vary though

Intended result would be a column with

V, B112

V, B11, K

S132, R113-GB00BK

 

and one with

40, 60

30, 30, 40

90, 10

0 Kudos
DanPatterson
MVP Esteemed Contributor

", ".join([i.strip().split(" ")[-1] for i in s0.split("%") if i])
'30, 30, 40'

like so?  See my post


... sort of retired...
0 Kudos
DanPatterson
MVP Esteemed Contributor

This will get you thinkin about what you want.

Currently it returns a list so you have to decide whether you want a particular value from the list (eg first, last etc) or whether you want to concatenate the values together (eg ", ".join([i for i in ... )

s = "S 132 90 %, R113-GB00BK 10 %"
[i.strip().split(" ")[-1] for i in s.split("%") if i]
['90', '10']

s0 = "V 30 %, B11 30 %, K 40 %"
[i.strip().split(" ")[-1] for i in s0.split("%") if i]
['30', '30', '40']

s1 = "V 40 %, B112 60 %"
[i.strip().split(" ")[-1] for i in s1.split("%") if i]
['40', '60']

 


... sort of retired...
jcarlson
MVP Esteemed Contributor

If the data are consistently formatted, you might consider looking at regular expressions to accomplish this.

Here's my code:

import re

values = [
    'V 40 %, B112 60 %',
    'V 30 %, B11 30 %, K 40 %',
    'S 132 90 %, R113-GB00BK 10 %'
]

# Regex patterns
other_patt = re.compile('\S+?(?=\s[0-9]+\s%)')
percent_patt = re.compile('[0-9]+(?=\s%)')

for value in values:
    print([re.findall(other_patt, value), re.findall(percent_patt, value)])

And here's what it returns:

[['V', 'B112'], ['40', '60']]
[['V', 'B11', 'K'], ['30', '30', '40']]
[['132', 'R113-GB00BK'], ['90', '10']]

 

Breaking down the regex patterns:

  • \S+?(?=\s[0-9]+\s%)
    • \S+?
      • \S+ matches one or more non-whitespace characters.
      • The '?' makes it non-greedy, meaning it will match as little as possible, so that we don't inadvertently grab more than one value
    • (?=...) Indicates a lookahead expression, meaning it specifically looks for strings which are followed by the value in the '...', but does not include that value in the returned match
    • \s looks for a whitespace character
    • [0-9]+ Looks for one or more consecutive numeric characters.
    • % Just a literal '%' character!
  • [0-9]+(?=\s%)
    • Again, [0-9]+ is looking for one or more consecutive numeric characters.
    • (?=\s$) A simpler lookahead expression, this time looking only for those numeric characters followed by a single whitespace character and a percent sign

Regex is quite useful for extracting text, and is a module well worth digging into.

- Josh Carlson
Kendall County GIS
SimonLange
New Contributor II

Thanks very much - that helped a lot!

JoshuaBixby
MVP Esteemed Contributor

The pattern does seem structured and simple enough that you might not need regular expressions, but I like regular expressions like @jcarlson.

>>> import re
>>>
>>> rows = [
...     r"V 40 %, B112 60 %",
...     r"V 30 %, B11 30 %, K 40 %",
...     r"S 132 90 %, R113-GB00BK 10 %"
... ]
>>>
>>> pct_pttn = re.compile("(\d*) %")
>>>
>>> # populuate column 1
>>> for s in rows:
...     pct_pttn.sub("", s).strip()
...
'V , B112'
'V , B11 , K'
'S 132 , R113-GB00BK'
>>> 
>>> # populate column 2
>>> for s in rows:
...     ", ".join(pct_pttn.findall(s))
...
'40, 60'
'30, 30, 40'
'90, 10'
>>>