Extracting number w/ specialcharacter from random position in string field

SimonLange · ‎01-22-2021

Hi all.

I am struggling with extracting a number in combination with a percentage sign (e.g 40 %, 60 %; 30 %, 30 % and 40 % and so on) from a field. The issue is that there is no regular position of said number - see example below - and that there are numbers which I dont want to extract.

Any ideas on how to extract the number-character combination? Thanks for any suggestions

Example fields:

V 40 %, B112 60 %

V 30 %, B11 30 %, K 40 %

S 132 90 %, R113-GB00BK 10 %

jcarlson · ‎01-22-2021

If the data are consistently formatted, you might consider looking at regular expressions to accomplish this.

Here's my code:

import re

values = [
    'V 40 %, B112 60 %',
    'V 30 %, B11 30 %, K 40 %',
    'S 132 90 %, R113-GB00BK 10 %'
]

# Regex patterns
other_patt = re.compile('\S+?(?=\s[0-9]+\s%)')
percent_patt = re.compile('[0-9]+(?=\s%)')

for value in values:
    print([re.findall(other_patt, value), re.findall(percent_patt, value)])

And here's what it returns:

[['V', 'B112'], ['40', '60']]
[['V', 'B11', 'K'], ['30', '30', '40']]
[['132', 'R113-GB00BK'], ['90', '10']]

Breaking down the regex patterns:

\S+?(?=\s[0-9]+\s%)
- \S+?
  - \S+ matches one or more non-whitespace characters.
  - The '?' makes it non-greedy, meaning it will match as little as possible, so that we don't inadvertently grab more than one value
- (?=...) Indicates a lookahead expression, meaning it specifically looks for strings which are followed by the value in the '...', but does not include that value in the returned match
- \s looks for a whitespace character
- [0-9]+ Looks for one or more consecutive numeric characters.
- % Just a literal '%' character!
[0-9]+(?=\s%)
- Again, [0-9]+ is looking for one or more consecutive numeric characters.
- (?=\s$) A simpler lookahead expression, this time looking only for those numeric characters followed by a single whitespace character and a percent sign

Regex is quite useful for extracting text, and is a module well worth digging into.

- Josh Carlson
Kendall County GIS

View solution in original post

DavidPike · ‎01-22-2021

Which number do you want to extract in those examples? Can you give examples along with the intended result? There may be no regular position, but is there a regular format - do you want the numbers preceding the % sign?

SimonLange · ‎01-22-2021

Hi David - thanks for your response. I specified my question to which numbers I want to extract!

The codes are always separated by a comma - the codes themselfes vary though

Intended result would be a column with

V, B112

V, B11, K

S132, R113-GB00BK

and one with

40, 60

30, 30, 40

90, 10

DanPatterson · ‎01-22-2021

", ".join([i.strip().split(" ")[-1] for i in s0.split("%") if i])
'30, 30, 40'

like so? See my post

... sort of retired...

DanPatterson · ‎01-22-2021

This will get you thinkin about what you want.

Currently it returns a list so you have to decide whether you want a particular value from the list (eg first, last etc) or whether you want to concatenate the values together (eg ", ".join([i for i in ... )

s = "S 132 90 %, R113-GB00BK 10 %"
[i.strip().split(" ")[-1] for i in s.split("%") if i]
['90', '10']

s0 = "V 30 %, B11 30 %, K 40 %"
[i.strip().split(" ")[-1] for i in s0.split("%") if i]
['30', '30', '40']

s1 = "V 40 %, B112 60 %"
[i.strip().split(" ")[-1] for i in s1.split("%") if i]
['40', '60']