Using Regular Expression in Calculate Field

MattSmithCarto · ‎04-22-2020

Asking for some help on a task I'm dealing with.

I have a field (PLAN) with a string that I'd like to split into two separate text fields (PLAN_PFX & PLAN_NUM). Each entry is predictable in the sense that it will be 1-5 alphabetic characters followed by 1-6 numeric characters. Alphabetic characters are always upper-case.

So what I need to do is something like...

PLAN	PLAN_PFX	PLAN_NUM
A305	A	305
BTR86542	BTR	86542
RP808112	RP	808112
WR8	WR	8

I know I'm going to need to use Regular Expression for it. I've tried a few methods just focusing on calculating for PLAN_NUM initially but I'm just fumbling around in the dark really.

PLAN_NUM = regex(!PLAN!)

import re
def regex(text):
split_list = re.split(r'[A-Z]', text)
return split_list[-1]

But this just returns a blank entry.

Any help appreciated. Thanks in advance.

MattSmithCarto · ‎04-22-2020

OK, never mind, the above code does work. Just have to change it to get the other part now!

Alternatively, if anyone has a more elegant way to do this, I'm all ears.

BenTurrell · ‎04-22-2020

Matt Smith‌,

You could also use this regex expression which might save you some trouble if you end up with some dodgy data thrown in. It should also simplify your output as well as [0] will be text and [1] will be numbers:

import re

def regex(text):
split_list = re.split(r'^[A-Z]+', text)
return split_list[-1]

JoshuaBixby · ‎04-23-2020

re.split won't give the results you expect:

>>> l = ["A305", "BTR86542", "RP808112", "WR8"]
>>> for s in l:
...     re.split(r'^[A-Z]+', s)
...
['', '305']
['', '86542']
['', '808112']
['', '8']
>>> ‍‍‍‍‍‍‍‍‍

Although the numbers are present in the output, the text portion will not be present in [0].

I don't think re.split is the right method for this job, re.match will work better:

>>> l = ["A305", "BTR86542", "RP808112", "WR8"]
>>> for s in l:
...     re.match(r'([A-Z]+)([^A-Z]+)', s).groups()
...
('A', '305')
('BTR', '86542')
('RP', '808112')
('WR', '8')
>>> ‍‍‍‍‍‍‍‍‍

Even though the above code snippet will work, it still leaves the door open to issues with case sensitivity and fields that only have a number.

I prefer to use character classes:

>>> l = ["A305", "BTR86542", "RP808112", "WR8", "Wr8", "305"]
>>> for s in l:
...     re.match(r'(\D*)(\d*)', s).groups()
...
('A', '305')
('BTR', '86542')
('RP', '808112')
('WR', '8')
('Wr', '8')
('', '305')
>>> ‍‍‍‍‍‍‍‍‍‍‍