Asking for some help on a task I'm dealing with.
I have a field (PLAN) with a string that I'd like to split into two separate text fields (PLAN_PFX & PLAN_NUM). Each entry is predictable in the sense that it will be 1-5 alphabetic characters followed by 1-6 numeric characters. Alphabetic characters are always upper-case.
So what I need to do is something like...
PLAN | PLAN_PFX | PLAN_NUM |
---|---|---|
A305 | A | 305 |
BTR86542 | BTR | 86542 |
RP808112 | RP | 808112 |
WR8 | WR | 8 |
I know I'm going to need to use Regular Expression for it. I've tried a few methods just focusing on calculating for PLAN_NUM initially but I'm just fumbling around in the dark really.
PLAN_NUM = regex(!PLAN!)
import re
def regex(text):
split_list = re.split(r'[A-Z]', text)
return split_list[-1]
But this just returns a blank entry.
Any help appreciated. Thanks in advance.
OK, never mind, the above code does work. Just have to change it to get the other part now!
Alternatively, if anyone has a more elegant way to do this, I'm all ears.
You could also use this regex expression which might save you some trouble if you end up with some dodgy data thrown in. It should also simplify your output as well as [0] will be text and [1] will be numbers:
import re
def regex(text):
split_list = re.split(r'^[A-Z]+', text)
return split_list[-1]
re.split won't give the results you expect:
>>> l = ["A305", "BTR86542", "RP808112", "WR8"]
>>> for s in l:
... re.split(r'^[A-Z]+', s)
...
['', '305']
['', '86542']
['', '808112']
['', '8']
>>>
Although the numbers are present in the output, the text portion will not be present in [0].
I don't think re.split is the right method for this job, re.match will work better:
>>> l = ["A305", "BTR86542", "RP808112", "WR8"]
>>> for s in l:
... re.match(r'([A-Z]+)([^A-Z]+)', s).groups()
...
('A', '305')
('BTR', '86542')
('RP', '808112')
('WR', '8')
>>>
Even though the above code snippet will work, it still leaves the door open to issues with case sensitivity and fields that only have a number.
I prefer to use character classes:
>>> l = ["A305", "BTR86542", "RP808112", "WR8", "Wr8", "305"]
>>> for s in l:
... re.match(r'(\D*)(\d*)', s).groups()
...
('A', '305')
('BTR', '86542')
('RP', '808112')
('WR', '8')
('Wr', '8')
('', '305')
>>>