Most efficient way to concatenate strings in millions of records ignoring nulls and spaces

206
4
a week ago
asmith_tssw
New Contributor

Hi everyone,

I am trying to concatenate address strings across a dataset that has 3.5 million records. I want to concatenate the values of eight fields into a single string, but many of the fields have null values and many have spaces instead of nulls or extra spaces, etc.

What is the most efficient way in Python or Arcade to join strings ignoring nulls or spaces and just having a single space between the joined values? I usually use a function like " ".join((i for i in (str1,str2,str3) if i)) to do this, but across millions of records I'm hoping to avoid the bottleneck of using an if statement. Also, I'm hoping to avoid having to do trim calculations on each field first. Is there a more efficient way to do this?

Thanks!

0 Kudos
4 Replies
RPGIS
by
Occasional Contributor III

Hi @asmith_tssw.,

Depending on the method, either python or arcade, you can easily achieve this. You can do something like this:

Python:

Code Expression:
!Concatenation( FieldA , FieldB , FieldC, etc... )!

Code Block:
def Concatenation( FieldA , FieldB , FieldC, etc... ):
    Values = [FieldA , FieldB , FieldC, etc...]
    ConcatNonNullValues = ''.join([ x for x in Values if x is not None ])
    return ConcatNonNullValues

 

Arcade:

var ValueList = [ $feature.FieldA , $feature.FieldB, $feature.FieldC, etc... ]
var ConcatValue = ''
for ( var index in ValueList ){
    ListValue = ValueList[ index ]
    if ( IsEmpty( ListValue ) == False ){ ConcatValue + ListValue }
    }
return ConcatValue

You can use several arcade functions to make the code above easier, but it depends on how you want to go about it.

Feel free to play around with the code above. It should work for what you need or at least point you in the right direction.

RhettZufelt
MVP Frequent Contributor

I use this one quite often:

 

 

strval = ' '.join(filter(None, [val1, val2, val3,val4]))

 

RhettZufelt_1-1713989829838.png

The space in the single quotes is the delimiter (single space).  If any of the val variables have a value, it will be concatenated, but, if Null, will just be ignored without error.

R_

JoshuaBixby
MVP Esteemed Contributor

A Python if statement will be far from the slowest part of this process.

0 Kudos
DanPatterson
MVP Esteemed Contributor

 

v = ['a', 'c', 'b', 'd', None]

v*2  # --- just to show you can expand lists
['a', 'c', 'b', 'd', None, 'a', 'c', 'b', 'd', None]

v0 = v*1000000  # lets make one with 5 million values

# ---- As Joshua says, it isn't slow, and there is an `if` statement in the # 
       list comprehension

%timeit ' '.join([x for x in v0 if x is not None ])
217 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ---- What about filter using the first 4 values agains `None`

' '.join(filter(None, v[:4]))
'a c b d'

# ---- no slouch either, and largely insignificant

%timeit ' '.join(filter(None, v0))
170 ms ± 8.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

 

slow coding night 😉

 


... sort of retired...