Dealing with duplicates

DanPatterson · ‎02-14-2023

Always fun, duplicates in attributes and/or geometry. This missive is a quick demo of some of the things you can do with duplicates.

Begin with an array of "coordinates". These will be simple as shown. I will even go through the process of making your own dataset for experimentation.

# -- make your base array of values
x = np.array([[0., 0.], [1., 1.], [2., 2.], [3., 3.]])

# -- using numpy magic, repeat the sequences of inputs to suit
arr = np.repeat(x, [3, 4, 2, 5], axis=0)

# -- yielding your positional data with multiple repeats
arr
Out[60]: 
array([[  0.00,   0.00],
       [  0.00,   0.00],
       [  0.00,   0.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  2.00,   2.00],
       [  2.00,   2.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00]])

Easy so far, stick with it.

For more numpy magic, what is going to happen is:

line 1 : Determine some base array information, the shape and data type which are
- arr.shape, arr.dtype # ((14, 2), dtype('float64')) 14 coordinates, floating point type
lines 2-3 : Prepare the array so it can be viewed as rows of single entities rather than pairs of values
- That is homework reading, views into arrays rather than copies of arrays.
lines 5-7 : Produce a mask array where the sequential values that are different are flagged. Obviously the first value (mask[0]) is set to True.

shp_in, dt_in = arr.shape, arr.dtype
dt = [('f{i}'.format(i=i), dt_in) for i in range(arr.shape[1])]
tmp = arr.view(dt).squeeze()  # -- view data and reshape to (N,)
# -- mask and check for sequential equality.
mask = np.empty((shp_in[0],), np.bool_)
mask[0] = True
mask[1:] = tmp[:-1] != tmp[1:]

From the above, the view of the original array can be examined using its mask.

lines 1-4 : where are the start locations of the sequence groupings?
lines 6-12 : what are the group values?
lines 14> : Split the array into subgroups.

# -- wh_ere are the breaks?
wh_ = np.nonzero(mask)[0]
wh_
array([0, 3, 7, 9], dtype=int64)

# -- what are their values?
tmp = arr[mask]
tmp 
array([[  0.00,   0.00],
       [  1.00,   1.00],
       [  2.00,   2.00],
       [  3.00,   3.00]])

# -- split the original array into subarrays of its values
sub_arrays = np.array_split(arr, wh_[wh_ > 0])
sub_arrays
[array([[  0.00,   0.00],
        [  0.00,   0.00],
        [  0.00,   0.00]]),
 array([[  1.00,   1.00],
        [  1.00,   1.00],
        [  1.00,   1.00],
        [  1.00,   1.00]]),
 array([[  2.00,   2.00],
        [  2.00,   2.00]]),
 array([[  3.00,   3.00],
        [  3.00,   3.00],
        [  3.00,   3.00],
        [  3.00,   3.00],
        [  3.00,   3.00]])]

But what if the data were all jumbled up? like

arr 
array([[  1.00,   1.00],
       [  2.00,   2.00],
       [  0.00,   0.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  0.00,   0.00],
       [  2.00,   2.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  1.00,   1.00],
       [  0.00,   0.00],
       [  1.00,   1.00],
       [  3.00,   3.00],
       [  1.00,   1.00]])
# -- sort it by one of several methods and slice it based on the first axis
arr[np.argsort(arr, axis=0)[:, 0]]
array([[  0.00,   0.00],
       [  0.00,   0.00],
       [  0.00,   0.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  2.00,   2.00],
       [  2.00,   2.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00]])

So identifying duplicates or sequences can be a challenge, but there is lots that you can do with them.

I should mention, that the whole principle above isn't just for coordinates, but for any form of tabular attribute.

Have fun. Other homework readings.

Find Identical (Data Management)—ArcGIS Pro | Documentation

Delete Identical (Data Management)—ArcGIS Pro | Documentation