# Dealing with duplicates

712
0
02-14-2023 07:30 PM
Labels (1)
MVP Esteemed Contributor
4 0 712

Always fun, duplicates in attributes and/or geometry.  This missive is a quick demo of some of the things you can do with duplicates.

Begin with an array of "coordinates".  These will be simple as shown.  I will even go through the process of making your own dataset for experimentation.

``````# -- make your base array of values
x = np.array([[0., 0.], [1., 1.], [2., 2.], [3., 3.]])

# -- using numpy magic, repeat the sequences of inputs to suit
arr = np.repeat(x, [3, 4, 2, 5], axis=0)

# -- yielding your positional data with multiple repeats
arr
Out[60]:
array([[  0.00,   0.00],
[  0.00,   0.00],
[  0.00,   0.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  2.00,   2.00],
[  2.00,   2.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00]])``````

Easy so far, stick with it.

For more numpy magic, what is going to happen is:

• line 1 : Determine some base array information, the shape and data type which are
• arr.shape, arr.dtype  #  ((14, 2), dtype('float64'))   14 coordinates, floating point type
• lines 2-3 : Prepare the array so it can be viewed as rows of single entities rather than pairs of values
• That is homework reading, views into arrays rather than copies of arrays.
• lines 5-7 :  Produce a mask array where the sequential values that are different are flagged.  Obviously the first value (mask[0]) is set to True.

``````shp_in, dt_in = arr.shape, arr.dtype
dt = [('f{i}'.format(i=i), dt_in) for i in range(arr.shape[1])]
tmp = arr.view(dt).squeeze()  # -- view data and reshape to (N,)
# -- mask and check for sequential equality.

From the above, the view of the original array can be examined using its mask.

• lines 1-4 : where are the start locations of the sequence groupings?
• lines 6-12 : what are the group values?
• lines 14> : Split the array into subgroups.

``````# -- wh_ere are the breaks?
wh_
array([0, 3, 7, 9], dtype=int64)

# -- what are their values?
tmp
array([[  0.00,   0.00],
[  1.00,   1.00],
[  2.00,   2.00],
[  3.00,   3.00]])

# -- split the original array into subarrays of its values
sub_arrays = np.array_split(arr, wh_[wh_ > 0])
sub_arrays
[array([[  0.00,   0.00],
[  0.00,   0.00],
[  0.00,   0.00]]),
array([[  1.00,   1.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  1.00,   1.00]]),
array([[  2.00,   2.00],
[  2.00,   2.00]]),
array([[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00]])]``````

But what if the data were all jumbled up? like

``````arr
array([[  1.00,   1.00],
[  2.00,   2.00],
[  0.00,   0.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  0.00,   0.00],
[  2.00,   2.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  1.00,   1.00],
[  0.00,   0.00],
[  1.00,   1.00],
[  3.00,   3.00],
[  1.00,   1.00]])
# -- sort it by one of several methods and slice it based on the first axis
arr[np.argsort(arr, axis=0)[:, 0]]
array([[  0.00,   0.00],
[  0.00,   0.00],
[  0.00,   0.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  1.00,   1.00],
[  2.00,   2.00],
[  2.00,   2.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00],
[  3.00,   3.00]])``````

So identifying duplicates or sequences can be a challenge, but there is lots that you can do with them.

I should mention, that the whole principle above isn't just for coordinates, but for any form of tabular attribute.

Find Identical (Data Management)—ArcGIS Pro | Documentation

Delete Identical (Data Management)—ArcGIS Pro | Documentation

Tags (5)