Dealing with duplicates

02-14-2023 07:30 PM
Labels (1)
MVP Esteemed Contributor
4 0 712

Always fun, duplicates in attributes and/or geometry.  This missive is a quick demo of some of the things you can do with duplicates.

Begin with an array of "coordinates".  These will be simple as shown.  I will even go through the process of making your own dataset for experimentation.


# -- make your base array of values
x = np.array([[0., 0.], [1., 1.], [2., 2.], [3., 3.]])

# -- using numpy magic, repeat the sequences of inputs to suit
arr = np.repeat(x, [3, 4, 2, 5], axis=0)

# -- yielding your positional data with multiple repeats
array([[  0.00,   0.00],
       [  0.00,   0.00],
       [  0.00,   0.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  2.00,   2.00],
       [  2.00,   2.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00]])


Easy so far, stick with it.

For more numpy magic, what is going to happen is:

  • line 1 : Determine some base array information, the shape and data type which are 
    • arr.shape, arr.dtype  #  ((14, 2), dtype('float64'))   14 coordinates, floating point type
  • lines 2-3 : Prepare the array so it can be viewed as rows of single entities rather than pairs of values
    • That is homework reading, views into arrays rather than copies of arrays.
  • lines 5-7 :  Produce a mask array where the sequential values that are different are flagged.  Obviously the first value (mask[0]) is set to True.


shp_in, dt_in = arr.shape, arr.dtype
dt = [('f{i}'.format(i=i), dt_in) for i in range(arr.shape[1])]
tmp = arr.view(dt).squeeze()  # -- view data and reshape to (N,)
# -- mask and check for sequential equality.
mask = np.empty((shp_in[0],), np.bool_)
mask[0] = True
mask[1:] = tmp[:-1] != tmp[1:]



From the above, the view of the original array can be examined using its mask.

  • lines 1-4 : where are the start locations of the sequence groupings?
  • lines 6-12 : what are the group values?
  • lines 14> : Split the array into subgroups.


# -- wh_ere are the breaks?
wh_ = np.nonzero(mask)[0]
array([0, 3, 7, 9], dtype=int64)

# -- what are their values?
tmp = arr[mask]
array([[  0.00,   0.00],
       [  1.00,   1.00],
       [  2.00,   2.00],
       [  3.00,   3.00]])

# -- split the original array into subarrays of its values
sub_arrays = np.array_split(arr, wh_[wh_ > 0])
[array([[  0.00,   0.00],
        [  0.00,   0.00],
        [  0.00,   0.00]]),
 array([[  1.00,   1.00],
        [  1.00,   1.00],
        [  1.00,   1.00],
        [  1.00,   1.00]]),
 array([[  2.00,   2.00],
        [  2.00,   2.00]]),
 array([[  3.00,   3.00],
        [  3.00,   3.00],
        [  3.00,   3.00],
        [  3.00,   3.00],
        [  3.00,   3.00]])]


But what if the data were all jumbled up? like

array([[  1.00,   1.00],
       [  2.00,   2.00],
       [  0.00,   0.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  0.00,   0.00],
       [  2.00,   2.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  1.00,   1.00],
       [  0.00,   0.00],
       [  1.00,   1.00],
       [  3.00,   3.00],
       [  1.00,   1.00]])
# -- sort it by one of several methods and slice it based on the first axis
arr[np.argsort(arr, axis=0)[:, 0]]
array([[  0.00,   0.00],
       [  0.00,   0.00],
       [  0.00,   0.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  1.00,   1.00],
       [  2.00,   2.00],
       [  2.00,   2.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00],
       [  3.00,   3.00]])


So identifying duplicates or sequences can be a challenge, but there is lots that you can do with them.

I should mention, that the whole principle above isn't just for coordinates, but for any form of tabular attribute.

Have fun.  Other homework readings.

Find Identical (Data Management)—ArcGIS Pro | Documentation

Delete Identical (Data Management)—ArcGIS Pro | Documentation






About the Author
Retired Geomatics Instructor (also DanPatterson_Retired). Currently working on geometry projects (various) as they relate to GIS and spatial analysis. I use NumPy, python and kin and interface with ArcGIS Pro.