Always fun, duplicates in attributes and/or geometry. This missive is a quick demo of some of the things you can do with duplicates.
Begin with an array of "coordinates". These will be simple as shown. I will even go through the process of making your own dataset for experimentation.
# -- make your base array of values
x = np.array([[0., 0.], [1., 1.], [2., 2.], [3., 3.]])
# -- using numpy magic, repeat the sequences of inputs to suit
arr = np.repeat(x, [3, 4, 2, 5], axis=0)
# -- yielding your positional data with multiple repeats
arr
Out[60]:
array([[ 0.00, 0.00],
[ 0.00, 0.00],
[ 0.00, 0.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 2.00, 2.00],
[ 2.00, 2.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00]])
Easy so far, stick with it.
For more numpy magic, what is going to happen is:
- line 1 : Determine some base array information, the shape and data type which are
- arr.shape, arr.dtype # ((14, 2), dtype('float64')) 14 coordinates, floating point type
- lines 2-3 : Prepare the array so it can be viewed as rows of single entities rather than pairs of values
- That is homework reading, views into arrays rather than copies of arrays.
- lines 5-7 : Produce a mask array where the sequential values that are different are flagged. Obviously the first value (mask[0]) is set to True.
shp_in, dt_in = arr.shape, arr.dtype
dt = [('f{i}'.format(i=i), dt_in) for i in range(arr.shape[1])]
tmp = arr.view(dt).squeeze() # -- view data and reshape to (N,)
# -- mask and check for sequential equality.
mask = np.empty((shp_in[0],), np.bool_)
mask[0] = True
mask[1:] = tmp[:-1] != tmp[1:]
From the above, the view of the original array can be examined using its mask.
- lines 1-4 : where are the start locations of the sequence groupings?
- lines 6-12 : what are the group values?
- lines 14> : Split the array into subgroups.
# -- wh_ere are the breaks?
wh_ = np.nonzero(mask)[0]
wh_
array([0, 3, 7, 9], dtype=int64)
# -- what are their values?
tmp = arr[mask]
tmp
array([[ 0.00, 0.00],
[ 1.00, 1.00],
[ 2.00, 2.00],
[ 3.00, 3.00]])
# -- split the original array into subarrays of its values
sub_arrays = np.array_split(arr, wh_[wh_ > 0])
sub_arrays
[array([[ 0.00, 0.00],
[ 0.00, 0.00],
[ 0.00, 0.00]]),
array([[ 1.00, 1.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 1.00, 1.00]]),
array([[ 2.00, 2.00],
[ 2.00, 2.00]]),
array([[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00]])]
But what if the data were all jumbled up? like
arr
array([[ 1.00, 1.00],
[ 2.00, 2.00],
[ 0.00, 0.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 0.00, 0.00],
[ 2.00, 2.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 1.00, 1.00],
[ 0.00, 0.00],
[ 1.00, 1.00],
[ 3.00, 3.00],
[ 1.00, 1.00]])
# -- sort it by one of several methods and slice it based on the first axis
arr[np.argsort(arr, axis=0)[:, 0]]
array([[ 0.00, 0.00],
[ 0.00, 0.00],
[ 0.00, 0.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 1.00, 1.00],
[ 2.00, 2.00],
[ 2.00, 2.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00],
[ 3.00, 3.00]])
So identifying duplicates or sequences can be a challenge, but there is lots that you can do with them.
I should mention, that the whole principle above isn't just for coordinates, but for any form of tabular attribute.
Have fun. Other homework readings.
Find Identical (Data Management)—ArcGIS Pro | Documentation
Delete Identical (Data Management)—ArcGIS Pro | Documentation