Identifying Repeated Vegetation Enquiries Over Time
Context
I’m working with a dataset of over 80,000 vegetation-related enquiries submitted by the public over a 10-year period. Each enquiry is represented as a point with the following attributes:
Objective
My goal is to identify repeated enquiries occurring at least twice over the time period with at least a year between enquiries. I want to find all repeat enquires to get a sense of the overall picture and then filter/reduce the overall number down to just a single enquiry to get a sense of how many I'm dealing with.
This will help reveal patterns of recurring issues (e.g., persistent vegetation problems) and allow me to filter down to unique cases for further analysis.
Challenges Encountered
Some locations have multiple identical points due to users clicking repeatedly when submitting an enquiry. In some cases, this results in 6+ identical points.
Solution:
I used Excel to remove duplicates based on a combination of:
Because the data is submitted by the public:
Methods Used – Image below.
Method 1: Buffer-Based Spatial Join (Blue Points)
This method helped group nearby points with similar subjects but missed some cases.
Method 2: Python-Based Temporal Filter (Red Points)
This method captured some cases missed by Method 1 but also failed to group certain most overlapping points.
Visual Comparison
As shown in the image below, red points (Method 2) and blue points (Method 1) don’t always overlap. Each method captures different aspects of the problem, and neither is fully comprehensive.
Next Steps
Looking for recommendations for improvements and next steps, I could append the 2 datasets and remove the intersecting points but want to look at mistakes I have made or alternative approaches which could highlight repeated enquires I am missing.
Don't have permission to upload screenshots mentioned in the post unfortunately.
Screenshot 1 link - (https://postimg.cc/YL3LqxLH)
Screenshot 2 link - (https://i.postimg.cc/Xv2DLbtq/Comparison.png)
A bit cumbersome at first, but using a combination of both methods would likely provide the best results.
In theory, you would use a library like ArcPy or geopandas to create a spatial query based on a buffer for each individual record, then use similar logic applied in your existing python script to systematically identify duplicates within the spatial constraint.
Once you have the data set at a reasonable state, it would be wise to take and automate this process to run somewhat often. In this case, you only need to query and compare records within your listed time constraint (the past year) instead of trying to evaluate against the entire 80,000 record dataset. This would help the process be less resource intensive moving into the future.
Finally, if possible you should look to try to implement some sort of constraints, options, etc into the collection method that would allow you to more easily query and evaluate records. This would like like having a dedicated field for complaint type ('Foliage Overburden', 'Dead Tree', etc.) and then another field that allows users to type out comments or descriptions for that complaint type.
Find Identical (Data Management)—ArcGIS Pro | Documentation
can be used to identify duplicates in attributes and even geometry or some combination thereof.
An overview of the Generalization toolset—ArcGIS Pro | Documentation using
Aggregate Points (Cartography)—ArcGIS Pro | Documentation
might be worth a look as well
I did intially use the find Idenitical tool like you described but it doesn't find all the records for some reason, thats why I ended up using excel in the end.
Thanks for linking to the aggregrate points tool, will try using the free aggregrate points tool as I don't have access to the cartography one.