Hello,
I am currently working on a research project to train a model to detect peaches on high-resolution imagery. Using the Train Deep Learning Model tool in ArcGIS Pro, you are only allowed to choose a percentage of your training data for your validation data. Since my training data is spatially autocorrelated, based on the existing literature, I would need to split the validation data from the training data in separate areas to avoid spatial leakage, where the validation set can contain chips that are adjacent to very similar training chips.
When I run the Train Deep Learning Model tool in ArcGIS Pro, the average precision is overestimated on the validation data.
Is there any way in ArcGIS Pro to choose a separate held-out area for validation instead of choosing a percentage of the training dataset?
We initially created a random tessellation of grids around the orchard and chose a random subset of those grids for digitization of the peaches. We then split the grids into 80% training and 20% testing. In an ideal experiment, I would want to label or select the grids for training, the ones that will be used for validation, and which ones will be used for testing.
Thank you for your time.
Sincerely,
Grisha Post
Solved! Go to Solution.
Your concern about spatial autocorrelation is valid, and in some workflows a spatially explicit split can be appropriate. However, the Train Deep Learning Model tool uses a single exported dataset and performs a random train/validation split to ensure consistency in schema, class definitions, metadata, and overall data distribution. Global image and label statistics in the EMD are also used to configure model-specific parameters (e.g., SSD settings such as zoom levels, aspect ratios, and grid sizes).
If the training and validation subsets are too spatially different, validation metrics may become less stable for guiding training, as they can reflect distribution shift rather than overfitting. For this reason, the validation split is intended to represent the same underlying distribution.
In many practical workflows, a large and diverse dataset combined with random splitting and data augmentation (which ArcGIS applies by default and can be controlled) is a solid and effective approach.
For evaluating true geographic generalization, a separate held-out spatial area evaluated after inference using Accuracy Assessment tools is typically more appropriate.
Here is how you can incorporated Test dataset in your workflow:
I hope this helps!
Cheers!
Your concern about spatial autocorrelation is valid, and in some workflows a spatially explicit split can be appropriate. However, the Train Deep Learning Model tool uses a single exported dataset and performs a random train/validation split to ensure consistency in schema, class definitions, metadata, and overall data distribution. Global image and label statistics in the EMD are also used to configure model-specific parameters (e.g., SSD settings such as zoom levels, aspect ratios, and grid sizes).
If the training and validation subsets are too spatially different, validation metrics may become less stable for guiding training, as they can reflect distribution shift rather than overfitting. For this reason, the validation split is intended to represent the same underlying distribution.
In many practical workflows, a large and diverse dataset combined with random splitting and data augmentation (which ArcGIS applies by default and can be controlled) is a solid and effective approach.
For evaluating true geographic generalization, a separate held-out spatial area evaluated after inference using Accuracy Assessment tools is typically more appropriate.
Here is how you can incorporated Test dataset in your workflow:
I hope this helps!
Cheers!