How to perform Validation?

BankimYadav · ‎04-27-2020

Okay I got through the step one i.e. 'Subset Features' in the GA tool box.

The cross-validation documentation says that the validation and cross-validation diagnostics are similar except that the input models are over the entire dataset and training portion, respectively.

How to prepare a new model on the training portion only. How to obtain the similar cross validation graphs. Right now, i just have two output point features classes generated from my point input dataset using 'Subset Features'.

Emily Norton ‌

EricKrause · ‎04-27-2020

Subset Features is used to split the data into "training" and "test" subsets. You will build the interpolation model as normal using the training subset using whichever interpolation method and parameters you decide. You'll then run GA Layer To Points tool and predict/validate to the test subset. Specify the field with the measured values in the test subset, and run the tool.

The output will be a feature class with all of the usual validation statistics for each individual feature. The Predicted and Error fields will always appear, but some models will also create Standard Error, Standardized Error, and Normal Value fields.

You can then create Scatter Plot charts using these fields. While they are not created automatically like they are for cross validation, they can all be created by simple scatter plots:

Predicted: Use the field of measured values and the Predicted field.

Error: Use the field of measured values and the Error field.

Standardized Error: Field of measured values and Standardized Error field.

Normal QQ Plot: Normal Value and Standardized Error fields.

The reason these do not appear in a pop-up window like the cross validation results is that this pop-up is a property of geostatistical layers. Feature classes cannot display them.

View solution in original post

EricKrause · ‎04-27-2020

Subset Features is used to split the data into "training" and "test" subsets. You will build the interpolation model as normal using the training subset using whichever interpolation method and parameters you decide. You'll then run GA Layer To Points tool and predict/validate to the test subset. Specify the field with the measured values in the test subset, and run the tool.

The output will be a feature class with all of the usual validation statistics for each individual feature. The Predicted and Error fields will always appear, but some models will also create Standard Error, Standardized Error, and Normal Value fields.

You can then create Scatter Plot charts using these fields. While they are not created automatically like they are for cross validation, they can all be created by simple scatter plots:

Predicted: Use the field of measured values and the Predicted field.

Error: Use the field of measured values and the Error field.

Standardized Error: Field of measured values and Standardized Error field.

Normal QQ Plot: Normal Value and Standardized Error fields.

The reason these do not appear in a pop-up window like the cross validation results is that this pop-up is a property of geostatistical layers. Feature classes cannot display them.

BankimYadav · ‎04-28-2020

Thank you Mr. Eric.

I did as you suggested and got what I wanted. I am happy with it. I have two points:

Regarding the linear regression equation in cross-validation:

The equation between measured vs predicted values is not the same as found through python or Excel. This is the excel file containing columns of measured and predicted values as copied from the Cross-Validation (CV) Stats.

Regarding cross-validation and validation:

Since I am working on the entire data for finding the best model, ranked by CV stats, should I even do the validation part? – sub setting the data into training and test portions and finding prediction performance on test portion. As given in GA documentation, validation is like a preliminary step as if the model works good on training portion then a ‘similar’ model will work good on the entire dataset.

Most probably, I would be using the same model on the training set. Would it be like just providing more stats about how it worked on train-test portions when I already know how it works on the entire dataset.

Please provide your wonderful insights.

EricKrause · ‎04-29-2020

The regression line in cross validation excludes some of the extreme values when fitting the line. This is why the line differs from Excel. Sorry that I had forgotten to mention this earlier. From the help:

"This procedure first fits a standard linear regression line to the scatterplot. Next, any points that are more than two standard deviations above or below the regression line are removed, and a new regression equation is calculated. This procedure ensures that a few outliers will not corrupt the entire regression equation."

Regarding whether to do validation or cross validation, you have a few choices. Validation is the most statistically defensible methodology (because it validates against data that was completely withheld), but it requires not using some of your data. Cross validation, on the other hand, uses all data to build the model, but it then validates against the same data used to build the model, so there is a bit of data double-dipping. The double-dipping isn't usually a problem because the influence of any individual point should not be too extreme.

The third option is to do a validation workflow to decide the parameters of the model. You can then apply this model to the entire data. To do this, perform the entire validation workflow. Then use the Create Geostatistical Layer tool, and provide the geostatistical layer used for validation and the entire dataset. This will apply the parameters of the validation model to all data.

If, as you say, you're going to choose your model by cross validation statistics, then I would probably just do cross validation and not do a full validation workflow. But it's up to you.

BankimYadav · ‎05-01-2020

Thank you. You have nice insights and vast experience of geostatistics.

The thirds option is interesting. Although, there is a bit of double-dipping in this method too over the repeated usage of training set of data. I will try to perform it and include in my writeup if I can theoretically defend its usage.

I am closing the thread here. Its answered and thank you.