Forest Based Classification and Regression

ArnieWaddell1 · ‎11-13-2024

I am using 5 rasters in my model to predict a soil property point data. I am wondering what it means when you have a low variance explained 22% (model bag of errors), however in my training data regresssion diagnostics, the R2 is 89% and the SMSE and SE are also low. Also, when I look at my residuals the model appears to have predicted very well.

CatherineMcSorley · ‎11-14-2024

Hi, it sounds like your model might be overfitting to the training data. So it looks like it's doing a really good job on the data the model was trained on, but then the model won't perform well when predicting to new data.

There are a couple of ways to avoid this. Start by looking at Validation Options accordion in the forest-based and boosted classification and regression tool, and make sure there is some data set aside for evaluation (Training data excluded for validation %). Then, in the output, you can evaluate your R^2, errors, etc. on both the training and the validation data. If the metrics are much better for training than for testing, your model is overfitting.

There is a checkbox in the tool to Optimize Parameters. This will choose the parameters (such as tree depth, etc.) that gives you the highest, say, R^2 specifically for your testing data. See more here: https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/how-forest-works.htm#:~:t...

--Catherine McSorley

ArnieWaddell1 · ‎11-15-2024

Hi Catherine,

Thanks for helping out.

My next questions are:

If my training data and validation show R2 values that vary greatly as shown below for Nb soil property, does this mean that Forest Based Classification and Regression does not work for this data and I should just use interpolation such as Kriging?
Does the low % variation explained (8.7% and 9.8%) in the Model Out bag of errors indicate that my rasters have very little relationship in explaining Nb and again therefore the model should not be used?
What does the values of importance indicate in the Top Variable Importance Table? They range from .02-.03. The values are very low when compared to Ca soil property which has a range from 17-57.
Ca also has a high variance explained in the Model Out Bag of Errors, therefore I am assuming that this models rasters have a strong relationship with Ca?

Nb

Arnie Waddell, M.A.

GIS Specialist

2nd Floor 303 Main St.

Winnipeg, Manitoba

Agriculture and Agri-Food Canada / Government of Canada

arnie.waddell@agr.gc.ca / Tel 431-275-4867

From: Esri Community <esricommunity@esri.com>
Sent: Thursday, November 14, 2024 6:33 PM
To: arnie.waddell@canada.ca
Subject: Re: Forest Based Classification and Regression (Subscription Update)

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
ATTENTION: Ce courriel provient de l’extérieur de l’organisation. Ne cliquez pas sur les liens et n’ouvrez pas les pièces jointes à moins que vous ne reconnaissiez l’expéditeur et que vous sachiez que le contenu est sûr.

*** DO NOT REPLY TO THIS E-MAIL ***

To respond, use the hyperlinked Response Options at the bottom of your notification below OR visit the Esri Community post directly and reply from there.

Hi ArnieWaddell1,

CatherineMcSorley (Esri Contributor) posted a new reply in Spatial Statistics Questions on 11-14-2024 04:33 PM:

Re: Forest Based Classification and Regression

Hi, it sounds like your model might be overfitting to the training data. So it looks like it's doing a really good job on the data the model was trained on, but then the model won't perform well when predicting to new data.

There are a couple of ways to avoid this. Start by looking at Validation Options accordion in the forest-based and boosted classification and regression tool, and make sure there is some data set aside for evaluation (Training data excluded for validation %). Then, in the output, you can evaluate your R^2, errors, etc. on both the training and the validation data. If the metrics are much better for training than for testing, your model is overfitting.

There is a checkbox in the tool to Optimize Parameters. This will choose the parameters (such as tree depth, etc.) that gives you the highest, say, R^2 specifically for your testing data. See more here: https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-statistics/how-forest-works.htm#:~:t...

--Catherine McSorley

Reply

Esri Community sent this message to arnie.waddell@canada.ca.
You are receiving this email because a new message matches your subscription to a topic.

If you do not want to receive notification for this message, unsubscribe the topic or mute the message.
To manage your email notifications, go to your settings in the community.