Categorical variables while using Geographically Weighted Regression (GWR)

3682
4
Jump to solution
03-05-2020 10:25 AM
NaziaNawrin1
New Contributor

Hello!
I have a question regarding Geographically Weighted Regression (GWR) in ArcGIS. So, I am using ArcMap 10.2 version. I have a large dataset of different parameters of groundwater geochemical constituents and physiography. In GWR model, my dependent variable is chemical concentration (continuous variable) and one of my explanatory variables is Physiography (categorical variable). I have classified eight physiography into 8 classes by numbering them from 1 to 8 and ran GWR model. My aim was to establish relationship between groundwater quality and physiography, i.e., to measure the R2 value.
Recently I have tried the same number of classes (8) in different order and found slightly low R2 value.

I have found in literature that “Dependent and Explanatory variables should be numeric fields containing a variety of values. Linear regression methods, like GWR, are not appropriate for predicting binary outcomes (e.g., all of the values for the dependent variable are either 1 or 0).”

So, my question is – Since one of my explanatory variable, Physiography, has dummy variables i.e. 1 to 8, Can GWR model run properly for this type of variables?

And why the R2 value was changed when I randomly changed the order of class for physiography?

0 Kudos
1 Solution

Accepted Solutions
EricKrause
Esri Regular Contributor

Hi Nazia,

In Geographically Weighted Regression, the explanatory variables can be categorical.  As long as your dependent variable is continuous, it is fine to use categorical variables as explanatory variables.  However, to use a categorical variable appropriately, you can't just assign values 1 through 8 to it.  As you found, GWR will try to use these actual numbers, and you will get very different results depending on which levels you label as 1 through 8.  For GWR to work properly with your categorical variable, you need to convert it to several indicator variables (variables that have the value 0 or 1) and then use these indicator variables as explanatory variables in GWR. 

The process of converting categorical variables to indicator variables is called "dummy encoding." Here is a good article about how to perform dummy encoding:

Dummy variable (statistics) - Wikiversity 

In your case, your categorical variable has 8 levels, so you will need to make 7 indicator variables to represent the different levels (you always use one less indicator variable than the number of levels of the category).

You'll need to make 7 new fields on your feature class.  For the first field, each feature that is in the first level of the category gets the value 1, and features in any other level get the value 0 (we say that the 1 "indicates" that the feature is in that level).  Similarly, in the second field, the features of the second level get a 1, and all other features get a 0.  Same for levels 3 through 7.  For level 8, the value 0 should go in all 7 fields.

When you encode this way, it does not matter which levels of the category is called the first, second, etc level of the category.  Changing the order will produce the same results in GWR.

Please let me know if you have any other questions or have any problems encoding your variable.

-Eric Krause

View solution in original post

4 Replies
EricKrause
Esri Regular Contributor

Hi Nazia,

In Geographically Weighted Regression, the explanatory variables can be categorical.  As long as your dependent variable is continuous, it is fine to use categorical variables as explanatory variables.  However, to use a categorical variable appropriately, you can't just assign values 1 through 8 to it.  As you found, GWR will try to use these actual numbers, and you will get very different results depending on which levels you label as 1 through 8.  For GWR to work properly with your categorical variable, you need to convert it to several indicator variables (variables that have the value 0 or 1) and then use these indicator variables as explanatory variables in GWR. 

The process of converting categorical variables to indicator variables is called "dummy encoding." Here is a good article about how to perform dummy encoding:

Dummy variable (statistics) - Wikiversity 

In your case, your categorical variable has 8 levels, so you will need to make 7 indicator variables to represent the different levels (you always use one less indicator variable than the number of levels of the category).

You'll need to make 7 new fields on your feature class.  For the first field, each feature that is in the first level of the category gets the value 1, and features in any other level get the value 0 (we say that the 1 "indicates" that the feature is in that level).  Similarly, in the second field, the features of the second level get a 1, and all other features get a 0.  Same for levels 3 through 7.  For level 8, the value 0 should go in all 7 fields.

When you encode this way, it does not matter which levels of the category is called the first, second, etc level of the category.  Changing the order will produce the same results in GWR.

Please let me know if you have any other questions or have any problems encoding your variable.

-Eric Krause

NaziaNawrin
New Contributor

Hi Eric,
Thank you so much for your reply. I really appreciate how clearly you explained. I will surely get in touch with you if I have any questions further. Have a great day!
Thanks

Nazia

0 Kudos
NasserSharareh1
New Contributor III

Thanks for the question and answer. 

What if I was predicting a binary outcome variable? can I use binary exploratory variables in that case?

also after running the GWR, how should I interpret the results? let's say I code males as 0 and females as 1. so the coefficients should be interpreted for females as they are coded 1?

0 Kudos
SadiaAfroza
New Contributor

Hi,

Thanks for the insights.

In GWR how can I specify continuous and categorical variables in the input? I understand from your reply that categorical variables should be transformed into dummy variables. However, I'm curious about how the software distinguishes between these dummy variables and the continuous ones in the process.

0 Kudos