Which type of log transformation is most appropriate?

1389
9
03-02-2019 03:54 PM
sajjidbudhwani
New Contributor

Hello Esri community,

I am a doctoral study at DU and I am working on a research project that includes data of Colorado's 178 school districts. Guided by theory, I have compiled a list of variables to run OLS and analyze how these variables interact with its neighboring school districts and predicts school district performance score.

Unfortunately, I found some of these variables being moderately/substantially skewed (not normally distributed) and hence need data transformation. I am struggling as these variables show a variety of skewness (moderate to substantial), are either positively/negatively skewed, includes some zero values and are in different formats – percentages, ratios, dollar amounts, count, and sum total. Due to such variability in my data, I am uncertain about which data log transformation would be most appropriate on each of these data types.

Any guidance would be very helpful.

Thanks.

Saj

0 Kudos
9 Replies
DanPatterson_Retired
MVP Emeritus

More questions than answered

- what do the descriptive statistics show?  and/or the spatial patterns?

- why are you needing to use OLS when there are non-parametric alternatives?

- are you just doing univariate or are you looking at multivariate descriptors

- If zero is a valid observation, then that will limit your transformations (assuming that transformations make sense)

- ratios, percentages and the like can be problematic (eg. spurious correlation and the fallacy of the ratio standard revisited)

0 Kudos
sajjidbudhwani
New Contributor

Thank you for your reply, Dan.

I did hot spot analysis on each of the 20 variables and I am able to see clusters of high and low values (especially on my dependent variable). Note: I did this on the raw data without applying any log transformation. However, descriptive statistics - the skewness value for most of the variables is outside the range of +/- 1 .

The only reason for performing OLS is that research is about defining “district effectiveness” meaning what district-level factors related to resource availability and its allocation can explain district performance rating.

Yes, zero is a valid observation (not null). For e.g. zero attrition rate of teachers or zero student chronic absenteeism rate.

Sajjid

Sajjid Budhwani | ELPS Department Graduate Assistant

Ph.D. Candidate, COESA Executive Board President, & UCEA Jackson Scholar 2018-2020

Morgridge College of Education

1999 E. Evans Avenue | Denver, CO 80210

(c) 720.410.4674 | sajjid.budhwani@du.edu<mailto:sajjid.budhwani@du.edu>

www.morgridge.du.edu/<http://www.morgridge.du.edu/>

0 Kudos
DanPatterson_Retired
MVP Emeritus

perhaps something spearman's rank correlation would be useful since it probably isn't the actual attrition rate (again it is a rate) versus student absenteeism (a rate as well) but maybe the ranks themselves.  The skewness may reflect a threshold.  Does the histogram show any bimodality? (you may be dealing with two 'populations' in the broadest sense of the term)

0 Kudos
sajjidbudhwani
New Contributor

I don’t see bimodality in any of my variables.

Just want to clarify that not all my variables are ratio/percentage. I have 8 variables in percentage form, 1 is ratio, 4 are in dollar amount, 7 are sum total of some kinds of population.

Sajjid

Sajjid Budhwani | ELPS Department Graduate Assistant

Ph.D. Candidate, COESA Executive Board President, & UCEA Jackson Scholar 2018-2020

Morgridge College of Education

1999 E. Evans Avenue | Denver, CO 80210

(c) 720.410.4674 | sajjid.budhwani@du.edu<mailto:sajjid.budhwani@du.edu>

www.morgridge.du.edu/<http://www.morgridge.du.edu/>

0 Kudos
DanPatterson_Retired
MVP Emeritus

These are categorical data... "7 are sum total of some kinds of population." (old school nonparametric data)

0 Kudos
sajjidbudhwani
New Contributor

No. They are discrete variables and not sub-types of a parent category. E.g. total population of teachers employed within a school district, total student population, free-and-reduced lunch population, etc.

Sajjid Budhwani | ELPS Department Graduate Assistant

Ph.D. Candidate, COESA Executive Board President, & UCEA Jackson Scholar 2018-2020

Morgridge College of Education

1999 E. Evans Avenue | Denver, CO 80210

(c) 720.410.4674 | sajjid.budhwani@du.edu

www.morgridge.du.edu/

Please excuse typos and brevity. Message sent from Outlook for Android<https://aka.ms/ghei36>

0 Kudos
DanPatterson_Retired
MVP Emeritus

That is different then. and for your second question, you can transform the values using whatever transformation you want, by adding fields in your tables and using the field calculator.  There are a whole load of 'math' module functions builtin.   But there are workarounds like using different distributions or assigning ridiculously small nonzero numbers (there is lots of discussion in the literature...

(first Dr Google hit Log transformation of values that include 0 (zero) for statistical analyses? )

I will leave whether doing it is appropriate in the first place, given there are alternatives.  that is a discussion between you and your advisor. 

good luck

0 Kudos
sajjidbudhwani
New Contributor

I'm specifically keen on knowing whether there is any way i can transform ratio/percentage data variables and use them along with other variables in one OLS model. Is that possible using ArcGIS pro?

Sajjid Budhwani | ELPS Department Graduate Assistant

Ph.D. Candidate, COESA Executive Board President, & UCEA Jackson Scholar 2018-2020

Morgridge College of Education

1999 E. Evans Avenue | Denver, CO 80210

(c) 720.410.4674 | sajjid.budhwani@du.edu

www.morgridge.du.edu/

Please excuse typos and brevity. Message sent from Outlook for Android<https://aka.ms/ghei36>

0 Kudos
DanPatterson_Retired
MVP Emeritus

Almost forgot, if you use R for stats, then you might be interested in tools available through the R bridge... there is a space here that may have resource information https://community.esri.com/groups/rstats