Translate R function caret::findCorrelation to Python 3 via Pandas using vectorisation

3608
3
01-20-2017 02:31 AM
TejuNC
by
New Contributor

The R function caret::findCorrelation searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function:

function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100)   {  if (names & is.null(colnames(x)))      stop("'x' must have column names when `names = TRUE`")  out <- if (exact)      findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)  else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)  out  if (names)      out <- colnames(x)[out]  out}

And the function findCorrelation_fast, which is the one I am interested in (with optional arguments removed):

findCorrelation_fast <- function(x, cutoff = .90){ if(any(!complete.cases(x))) stop("The correlation matrix has some missing values.") averageCorr <- colMeans(abs(x)) averageCorr <- as.numeric(as.factor(averageCorr)) x[lower.tri(x, diag = TRUE)] <- NA  combsAboveCutoff <- which(abs(x) > cutoff)  colsToCheck <- ceiling(combsAboveCutoff / nrow(x)) rowsToCheck <- combsAboveCutoff %% nrow(x)  colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck] rowsToDiscard <- !colsToDiscard   deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard]) deletecol <- unique(deletecol) deletecol}

I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping.

My two questions are:

  1. Based on my implementation below, is there a Pythonic way to replace the nested for loop with a vectorised implementation?
  2. Related to (1), the R function findCorrelation_fast uses the line averageCorr <- as.numeric(as.factor(averageCorr)). This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R.

My Python implementation and an example of its usage:

import numpy as npimport pandas as pd # calculate pair-wise correlationsdef findCorrelated(corrmat, cutoff = 0.8):     ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations# args:    # corrmat: a correlation matrix    # cutoff: pairwise absolute correlation cutoff# returns:    # variables to removed     if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'    averageCorr = corrmat.abs().mean(axis = 1)     # set lower triangle and diagonal of correlation matrix to NA    corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))    corrmat.values[[np.arange(len(corrmat))]*2] = None       # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it    to_delete = list()    for col in range(0, len(corrmat.columns)):        for row in range(0, len(corrmat)):            if(corrmat.iloc[row, col] > cutoff):                if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)                else: to_delete.append(col)     to_delete = list(set(to_delete))     return to_delete # generate some datadf = pd.DataFrame(np.random.randn(50,25))# demonstrate usage of function    removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelatedprint('Columns to be removed:')print(removeCols)uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)print('Uncorrelated variables:')print(uncorrelated)

0 Kudos
3 Replies
DanPatterson_Retired
MVP Emeritus

Could you reformat your code so it can be examined

/blogs/dan_patterson/2016/08/14/script-formatting 

0 Kudos
TejuNC
by
New Contributor

The R function caret::findCorrelation searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function:

function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) <
100)
{
if (names & is.null(colnames(x)))
stop("'x' must have column names when `names = TRUE`")
out <- if (exact)
findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)
else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)
out
if (names)
out <- colnames(x)[out]
out
}

And the function findCorrelation_fast, which is the one I am interested in (with optional arguments removed):

findCorrelation_fast <- function(x, cutoff = .90)
{
if(any(!complete.cases(x)))
stop("The correlation matrix has some missing values.")
averageCorr <- colMeans(abs(x))
averageCorr <- as.numeric(as.factor(averageCorr))
x[lower.tri(x, diag = TRUE)] <- NA
combsAboveCutoff <- which(abs(x) > cutoff)

colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
rowsToCheck <- combsAboveCutoff %% nrow(x)

colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
rowsToDiscard <- !colsToDiscard

deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard])
deletecol <- unique(deletecol)
deletecol
}

I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping.

My two questions are:

  1. Based on my implementation below, is there a Pythonic way to replace the nested for loop with a vectorised implementation?
  2. Related to (1), the R function findCorrelation_fast uses the line averageCorr <- as.numeric(as.factor(averageCorr)). This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R.

My Python implementation and an example of its usage:

import numpy as np
import pandas as pd

# calculate pair-wise correlations

def findCorrelated(corrmat, cutoff = 0.8):

### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations
# args:
# corrmat: a correlation matrix
# cutoff: pairwise absolute correlation cutoff
# returns:
# variables to removed

if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'
averageCorr = corrmat.abs().mean(axis = 1)

# set lower triangle and diagonal of correlation matrix to NA
corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))
corrmat.values[[np.arange(len(corrmat))]*2] = None

# where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it
to_delete = list()
for col in range(0, len(corrmat.columns)):
for row in range(0, len(corrmat)):
if(corrmat.iloc[row, col] > cutoff):
if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)
else: to_delete.append(col)

to_delete = list(set(to_delete))

return to_delete

# generate some data
df = pd.DataFrame(np.random.randn(50,25))

# demonstrate usage of function
removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelated
print('Columns to be removed:')
print(removeCols)
uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)
print('Uncorrelated variables:')
print(uncorrelated)

Hope you can see clearly now...

0 Kudos
DanPatterson_Retired
MVP Emeritus

The indentation is completely off...

EDIT   

You can follow up over on Stack Exchange... the code is formatted properly there

Translate R function caret::findCorrelation to Python 3 via Pandas using vectorisation - Stack Overf... 

did you recopy the original code and paste it wil the python syntax highligher?

It appears you are using np.triu to manage the correlation matrix but it is hard to follow, 

have you looked at the np.corrcoeff and np.cov in numpy (np).  inline help.  

numpy/function_base.py at 32ade3a75de147027c477a08d427d6f64603edfd · numpy/numpy · GitHub 

Also np.vectorize is a helper to reduce the need for loops.  I am not sure why you need to flip this all out to pandas, unless your resultant array ends up being an object dtype.

Perhaps you could show an input array as an example and a desired output array to see if findcorrelation fast can be implemented within numpy directly

0 Kudos