The R function caret::findCorrelation
searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function:
function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100) { if (names & is.null(colnames(x))) stop("'x' must have column names when `names = TRUE`") out <- if (exact) findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose) else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose) out if (names) out <- colnames(x)[out] out}
And the function findCorrelation_fast
, which is the one I am interested in (with optional arguments removed):
findCorrelation_fast <- function(x, cutoff = .90){ if(any(!complete.cases(x))) stop("The correlation matrix has some missing values.") averageCorr <- colMeans(abs(x)) averageCorr <- as.numeric(as.factor(averageCorr)) x[lower.tri(x, diag = TRUE)] <- NA combsAboveCutoff <- which(abs(x) > cutoff) colsToCheck <- ceiling(combsAboveCutoff / nrow(x)) rowsToCheck <- combsAboveCutoff %% nrow(x) colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck] rowsToDiscard <- !colsToDiscard deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard]) deletecol <- unique(deletecol) deletecol}
I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for
loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping.
My two questions are:
for
loop with a vectorised implementation?findCorrelation_fast
uses the line averageCorr <- as.numeric(as.factor(averageCorr))
. This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R.My Python implementation and an example of its usage:
import numpy as npimport pandas as pd # calculate pair-wise correlationsdef findCorrelated(corrmat, cutoff = 0.8): ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations# args: # corrmat: a correlation matrix # cutoff: pairwise absolute correlation cutoff# returns: # variables to removed if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square' averageCorr = corrmat.abs().mean(axis = 1) # set lower triangle and diagonal of correlation matrix to NA corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool)) corrmat.values[[np.arange(len(corrmat))]*2] = None # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it to_delete = list() for col in range(0, len(corrmat.columns)): for row in range(0, len(corrmat)): if(corrmat.iloc[row, col] > cutoff): if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row) else: to_delete.append(col) to_delete = list(set(to_delete)) return to_delete # generate some datadf = pd.DataFrame(np.random.randn(50,25))# demonstrate usage of function removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelatedprint('Columns to be removed:')print(removeCols)uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)print('Uncorrelated variables:')print(uncorrelated)
Could you reformat your code so it can be examined
The R function caret::findCorrelation
searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function:
function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) <
100)
{
if (names & is.null(colnames(x)))
stop("'x' must have column names when `names = TRUE`")
out <- if (exact)
findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)
else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)
out
if (names)
out <- colnames(x)[out]
out
}
And the function findCorrelation_fast
, which is the one I am interested in (with optional arguments removed):
findCorrelation_fast <- function(x, cutoff = .90)
{
if(any(!complete.cases(x)))
stop("The correlation matrix has some missing values.")
averageCorr <- colMeans(abs(x))
averageCorr <- as.numeric(as.factor(averageCorr))
x[lower.tri(x, diag = TRUE)] <- NA
combsAboveCutoff <- which(abs(x) > cutoff)colsToCheck <- ceiling(combsAboveCutoff / nrow(x))
rowsToCheck <- combsAboveCutoff %% nrow(x)colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck]
rowsToDiscard <- !colsToDiscarddeletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard])
deletecol <- unique(deletecol)
deletecol
}
I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for
loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping.
My two questions are:
for
loop with a vectorised implementation?findCorrelation_fast
uses the line averageCorr <- as.numeric(as.factor(averageCorr))
. This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R.My Python implementation and an example of its usage:
import numpy as np
import pandas as pd# calculate pair-wise correlations
def findCorrelated(corrmat, cutoff = 0.8):
### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations
# args:
# corrmat: a correlation matrix
# cutoff: pairwise absolute correlation cutoff
# returns:
# variables to removedif(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'
averageCorr = corrmat.abs().mean(axis = 1)# set lower triangle and diagonal of correlation matrix to NA
corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))
corrmat.values[[np.arange(len(corrmat))]*2] = None# where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it
to_delete = list()
for col in range(0, len(corrmat.columns)):
for row in range(0, len(corrmat)):
if(corrmat.iloc[row, col] > cutoff):
if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)
else: to_delete.append(col)to_delete = list(set(to_delete))
return to_delete
# generate some data
df = pd.DataFrame(np.random.randn(50,25))# demonstrate usage of function
removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelated
print('Columns to be removed:')
print(removeCols)
uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)
print('Uncorrelated variables:')
print(uncorrelated)
Hope you can see clearly now...
The indentation is completely off...
EDIT
You can follow up over on Stack Exchange... the code is formatted properly there
did you recopy the original code and paste it wil the python syntax highligher?
It appears you are using np.triu to manage the correlation matrix but it is hard to follow,
have you looked at the np.corrcoeff and np.cov in numpy (np). inline help.
numpy/function_base.py at 32ade3a75de147027c477a08d427d6f64603edfd · numpy/numpy · GitHub
Also np.vectorize is a helper to reduce the need for loops. I am not sure why you need to flip this all out to pandas, unless your resultant array ends up being an object dtype.
Perhaps you could show an input array as an example and a desired output array to see if findcorrelation fast can be implemented within numpy directly