AnsweredAssumed Answered

Translate R function caret::findCorrelation to Python 3 via Pandas using vectorisation

Question asked by tejuseo.nc on Jan 20, 2017
Latest reply on Jan 21, 2017 by Dan_Patterson

The R function caret::findCorrelation searches a correlation matrix and returns a vector of integers corresponding to variables which, if removed, would reduce pair-wise correlations among the remaining variables. Here is the R code for this function:

function (x, cutoff = 0.9, verbose = FALSE, names = FALSE, exact = ncol(x) < 100)   {  if (names & is.null(colnames(x)))      stop("'x' must have column names when `names = TRUE`")  out <- if (exact)      findCorrelation_exact(x = x, cutoff = cutoff, verbose = verbose)  else findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose)  out  if (names)      out <- colnames(x)[out]  out}

And the function findCorrelation_fast, which is the one I am interested in (with optional arguments removed):

findCorrelation_fast <- function(x, cutoff = .90){ if(any(!complete.cases(x))) stop("The correlation matrix has some missing values.") averageCorr <- colMeans(abs(x)) averageCorr <- as.numeric(as.factor(averageCorr)) x[lower.tri(x, diag = TRUE)] <- NA  combsAboveCutoff <- which(abs(x) > cutoff)  colsToCheck <- ceiling(combsAboveCutoff / nrow(x)) rowsToCheck <- combsAboveCutoff %% nrow(x)  colsToDiscard <- averageCorr[colsToCheck] > averageCorr[rowsToCheck] rowsToDiscard <- !colsToDiscard   deletecol <- c(colsToCheck[colsToDiscard], rowsToCheck[rowsToDiscard]) deletecol <- unique(deletecol) deletecol}

I am writing a function that emulates the intent of this function in Python 3 with help from pandas. My implementation contains a nested for loop, which I understand is far from the most efficient way to achieve the desired result. The original R function does the job without any looping.

My two questions are:

  1. Based on my implementation below, is there a Pythonic way to replace the nested for loop with a vectorised implementation?
  2. Related to (1), the R function findCorrelation_fast uses the line averageCorr <- as.numeric(as.factor(averageCorr)). This construction seems both very alien to me and also crucial to the success of the loopless R implementation. Can anyone shed any light on what this line is doing? My intuition tells me that it is being incredibly clever and leveraging some unique behaviour of R.

My Python implementation and an example of its usage:

import numpy as npimport pandas as pd # calculate pair-wise correlationsdef findCorrelated(corrmat, cutoff = 0.8):     ### search correlation matrix and identify pairs that if removed would reduce pair-wise correlations# args:    # corrmat: a correlation matrix    # cutoff: pairwise absolute correlation cutoff# returns:    # variables to removed     if(len(corrmat) != len(corrmat.columns)) : return 'Correlation matrix is not square'    averageCorr = corrmat.abs().mean(axis = 1)     # set lower triangle and diagonal of correlation matrix to NA    corrmat = corrmat.where(np.triu(np.ones(corrmat.shape)).astype(np.bool))    corrmat.values[[np.arange(len(corrmat))]*2] = None       # where a pairwise correlation is greater than the cutoff value, check whether mean abs.corr of a or b is greater and cut it    to_delete = list()    for col in range(0, len(corrmat.columns)):        for row in range(0, len(corrmat)):            if(corrmat.iloc[row, col] > cutoff):                if(averageCorr.iloc[row] > averageCorr.iloc[col]): to_delete.append(row)                else: to_delete.append(col)     to_delete = list(set(to_delete))     return to_delete # generate some datadf = pd.DataFrame(np.random.randn(50,25))# demonstrate usage of function    removeCols = findCorrelated(df.corr(), cutoff = 0.01) #set v.low cutoff as data is uncorrelatedprint('Columns to be removed:')print(removeCols)uncorrelated = df.drop(df.index[removeCols], axis =1, inplace = False)print('Uncorrelated variables:')print(uncorrelated)

Outcomes