Logistic Regression in ArcGIS is this a possibility?

AnthoniaOnyeahialam · ‎12-08-2011

Is binary logistic regression in ArcGIS a possibility ? I know one can do linear regression but binary response regression is what am not sure of. Because I have this sort of data which is also mapped so am wondering what I can do with it in ArcGIS.

JeffreyEvans · ‎06-10-2013

I would recommend cutting out the middle-man and running this directly in R. I rewrote the R code so you can run it directly. Set arguments in the "USER ARGUMENTS" section and the rest of the code is automated. The OutShp will be written to the working directory. This will require the sp, rgdal and rms libraries.

####################################################
# USER ARGUMENTS 
####################################################
# SET WORKING DIRECTORY
setwd("D:/TMP")

# SET VARIABLE NAMES
dependentVar = "SppDen"
independentVars = c("Var1","Var2","Var3")

usePenalty=TRUE         # USE PENALTY
InShp="MyShape"         # IN SHAPEFILE (NO .shp NEEDED)
OutShp="LogitModel"     # OUT SHAPEFILE (NO .shp NEEDED) 
diagTable = "Diag.csv"  # DIAGNOSTIC TABLE
coefTable = "Coef.csv"  # COEFFICENT TABLE

####################################################
# START MODEL
####################################################
require(sp)
require(rgdal)
require(rms)

# READ SHAPEFILE 
shp <- readOGR(getwd(), InShp)

# CHECK VARIABLE NAMES 
if ( is.na(match(dependentVar,names(shp@data))) )
  stop("Dependent Variable not present in data")  

xNames <- intersect(independentVars,names(shp@data)) 
  if (length(xNames) < length(independentVars)) 
    stop("Mismatch in Independent Variable Names")

# CREATE FORMULA
form=as.formula(paste(dependentVar, paste(independentVars, collapse='+'), sep='~'))

# LOGISTIC REGRESSION WITH AIC
fit <- lrm(form, data=shp@data, x = TRUE, y = TRUE) 
  bf <- pentrace(fit, seq(.2,1,by=.05))
  if (usePenalty) {
      pen = bf$penalty
  } else {
      pen = 0.0
  }

allPens = bf$results.all[,1]
allAICs = bf$results.all[,3]
for (i in 1:length(allPens)){ 
    penValue = allPens
    if (penValue == pen){
        aic = allAICs
        }
    }

if (usePenalty){ fit = update(fit, penalty=bf$penalty) }

# RESIDUAL ERROR
res <- residuals.lrm(fit)
  resSTD <- (res - mean(res)) / sqrt(var(res))

# ADD RESIDUALS, STANDARDIZED RESIDUALS AND PROBABILITIES 
shp@data <- data.frame(shp@data,Residual=res, StdResid=resSTD, 
                Probs=predict(fit,shp@data[,rownames(fit$var)[-1]],
    type="fitted") )  

# WRITE SHAPEFILE    
writeOGR(shp, dsn=getwd(), OutShp, driver="ESRI Shapefile", 
         check_exists=TRUE, overwrite_layer=TRUE)
   
# WRITE COEFFICENT AND DIAGNOSTIC TABLES 
allIndVars = c("Intercept")
allIndVars = append(allIndVars, independentVars)
k = length(allIndVars)
  d = matrix(0, k, 4)
    d[,1] = fit$coefficients
      d[,2] = sqrt(diag(fit$var))
        d[,3] = d[,1] / d[,2]
          d[,4] = pnorm(abs(d[,3]), lower.tail = FALSE) * 2.0
coefList = list("Variable"=allIndVars, "Coef"=d[,1], 
               "StdError"=d[,2], "Wald"=d[,3], "Prob"=d[,4])
coefFrame = data.frame(coefList)
  write.csv(coefFrame, coefTable, row.names=FALSE)  

daigFrame <- data.frame(Names=c(names(fit$stats),"PEN","AIC"),
                      Value=c(as.vector(fit$stats), pen, aic))
  write.csv(daigFrame, diagTable, row.names=FALSE)

paolapasino · ‎09-06-2013

https://github.com/Esri/R-toolbox-py

Try this updated tool box it works ( not fully) for me

Paola

JeffreyEvans · ‎09-09-2013

Here is logistic regression R code that will work with the current toolbox (all the dependencies are correct). Copy the code into a text editor then save and replace "LogitWithR.r" in the "Scripts" directory of the toolbox.

##############################################################################
# PROGRAM: LogitRegression (logistic regression called form ArcGIS)
# USE: LOGISTIC (BIONOMIAL) REGRESSION
# REQUIRES: R > 2.15.0, rms
#                                                                                                                                                                                                     
# ARGUMENTS: 
#       ldata    DATAFRAME OBJECT CONTANING VARIABLES 
#       y        RESPONSE VARIABLE IN ldata                       
#       x        INDEPENDENT VARIABLES(S) IN ldata
#       penelty  APPLY REGRESSION PENELTY (TRUE/FALSE)           
#       ...      ADDITIONAL ARGUMENTS PASSED TO lrm                          
# 
# VALUE:  
#      A LIST OBJECT CONTANING OBJECTS:
#        model        lrm MODEL OBJECT 
#        diagTable    DATAFRAME OF REGRESSION DIAGNOSTICS
#        coefTable    DATAFRAME OF REGRESSION COEFFICENTS
#        Residuals    DATAFRAME OF RESIDUALS AND STANDARDIZED RESIDUALS
#
# REFERENCES:
#    Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. 
#      Applied Statistics 41:191�??201, 1992.
#
#    Shao J: Linear model selection by cross-validation. JASA 88:486�??494, 1993.
#                                                                       
# EXAMPLES: 
#    require(sp)
#    require(rms)
#    
#    data(meuse)
#      coordinates(meuse) <- ~x+y   
#        meuse@data <- data.frame(DepVar=rbinom(dim(meuse)[1], 1, 0.5), meuse@data)
#          names(meuse@data)
#    
#    # RUN Bernoulli Trials TEST
#    binom.test( c(length(meuse[meuse@data$DepVar == 1 ,]$DepVar),
#                length(meuse[meuse@data$DepVar == 0 ,]$DepVar)), 0.65)
#        
#    # RUN LOGISTIC MODEL
#    lmodel <- LogitRegression(meuse@data, y="DepVar", x=c("dist","cadmium","copper")) 
#      lmodel$model
#      lmodel$diagTable
#      lmodel$coefTable
#    
#    # ADD RESIDUALS, STANDARDIZED RESIDUALS AND PROBABILITIES 
#    meuse@data <- data.frame(meuse@data, Residual=lmodel$Residuals[,1], 
#                     StdResid=lmodel$Residuals[,2], 
#                     Probs=predict(lmodel$model,meuse@data[,rownames(lmodel$model$var)[-1]],
#          type="fitted") )  
#    
#    # PLOT PROBABILITIES     
#    library(lattice)
#      trellis.par.set(sp.theme())
#        spplot(meuse, c("Probs"))     
#                                                                                          
# CONTACT: 
#     Jeffrey S. Evans
#     Senior Landscape Ecologist 
#     The Nature Conservancy - Central Science
#     Adjunct Assistant Professor
#     Zoology and Physiology
#     University of Wyoming
#     Laramie, WY
#     (970)672-6766
#     jeffrey_evans@tnc.org
######################################################################################   
LogitRegression <- function (ldata, y, x, penalty=TRUE, ...) {
  if (!require (rms)) stop("sp PACKAGE MISSING")
   if ( is.na(match(y,names(ldata))) )
        stop("Dependent Variable not present in data")  
    xNames <- intersect(x,names(ldata)) 
      if (length(xNames) < length(x)) 
        stop("Mismatch in Independent Variable Names")
    form=as.formula(paste(y, paste(x, collapse='+'), sep='~'))
   fit <- lrm(form, data=ldata, x=TRUE, y=TRUE, ...) 
        bf <- pentrace(fit, seq(0.2,1,by=0.05))
          if (penalty) {
            pen = bf$penalty
          } else {
            pen = 0.0
        }
      allPens=bf$results.all[,1]
      allAICs=bf$results.all[,3]
        for (i in 1:length(allPens)){ 
            penValue=allPens
              if (penValue == pen){
                  aic = allAICs
                  }
              }
    if (penalty){ fit = update(fit, penalty=bf$penalty) }
      res <- residuals.lrm(fit)
        resSTD <- (res - mean(res)) / sqrt(var(res))
      allIndVars <- c("Intercept")
      allIndVars <- append(allIndVars, x)
        k <- length(allIndVars)
          d <- matrix(0, k, 4)
            d[,1] <- fit$coefficients
              d[,2] <- sqrt(diag(fit$var))
                d[,3] <- d[,1] / d[,2]
                  d[,4] <- pnorm(abs(d[,3]), lower.tail = FALSE) * 2.0
      coefList=list("Variable"=allIndVars, "Coef"=d[,1], 
                     "StdError"=d[,2], "Wald"=d[,3], "Prob"=d[,4])
      coefFrame=data.frame(coefList) 
      diagFrame=data.frame(Names=c(names(fit$stats),"PEN","AIC"),
                           Value=c(as.vector(fit$stats), pen, aic))
    return(list(model=fit, diagTable=diagFrame, coefTable=coefFrame, 
             Residuals=data.frame(res=res,resSTD=resSTD))) 
  }

### Version check
Rv <- paste(round(as.numeric(R.version$major),digits=0),
            as.numeric(R.version$minor),sep=".")
    if (Rv < 2.14) stop("NEED TO UPGRADE VERSION OF R TO >= 2.14")   
  
#### Load required packages
print("Loading Libraries....")
  if (!require(rms)) stop("rms PACKAGE MISSING")
  if (!require(sp)) stop("sp PACKAGE MISSING")
  if (!require(rgdal)) stop("rgdal PACKAGE MISSING")
  if (!require(foreign)) stop("foreign PACKAGE MISSING")

#### Get Arguments 
Args = commandArgs()
  inputFC = sub(".shp", "", Args[5], ignore.case=TRUE)
    outputFC = sub(".shp", "", Args[6], ignore.case=TRUE)
      dependentVar = Args[7]
        independentVarString = Args[8]
         usePenalty = as.integer(Args[9])
      usePenalty <- usePenalty == 1
    coefTable = sub(".csv", "", Args[10], ignore.case=TRUE)
  diagTable = sub(".csv", "", Args[11], ignore.case=TRUE)
print(paste(commandArgs(), collapse=" "))

### Extract path and shapefile names
tmp <- unlist(strsplit(inputFC, "/"))[-length(unlist(strsplit(inputFC, "/")))] 
  path <- paste(tmp,collapse="/") 
    in.shp.name <- unlist(strsplit(inputFC, "/"))[length(unlist(strsplit(inputFC, "/")))]
      out.shp.name <- unlist(strsplit(outputFC, "/"))[length(unlist(strsplit(outputFC, "/")))]

#### Get variable Names 
independentVars = strsplit(independentVarString, ";")
  independentVars = c(unlist(independentVars))

### Read shapefile using rgdal
shp <- readOGR(path, in.shp.name)  
  
### Logistic regression model
print("Running model....")

lmodel <- LogitRegression(shp@data, y=dependentVar, x=independentVars, 
                          penalty=usePenalty)
  print( lmodel$model )

### Add columns to data and write shapefile 
print("Writing Output....")
  shp@data$Probs <- predict(lmodel$model,shp@data[,rownames(lmodel$model$var)[-1]], 
                            type="fitted") 
    shp@data$Residual = lmodel$Residuals[,1]
    shp@data$StdResid = lmodel$Residuals[,2]
writeOGR(shp, path, out.shp.name, driver="ESRI Shapefile", check_exists=TRUE, 
         overwrite_layer=TRUE) 

### Write  comma seperated ASCII flatfile of model coefficients and diagnositics 
###   model        lrm MODEL OBJECT 
###   diagTable    DATAFRAME OF REGRESSION DIAGNOSTICS
###   coefTable    DATAFRAME OF REGRESSION COEFFICENTS
###   Residuals    DATAFRAME OF RESIDUALS AND STANDARDIZED RESIDUALS
write.csv(lmodel$coefTable, coefTable)
write.csv(lmodel$diagTable, diagTable)

#if (AIF == TRUE) {
#    p <- pentrace(lmodel$model, penalty=seq(0.1,1,by=0.05), which="aic.c")
#   plot(p)
#  }
print("Model Complete...")

MarilynMontgomery · ‎12-11-2013

Hi Dr. Evans,

thank you for posting the script for R for logistic regression. I started using R only a few weeks ago and I got it to work without the "middle man" (arcgis).

However I wanted to make sure, this is not an autologistic regression, correct? I want the spatial covariate for a autologistic/spatial logistic model; and it doesn't look like this script takes the geography of the observations into account. Am I right?

Thanks,

Marilyn

JeffreyEvans · ‎12-11-2013

Marilyn,
You are correct this is not an autologistic model. However, all you have to do to specify an autologistic regression using this code is to spatially lag your dependent variable and include it on the x side of the model as an independent variable (autocovariate). You can easily spatially lag a variable using the spdep package.

The caveat with this type of model is that there is an assumption that the autocovariate does not correlate with the other independent variables. If this is the case then the model becomes quite biased and the coefficients often misleading. It is highly recommend that you explore mixed effect models and treat the autocorrelation as a random effect. This type of model can be specified using the GLMM function in the nlme package. If you would like to correctly specify an autologistic model you could put it in an MCMC context and specify the autocorrelation function as a design matrix.

Just in case you have not read them I attached the Dormann (2007) paper on autologistic regression and well as the He et al., (2003) simulation paper.

Best,
Jeff

MarilynMontgomery · ‎12-26-2013

Hi Jeff,

Thank you very much for your helpful reply. Does the lag.listw command in R do the same thing as autocov_dist, if I use a distance-based weight matrix with lag.listw and specify the same distances for lag and autocovariate? It seems like I could use either of these commands to calculate an autocovariate for autologistic regression.

And could you give me any more guidance on how to "correctly specify an autologistic model ... put it in an MCMC context and specify the autocorrelation function as a design matrix"? When you say autocorrelation function as a design matrix, are you referring to the autocovariate?

Thanks again!

Marilyn

JeffreyEvans · ‎12-27-2013

The "autocov_dist" and "lag.listw" functions do not result in the same thing. For a autologistic model you want to use the "autocov_dist". The "lag.listw" results in a sparse matrix based on the spatial lag of the defined neighbor relationship. In concert with knearneigh (where k = (n-1)) you could trick this function to return lagged weights but the "autocov_dist" function is specifically designed to produce a autocovariate.

I have to say that an MCMC specification of a model is a very complicated procedure and requires either writing your own MCMC function, which is a huge undertaking if you are not familiar with the mathematical/statistical theory, or using Jags/WinBUGS through an R interface.

The design matrix is defining a specific autocovariance function in the parameter portion of the Bayesian model rather than depending on an autocovariate, which can bias the model. This type of model is really beyond the scope of this forum, as is much of the R advice thus far. If you have further questions I would recommend contacting me off list. jeffrey_evans<at>tnc.org