Predicting with R - anyone?

3178
3
04-08-2014 04:08 AM
mettegreve
New Contributor III
I hope someone outthere can give me some ideas as to what might be wrong. I'm using R for predicting a resulting raster from several rasters used as predictors. One of these rasters is causing me problems. If I use it in the prediction, R will crash. If I leave it out and only run with the 16 other predicting rasters everything is fine. I've tried reclassifying it and cutting it up into smaller chunks. When cut into three parts, two of them will run and one will crash. All predictor rasters are ESRI Grids, the problematic one is a categorical raster and it looks fine in ArcGIS.

All rasters have the same projection, cell size and extent.

Please if anyone have an idea or suggestion I could try, I'd be very grateful. It's a little off topic, but I might be lucky that someone have tried this before. My code is below, but it runs fine without that one particular raster, so...

TIA,
Mette

library(Cubist)
library(caret)
library(class)

setwd("D:/dtemp/Texture/ModellingMarch2014")
data<-read.table(file="clay_silt_fsand_Adata.txt", header=T, sep=";")
str(data)####data structure

###########give factor data####
data$georeg<-as.factor(data$georeg)
data$landscape<-as.factor(data$landscape)
data$wetlands<-as.factor(data$wetlands)
data$landuse<-as.factor(data$landuse)
data$geology<-as.factor(data$geology)
data$fgjord<-as.factor(data$fgjord)


names(data)

nc<-ncol(data)
data_predictors<-data[,c(6:nc)]


####make cubist model####
ctrl<- trainControl(method = "LGOCV",repeats=10, number=10,
                    summaryFunction = defaultSummary, selectionFunction = "best")
cubist_model<-train(data[,c(6:nc)],data[,2],method="cubist",trControl=ctrl)


###########package for read rasters###
library(raster)
library(rgdal)

#########read raster file###
aspect<-raster("D:/dtemp/Texture/Tiles/Tile1/aspect")
centallav0<-raster("D:/dtemp/Texture/Tiles/Tile1/centallav0")
dirinsola<-raster("D:/dtemp/Texture/Tiles/Tile1/dirinsola")
elevation<-raster("D:/dtemp/Texture/Tiles/Tile1/elevation")
fgjord<-raster("D:/dtemp/Texture/Tiles/Tile1/fgjord")
flowaccu<-raster("D:/dtemp/Texture/Tiles/Tile1/flowaccu")
geology<-raster("D:/dtemp/Texture/Tiles/Tile1/geology")
georeg<-raster("D:/dtemp/Texture/Tiles/Tile1/georeg")
landscape<-raster("D:/dtemp/Texture/Tiles/Tile1/landscape")
landuse<-raster("D:/dtemp/Texture/Tiles/Tile1/landuse")
midslppos<-raster("D:/dtemp/Texture/Tiles/Tile1/midslppos")
mrvbf<-raster("D:/dtemp/Texture/Tiles/Tile1/mrvbf")
sagawi<-raster("D:/dtemp/Texture/Tiles/Tile1/sagawi")
slpdeg<-raster("D:/dtemp/Texture/Tiles/Tile1/slpdeg")
valldepth<-raster("D:/dtemp/Texture/Tiles/Tile1/valldepth")
vdistchn<-raster("D:/dtemp/Texture/Tiles/Tile1/vdistchn")
wetlands<-raster("D:/dtemp/Texture/Tiles/Tile1/wetlands")

predictors<-c(aspect,centallav0,dirinsola,elevation,fgjord,flowaccu,geology,
              georeg,landscape,landuse,midslppos,
              mrvbf,sagawi,slpdeg,valldepth,vdistchn,wetlands)

var.stack <- stack(predictors)###all input should have same extent and cell size

#Cubist_result ###print the result
capture.output(summary(cubist_model), file = "summary(cubist_model_clayA_tile1).txt")###write results out to text file
system.time(predict(var.stack, cubist_model,
                    filename="D:/dtemp/Texture/ModellingMarch2014/clayA_tile1.img",
                    fun=predict, ext=NULL, const=NULL, index=1, na.rm=TRUE, progress="text",
                    format="HFA", overwrite=TRUE))
0 Kudos
3 Replies
JeffreyEvans
Occasional Contributor III
You forgot to explain details on the offensive raster. Is this a factorial variable in the model? If so, are all of the levels in the raster represented in the model? The GDAL bindings to the ESRI API have never been too stable so, you may want to convert your rasters to an "img" or "tif" format.     

Keep in mind that the authors of a given package write the appropriate predict function and then make it available, in the global environment through R's generic "predict" function. Sometimes the package developers do not do the best job at error checking their functions and bugs arise. In this case, not only do you have "predict.cubist" but also "predict.raster" interacting to make your life difficult. If they are handling NA or unallocated factors differently, R may throw an error. This is just speculation but as a first step I would check the match in levels between you raster and model.

Now, lets trim down your code a bit.

First, make sure that your object names do not conflict. There is a base function in R called "data" so you should never name an object the same as a existing function. As R searches the Global environment, this can cause bizarre problems in for/while loops.

Think efficient! For example, if you have a number of objects to operate on think about a loop or lapply. Here is a loop for coercing columns into factors.

for( i in c("georeg", "landscape", "wetlands", "landuse", "geology", "fgjord") ) {    
     data[,i] <- as.factor(data[,i]) } 
 

When possible use indexing, and the number of the index, rather than the name.

for( i in c(1,2,4,6,8,10) ) { data[,i] <- as.factor(data[,i]) } 
 

You are creating many unnecessary objects in reading your rasters. You can use wildcards to pass directory listing to stack. Say that your rasters are "img" format or the only files in the directory you could for example :

var.stack <- stack( list.files(getwd(), pattern="img$", full.names=TRUE)  )

   
You could also create a vector of everything in the directory:

predictors <- list.files()


And then index specific elements in the resulting vector:

var.stack <- stack( predictors[c(1,2,4,7)] )
0 Kudos
mettegreve
New Contributor III
Thanks for your answer Jeffrey, strange sometimes how things just work out when you reach out.

I figured out the exact same yesterday while explaining the problem to a colleague. The problematic raster was indeed factorial and one of these values did not exist in the data input file. It was the code for 'Bedrock' which in Denmark only exists on the island of Bornholm. As I was using soil sample lab results in the input file, no soil samples were taken on bedrock (of course) and this made R crash. As I said, when splitting up into three tiles, two would run and one would not - yes, the one with Bornholm. I'm so happy :-).

You provided me with lots of nice tips and trics - thank you for that.

Have a nice day,
Mette
0 Kudos
JeffreyEvans
Occasional Contributor III
The raster package predict function has error handling for unallocated factor levels so, the culprit must be cubist.predict crashing when presented with a new factor level in "new data". You may want to let the package authors know about this bug. You could easily demonstrate it by modifying the levels in one of the factorial covariates in "data" and back-predicting the model. 

I have some R tutorials on my Quantitative Spatial Ecology website that you may find interesting. In fact, I have a tutorial implementing a similar model but, using random forests (which I would highly recommend you looking into as an alternative to Cubist).
0 Kudos