Distance from one set of xy coordinatesto another set of xy coordinates

SarahNell · ‎09-19-2012

Hi,

Just looking for some brief input into an issue I'm having. I have two shapefiles with xy coordinates for each individual; one has data on disease (0,1) in species 1 and the other has data on disease (0,2) in species 2. Basically, background research suggests that infected individuals in species 1 increase the risk for infection in species 2 individuals. Therefore, I want to calculate if positives in species 2 are more likely to be located near positives in species 1.

I'm fairly familiar with statistics, but I've just started delving into spatial statistics. I think an appropriate analysis might be a bivariate K-function, but I'm not sure how that would be accomplished with ArcGIS and there also might be other more appropriate analyses that I haven't considered.

Thanks in advance.

JeffreyEvans · ‎09-20-2012

If I understand you correctly you wish to test the spatial relationship between the presence of disease in two species. The bivariate Cross-K statistic should get at this spatial relationship. You do not need the negative data [0] in each species. Unfortunately, bivariate point pattern statistics are not available in ArcGIS. Your only real option is conducting the analysis in R (http://cran.r-project.org/). I have included code that provides an example of a Cross-K, with a Monte Carlo simulation for significance testing, using the spatstat package. The simulated example will not be significant, so the simulation envelope will always encompass the full distribution. Applying this to real data will provide more sensible results. My example creates the simulated data in a SpatialPointsDataFrame sp object, which is what you get when you read point shapefiles using the readOGR function in rgdal. This should make it easy to apply the code example to your data. Most everything else is plug-and-play. In the code, the column name containing disease presence is called "MARKS", so you will need to change it to match your data. Please note that the way I have the analysis implemented, it returns the Besag-L(d) statistic which is a standardized version of the Ripley's-K(r) where the expected null is centered at 0. Some additional information and references on K(r) and L(d) is provided in this thread (http://forums.arcgis.com/threads/6160-K-Function).

# ADD REQUIRED PACKAGES require(rgdal) require(sp) require(spatstat) options(scipen=5)  # SIMULATE 1000 RANDOM OBS WITH A BINOMINAL [0,1] MARK COLUMN, THE ASSUMPTION IS THAT A  #  VALUE OF 1 & 2 REPRESENTS PRESENCE OF DESEASE IN EACH SPP i=1000 spp1 <- SpatialPointsDataFrame(cbind(round(runif(i), 2), round(runif(i), 2)),                             data.frame(ID=seq(1,i,1), MARKS=rbinom(i,1,0.5))) spp2 <- SpatialPointsDataFrame(cbind(round(runif(i), 2), round(runif(i), 2)),                             data.frame(ID=seq(1,i,1), MARKS=rbinom(i,1,0.5))) # RECODE spp2 MARKS AS 2 spp2@data$MARKS[spp2@data$MARKS == 1] <- 2   ############################################# ###### START MONTE CARLO CROSS-K ANALYSIS HERE ###### #############################################  # COMBINE INTO A SINGLE sp POINT OBJECT AND COERCE MARKS COL TO FACTOR spp <- rbind( spp1[spp1@data$MARKS > 0 ,], spp2[spp2@data$MARKS > 0 ,])   spp@data$MARKS <- as.factor(spp@data$MARKS)                    # CREATE CONVEX HULL FOR ANALYSIS WINDOW         W=ripras(coordinates(spp))   # COERCE TO spatstat ppp OBJECT AND CREATE MARKS   spp.ppp=as.ppp(coordinates(spp), W)                     spp.ppp=setmarks(spp.ppp, as.factor(spp$MARKS))    # ESTIMATE BANDWIDTH area <- area.owin(W) lambda <- spp.ppp$n/area  ripley <- min(diff(W$xrange), diff(W$yrange))/4    rlarge <- sqrt(1000/(pi * lambda))      rmax <- min(rlarge, ripley) bw <- seq(0, rmax, by=rmax/10)    # CALCULATE PERMUTED CROSS-K AND PLOT RESULTS    Lenv <- envelope(spp.ppp, fun="Kcross", r=bw, i="1", j="2", nsim=99, nrank=5,                   transform=expression(sqrt(./pi)-bw), global=TRUE)             plot(Lenv, main="Cross K", xlab="Distance", ylab="L(r)", legend=F, col=c("white","black","grey","grey"),      lty=c(1,2,2,2), lwd=c(2,1,1,1) )      polygon( c(Lenv$r, rev(Lenv$r)), c(Lenv$lo, rev(Lenv$hi)), col="lightgrey", border="grey")        lines(supsmu(bw, Lenv$obs), lwd=2)        lines(bw, Lenv$theo, lwd=1, lty=2)          legend("topleft", c(expression(hat(L)(r)), "Simulation Envelope", "theo"), pch=c(-32,22),                 col=c("black","grey"), lty=c(1,0,2), lwd=c(2,0,2), pt.bg=c("white","grey"))

View solution in original post

JeffreyEvans · ‎09-20-2012

If I understand you correctly you wish to test the spatial relationship between the presence of disease in two species. The bivariate Cross-K statistic should get at this spatial relationship. You do not need the negative data [0] in each species. Unfortunately, bivariate point pattern statistics are not available in ArcGIS. Your only real option is conducting the analysis in R (http://cran.r-project.org/). I have included code that provides an example of a Cross-K, with a Monte Carlo simulation for significance testing, using the spatstat package. The simulated example will not be significant, so the simulation envelope will always encompass the full distribution. Applying this to real data will provide more sensible results. My example creates the simulated data in a SpatialPointsDataFrame sp object, which is what you get when you read point shapefiles using the readOGR function in rgdal. This should make it easy to apply the code example to your data. Most everything else is plug-and-play. In the code, the column name containing disease presence is called "MARKS", so you will need to change it to match your data. Please note that the way I have the analysis implemented, it returns the Besag-L(d) statistic which is a standardized version of the Ripley's-K(r) where the expected null is centered at 0. Some additional information and references on K(r) and L(d) is provided in this thread (http://forums.arcgis.com/threads/6160-K-Function).

# ADD REQUIRED PACKAGES require(rgdal) require(sp) require(spatstat) options(scipen=5)  # SIMULATE 1000 RANDOM OBS WITH A BINOMINAL [0,1] MARK COLUMN, THE ASSUMPTION IS THAT A  #  VALUE OF 1 & 2 REPRESENTS PRESENCE OF DESEASE IN EACH SPP i=1000 spp1 <- SpatialPointsDataFrame(cbind(round(runif(i), 2), round(runif(i), 2)),                             data.frame(ID=seq(1,i,1), MARKS=rbinom(i,1,0.5))) spp2 <- SpatialPointsDataFrame(cbind(round(runif(i), 2), round(runif(i), 2)),                             data.frame(ID=seq(1,i,1), MARKS=rbinom(i,1,0.5))) # RECODE spp2 MARKS AS 2 spp2@data$MARKS[spp2@data$MARKS == 1] <- 2   ############################################# ###### START MONTE CARLO CROSS-K ANALYSIS HERE ###### #############################################  # COMBINE INTO A SINGLE sp POINT OBJECT AND COERCE MARKS COL TO FACTOR spp <- rbind( spp1[spp1@data$MARKS > 0 ,], spp2[spp2@data$MARKS > 0 ,])   spp@data$MARKS <- as.factor(spp@data$MARKS)                    # CREATE CONVEX HULL FOR ANALYSIS WINDOW         W=ripras(coordinates(spp))   # COERCE TO spatstat ppp OBJECT AND CREATE MARKS   spp.ppp=as.ppp(coordinates(spp), W)                     spp.ppp=setmarks(spp.ppp, as.factor(spp$MARKS))    # ESTIMATE BANDWIDTH area <- area.owin(W) lambda <- spp.ppp$n/area  ripley <- min(diff(W$xrange), diff(W$yrange))/4    rlarge <- sqrt(1000/(pi * lambda))      rmax <- min(rlarge, ripley) bw <- seq(0, rmax, by=rmax/10)    # CALCULATE PERMUTED CROSS-K AND PLOT RESULTS    Lenv <- envelope(spp.ppp, fun="Kcross", r=bw, i="1", j="2", nsim=99, nrank=5,                   transform=expression(sqrt(./pi)-bw), global=TRUE)             plot(Lenv, main="Cross K", xlab="Distance", ylab="L(r)", legend=F, col=c("white","black","grey","grey"),      lty=c(1,2,2,2), lwd=c(2,1,1,1) )      polygon( c(Lenv$r, rev(Lenv$r)), c(Lenv$lo, rev(Lenv$hi)), col="lightgrey", border="grey")        lines(supsmu(bw, Lenv$obs), lwd=2)        lines(bw, Lenv$theo, lwd=1, lty=2)          legend("topleft", c(expression(hat(L)(r)), "Simulation Envelope", "theo"), pch=c(-32,22),                 col=c("black","grey"), lty=c(1,0,2), lwd=c(2,0,2), pt.bg=c("white","grey"))

SarahNell · ‎10-01-2012

Thanks for the response and the suggestion of running the analysis in R. It was very helpful. After looking at your coding, I'm not sure I understand what distance units (x axis) are being reported for the plot. And I've often seen this output accompanied by 95% confidence intervals ... do you think they are necessary or can you confidently determine the strength of the association without them?

Thanks again.

JeffreyEvans · ‎10-02-2012

The distance units are the same as the projected units. You will need to modify the code to use great circle distance if your data is in a geographic projection (lat/long). I would highly recommend projecting your data so its projections units are meters or feet. The K-function tests clustering at a series of distance lags. The x-axis represents each distance lag. The envelope represents the upper and lower (0.5 & 0.95 percentiles) bounds of the Monte Carlo simulation. I would highly recommend reading some of the primary literature on this method which describes the assumptions, application and interpretation of this statistic.

Besag, J.E. (1977) Comments on Ripley's paper. Journal of the Royal Statistical Society B39:193-195.

Besag, J.E. and P.J., Diggle, (1977) Simple Monte Carlo tests for spatial pattern. Applied Statistics 26, 327-333.

Diggle P.J., (2003). Statistical analysis of spatial point patterns. Edward Arnold London

Ripley, B.D. (1977) Modelling Spatial Patterns. Journal of the Royal Statistical Society B39:172-212.

Ripley, B.D. (1976) The Second-Order Analysis of Stationary Point Processes. Journal of Applied Probability 13:255-266.

SarahNell · ‎10-03-2012

Thanks for the response on making sure the projection units are in meters or feet. I think this was the primary issue I was having with the code (and it kept the envelope boundaries from being visible on the plot).