Hi All,
I have total 13 raster files for different years (2001 - 2013). All data are in tiff format and same spatial extent (row and column). I am trying to perform regression line slope/trend analysis between each grid points for 13 raster data sets. I have tried in ArcGIS raster calculator but I won't run the task due to complexity.
The outcome would measure the net change between pixels through my time series data.
So my question is, can it be performed using raster calculator or ArcPy Script? For such type of statistical analysis using ArcGIS, really I am facing one of the big limitation with ArcGIS.
Using excel, it can be calculated using SLOPE command where X (Time) and Y (value) has been used. But for creating a Spatial map of Slope/trend analysis, we need to perform using raster data.
The download link is given below for Sample data sets (Due to size limit, I could not attach the data with here )
Data Link: Dropbox - Water_Use_Efficiency.zip
Please share your experience.
Thanks in advance.
Solved! Go to Solution.
Thanks for sharing and I am glad that the code helped.
Hi Xander Bakker, Recently we got a very useful script from you regarding regression analysis.
Here, I would like to add one more parameter in this script which is P Value (calculate the significance). In the above script, we have been used the Numpy library numpy.linalg.lstsq which gives the output as Slope, intercept etc. However, P value is not calculated. P value also needed because it tells the how significant increasing and decreasing the trend while we interpret the output statistically in front of the scientific community.
So while reading the python library, I found the Scipy library scipy.stats.linregress and i just ran one script on my system and i got the slope, P value. Script is attached below
from numpy import arange,array,ones#,random,linalg
from scipy import stats
xi = arange(0,11)
A = array([ xi, ones(11)])
# linearly generated sequence
y = [154.96, 151.39, 143.87, 152.24, 142.97, 154.23, 141.15, 135.37, 151.50, 143.82, 143.77]
slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)
print 'slope', slope
print 'p_value', p_value
which gives the output as
Returns:
slope -0.899090909091
slope of the regression line
p-value : 0.142283969084
two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero.
So instead of Numpy, can we use here Numpy and Scipy both library to get the addition p value or in the existing script can we add this function to obtain the p Value as a raster.
Please guide us to solve this.
Thanks & regards
Shouvik Jha
Hi shouvik jha , sure you can.
In the original code in the function RegressionLine on line 102, in the beginning there are two lists being created: lst_x and lst_y. You can use these two lists in the linregress function:
stats.linregress(lst_x, lst_y)
Return the p_value and create a raster from it.
Hi Xander Bakker, Thank you for the suggestions.
I tried the same, I just changed the function in line no 53 and 102 as using Numpy and Scipy both. but I getting the error.
Whatever line I have updated based on the source line which is marked by #
Error regarding 'tuple index out of range'. I have attached below updated code and Error massage. KIndly have a look
I just want to modify our earlier script to add one more parameter as p value raster. Earlier script we have calculated the parameters such ['slope', 'intersept', 'tot_dist', 'mean_dist']. I just want to remove the 'tot_dist', 'mean_dist' and it will replace by ['r_value', 'p_value', 'std_err'] remain will be the same. So the final output raster would be ['slope', 'intersept', 'r_value', 'p_value', 'std_err']
Updated script:
#-------------------------------------------------------------------------------
# Name: Regression Analysis.py
# Purpose: Regression Analysis through time series data
# Author: Xander Bakker
# Source: Geonet
# Created: 30-07-2017
#-------------------------------------------------------------------------------
def main():
print "load modules..."
import arcpy
from scipy import stats
from numpy import arange,array,ones
arcpy.env.overwriteOutput = True
import numpy as np
template = r'D:\NPP_GUJARAT_MONSOON\MODIS_ET\WUE4\r{0}_WUE.TIF'
nodata = -3.4028235e+38
out_template = r'D:\NPP_GUJARAT_MONSOON\MODIS_ET\WUE4\WUE_{0}.TIF'
# out_names = ['slope', 'intersept', 'tot_dist', 'mean_dist']
out_names = ['slope', 'intersept', 'r_value', 'p_value', 'std_err']
print "create numpy list..."
lst_np_ras = []
for i in range(2, 13): # 2 number will start of raster name e.g. 2 = r002_WUE, 3 = r003_WUE
ras_path = template.format("%03d" % (i,))
print " - ", ras_path
ras_np = arcpy.RasterToNumPyArray(ras_path)
lst_np_ras.append(ras_np)
print "read props numpy raster..."
ras_np = lst_np_ras[0]
rows = ras_np.shape[0]
cols = ras_np.shape[1]
print " - rows:", rows
print " - cols:", cols
print "create output numpy array..."
ras_path = template.format("%03d" % (2,)) # 2 number will take the number of raster name e.g. 2 = r002_WUE, 3 = r003_WUE
raster = arcpy.Raster(ras_path)
lst_ras_np_res = []
for out_name in out_names:
print "out_name", out_name
ras_np_res = np.ndarray((rows, cols))
lst_ras_np_res.append(ras_np_res)
print " - out rows:", ras_np_res.shape[0]
print " - out cols:", ras_np_res.shape[1]
print "loop through pixels..."
pix_cnt = 0
for row in range(rows):
for col in range(cols):
pix_cnt += 1
if pix_cnt % 5000 == 0:
print "row:", row, " col:", col, " pixel:", pix_cnt
lst_vals = []
for ras_np in lst_np_ras:
lst_vals.append(ras_np[row, col])
lst_vals = ReplaceNoData(lst_vals, nodata)
# perform calculation on list
# a, b, sum_abs_dif, avg_abs_dif, cnt_elem = RegressionLine(lst_vals)
a, b, sum_abs_dif, avg_abs_dif, cnt_elem = stats.linregress(lst_vals)
if cnt_elem > 1:
slope = a
intersept = b
lst_ras_np_res[0][row, col] = slope
lst_ras_np_res[1][row, col] = intersept
lst_ras_np_res[2][row, col] = sum_abs_dif
lst_ras_np_res[3][row, col] = avg_abs_dif
else:
lst_ras_np_res[0][row, col] = nodata
lst_ras_np_res[1][row, col] = nodata
lst_ras_np_res[2][row, col] = nodata
lst_ras_np_res[3][row, col] = nodata
pnt = arcpy.Point(raster.extent.XMin, raster.extent.YMin) # - raster.meanCellHeight
xcellsize = raster.meanCellWidth
ycellsize = raster.meanCellHeight
print "Write output rasters..."
i = 0
for out_name in out_names:
out_ras = out_template.format(out_name)
print " - ", out_ras
ras_np_res = lst_ras_np_res[i]
ras_res = arcpy.NumPyArrayToRaster(ras_np_res, lower_left_corner=pnt, x_cell_size=xcellsize,
y_cell_size=ycellsize, value_to_nodata=nodata)
ras_res.save(out_ras)
arcpy.DefineProjection_management(in_dataset=out_ras, coor_system="GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]]")
i += 1
def ReplaceNoData(lst, nodata):
res = []
for a in lst:
if a < nodata / 2.0:
res.append(None)
else:
res.append(a)
return res
# def ReplaceNoData(lst, nodata):
stats.linregress(lst_x, lst_y)
from numpy import arange, array, ones, linalg
months = range(2, 13)
lst_x = []
lst_y = []
cnt = 0
for a in lst:
if not a is None:
lst_x.append(months[cnt])
lst_y.append(a)
cnt += 1
cnt_elem = len(lst_x)
if cnt_elem >= 2:
A = array([ lst_x, ones(len(lst_x))])
# linearly generated sequence
# w = linalg.lstsq(A.T, lst_y)[0]
w = stats.linregress(A.T, lst_y)[0]
# obtaining the parameters of the regression line
a = w[0] # slope
b = w[1] # offset
# calculate acumulated abs difference from trend line
zip_lst = zip(lst_x, lst_y)
sum_abs_dif = 0
for xy in zip_lst:
x = xy[0]
y = xy[1]
calc = x * a + b
abs_dif = abs(y - calc)
sum_abs_dif += abs_dif
avg_abs_dif = sum_abs_dif / cnt_elem
return a, b, sum_abs_dif, avg_abs_dif, cnt_elem
else:
return None, None, None, None, None
if __name__ == '__main__':
main()
Error message:
Traceback (most recent call last):
File "<module1>", line 142, in <module>
File "<module1>", line 61, in main
File "C:\Python27\ArcGIS10.3\lib\site-packages\scipy\stats\stats.py", line 2997, in linregress
elif x.shape[1] == 2:
IndexError: tuple index out of range
Hi xander_bakker, I am still stuck into this matter to modify the code based on SciPy.
My kind request please cooperate to solve the issue.
Scipy can definitely be used, it has more stats functions but it is built on numpy for many of its functions. If you are concerned about the methods of calculating the regression coefficients, just examine the constraints, if any, Generally differences may exist only at the degrees of freedom level, but many of their stats tests have a ddof parameter (ie N vs N-1 on the denominator).
given that your p-value is what it is, you can treat your slope correlation as insignificant. A graph would help
Dan_Patterson Thanks.
Yes, Scipy is more efficient to produce the statistical result and P value tells the how significant the trend. The significance of trend is more important while you interpret any output of trend or correlation. Everybody wants to know the How significance relation.
Hi shouvik jha ,
I implemented the following changes to the script:
#-------------------------------------------------------------------------------
# Name: stats_numpy_scipy_series.py
# Purpose:
#
# Author: Xander
#
# Created: 30-07-2017
# Updated: 28-08-2017
#-------------------------------------------------------------------------------
def main():
print "load modules..."
import arcpy
arcpy.env.overwriteOutput = True
import numpy as np
template = r'C:\GeoNet\WUE2\r{0}_WUE.TIF'
nodata = -3.4028235e+38
out_template = r'C:\GeoNet\WUE2\WUE2_{0}.TIF'
out_names = ['slope', 'intersept', 'tot_dist', 'mean_dist', 'p_value']
print "create numpy list..."
lst_np_ras = []
for i in range(1, 14):
ras_path = template.format("%03d" % (i,))
print " - ", ras_path
ras_np = arcpy.RasterToNumPyArray(ras_path)
lst_np_ras.append(ras_np)
print "read props numpy raster..."
ras_np = lst_np_ras[0]
rows = ras_np.shape[0]
cols = ras_np.shape[1]
print " - rows:", rows
print " - cols:", cols
print "create output numpy array..."
ras_path = template.format("%03d" % (1,))
raster = arcpy.Raster(ras_path)
lst_ras_np_res = []
for out_name in out_names:
print "out_name", out_name
ras_np_res = np.ndarray((rows, cols))
lst_ras_np_res.append(ras_np_res)
print " - out rows:", ras_np_res.shape[0]
print " - out cols:", ras_np_res.shape[1]
print "loop through pixels..."
pix_cnt = 0
for row in range(rows):
for col in range(cols):
pix_cnt += 1
if pix_cnt % 5000 == 0:
print "row:", row, " col:", col, " pixel:", pix_cnt
lst_vals = []
for ras_np in lst_np_ras:
lst_vals.append(ras_np[row, col])
lst_vals = ReplaceNoData(lst_vals, nodata)
# perform calculation on list
a, b, sum_abs_dif, avg_abs_dif, cnt_elem, p_value = RegressionLine(lst_vals)
if cnt_elem > 1:
slope = a
intersept = b
lst_ras_np_res[0][row, col] = slope
lst_ras_np_res[1][row, col] = intersept
lst_ras_np_res[2][row, col] = sum_abs_dif
lst_ras_np_res[3][row, col] = avg_abs_dif
lst_ras_np_res[4][row, col] = p_value
else:
lst_ras_np_res[0][row, col] = nodata
lst_ras_np_res[1][row, col] = nodata
lst_ras_np_res[2][row, col] = nodata
lst_ras_np_res[3][row, col] = nodata
lst_ras_np_res[4][row, col] = nodata
pnt = arcpy.Point(raster.extent.XMin, raster.extent.YMin) # - raster.meanCellHeight
xcellsize = raster.meanCellWidth
ycellsize = raster.meanCellHeight
print "Write output rasters..."
i = 0
for out_name in out_names:
out_ras = out_template.format(out_name)
print " - ", out_ras
ras_np_res = lst_ras_np_res[i]
ras_res = arcpy.NumPyArrayToRaster(ras_np_res, lower_left_corner=pnt, x_cell_size=xcellsize,
y_cell_size=ycellsize, value_to_nodata=nodata)
ras_res.save(out_ras)
arcpy.DefineProjection_management(in_dataset=out_ras, coor_system="GEOGCS['GCS_WGS_1984',DATUM['D_WGS_1984',SPHEROID['WGS_1984',6378137.0,298.257223563]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]]")
i += 1
def ReplaceNoData(lst, nodata):
res = []
for a in lst:
if a < nodata / 2.0:
res.append(None)
else:
res.append(a)
return res
def RegressionLine(lst):
from numpy import arange, array, ones, linalg
from scipy import stats
months = range(1, 14)
lst_x = []
lst_y = []
cnt = 0
for a in lst:
if not a is None:
lst_x.append(months[cnt])
lst_y.append(a)
cnt += 1
cnt_elem = len(lst_x)
if cnt_elem >= 2:
A = array([ lst_x, ones(len(lst_x))])
# linearly generated sequence
w = linalg.lstsq(A.T, lst_y)[0]
slope, intercept, r_value, p_value, std_err = stats.linregress(lst_x, lst_y)
# obtaining the parameters of the regression line
a = w[0] # slope
b = w[1] # offset
# calculate acumulated abs difference from trend line
zip_lst = zip(lst_x, lst_y)
sum_abs_dif = 0
for xy in zip_lst:
x = xy[0]
y = xy[1]
calc = x * a + b
abs_dif = abs(y - calc)
sum_abs_dif += abs_dif
avg_abs_dif = sum_abs_dif / cnt_elem
return a, b, sum_abs_dif, avg_abs_dif, cnt_elem, p_value
else:
return None, None, None, None, None, None
if __name__ == '__main__':
main()
Please be aware that I did not test the code...
Hi Xander Bakker, many thanks for your great cooperation.
I ran the script. The script is running perfectly and results also correct.
Once again thank you very much for your such great contribution.
Thanks & regards
Shouvik Jha
You are welcome. Great to hear that the code is working!