Aggregating hourly rasters into daily rasters

deleted-user-roJGZJe3MuBa · ‎10-05-2018

I have hundreds of three hourly soil moisture rasters that we need to aggregate into daily rasters. I am using a python script running in PyCharm using the ArcGIS Pro Python interpreters. My script runs the first day aggregation in under 10 seconds and after that everthing starts to slow down even adding each of the three hourly raster takes up to 15 seconds, whereas the first total day process takes less than a single raster process. You will notice I am using a list comprehension to list my rasters because using the ListRasters option caused all the temp files to be created in the same location as the soil moisture rasters. My script is as follows:

import arcpy
from arcpy.sa import *
from datetime import datetime
import os

arcpy.CheckOutExtension("Spatial")

location = r"E:\New Folder\\"
arcpy.env.scratchWorkspace = r"E:\SMAP\Scratch\\"
arcpy.env.overwriteOutput = True

rasters = [f for f in os.listdir(location) if f.endswith(".tif")]

current_month = datetime.strptime("20150331", "%Y%m%d")

x = 0
total = 0

for r in rasters:

    print(r)
    month = r[0:8]
    month = datetime.strptime(month, "%Y%m%d")

current_raster = Raster(location + r)
print("RASTER MADE")

    if current_month.day == month.day:

        smap = Con(current_raster > -9999, current_raster)
        print("CON MADE")
        total = total + smap
        print("CON ADDED")
        x += 1

    else:
        print(x)
        arcpy.env.mask = location + "SMERGE.shp"

        total = total / x
        print("TOTAL OBTAINED")
        x = 1
        print("S" + str(current_month.year) + str(current_month.strftime("%m")))
        total.save(r"E:\SMAP\outdaily\\" + "S4" + str(current_month.year) + str(current_month.strftime("%m")) +                          str(current_month.strftime("%d")) + ".tif")
        print("TOTAL SAVED")
        total = 0
        total = smap

current_month = month
y += 1

print(x)
arcpy.env.mask = location + "SMERGE.shp"
total = total / x
print("S" + str(current_month.year) + str(current_month.strftime("%m")))
total.save(r"E:\SMAP\outdaily\\" + "S4" + str(current_month.year) + str(current_month.strftime("%m")) + str(current_month.strftime("%d")) + ".tif")

arcpy.CheckInExtension("Spatial")

Any Ideas will be appreciated.

DanPatterson_Retired · ‎10-05-2018

one at a time is not the way to go.

This has come up before and there are alternatives if you have the memory

CellStatistics if you want to purely stay in arcpy, and builtin tools. Otherwise numpy is quick and they have improved their 'nan' functions which means you have to worry less about defining the nodata values for multiple arrays if they are all still all the same (ie -9999 or something)

https://community.esri.com/message/593354?commentID=593354#comment-593354

6,000+ rasters the follow-up

 a0
array([[10., nan, nan, nan,  9., 10.,  7.,  9.,  8.,  6., nan,  4.,  5.,  9., 10.],
       [nan, nan,  7.,  6.,  9.,  8.,  4., nan, nan,  7.,  6., 10., nan,  6., 10.],
       [ 8.,  8.,  8., nan,  5.,  9.,  9.,  5.,  8.,  6.,  7., 10., nan, nan,  6.],
       [nan,  7.,  4.,  5., nan,  7., nan, 10., nan, nan,  4., nan,  8.,  7., 10.],
       [ 7., nan, nan, nan,  7.,  8., nan, 10.,  5.,  5., nan, nan, 10., nan,  8.],
       [ 7.,  9.,  7.,  7., nan, nan,  7.,  9., nan, nan,  5., 10.,  6., nan,  5.],
       [ 8., 10., nan,  7., 10., 10., 10.,  9.,  4.,  6., nan,  9., nan,  9., nan],
       [10.,  5.,  7.,  6., nan,  4., nan, nan,  5.,  8.,  8., nan, nan,  5.,  8.],
       [ 4.,  8.,  6.,  7., 10.,  7., nan,  5., 10.,  5.,  7., nan, nan,  5.,  8.],
       [ 9., nan,  9.,  4.,  7., nan, nan,  6., nan, nan, nan,  4., nan, nan, nan]])

a1
array([[nan,  5., nan,  6., 10., nan,  5., nan, nan,  7.,  7.,  4.,  7.,  8.,  5.],
       [nan, nan,  7.,  7.,  9., nan,  5.,  6.,  6.,  8., nan, nan,  5., 10., nan],
       [nan,  4., nan,  8.,  7.,  8.,  9.,  8.,  4., nan,  8.,  5., nan,  9., 10.],
       [ 8., nan,  8.,  8.,  5., nan,  4.,  5.,  5.,  8.,  7.,  6.,  9.,  8., nan],
       [ 7.,  6., 10.,  5., nan,  6.,  8.,  9.,  4., nan, nan,  4.,  5.,  9.,  4.],
       [ 7., 10.,  9., nan,  4.,  8.,  4., nan,  5.,  7.,  4.,  4.,  8., nan,  7.],
       [nan, nan,  8.,  9., nan,  5.,  8., 10., nan, nan,  4.,  4., nan,  6., 10.],
       [ 6.,  4., nan,  4., nan,  5., nan, nan,  5., 10.,  7., nan,  4., 10., 10.],
       [nan, nan,  5., nan, nan,  9., nan, nan,  5., nan, nan,  4.,  5.,  5., 10.],
       [10.,  4.,  5.,  4.,  7.,  7.,  9.,  8., nan,  4., nan,  9., 10.,  4., nan]])

a2
array([[ 4., nan,  9., nan, nan,  9.,  4., 10.,  4., nan,  4.,  4.,  9.,  6., nan],
       [ 6., 10.,  8.,  4.,  8.,  9.,  7., 10., nan,  9.,  7.,  6.,  9.,  5., 10.],
       [nan,  9.,  5., nan,  7., 10.,  9.,  9.,  4.,  5.,  5., nan,  9.,  5.,  6.],
       [nan,  8., nan,  4., nan, nan,  9.,  9.,  4.,  4.,  5., nan,  4., 10.,  9.],
       [ 5.,  7.,  9., nan,  7.,  5., nan,  5., nan, nan,  8., nan, 10., nan, nan],
       [nan,  6.,  5.,  6.,  6.,  6., 10., nan,  5.,  8., nan, 10.,  8.,  8.,  8.],
       [ 9., nan,  6., 10., nan,  8., nan,  7.,  7., nan, nan, nan,  5.,  8.,  8.],
       [ 4.,  5.,  9.,  6., nan, nan,  9.,  7.,  4.,  8.,  6.,  4., 10., nan, nan],
       [nan, nan,  4.,  6., nan, nan,  7.,  6.,  6.,  6.,  5.,  8.,  4.,  8.,  7.],
       [nan,  7., nan, nan,  8., nan,  6.,  9., 10.,  5.,  9.,  4.,  9.,  4.,  9.]])


z = np.array([a0, a1, a2])

# ---- no the sums ---- 
#

np.sum(z, axis=0)  # ---- if any time at any location has nodata, it is no data

array([[nan, nan, nan, nan, nan, nan, 16., nan, nan, nan, nan, 12., 21., 23., nan],
       [nan, nan, 22., 17., 26., nan, 16., nan, nan, 24., nan, nan, nan, 21., nan],
       [nan, 21., nan, nan, 19., 27., 27., 22., 16., nan, 20., nan, nan, nan, 22.],
       [nan, nan, nan, 17., nan, nan, nan, 24., nan, nan, 16., nan, 21., 25., nan],
       [19., nan, nan, nan, nan, 19., nan, 24., nan, nan, nan, nan, 25., nan, nan],
       [nan, 25., 21., nan, nan, nan, 21., nan, nan, nan, nan, 24., 22., nan, 20.],
       [nan, nan, nan, 26., nan, 23., nan, 26., nan, nan, nan, nan, nan, 23., nan],
       [20., 14., nan, 16., nan, nan, nan, nan, 14., 26., 21., nan, nan, nan, nan],
       [nan, nan, 15., nan, nan, nan, nan, nan, 21., nan, nan, nan, nan, 18., 25.],
       [nan, nan, nan, nan, 22., nan, nan, 23., nan, nan, nan, 17., nan, nan, nan]])

np.nansum(z, axis=0)  # ---- sum the values even if one is missing
 
array([[14.,  5.,  9.,  6., 19., 19., 16., 19., 12., 13., 11., 12., 21., 23., 15.],
       [ 6., 10., 22., 17., 26., 17., 16., 16.,  6., 24., 13., 16., 14., 21., 20.],
       [ 8., 21., 13.,  8., 19., 27., 27., 22., 16., 11., 20., 15.,  9., 14., 22.],
       [ 8., 15., 12., 17.,  5.,  7., 13., 24.,  9., 12., 16.,  6., 21., 25., 19.],
       [19., 13., 19.,  5., 14., 19.,  8., 24.,  9.,  5.,  8.,  4., 25.,  9., 12.],
       [14., 25., 21., 13., 10., 14., 21.,  9., 10., 15.,  9., 24., 22.,  8., 20.],
       [17., 10., 14., 26., 10., 23., 18., 26., 11.,  6.,  4., 13.,  5., 23., 18.],
       [20., 14., 16., 16.,  0.,  9.,  9.,  7., 14., 26., 21.,  4., 14., 15., 18.],
       [ 4.,  8., 15., 13., 10., 16.,  7., 11., 21., 11., 12., 12.,  9., 18., 25.],
       [19., 11., 14.,  8., 22.,  7., 15., 23., 10.,  9.,  9., 17., 19.,  8.,  9.]])
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Once you get set up using RasterToNumPyArray to create the arrays and finally a NumPyArrayToRaster to get the answer back to a *.tif, then you have a template to use with another data.

In the examples show, Xander and I were working with ascii files being read from a folder, so if the data are in one location it if even simpler.

Worth a look, because I suspect the sequential adding is just slowing the process down since you are running out of memory. At least with this alternative, you can aggregate in chunks then add the chunks as needed (ie daily to monthly, monthly to yearly etc etc)

deleted-user-roJGZJe3MuBa · ‎10-05-2018

I have worked with numpy before when working with netCDF files and I know they work awesome. I just was baffled at this specific problem given that I am running on a server with 32 cores and 256GB of RAM. I have worked with memory intensive scripts dealing with rasters before such as converting to point or polygon specifically for our mapping needs. I did find an error on my script but that would only affect the final result of the average not the tools themselves. I will have to experiment with numpy to raster and how that works given that our final outcome are maps to be overlayed with other datasets. I just don't get why the first 8 rasters work like a lighting and produce the same result, and after that everything stalls. It's not like the script slowed down after processing hundreds of rasters.

Thanks for the suggestion.

deleted-user-roJGZJe3MuBa · ‎10-05-2018

Just as an update, I realized somehow I had deleted the mask environment setting to be cleared after each average of the day. The mask environment setting seems to have a more processing downgrade on the actual geoprocessing tool Con than on the actual Map Raster Algebra (eliminating negative values from the raster VS doing actual math calculations on the rasters). Once I applied the mask only to the actual final result of the operation (getting the average), the script once ran super fast. I did some tests for days to corroborate results and running the tools within ArcGIS Pro one by one they yielded the same results as the script.

DanPatterson_Retired · ‎10-06-2018

makes sense, if you do this in numpy it would be the equivalent to the difference in doing 'np.sum' versus 'np.nansum' or 'np.ma.sum' if you look at numpy at some stage. Interruption to convert classes (ie areas) to 'nan' or a 'mask' would be an extra unneeded step... best to leave it to the end.