Things to consider when creating raster proxies for the cloud

NCOneMap · ‎08-30-2019

We are in the process of moving from on-prem servers to AWS. We have several image services and underlying data that we need to migrate. We are converting source tif's to MRF and will be storing these in S3. As we go through this learning curve we've developed a few questions that hopefully aren't too ignorant. If someone could share their expertise, we would greatly appreciate it.

Why would one want to create overviews and then convert to raster proxies rather than letting the raster proxy cache be created dynamically?
Why would someone want to create external raster proxy files rather than having them embedded in the mosaic dataset?
How could we estimate the disk space needed for the raster proxy cache? Use the disk space needed for our on-prem overviews? Size of one of our cached image services?
What is the best practice, have mosaic datasets in individual FGDB's on each GIS server instance or in an enterprise geodatabase?
Is it best to set image compression, pyramid compression, and raster proxy compression to use the same type of compression for performance? All LERC, all JPEG, etc.
Historically, our imagery is primarily used for background contextual viewing and we have always used JPEG compression. We have always included a notice to users that because of the compression, it should not be used for any type of analysis. Would using LERC compression remove this caveat? If so, what are the recommended settings we should use?

Thanks again for any insight.

PeterBecker · ‎08-30-2019

Why would one want to create overviews and then convert to raster proxies rather than letting the raster proxy cache be created dynamically?

Overviews are reduced resolution versions of the mosaic dataset. IE they are the result of mosaicking multiple images together at lower resolution and used when users zoom to small scales that would otherwise require opening a large number of the source images. There is not simple method to have the system generate such overviews automatically. Once you have added your imagery to the mosaic dataset you typically then create overviews. If you create these locally then you need to copy them to you cloud storage and you can reference them through raster proxies. The alternative is to directly create them on the servers in the cloud (eg directly on the ephemeral drives). These would by default be created on the server file share. An alternative is also to use Generate Raster to create the overview image as a CRF directly to the cloud store.

Why would someone want to create external raster proxy files rather than having them embedded in the mosaic dataset?

In many cases users create raster proxies from data they have in the cloud in order to have a local representation of the cloud storage (and metadata) on local machines. This can simplify image management for users that prefer to work on local machines. Then before publishing to a cloud server they embed the raster proxies in the mosaic dataset so that the mosaic dataset does not reference external files.

How could we estimate the disk space needed for the raster proxy cache? Use the disk space needed for our on-prem overviews? Size of one of our cached image services?

The disk space used for the raster proxy cache is equivalent to the unique areas of imagery that is accessed by the server. If users access the same area frequently it will remain small. If users go to many different areas it will increase, and directly dependent on the extent and resolutions visited. Note that the cache can be cleared away at any time and tools are included to automate this. It is recommended to have the cache on local ephemeral drives on each server machine. Storing the mosaic dataset overviews on ephemeral is an option but may required more management if you have multiple machines. The size of a tile cached (IE rgb 8bit) can be estimates as CacheSize in MB is approximately AreaInKm2*0.5/(PixelSize in m)^2 eg 20km2 @ 0.25m resolution is approx. 160MB

What is the best practice, have mosaic datasets in individual FGDB's on each GIS server instance or in an enterprise geodatabase?

An enterprise geodatabase stored in RDS is simpler to manage. Any updates can be immediately reflected on the serves and does required anything on the server to be changed. Putting the FGDB on the ephemeral drive of the server mean you don’t need to manage (or pay for) an enterprise geodatabase (eg RDS), but you do need to change the servers when the mosaic datasets get updated. This could be automated but can be fiddley. Also if working with very large mosaic datasets (in the millions of records) then it can be time consuming to copy to server machines on updates.. I do not recommend putting a file geodatabase on a fileshare.

Is it best to set image compression, pyramid compression, and raster proxy compression to use the same type of compression for performance? All LERC, all JPEG, etc.

You can mix compression. No need to keep to the same. Compression should be dependent on the data you have and how much information loss you accept.

Historically, our imagery is primarily used for background contextual viewing and we have always used JPEG compression. We have always included a notice to users that because of the compression, it should not be used for any type of analysis. Would using LERC compression remove this caveat? If so, what are the recommended settings we should use?

If you use Lossless compression (eg LERC with suitable tolerance or deflate) then the pixel values will not be changed from the original values. No artifacts from compression will be added. If the data source is suitable for analysis or not depends on the source. If for example compression was used in the production process or the data has been in different ways resampled then this could also have added artifacts in the source. Then again many deep learning models are quite invariant to the artifact created by compression, so the answer also depends on what analysis is to be performed.