why cache data?

590
8
05-03-2018 03:27 PM
DaveOrlando
Occasional Contributor III

I've been reading lots about the issues with the cache and resetting the index etc etc.

I don't understand the purpose of the cache, can somebody please explain why an Open Data site is not accessing the data directly.

There are plenty of easy mechanisms to go from REST endpoint to JSON to shapefile/KML/GDB quite quickly so I don't understand the concept behind caching the data. Is it really just for speed?

I think when people are downloading Open Data they can accept a longer wait. It proves that it is actually fetching new data.

I would much rather see an option for 'do not cache'

Thanks

Tags (3)
8 Replies
Asgharkhan
Occasional Contributor

Cached data is only using in basemap because you can load every scale fast if catched. Also you can upload data without caching to server. But if you upload basemap without caching its not working well & you can't scale it.

0 Kudos
DanielFenton1
Regular Contributor

Hi Dave,


Thanks for your question! Let me first separate out two concepts, caching for downloads and indexing for search.

To enable your users to easily search for and retrieve content from your sites, we store a set of metadata about each layer, csv, document etc in an index.

To enable your users to download data in kml, csv, geojson, and shapefiles. We cache all features in every layer or table stored on your site in a filesystem. There are several reasons for using this cache.

1. ArcGIS Server does not support all of these export formats. Some are supported but only in certain configurations and versions of Server that are part of ArcGIS Enterprise and ArcGIS Online. This means that in order to offer these download formats, we must extract the raw data from your servers and run them through a translation process. You are correct that there are many ways to do this process. E.g. there are some javascript packages that can do this in the browser. However, the only software tool that is feature complete for this process that can handle many different kinds of source data is GDAL. That cannot be run in the browser.

2. Downloadable files may be extremely large, e.g. gigabytes. It takes a lot of computation to produce these files. It would not be efficient for every user to have to produce the transformation in their browser even if the shortcomings outlined above did not exist.

3. It's often a quite intensive operation to extract all the features from a Feature Service. As Server was not designed directly for this procedure, our systems have to find a way to query for all features one chunk at a time. If every user who requested a download had to go through this process, it would quickly overwhelm the servers providing the data. Last year, ArcGIS Hub server 12.3 million downloads. Some e.g. geojson extracts used by web applications, were download hundreds of thousands of times.

4.  "I think when people are downloading Open Data they can accept a longer wait. It proves that it is actually fetching new data." It's an interesting point, and we did used to have a process that would force users to wait for downloads to recache with new data. After receiving a lot of feedback from customers and users who felt the process was not a good user experience, we changed the procedure to only cache in the background. With that setting, we feel we can best serve users who come expecting a quick download. Our analytics also show that downloads increased significantly when we moved to this background caching process.

With respect to issues users face with either our index and cache being out of date, we are currently working on the next generation of our indexing technology. That will be rolling out this year and should solve most of the issues users today face. We will also be able to apply this technology proactively detecting when downloads should be updated. Our challenge is that ArcGIS Server does not, by default, report when data has been updated. Server also does not notify our download system when data needs to change. There are some techniques we can apply to check in a lightweight way and we will be researching those this year.

I'm sorry this process has been confusing for you. I hope I've been able to help you, and others better understand why our systems work they way they do. If not, please feel free to let me know!

Daniel Fenton

Lead Engineer for ArcGIS Hub Search and Downloads

DaveOrlando
Occasional Contributor III

Hi Daniel,

Thanks very much for the response, I really appreciate the time you put into it. I can also appreciate the complexity you must face with all the different formats and the larger datasets.

I may have to create a geoprocessing script to create a shapefile and kml on the fly from a Hosted Feature service. I've created a few in the past so it shouldn't be too much a stretch for me.

We're dealing with Emergency Response and Evacuation Zones. It's great that most agencies will use our REST endpoints directly, but some are non-ESRI users and need a shapefile or KML download

Thanks again,

NickShannon2
New Contributor III

Thanks for the information Daniel. 

It would be greatly appreciated if you could clarify what the 'update date' means. 

According to doc.arcgis.com:  "For datasets that are set to automatic, the download cache is dropped every 24 hours and is regenerated the next time a user downloads the dataset."

So, the download cache for all datasets are refreshed every 24 hours, or earlier if someone initiates a download.  

What does the update-date mean for the State Road Network below?  The dataset is sourced from ArcGIS Server and is updated weekly. From what I can tell, the date reflects the last time someone initiated a download - rather than the actual date it was updated (or the 24-hour cache refresh). 

We currently don't have 'editor tracking' turned on.  Given that the download cache is refreshed every 24-hours, what is the advantage of turning on 'editor tracking'?  

Turning on 'Editor Tracking' results in a schema change which is undesirable. 

Thanks in advance. 

Open Data update Date

0 Kudos
PatrickHammons1
Esri Contributor

Hi Nick,

The update date is currently triggered by a change to the item properties within ArcGIS Online (thumbnail, tags, description, etc), but this is changing soon! Later this year, that date will reflect the last time either the item properties or the data itself was updated, along with a hover message indicating what actually changed (e.g. "Data updated", "Metadata updated")

To my knowledge, turning on editor tracking has no impact on the data cache. Correct me if I'm wrong Daniel Fenton‌.

Patrick

KevinHibma
Esri Contributor

Hi Nick,

A little late to the party here, but your question on dates -- I'm working with a client that is in this exact same situation. They have a process that updates their Sever data (daily, weekly, etc), but the Open Data site has no knowledge of that update, thus it looks out of date because the arcgis item doesnt know about the data updates.

Using the ArcGIS Python API, I've put together some sample scripts. As part of their scripted processes to update Server services, we call some scripts which simply "touch" the arcgis.com item, thus updating the metadata and the date you see in Open Data. Feel free to hack them up, the specific function can be seen here: OpenData/codesamples.py at master · khibma/OpenData · GitHub  (make sure to grab Manager.py) There are many ways you could do this: simply run it against all arcgis.com items, or couple it with some process to update only relevant items.

NickShannon2
New Contributor III

Thanks Kevin, I will take advantage of your python scripts on GitHub.

This will be a good fix pending the update Patrick referred too.

With regards to editor tracking, one advantage is that it enables the dataset for offline use.  But if it has no impact on data caching or the 'update' date, then I see no need to enable editor tracking.

It would be helpful if the Esri documentation was clearer on how Open Data works behind the scenes.

For example, how is data cached for ArcGIS server datasets that are updated hourly or every 5-mins?

0 Kudos
AaidaHoney
New Contributor

You can set up a cache on any nonterminal node. When you set up a cache on a node, the cache is filled with the data that passes through the node the next time you run the data stream. From then on, the data is read from the cache (which is stored on disk in a temporary directory) rather than from the data source.

Period Tracker

0 Kudos