Manage Map Cache Tile: Errors at 10.2.2

DavidColey · ‎05-13-2014

We are experiencing brand new caching issues at 10.2.2. I am unable to either DELETE_TILES or RECREATE_ALL_TILES from an existing cache. The error we keep receiving is:

Line 30Output failure, error string = Error moving bundle Failed to cache extent: -9223211.962860 3087885.815688 -9079003.170544 3271103.291998 at scale 577790.55428899999.

The error occures when running RECREATE_ALL_TILES or DELETE_TILES operations of the Manage Map Cache Tiles (run as a tool or script) when run from a dedicated server gp cluster machine. Our clustered setup consists of a network file server storing our config, security and data-stores, as well as all server and system directories. Two servers serve as a mapCluster for map and image services, a single server in a gpCluster for caching and user-defined gp services, and a virtualized web server for the WebAdaptor.

This has occurred as of lastweek whether reading from the GIS SDE database OR from a data-store file gdb. All permissions for directory shares and folder and file security have been accounted for.

When the error occurs, the entire cache becomes corrupt, making any other caching operations in-operable (delete cache, delete tiles, etc). During the recreate, the bundlx file is being removed but the bundle file is not, thus not allowing the tool to overwrite thebundle from the admin-defined D:/arcgistemp folder(s) on the GP server.

Thus far, the only solution has been to kill the service, stop the site, and then remove the entire cache directory, restart the site, recreate both the service and the cache.

Has anyone encountered this?

Thanks-
David

DavidColey · ‎10-30-2014

Hi Bob- what your doing will definently work. I'm taking a slightly diffrerent approach that may be less intensive step-wise but requires duplicate services and a bit more drive space so it may not work for everyone:

1. I created duplicate services and caches on my production server, and secured them.

2. Run my recreate all tiles python scripts on the secured servcies. Becasue these protected services are not under load, recreating all tiles does not fail. At least it hasn't yet

3. I then use windows commands in a bat to remove the prodcution cache directories, and xcopy the protected bundles over.

That's it. In this way, I don't have to worry about locking or exporting caches or restarting services, and everything can still run by the task scheduler. But again whatever works for your environment, setup, security and all of the contstraints we all face . . .

BobNutsch1 · ‎10-30-2014

Hi David,

Thanks for that info. In my testing, I basically did what you are doing but it didn't work for me because of the locks on the target production files. I couldn't even remove the production cache directories in order to run xcopy. So the two of us might have something different in our architectures. 🙂 I do like your idea of hiding the services with security, that's more elegant than my approach of having TESTONLY in the name of the service.

This is one of the things that makes debugging this thing tricky, finding the exact sequence of things to create the issue. For example, I spent a lot of hours working on this, and then thought that I had a way to make it work when Esri told me to use Export Cache, thinking that tool would properly work with locks on files. For a while I thought that it was working because in my testing, I was only testing on the first 10 or so cache levels where there was little if any locking going on. But once I started doing some updates in deeper scale levels, I began to have all sort of problems.

Hopefully we'll see some news from Esri soon.

DavidColey · ‎10-30-2014

Agreed. I completely had to make sure that I had no left over .glock files in my config or missing bundlex files in my directories (at all cache levels). I did that by tearing down and recreating both the public production and secured private services and caches. Not insignificant as I have both local (for Mobile, ArcPad), and web mercator basemaps and street framework caches. Same as you I'm sure.

Then I could use the remdir and xcopy method....

JonathanIrwin · ‎11-12-2014

Hi all

I have been watching this thread closely over the last few weeks. We have a 10.2.1 multi clustered and server site using shared storage with UNC paths. We have regularly been seeing the "index was either too large or too small" and "field is not editable" errors when caching our services. We get these errors when updating caches and creating from fresh.

I was wondering if anyone has seen these following errors in the Caching Tools logs?

- Error executing tool.: TilesWorker: Output failure, error string = Unable to construct the map service instance. Failed to execute (Manage Map Cache Tiles Worker).

- Unable to instantiate class for xml schema type: CIMDEGeographicFeatureLayer

- Invalid xml registry file: e:\program files\arcgis\server\bin\XmlSupport.dat

We first noticed these errors referencing the XMLsupport.dat file when publishing services. It would happen randomly, then more frequent up until we couldn't publish at all. An AGS service restart appears to resolve the issue at that point in time but eventually it comes back again.

I only recently noticed the same error when caching. My caching came to a standstill, however as soon as I restarted the AGS service all was good again - cache updates no problem.

I am really wondering what causes the XML.dat file to become an issue\corrupt. Esri support have recommended replacing this file with the same from a server install at the same build - but this wont prevent further corruption!

So really just wondering if you are all seeing the same errors relating to XMLsupport.dat?

Thanks

Jonathan

BobNutsch1 · ‎11-12-2014

Hello,

I have not see this but then maybe that's because we are using 10.2.2. When you cache your services, are you updating cached services that are in use (e.g., being used in production) or are you caching on a staging service into a cache that is not in use? I think that most of this thread has to do with working with cached services that are in use, and thus file locks are not being properly dealt with.

Regards, Bob

DavidColey · ‎11-12-2014

You are correct Bob. I started this thread because, at 10.2.2, the locks were not being handled when caching services under load in a clustered environemnt. Prior to 10.2.2 we experienced many incomplete caches, but we never encountered the cache corruption due to a .bundle files' exchange (bundlex) file being removed without also removing the bundle.

The whole point here is that prior to .2, caching services under production was a never a problem in a clustered environment. We may have had to re-create missing tiles, but never had bundle file removal issues and as such never had to create and cache staging services.

Now, I have essentially re-instanciated my pre 10.1 method of caching a stage service, and then moving its bundles into my production services' layer directory structure.

Disappointingly, as of yesterday's testing I re-discovered that in order to completely remove the production cache files, I must stop the ArcServer service on each produciton machine.. In earlier testing, I was somehow I was gettting away with only stopping the production cachce service (apologies to Bob).

BobNutsch1 · ‎11-12-2014

Thanks David for the info. No apology needed, this is a very tricky one to nail down. When I first started working with Esri support on the problem they suggested using Export Cache (which is handy but is a very slow way to update cache) as a workaround. I thought that using Export Cache fixed the problem but DOH! I initially only tested using the upper cache levels where there was no usage at the time, giving me the mistaken impression the problem was solved.

Until I hear back from Esri (I haven't heard back yet after sending a reminder), so far my work flow up above seems to be working. I just have to tell the users that due to the Esri bug I am forced to pull down the service for a period of time while I update the production cache with the staging cache.

I haven't tried 10.3 pre-release to see if the problem has been fixed. It should be easy enough to test; from what I recall one doesn't even have to be using a clustered environment to duplicate the problem.

MichaelVolz · ‎11-12-2014

Bob:

I thought this issue only occurred in a clustered environment based on David Coley's initial observations.

Please correct me if I am wrong.

BobNutsch1 · ‎11-12-2014

Hello Michael,

It for sure occurs in a clustered environment but I believe that I duplicated on a non-clustered environment in some of my early testing. I may be wrong but I have to double check and get back to you.

BobNutsch1 · ‎11-12-2014

Hi, I did a quick check on a non-clustered environment and cannot reproduce the problem. It is likely I was mistaken in the above comment, my apologies.