How to avoid job status timeout ("Response already committed") for long running geoprocessing jobs on ArcGIS Server (linux)?

DanSlayback1 · ‎02-14-2016

I have some geoprocessing services published to ArcGIS Server that can take many hours to complete (which is expected and normal). This is using ArcGIS Server 10.3.1 on Linux (RHEL), launching the GP service via REST call. At some point after they've run for several hours (maybe 12 or more), you can no longer check the job status at the REST endpoint, eg:

http://arcgisserver:6080/arcgis/rest/services/GPTool/GPServer/GPTool/jobs/jc5cfc333f7af41219b9d90c85...

Normally that brings up a listing of status updates from the GP job. But after some period of time, the returned webpage is blank, and trying to query it brings up SEVERE errors in the ArcGIS Server logs, with two separate entries:

Response already committed. Cannot forward to error page.

Unable to process request. org.apache.jasper.JasperException: java.lang.NullPointerException

Even so, the job appears to be running, as ArcGIS Server reports an instance in use. The timeout for the GP Service is sufficient to let it run for 24 hours (at which point, I assume, Server will kill it whether its complete or not, and that instance will no longer be in use).

I'm not clear if this disappearance of the job status occurs after a set time, or possibly at midnight; at midnight, the service properties are set to recycle (this is the default). I do know that such recycling does not kill running jobs, but maybe it kills the ability to retrieve job status?

I'd appreciate any workarounds if anyone knows of any. My GP services report various messages that normally appear in the job info page, that is very helpful for diagnostics (I suppose I could have it write those to a separate log file, but if I can simply retrieve the normally-available job status, that would be simpler!).

VinceAngelo · ‎02-15-2016

There are many possible issues here.

First off, there's the way the service has been configured. On the Pooling page, there's a value for "The maximum time a client can use a service" (default is 600 sec -- 10 minutes). On the Processes page, there's a "Recycle this configuration every" property, which defaults to 24 hours. What values do you have on these parameters?

Even when these properties are set to values which would permit the job to continue, there's another aspect -- If the job directory remains unchanged for the service use duration, even when the job is running successfully, then the job will be terminated and the job folder removed. I've successfully used a Thread in my ArcObjects Java GP job to update the job folder to prevent preemptive termination.

Finally, you may just need to restructure your service to run in smaller processing units, so that no job exceeds 10-15 minutes of execution, with a "state set" (JSON or ASCII file or table) to track where the last processing step left off, so that execution can seamlessly continue. A servlet would be ideally situated to follow progress and resubmit work units as necessary, though you might be able to just let each job request a new job to continue its work. The sexy part of this option is that you might be able to use multiprocessing to solve the job fragments in parallel, reducing overall runtime.

- V

DanSlayback1 · ‎02-16-2016

Thanks very much for the info on the dangers of idle directories, and suggestions to break the processing into chunks. That will be a more complicated piece of code, but might have other efficiencies as you point out. In this case, I'm updating a mosaic dataset attribute ('footprint') table, so I'm not sure if doing that in parallel chunks would be a good idea (if it might have locking problems) or not.

In your case, how do you 'update the job folder'? (and if you can suggest how that might be done from a python-based script...).

It would seem Server should be able to keep a directory 'active' for GP jobs that it knows are still running. I wonder if this is just an oversight on ESRI's part (not imagining people might run GP jobs for several hours), or intentional.

In anycase, for this service, I'd set the maximum time a client can use a service to 36 hours (in seconds). And set the recycling to 48 hours. Even so, at some point well before either, the status page disappears ,but it sounds like that is a result of the directory being idle. The GP process/service seems to continue - Server Manager indicates an instance is in use, and CPU is being used. For smaller chunks of data (the norm), this takes a few hours to run, during which I can always retrieve status updates, and it completes successfully. I just have one very large chunk of data I was trying to push through. And appears to be failing for other reasons.

VinceAngelo · ‎02-16-2016

Multi-threading is so easy in Java, kicking off a thread to "touch" the GP job folder at 5 minute intervals is trivial (once the job folder is identified, the thread just has a sleep(300000) and a File.setLastModified(now)). Porting that to Python would hang on multithreading, but if your main loop runs in a fraction of the refresh interval, you can just add a loop element to invoke os.utime with the current time on the job folder.

It's hard to understand how updating a mosaic dataset footprint could take hours to run.

- V

DanSlayback1 · ‎02-17-2016

Thanks for the tip. I'm not familiar with Java or arcobjects - just work in python, but maybe I'll look into doing this from python at some point, it does seem doable (if the gp service can interact with the job directories...).

But for routine use, the gp service should not take this long, as the mosaic datasets will not be so large (maybe thousands of records, or some limit I can set). I've been struggling to operate on a very large mosaic dataset with 600,000 items that already exists. I think I can just manually push through what I need to do in this one case, and in the future avoid the timeout issues by limiting mosaic dataset size before running these attribute updates.