Long-running thread stops without exception

MatthesRieke · ‎08-29-2014

Hi everybody,

my Java SOE (AGS 10.2) does a set of database queries (via IQueryDef) running in a scheduler thread (invoked via Quartz 2.2.1) that take a few minutes. These queries are scheduled once a day and dump the results into a file on the HDD, creating a fast-access cache of the results.

I now did some stress testing of this cache updates, making it run every 20 minutes and observing the timestamp of the files. The results are a bit patchy. It seems that from time to time the update job does not run until the end (file does not get updated, last logging statement is not executed). The most dubious thing about this that not a single exceptions is being logged. I am catching RuntimeException and Throwable (logging, then re-throwing) already but nothing comes up in the AGS log files. Though, the original scheduler thread (managed by Quartz) is in waiting mode, looking like the execution has finished successfully.

I ran out of ideas after the migration from ScheduledExecutorService to Quartz did not change anything. Does someone have an idea of what could be done? May it be that ArcObjects or COM gets in the way somehow here?

Thanks in advance,

Matthes

ErinBrimhall · ‎08-29-2014

Hi Matthes,

I observed a similar sounding problem with a .NET SOAP SOE where long-running calls would simply end. No exceptions caught; no ArcGIS Server log entries; just, nothing.

Ultimately, I had to convert the long-running functions of the SOE into Geoprocessing tools and a Geoprocessing service, which are supposedly more appropriate for intensive operations that take more than a couple seconds.

Everything worked smoothly after making this conversion.

I don't know how appropriate a change like this would be for your particular situation, but the problem sounded familiar and I wanted to share my experience.

MatthesRieke · ‎09-01-2014

Hi Erin,

thanks for your feedback. It's a pitty that something like this happens. As a (java) developer you do not expect that something like this happens. At least some sort of log should shed light on the issue. I really have no idea how a sequantial method can be stopped without any sort of exception or error.

As you also had this issue with a .NET SOE it seems like that the problem is happening somewhere deep inside the ArcObjects code.

Regarding the outsourcing of the cache update to a GP service, I have the feeling that this does not meet my requirements. I simply want to provide an installable SOE to the customer without any additional configuration/deployment work to be done on the production server.. Nevertheless I appreciate sharing your solution!

kind regards,

Matthes

MatthesRieke · ‎09-03-2014

I did some additional testing, debugging and logging on this issue.

First of all, I simulated some long running task with simple Thread.sleep() calls, making a scheduled thread work for more than 10 minutes three times per hour. Not a single execution of this simulated workers crashed in the way described above. That lead me to the consumption that there is probably something going wrong within the ArcObjects interaction.

So I created an additional observer thread, that logs the thread's state and dumps its StackTrace to a file once it exceeds the 10 minute mark. The results are of one of the two following kinds:

Thread-15 102 RUNNABLE

com.esri.arcgis.interop.NativeObjRef.nativeVtblInvokeNative(Native Method)

com.esri.arcgis.interop.NativeObjRef.a(Unknown Source)

com.esri.arcgis.interop.Dispatch.vtblInvoke(Unknown Source)

com.esri.arcgis.geodatabase.ICursorProxy.nextRow(Unknown Source)

MyCallingClass.pullData(MyCallingClass.java:142)

or

Thread-22 176 RUNNABLE

com.esri.arcgis.interop.NativeObjRef.nativeVtblInvokeNative(Native Method)

com.esri.arcgis.interop.NativeObjRef.a(Unknown Source)

com.esri.arcgis.interop.Dispatch.vtblInvoke(Unknown Source)

com.esri.arcgis.geodatabase.IQueryDefProxy.evaluate(Unknown Source)

MyCallingClass.pullData(MyCallingClass.java:89)

The thread is "stuck" (even it is in RUNNABLE state) at that method for 1-3 minutes and than happens the magic (!!): All logs suddenly stop. I never receive any more logs for this SOE instance - and a new one is created subsequently. My conclusion is that AGS is shutting down the SOE instance the hard way (similar to "kill -9 <processId>" on UNIX). No shutdown hook is executed, it is just killed.

I have no clue how to go on. This might be a deadlock bug inside the COM object or related. It cannot be reproduced 100% - it happens from time to time at different queries.

ErinBrimhall · ‎09-03-2014

Hi Matthes,

Nice sleuthing with the separate observer thread for grabbing the stack-trace.

Now I'm curious about your code. Can you share the lines performing the query and iteration over the cursor results?

MatthesRieke · ‎09-04-2014

Sure! It's in prototyping/testing conditions - so do not expect clean code

This one is the thread observer pattern of the quartz-scheduled cache update:

ArcGIS-Server-SOS-Extension/QuartzCacheScheduler.java at 6f340a4ed3106779e074f857ae859f8b0fd1d12e · ...

These are the actual database queries which are called from that scheduled task:

ArcGIS-Server-SOS-Extension/AccessGdbForOfferingsImpl.java at 6f340a4ed3106779e074f857ae859f8b0fd1d1...

As you can see, I have just tried to play around with FutureTask<T> and timeouts on these (I do not know an equivalent for C# but there is one for sure). This way I hope to catch the stucks and react on them (if AGS allows me to do so). First tests look promising, I will report after I have done some exhaustive stress testing!

MatthesRieke · ‎09-05-2014

After some testing (regular cache updates every 20 minutes / 10-15 parallel service requests every 15 minutes) I got some results. It seems like that the database server hangs for unkown reason. I am using MS SQL Server 2008 in our testing environment. By chance I catched the right moment when the SOE got stuck and had the possibility to check the DB server. I could not connect via the SQL Server Management Studio. In addition, the FutureTask pattern provided some additional insights. I was able to react and keep the SOE up a little longer. Here are the logs:

05.09.2014 07:17 - SEVERE - Processing request took longer than the usage timeout for service 'EEA/AirQualitySos.MapServer'. Server request timed out. Check that the usage timeout is appropriately configured for such requests. - Server

05.09.2014 07:18 - WARNING - "The StandaloneTable:'Airquality_E2a.DBO.%Observation' in Map:'Layers' is invalid. The base table definition string ""Airquality_E2a.DBO.%Observation"" is invalid. Failure to access the DBMS server." - EEA/AirQualitySos.MapServer

05.09.2014 07:18 - WARNING - "The StandaloneTable:'Airquality_E2a.DBO.%Network' in Map:'Layers' is invalid. The base table definition string ""Airquality_E2a.DBO.%Network"" is invalid. Failure to access the DBMS server." - EEA/AirQualitySos.MapServer

05.09.2014 07:18 - WARNING - "The Layer:'Airquality_E2a.DBO.FeatureOfInterest' in Map:'Layers' is invalid. The base table definition string ""Airquality_E2a.DBO.FeatureOfInterest"" is invalid. Failure to access the DBMS server." - EEA/AirQualitySos.MapServer

05.09.2014 07:18 - WARNING - "The Layer:'Airquality_E2a.DBO.SamplingPoint' in Map:'Layers' is invalid. The base table definition string ""Airquality_E2a.DBO.SamplingPoint"" is invalid. Failure to access the DBMS server." - EEA/AirQualitySos.MapServer

05.09.2014 07:18 - WARNING - "The Layer:'Airquality_E2a.DBO.Station' in Map:'Layers' is invalid. The base table definition string ""Airquality_E2a.DBO.Station"" is invalid. Failure to access the DBMS server." - EEA/AirQualitySos.MapServer

05.09.2014 07:19 - SEVERE - Failed to initialize server object 'EEA/AirQualitySos': 0x80004005: Failure to access the DBMS server - Server

05.09.2014 07:19 - SEVERE - [org.n52.sos.SosSoe] There was a problem while creating DB access: Argument is not a COM object: null com.esri.arcgis.interop.Dispatch.initDispatch(Unknown Source) com.esri.arcgis.interop.Dispatch.(Unknown Source) com.esri.arcgis.geodatabase.IDatasetProxy.(Unknown Source) com.esri.arcgis.geodatabase.FeatureClass.(Unknown Source) org.n52.sos.db.impl.AccessGDBImpl.(AccessGDBImpl.java:147) org.n52.sos.SosSoe.construct(SosSoe.java:216) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) com.esri.arcgis.interop.NativeObjRef.nativeVtblInvokeNative(Native Method) com.esri.arcgis.interop.NativeObjRef.a(Unknown Source) com.esri.arcgis.interop.NativeObjRef.a(Unknown Source) com.esri.arcgis.interop.Dispatch.vtblInvoke(Unknown Source) com.esri.arcgis.system.IObjectConstructProxy.construct(Unknown Source) com.esri.arcgis.discovery.servicelib.impl.SOThread.run(SOThread.java:323) - EEA/AirQualitySos.MapServer

05.09.2014 07:19 - SEVERE - Failed to construct instance of service 'EEA/AirQualitySos.MapServer'. Failure to access the DBMS server - Server

This looks like the SOE initiaization failing. I am now trying to quick fail on the timeouts, giving the DB server a chance to recover, and reschedule the update 30 minutes later...

ErinBrimhall · ‎09-10-2014

Hi Matthes,

This is good info. Have you done anything to determine what could be the root-cause of the DB server connection failures? And to be clear, have you reproduced this problem in your Test environment against the MS SQL 2008 database?

There could be some clues in the SQL Server logs.

Also, is there anything special about the credentials you are using to connect to the datasources of each layer in the published Map document? I presume the different layer datasources use a connection that matches one registered with ArcGIS Server?

And finally, the first error message logged: "Processing request took longer than the usage timeout for service 'EEA/AirQualitySos.MapServer'. Server request timed out. Check that the usage timeout is appropriately configured for such requests."

Are you using the default service usage timeout or has this been configured differently?

MatthesRieke · ‎09-16-2014

Hi Erin,

yeah I have played around with the request timeouts, but that did not have any effect. I also have the feeling that this specific log entry ("Processing request took longer ...") is just a subsequent symptom of the SOE getting into an undefined state.

We are using SQL Server 2008R2 on our test environment as well as on the production server. On my development machine there is a 2012 instance installed. I did not have the chance to run long-term tests on it - maybe I can start some off around the weekend.

Thanks for the hint with the SQL Server logs. I just digged into these and it turned out that each time the cache update fails with a timeout the SQL Server restarts a few minutes after the occurence (at least this is what I read from the messages).

09/10/2014 03:16:54 - spid52 - Unknown - Starting up database 'ReportServer$SQLEXPRESSTempDB'.

09/10/2014 03:16:53 - spid52 - Unknown - Starting up database 'ReportServer$SQLEXPRESS'.

These two messages are the only entries around the time of occurence. I also checked OS system messages, but could not find anything related.

Another interesting fact: I am doing the cache update every 20 minutes and did parallel HTTP service requests (around 10 batched requests) every 15 minutes. After stopping the HTTP request execution the updates went well for around 3 days. I am more and more getting the feeling that the SQL Server crashed due to some kind of overload.

MatthesRieke · ‎11-26-2014

Hi,

I just wanted to add some enlighting remarks. I have created a duplicate database and got rid of alle SDE relevant tables, schemas, etc (as I do not require these for my use case). The result is, that when I create a FeatureClass with the data source of my IMapServerDataAccess, that I receive an instance of SqlWorkspace (instead of IWorkspaceProxy).

The SqlWorkspace is now in the testing environment and for a week it did not crash in any way. It seems very stable. So imho the problem lies in the SDE-related workflows of ArcObjects.

Best

Matthes