Select to view content in your preferred language

ArcGIS (10.1 SP1) Site and Web Adapter randomly crash and stop responding

17034
25
Jump to solution
05-21-2013 02:59 PM
by Anonymous User
Not applicable
Original User: btelliot

We are struggling with two main problems since moving from ArcServer 10.0 to 10.1.

1. Poor Performance
2. Constant Server downtime / general site instability.

Our ideal server architecture would be to have a multiple virtual machine site, with 2 clusters, and a single web adapter running only using SSL.  See attached image for configuration.
[ATTACH=CONFIG]24566[/ATTACH]

We currently have about 350 services running on our site. 

  • ~300 of which are configured with a minimum instance of 0 (should turn themselves off) and a max instance of 2.

  • ~20-30 are cached

  • All running in High-Isolation

  • licensed for 4 cores per machine and additional staging license (12 cores total).

  • 16GB ram per machine.

  • web adapter is running with 1 core.

Performance

Our main issue with performance comes from administering / publishing services.  Since we have multiple machines, we need to reference the config store from a UNC path.  This is a known bug that should be fixed in SP2.  (Why they haven't released a hotfix for this is beyond me).  For more details see thread:  http://forums.arcgis.com/threads/66388-Slow-performance-administering-services-in-ArcCatalog-and-Arc...

However, we also have performance issues on the web client side of things.  These issues are intermittent and and difficult to replicate.  We can measure this latency using the Network tab of the "Developer Tools" in google chrome.  It will sometimes take 3-5 minutes for the server to return the data to the web browser, even on cached services that are already running

Depending on our configuration and the complexity of the MXD, publishing a service usually takes around 5 minutes at the best of times.  At the worst, republishing an existing map document can take up to 30 minutes.  If anyone else has experienced any of these issues please let me know!

We have monitored our system resources on the virtual machines, and we rarely hit upwards of 30% CPU usage, unless caching or restarting the machines.

Stability

Since moving to 10.1, we have maybe had a maximum of 1 week go by without a server outage / issue.  As we are growing as a company, more people are relying on our services in their workflows, and downtime becomes less and less bearable. In theory, a multiple machine site should be more stable.  One server goes offline, the web adapter recognizes this and redirects the traffic to a different server.

Main Issue:

  • We have noticed that ArcServer running on one of the machines will periodically crash and stop working.

  • We don't see a spike in system resources, or any other telltale signs on the vms, it just stops responding as it should.

  • We will experience this at least once a day.

Our normal fix is:

  • Check to see if the web adapter is responding; if not, restart the VM

  • Check to see if each individual machine is responding (try to log-into the ArcGIS Service Manager); if not, restart the VM

  • Reboot whichever server is crapping out, if that doesn't work, try the other one(s).

  • If it still doesn't work, try stopping the machine from https://[machinename]:6443/arcgis/admin, and then starting it again.

If anyone has some insight into what may be causing this issue, please let us know.

Thank you for taking the time to read this!

Brett



TL;DR: ArcServer 10.1 SP1 is still very buggy.  ArcServer will randomly stop working, and we will need to reboot the virtual machine it is running on.  We have to do this A LOT.
25 Replies
by Anonymous User
Not applicable
Original User: btelliot

Good to hear!  I really hope that this resolves the instability problems that you are seeing. 

Please report your findings so that everyone can benefit from them.

Did your server logs, as seen in ArcGIS Server Manager, have any errors in them?  In particular, I am wondering about out of memory/heap type errors.


Our server logs had a ton of errors in them.  I uploaded them to the dropbox folder yesterday as well.
0 Kudos
RichardWatson
Deactivated User
I saw something like 1978 Severe errors over ~3 days.

I suggest that you take the time to create a spreadsheet in which you group the issues together.

Something like this:

Count    Problem
------   --------
105       Feature service not found
202       Server wait time expired

After doing that then you can sort on the count so that the most pressing problems appear first.

I did see a few out of memory errors.  I suspect that you might have several issues going on here.
0 Kudos
by Anonymous User
Not applicable
Original User: btelliot

We had another server crash this afternoon.  Reading the logs makes it look like it made it about 20 minutes before the next crash.

It looks like changing the Max Heap Parameters didn't fix the issue...

I've updated the logs and DMPs in the dropbox folder.  I created excel files of the most recent logs and it looks like there are a couple memory issues, but mostly errors where the server fails to initiate a MapService, or can't find the service because the server has stopped.

One of the errors that stands out in the logs is the following:
2013-05-23T12:57:57,350  SEVERE  9000 Rest 3308 22  Internal Server Error. Error handling service request : Unknown Container Exception: org.apache.openejb.client.RemoteFailoverException: Cannot complete request.  Retry attempted on 1 servers; nested exception is: java.io.IOException: Cannot deternmine se.

No clue where to go from here!
0 Kudos
RichardWatson
Deactivated User
Here is an example of the full error information:

<Msg time="2013-05-22T15:21:39,778" type="SEVERE" code="9003" source="Rest" process="3308" thread="22" methodName="" machine="TERAARCSERV1.TERA1.COM" user="" elapsed="">Unable to process request. Error handling service request : Unknown Container Exception: org.apache.openejb.client.RemoteFailoverException: Cannot complete request.  Retry attempted on 1 servers; nested exception is: java.io.IOException: Cannot deternmine server protocol version: Received null/0.0</Msg>

If you look at the source field, is it either Rest or Server.  I think that there is a single Rest process (on each GIS Server) whose job it is to handle the incoming requests and handle them.  The Server process is an instance of ArcSOC which is the process which does the actual work, e.g. map service, and there many of these.

The above error comes from Rest.  I'm not a Java programmer but my guess is that the program is trying to communicate with another machine using Java and there is some type of mismatch between the machines.  If you Google on that error then it seems to say that perhaps there is a software version mismatch.

I do believe that the REST process can and does try to communicate with other nodes (i.e. GIS Servers) in the site.  I am guessing that there is some type of problem here.  I noticed that all the logs are about TERAARCSERV1.  Where are the logs/entries for the other machine in the site?  Your diagram indicates that you have 2 machines in the default cluster.  I am wondering if there is some type of capability and/configuration problem here.  Do the machines have the exact same version of ArgGIS Server installed?  It could also be firewall/security configuration related.  Are all the required ports open?

http://resources.arcgis.com/en/help/main/10.1/index.html#//015400000537000000
0 Kudos
by Anonymous User
Not applicable
Original User: btelliot

Here is an example of the full error information:

<Msg time="2013-05-22T15:21:39,778" type="SEVERE" code="9003" source="Rest" process="3308" thread="22" methodName="" machine="TERAARCSERV1.TERA1.COM" user="" elapsed="">Unable to process request. Error handling service request : Unknown Container Exception: org.apache.openejb.client.RemoteFailoverException: Cannot complete request.  Retry attempted on 1 servers; nested exception is: java.io.IOException: Cannot deternmine server protocol version: Received null/0.0</Msg>

If you look at the source field, is it either Rest or Server.  I think that there is a single Rest process (on each GIS Server) whose job it is to handle the incoming requests and handle them.  The Server process is an instance of ArcSOC which is the process which does the actual work, e.g. map service, and there many of these.

The above error comes from Rest.  I'm not a Java programmer but my guess is that the program is trying to communicate with another machine using Java and there is some type of mismatch between the machines.  If you Google on that error then it seems to say that perhaps there is a software version mismatch.

I do believe that the REST process can and does try to communicate with other nodes (i.e. GIS Servers) in the site.  I am guessing that there is some type of problem here.  I noticed that all the logs are about TERAARCSERV1.  Where are the logs/entries for the other machine in the site?  Your diagram indicates that you have 2 machines in the default cluster.  I am wondering if there is some type of capability and/configuration problem here.  Do the machines have the exact same version of ArgGIS Server installed?  It could also be firewall/security configuration related.  Are all the required ports open?

http://resources.arcgis.com/en/help/main/10.1/index.html#//015400000537000000


I realize I left out an important detail.  Right now we are only running the site on one machine (teraarcserv1) and a single web adapter. We removed the other machines because we found it to be even less stable. 

When the machines have been added to the site, they are all running the same software (ArcServer 10.1 SP1), and we are able to trace the network traffic and find that the machines can talk to each other no problem.

Thanks again for your help with this.  I really appreciate it!
0 Kudos
by Anonymous User
Not applicable
Original User: btelliot

Quick update.

After a few email correspondences with Peter Kovalchuk from ESRI Canada, he recommended us changing the config store from a UNC to a local path.

To do this, I did the following:

  • I created a new folder called C:\arcgisserver\config-store on the machine teraarcserv1.

  • Next, I went into teraarcserv1 > ArcGIS Server Manager > Site > Configuration Store and clicked the edit icon next to the config store location.

  • I entered in the new destination folder "C:\arcgisserver\config-store" and clicked "Apply".

  • I got an error from the server "Name is null".  (See below).

[ATTACH=CONFIG]24652[/ATTACH]



Peter suggested that our site config storage is corrupted, and we should build a new site, start publishing services, and monitor site stability.

The last time we rebuilt the site, we had to republish upwards of 100 services, and it took two people over a week to accomplish this, even using python scripts to automate the publishing process :(.

I will try setting up a new site, and update this thread to see if this was the issue.

Thanks again for your help Richard.

Brett
0 Kudos
by Anonymous User
Not applicable
Original User: btelliot

Update #2.

As advised by ESRI Tech support Canada, we created a fresh new Virtual Machine using a local config store (C:\arcgisserver\config-store), while having our server output directories on the UNC path.

I used a python script to publish 250 or so map services before we started a ton of error messages again.  We noticed that when publishing some of the larger services manually, the pagefile on the VM started to grow and ate up all of space on the local hard drive, leaving 0 bytes of free space (60GB/60GB).  We then restarted the computer and the there was more free space on the local disk (30GB/60GB).  We set the pagefile to have a maximum size of 10GB to prevent it from ballooning again.

By this time, the server was crashing again, and not useable / stable.
0 Kudos
by Anonymous User
Not applicable
Original User: btelliot

Update #3.

After another meeting with ESRI Canada Tech support, they decided that the python script used for publishing the services was potentially a source of the problem.  We scrapped the site once again, and started rebuilding from scratch.

Since this meeting on Tuesday, we have:

  • Created a fresh new VM �?? TeraArcServ3

  • Installed ArcServer and SQL Client on it

  • Published all 250 services to the server manually

  • Retired the old crashed / corrupted site

  • Pointed our WebAdapter at the new site, TeraArcServ3

We are finding similar errors in our logs, and our new site is crashing again.  The main Error we are encountering is the following.  This is happening with about a dozen or so services.  These services use a variety of different data sources; shapefiles, file GDB, and SDE.

Watercourse Crossings layer failed to load: Fault code: 500

Fault info: Error handling service request : Cannot obtain a free instance.; nested exception is:
com.esri.arcgis.discovery.ejb.ArcGISServerEJBException: java.lang.Exception: Could not initialize service '8300_to_8399.t8395_WCC.MapServer'.


I�??ve attached a copy of the logs and crash files to this forum post.

Thanks again for your help!
0 Kudos
by Anonymous User
Not applicable
Original User: btelliot

Update #4

Our server is now running okay, although our performance is fairly sluggish right now.

Our problem was resolved by undoing the modifications to the heap sizes that ESRI Tech Support recommended. 

We are currently back to using the defaults for App server maximum heap size and SOC maximum heap size (256 and 64 MB respectively).


Lessons learned:


  • Our config-store was most likely corrupted

  • Rebuilding our site solved this problem

  • Modifying our maximum heap size did not increase server performance, and caused instability

  • We may need to do further testing to find an optimum heap size for maximum performance and stability

I will be on vacation for the next 3 weeks, so I won't be able to update this thread.  Thanks for the help Richard.

-Brett
0 Kudos
JasonHarris2
Regular Contributor
I have been struggling with these very same issues - using almost the exact same configuration.  Contacting support now, but curious to know if you have anything else to add?  How has stability been?

Thanks
Jason
0 Kudos