CachingTools load balancing

BrunoMendes · ‎03-19-2015

I have two ArcGIS Servers 10.3 running in a single cluster hosting the System/CachingTools service, 4 instances on each machine.

When I run Manage Map Cache Tiles tool specifying that I want 8 instances to be used, for single or multiple scale levels, all the work is done by a single machine, therefore, only 4 instances are used.

Another test I did was to run multiple Manage Map Cache Tiles tools unchecking "WAIT FOR JOB COMPLETION" flag and different scale levels on each run, but the results were the same: a single machine processed the whole work.

Is there any configuration missing in order to be able to use all the computing power of the cluster?

TravisSaladino · ‎03-27-2015

Hi Bruno,

Have you checked the logs to see if there are any messages coming form the node that is part of the cluster but isn't processing?

You can filter messages from just that system by selecting it in the source drop down list.

By the way, I ask this because with the info you've shared there doesn't seem to be an obvious reason for this behavior.

BrunoMendes · ‎03-30-2015

Hi Travis,

I've re-run the tests and I didn't find anything in the logs.

I'm experiencing a similiar issue with dynamic services running on a multi-machine cluster. All requests are handled by a single machine. I've checked the firewall and it is disabled in both machines.

TravisSaladino · ‎03-30-2015

Bruno

Just for trouble shooting, what happens if you specify only the "non-functioning" node to be the sole member of the cluster supporting the caching tools?

Also I would suggest for this trouble shooting to set the logging to debug to capture all the log messages.

For the dynamic services, is there any commonality with the caching node? Other then being a member of the same site?

And a few more questions to get a better idea of the system...

How many nodes are part of this site? Just the two?

How many clusters?

Are you using a web adaptor(s)?

Is there a network load balancer in front of the site?

Are you publishing (or managing the service to start the caching process) through a connection to the site via port 6080(or 6443) or via the web adaptor?

Checking the firewalls was a great step too.

By the way, how are you assessing that the work is only being done on the one system? I only ask because there are many ways to check this, and don;t want to assume anything.

regards,

BrunoMendes · ‎03-30-2015

Travis,

I'll execute the procedure with only the previously idle node and then post the results. Right now I'm generating a large cache and can't stop it.

But to clarify our scenario, we've set up 6 new Windows machines to migrate from version 10.0 to 10.3: 1 is the web adaptor running on IIS; 1 serves as a storage; and the other 4 are ArcGIS Servers. There is no load balancer and there is only 1 ArcGIS Server Site.

Initially, I've created four clusters having a single machine on each of them, to analyze the results of different kinds of requests against a single web adaptor. The results were not what I expected but it is subject to another thread.

This specific test consisted on a 2 machine cluster running System services, including CachingTools configured to run up to 4 instances per machine; and another cluster running a single cached service on a single machine.

First attempt was to run a single Manage Map Cache Service tool with more instances than a single machine could handle, so I specified 8 instances. Due to some considerations I've read about cache controller, I've redone this specifying 6 instances.

Second attempt was to run multiple Manage Map Cache Service tools, so I've sent two requests specifying 4 instances: one to generate the whole cache at scales 250.000, 180.000 and 115.000; and another to generate the cache for a specific area at scales 60.000, 30.000 and 10.000. Again, due to the cache controller, I've re-run specifying 3 instances for each tool.

The results were the same in all attempts: one machine with CPU consumption at 100% and writing a dozen MBps to the storage; while the other was totally idle.

I've tried using SystemMonitor and ServicesDashboard but they didn't provide the specific information I needed for this test, so I ended up using Resource Monitor from Windows Server 2012 to gather information about Memory, CPU, Disk Access and Bandwith consumed during the tests.

*NOTE: the caches were deleted between each attempt. I've also thought it could be caused by Bundled format, but tests using Exploded one provided the same results.

TravisSaladino · ‎03-30-2015

Hi Bruno,

Without messages from the server, it's hard to say what the issue may be.

Because the site is responsible for the internal load balancing of requests to the nodes in a cluster, there is very little configuration that must be done... normally this is a good thing due to the ease of setup.

Generally, the things I've found that cause issues in multi-node sites are:

- not specifying the exact same credentials for the ArcGIS Server account that's used at install of the server software

- if connecting to a registered database, not all nodes have the correct DB native client installed

- the config-store and server directories are not accessible by all nodes. this would include setting a secondary location for the cache that one node can't reach.

- firewalls blocking communication between nodes and clusters (in the range 4000 range) of the site, or to data sources.

I assume you've read

Allocation of server resources to caching—Documentation | ArcGIS for Server , and maybe Accelerating map cache creation | ArcGIS Blog to help with the sizing and configuration of the services and clusters.

Sorry I don't have a definitive answer, this is a tough one.

BrunoMendes · ‎12-14-2015

Travis, by removing one machine from the cluster and generating the cache, then adding it back and removing the another one to generate the cache again, I've checked that:

1. credentials on both windows services are the same;

2. both servers can connect to the database;

3. all arcgis server directories and the config-store are accessible to both machines;

4. Windows firewall is disabled on both machines and they belong to the same domain.

My initial idea was: whenever I need to generate a big cache, I would allocate more machines to speed up the work; after the cache is complete, these machines could be reallocated to do other stuff, either by putting them offline thus deallocating hardware resources or adding them to another cluster.

What I ended up doing was: allocate more hardware resources to the single caching machine prior to running the job and then deallocating it.

MichaelVolz · ‎12-14-2015

Bruno:

Did you do all your testing at 10.3.0? If so, did you try upgrading to 10.3.1 to see if that solved the issue you were having where only 1 machine in a 2 cluster site was actually performing the cache processing?