Recommended Strategies for Load Testing an ArcGIS Server Deployment

AaronLopez · ‎05-27-2021

Recommended Strategies for Load Testing an ArcGIS Server Deployment

Start with a test plan

Most testing software can produce some type of report once the test has completed. This report should answer questions the test plan was asking.

For example:

a) Could the ArcGIS service utilize all the CPU (e.g. CPU-bound)?

b) Could the ArcGIS service deliver a specific throughput (e.g. a certain amount of transactions/sec)?

c) Did the throughput deliver an average response time that met our performance requirement?

Defining a purpose for a test helps keep the testing effort focused on a goal.

Tests should be conducted against application with no known major bugs or defects

Loading testing should not be used to functionally test a deployment or application.
An application should pass quality assurance (QA) testing before load tests are conducted.
Additionally, if an application contains major bugs, such deficiencies could have a measurable impact to the performance and scalability of the offered services. This in turn would most likely negate the results of the test and potentially waste time, money, and resources.

Interact with your application (and ArcGIS services) before load testing them

If you are the only user on your system and the ArcGIS services respond very slowly, there is no need to conduct a test under load. The next step should be to tune and optimize the ArcGIS services for performance.

Coordinate test execution

Many times a load test is conducted in a QA/Staging or Test environment but occasionally can be carried out in Production.
Whatever the environment, it is important to remember that often, services and resources are being consumed by other users too (and not just the load testing team).
To help avoid confusion and unexpected experiences, it is highly recommended to coordinate the execution of any load tests with the appropriate personnel.
Doing so can help provide a better experience for actual users and can avoid unwanted noise in a load test (from actions real users might be doing).

Verify that the Test environment matches expectation

Sometimes the Test environment might be scaled back due to a variety of reasons. Then, over time, Test and Production become vastly different environments. In this case, test results derived from Test would have little meaning in relation to how Production will perform or scale.

For example:
a) If Production is expected to be deployed using a Highly Available architecture then Test should too

b) If Production is expected to utilize an enterprise geodatabase containing 500GB of vector data then Test should too

c) If Production uses ArcGIS Servers with 32 cores and service instance maximums are set to 32, then Test should too
Keeping the Test and Production environments in lock-step can help provide the best value (and expectation) on the results. In cases where they knowingly do not match, make a note of it before testing begins in and in any derived conclusions of the Test environment.

Start a load test at step 1

Using 1 as an initial load step can assist your post-test analysis.

Step 1 (or one concurrent test thread) represents the best-case scenario for your test. This is your baseline and is a good measuring stick to understand how well or poorly the ArcGIS service scaled as pressure increased (e.g. when additional test threads were added).

Collect hardware utilization from all machines involved in the test

Most testing frameworks typically provides the ability to collect hardware utilization from the servers and test client itself. This can be valuable for understanding resource consumption and identifying bottlenecks (e.g. the resources on the test client can also be a limiting factor).

However, despite having this feature in the testing software, collecting the hardware utilization is not always possible due to permission or network access (e.g. connecting through firewalls/routers).

While obtaining this information directly through a testing framework is certainly convenient, there are other ways to accomplish this task. Using free tools like PerfMon in Windows or dtstat in Linux) to capture this data is an extra step but worth the effort. Once the test is complete, analysis can still be performed on the manually created chart data from the hardware utilization.

Test individual ArcGIS services first

If a specific web application uses more than one ArcGIS service, test and tune each one separately.

This approach makes it easier to identify which services may have bottlenecks or limitations that prevent them from utilizing the available hardware.

If an ArcGIS service cannot utilize all the available CPU hardware of ArcGIS Server, the tester/analyst should notify the appropriate person that a tuning opportunity in the deployment exists

Also, avoid starting the testing effort with the full application workflow as it can be difficult to spot potential bottlenecks when many ArcGIS services are tested at once.

Test as physical close to the deployment as possible

Try not to have the test “simulate” the Internet. Testing as physically close to the deployment as possible can help provide the best understanding of what the server hardware can deliver.

Purposely introducing network latency or poor bandwidth will add noise to a test and make it difficult at recognizing the full capability of the ArcGIS services and the servers they are running on.

Step and test duration

Tests do not need to run for 8 hours to provide useful information on the ArcGIS service in question. However, it is also recommended to avoid running too short a load test. This comes down to choosing a length of test time (and for each step duration) that provides the right amount of information. In other words, it is about recording enough request samples to get a “good” average.

The length of time to use is typically tied to your response time, a fast response time can deliver many requests with a 5-minute step load. A slow response time may need a 15-minute step load to record just as many values.

As a tester, you may not always get the step and test duration right with your first estimation and may need to adjust and rerun the test.

Be mindful of the ArcGIS Server Log Level

Although the ArcGIS Server logs can provide a great deal of information to analyze, it is important to understand the higher log levels of VERBOSE and DEBUG can slow down the performance of a very busy site and are not a recommended setting for a Production environment. Whereas, the value of WARNING (the default) provides the best possible service performance as it only records warnings and errors.

However, a setting of FINE is a good compromise between useful analytical information (like elapsed times on dynamic requests) and speed.

Traditional ArcGIS services can be tuned within (ArcGIS) Server

Before load testing a traditional (e.g. dedicated) ArcGIS service to tune it or understand how it performs, try setting the value of its ArcSOC instance maximum to the number of CPU cores available on ArcGIS server.

After restarting the service, the setting will allow multiple, simultaneous requests to take advantage of the available hardware and to show it in the best possible light (this assumes the service is CPU-bound).

Increasing the value of the ArcSOC instance maximum will also allow the service to utilize more CPU but in turn, more memory as well. Please ensure there is adequate memory available on the ArcGIS Server machine to accommodate the adjustment. If the service is not in heavy demand, the extra instances will become idle (default is 1800 seconds) and they will shutdown in order to free server memory.

Similarly, increasing a service instance's minimum (to match the maximum) is a good strategy for obtaining predicable performance. This is recommended for the most popular services but keep in mind that such a configuration will always consume memory (for that service) since no instances will shutdown after the idle time has elapsed.

Shared services have instance settings that could be adjusted as well. However, if a shared service is popular enough to be load tested, it should be adjusted to run as a traditional, dedicated service.

Not all ArcGIS services are CPU bound

If an ArcGIS service is CPU-bound, it means the amount of throughput it can deliver (or capacity it can support) is limited only by the number of CPUs on the ArcGIS Server machine(s). In many ways, this can be a good thing.

However, this is not always the case. Sometimes, you can have bottlenecks in other hardware like network, for example. Occasionally, you can encounter a bottleneck in a software component which can be by design or unintended.

Therefore, the collection of hardware metrics during the test is very important. Observing utilization of the CPU, Memory, Network and Disk can help provide the tester/analyst vital information for understanding if there is something limiting the scalability of the ArcGIS Server service and whether it is the server’s hardware or test client’s.

It’s all about throughput (not users)

Throughput is measured, users are calculated…they are two different artifacts from a test.

In a test, throughput is typically defined as transactions/second (or operations/second), and this is a value that should be measured by the test client software. On the other hand, the definition of a “user” can vary but is something that is usually calculated off throughput.

Since throughput is observed directly from the results of a load test, it is one of the best metrics for determining the scalability of a deployment.

On a related note, a test thread (e.g. the item that increases in correspondence to more pressure being added to a load test) is not the same as a user either. The number of test threads that are utilized and their duration are typically configured in the step load definition of a test.

Verify that the test was successful

The completion of a test does not necessarily mean it was a “good” and capable of successfully answering the questions in the test plan. It is important to verify and validate the test was sending the right requests where it was supposed to and getting the expected responses.

A quick manual quality control (QC) check of the request composition in the test can help with the former.

While monitoring the average content length (per response) can help with the latter.

Most test software provides a “why” to capture the average content length (or something similar). The general rule of thumb is that the average value for this metric should hold relatively constant throughout the test. If it increases or dips drastically, further investigation is recommended as the expected response for the requests may not be coming back (e.g. errors instead of image or json content) or if the response if valid but wildly variable a different test design may be needed.

Additionally, it is important to determine if the requests themselves were successful (e.g. HTTP 200). Some test software may allow the analyst to configure validation checks on the responses within the test itself. That said, the profiling of average content length metric, usually provides a more accurate view on the expected response from the server.

Test results are not a guarantee of the support for X number of users

Test results only validate the tested workflow. This tested workflow will show throughput for a specific type of request with a corresponding response time. It does not promise or guarantee that the deployment will support X number of users.

Remember, the definition of a user can vary, and can mean different things for different deployments.

Avoid testing shared resources like ArcGIS Online or Google Maps

Free and public service offerings from ArcGIS Online or Google Maps are there for the “community”. Such resources are quite robust and scalable but cannot be performance tuned with respect to every user.

Since they are not directly part of an on-premise deployment, they should be considered an “external” resource. As a result, requests to them should be removed from a load test as the test should focus solely on the capabilities of its own hardware.

Repeatable test results

If the results for a load test against an ArcGIS Server service show similar trends lines across test runs (e.g. the same throughput is achieved around the same point during the test), the resource is generally considered “stable”. Being able to repeat the results for a test is a good characteristic.

When results are not immediately repeatable, the tester/analyst must look deeper and try to understand the inconsistent behavior. It could be that the hardware was being used to service requests other than the test (e.g. another user on the system). Or, if the deployment was on shared infrastructure (e.g. virtualization), the underlying hardware was being utilized for another purpose (other virtual machines were executing resource intensive tasks). For cases such as this, conducting the load tests during non-peak hours might yield more reproducible results showing that the service has the potential to be stable.

Design of tests/workflows should be realistic and based on what would be expected from a user

Avoid theories and projections; just focus on how the user should be using the application...the anticipated workflow.
Load testing can be easy to do but it can also be easy to expand the testing scope to include unnecessary or unlikely scenarios.

Understand the value

Many times, the path to a good and useful test is as important as the test itself. As an analyst this helps you:

a) Validate the testing procedures

b) Have the greatest ability to explain the results which in turn, both make your test valuable

1) Some individuals will not just ask for results but for analysis and conclusions as well

2) Be prepared to backup these conclusions with data

Keep it simple

Sometimes the most informative testing efforts are simple and not overly complex.