Select to view content in your preferred language

Troubleshooting Files Shares and ArcGIS Enterprise

8752
2
08-20-2021 05:15 PM
DannyKrouk
Esri Contributor
21 2 8,752

Introduction

ArcGIS Enterprise architectures often have requirements for use of file shares for storage of shared configuration files, content, or backups (https://enterprise.arcgis.com/en/server/latest/install/windows/choosing-a-nas-device.htm). This is particularly true when the deployment pattern involves multiple-machine sites (such as the high-availability configurations). There are uses of file shares across each of the main ArcGIS Enterprise components (Portal for ArcGIS, ArcGIS Server, and ArcGIS Data Store).

In the diagram below, the “Shared Content” and “Shared Config-store and Directories” are located on a file share:

DannyKrouk_0-1629504397865.png

 

File shares come in many types and varieties, ranging from physical hardware devices to virtual file systems or other providers. Shared file systems provide an efficient way of sharing content between multiple components, but they can also introduce challenges related to performance, permissions, or file-level consistency.

When there are problems in an ArcGIS Enterprise deployment, and you suspect they may be related to shared storage, it can be difficult to know how to assess your architecture for potential challenges and what to do about it. In practice, solving a file share problem can be challenging, but your chances of success are much better if you establish a specific metric or measurement to use to define, create a baseline for, and test the suspected problem. This article is designed to help you determine whether there is a file share problem in your ArcGIS Enterprise deployment, what symptoms and indicators might point to this type of problem (its signature), and how to investigate the root cause.

While similar principles apply to Linux/NFS and Windows/SMB systems, the details of this article focus on Windows/SMB systems.

How does ArcGIS Enterprise use a file share?

There are several ways in which ArcGIS Enterprise uses a file share, including the following:

  1. A registered data repository in an ArcGIS Server site
  2. A shared backup location for ArcGIS Data Store backups
  3. A location for storage and extraction of WebGISDR backups for disaster recovery workflows
  4. A location to store the “config-store” and “server directories” of a multiple-machine ArcGIS Server site
  5. A location to store the “content directory” of a highly available ArcGIS Enterprise portal site

The latter two are the focus of this article; in these cases, the file share plays a role in how each machine in the multiple-machine site knows what is going on. Metaphorically, the file share acts like part of the “nervous system” of the ArcGIS Enterprise portal or ArcGIS Server site when it is supporting the “config-store”, “server directories”, or “content directory”. So, if there is a problem with the file share, the ArcGIS Server or Portal for ArcGIS site can display a wide variety of intermittent symptoms, such as trouble publishing services or server “instability”.

Recognizing and investigating a problem related to a file share

Analyzing a potential file share problem may begin as a solo effort, examining the software logs for each ArcGIS software component. However, if you find evidence there that points to issues related to file access, you will need to work with other information sources or software components. This will likely mean working with other people, as the required access privileges and knowledge are rarely concentrated in a single person.

This article describes what you can pursue with different sets of privileges and knowledge. We start with the logs in ArcGIS Enterprise for two reasons: first, we assume that you have those privileges—and second, this is where you make the initial determination of whether there is reason to believe there is a file share problem and what that reason is.

Foundations

Before you begin, you want to set yourself up to be as productive as possible by choosing an appropriate environment, controlling for complexity, and getting the right team put together.

Environments

You should try to make use of “lower environments” (in other words, not your production environment) if you can observe the problems in those systems. Whatever problem you see in production, use the ArcGIS Server logs to understand its signature. Finding that signature is discussed below. Then, look in your staging, testing, or UAT environment to see if you can find the same signature. Note that you may have to generate some activity in that environment to see the problem (many problems do not present when the system is idle).

If you can see the problem in the lower environment, that is the environment in which you should investigate. Since you have more control over the level of activity in that environment, you will be less distracted by unrelated factors. And, because it is more practical to know what is happening, you can form better ideas about potential causes. Finally, when it comes to making changes for troubleshooting, you should always make them in lower environments first, when possible, to avoid unnecessary disruptions to users.

Complexity

You want to back off as much complexity as possible without changing your basic configuration. Working in a lower environment helps to reduce complexity. But, when you are working with multiple-machine sites, the multiple machines mean that you have multiple places to look for any given issue. 

A frequently effective practice is to shut down the redundant machines in the site. For example, if there are three machines in an ArcGIS Server site, turn off (the OS) or stop (in the site) two machines. The remaining machine is still using the file share for the same purposes, and many problems will continue to express themselves. If shutting down the redundant machines causes the problem to disappear, this is an important clue about the nature of the problem. Specifically, it tells you that the problem has to do with concurrent client access, and perhaps not the network.

Involve others

When troubleshooting issues that span multiple environments, you may run into problems that exceed your access privileges or experience. While privileges can be granted temporarily, experience in those domains is just as valuable. Setting yourself up with a dedicated problem-solving team will ensure that you preemptively have resources on hand when questions come up.

A frequent source of dysfunction in this kind of virtual team effort is getting everyone to understand how they can make a meaningful contribution. Network administrators and file share administrators typically know little about Esri software and are not likely incentivized to learn a lot about the applications on their infrastructure. However, they typically have curious minds and like to solve problems. If you present them with an open-ended question (“Is the network having problems?”) or a prematurely broad accusation (“The network is causing us problems”), you are unlikely to get interesting responses. On the other hand, if you ask evidence-driven, specific questions, your odds of a productive response go up. For example, “We see ‘connection time out’ messages in our application logs at times X, Y, and Z. It seems to be happening every 2 to 3 hours. Would you be able to capture traffic for that period of time and help us understand what is happening with the connections?”

Evidence-based hypotheses and investigation

Whether you are trying to be effective on your own or as part of a virtual team, an evidence-based approach is the gold standard for making progress. Some cornerstones of an evidence-based approach are as follows:

  1. Make your initial observations before making any changes to the system, and establish these observations are consistent.
  2. Start with specific observations, about a certain workflow, request, or operation.
  3. Find repetition or repeatability. You don’t want to chase outliers or false alarms.
  4. Document as you go. You want to be able to go back and confirm details of your observations and share your observations with others.
  5. Create more than one hypothesized cause for each problem. Your first idea is rarely the answer; create several, pursuing them one at a time.
  6. Identify the evidence that could invalidate (or support) a hypothesis and then pursue that evidence.
  7. Leverage the expertise of others. One of the best ways to get participation from an expert is to ask them to show off their expertise by explaining the possible meanings of a specific observation. It is a double win: you advance your understanding, and you make them more likely to want to help you going forward.

What you can learn as an ArcGIS Enterprise administrator: The signature

The logs in ArcGIS Enterprise (ArcGIS Enterprise portal logs and ArcGIS Server logs) are where you identify your problem signature. It is the measurement you will use to determine whether you are having a file share problem and whether a change you make actually resolved it.

When an ArcGIS Enterprise component “talks” to a file share, it is reading and writing file objects, often referred to as Input/Output or I/O. And, it is doing so as a specific account (the service account), so you can have permissions problems and file system problems. 

Permissions problems are usually quite recognizable in the log messages and relatively straightforward to address. For example, the log message “Cannot write to directory path ''{0}''. Please check that the location is valid and that the ArcGIS Server account has permissions to the location.” (code 6697) describes the cause and provides an idea for the solution. Note that effective permissions will involve both those for the share itself and the files and directories exposed through the share.

Share Permissions

File and Directory Permissions

DannyKrouk_1-1629504397869.png

 

DannyKrouk_2-1629504397873.png

 

 

Log messages that mention a path related to the file share, I/O, or an IOException likely mean that permissions are not the issue. At the end of this article, there is an appendix with a partial list of log codes and message types that are correlated with file share problems. Correlation is foundational for establishing causation. If you see messages like these, it’s a good indication that you need to look more closely at the file share and/or the network pathway to it, though not irrefutable proof that the file share is at fault.   

The details of the log message types may obscure the high-level patterns you are looking for. A file share is a file system on the other side of a network, so if there’s a problem, there can be at least two types of sources: the file share solution itself or the network. The following are examples of messages that indicate a problem accessing files from a file share (some information removed for clarity or privacy):

Enterprise Component

Level

Code

Message

Notes

Server

WARNING

7721

“failed to write heartbeat”

 

Server

WARNING

7712

“An error was encountered while synchronizing with the config store”

 

Server

SEVERE

6561

“Failed to return all folder configurations”

 

Server

SEVERE

9000

“Internal Server Error: “Service <name> not found”

Note that a request for a service that does not currently exist will generate a message like this also. For this to indicate a problem with a file share, the service must actually exist in the site.

Server

SEVERE

6605

“Failed to return all services configurations in the folder … (The system cannot find the file specified)”

 

Server

SEVERE

6652

“Unable to read the service … from the configuration store … (The system cannot find the file specified)”

 

Server

SEVERE

6566

“Failed to retrieve the status of the service … (The system cannot find the file specified)”

 

Server

SEVERE

9015

“Error getting list of services. … (The system cannot find the file specified)”

 

Server

SEVERE

6615

“Unable to retrieve 'Permissions' resource information. …”

This message is not exclusive to file share issues, but it can be associated with file share issues.

Enterprise portal

SEVERE

218037

“The Portal site has been initialized and configured but is currently not accessible because the content directory is not available …”

 

 

Which of these messages might lead you to focus on the network and which might lead you to focus on the file share? None of these messages involve phrases like “connection timed out,” “connection was forcibly closed,” or “timeout.”  This absence suggests that there was not a problem connecting to the file share; there was a problem after that point. This points you to the file share as the next place you should focus your attention.

Conversely, if you do see a phrase like “connection time out”, this is an indication that you should investigate your network—because connecting is what networks do.

Your Esri server log investigation should be able to establish 3.5 things:

  1. There are or are not messages that are specific to the file share locations supporting the config-store, server directories, or the content directory.
  2. The frequency of occurrence of these messages
    1. Ideally, if they are correlated with any particular activity (pattern of requests / use of the software)
  3. There is or is not evidence in the messages that the problem has to do with connecting (the network). 

This is your signature. When you have a signature, you can begin to look at other information sources for events that are coincident in time.

What you can learn with a local machine administrator

We go to other log sources to look for the root cause or the “why.” In the case of a file share related problem, the Esri system is the victim. The Esri logs won’t likely offer insight into the root cause. 

The local machine is the machine (or machines) on which the Esri software runs. It is the client to the file share server. Logs from the local machine allow you to learn whether there is a machine-specific problem or a problem that is detectable by the local machine. 

A local machine administrator has access to the Event Viewer logs. These logs contain a wealth of information from various subsystems of the Windows OS. You want to take the timestamps from your Esri server log observations and see if there are interesting records in a few categories of logs from the Event Viewer system. In most circumstances, you want to look at the Windows System, Security, and Application categories. The screen capture below illustrates the log filtering capabilities that allow you to focus on types of messages for time periods of interest for these main Windows logs:

DannyKrouk_3-1629504397885.png

 

Because you are focused on file shares, there is an additional source of potential information buried a little more deeply in the Event Viewer log tree hierarchy. If you browse the tree structure on the left to Applications and Services Logs > Microsoft > Windows > SMBClient, there are three more logs (Connectivity, Operational, and Security) that relate specifically to SMB (SMB is the protocol for communicating with the file share). The screen capture below illustrates some log content with an error that, while not common, would certainly cause a problem for an Esri server:

DannyKrouk_4-1629504397901.png

 

Whether you are looking at the main Windows logs or the SMB Client logs, your primary focus should be on messages that are classified as Critical, Warning, or Error. You should filter based on these event levels and time.

What can these logs tell you? 

  1. Does the local machine detect a problem?
    1. If you do not find thematically related errors in the main Windows logs for the timestamps from your Esri server logs, then you have some evidence that the problem is not due to a problem that is client-machine specific. While this may not be enough evidence to completely rule out the possibility, you have taken important steps toward that objective, and you can prioritize your efforts elsewhere.
    2. If you do find interesting errors on the local machine, you prioritize your efforts on understanding and trying to resolve those. 
  2. Does the SMB protocol detect a problem?
    1. If you do not find errors in the SMBClient logs that match the timestamps, you can infer that the problem is not an error as far as the SMB protocol is concerned. For example, if you are investigating messages that indicate a file cannot be found, the SMB protocol is not going to classify that as an error— For all it knows that file doesn’t exist. In that case, the protocol functioned correctly and returned a correct piece of information. 
    2. On the other hand, if you see an error such as the one shown above, it makes sense to focus the investigation on the SMB protocol itself.

Finding thematically related errors in these logs is often very meaningful to other specialists who are unfamiliar with Esri logging but more likely to understand or trust operating system logging.

What you can learn with other specialists

Many file share problems will not present in the Event Viewer logs of the client OS. Most of the other sources of information require elevated privileges (beyond Esri server administration or local machine administration), specialized knowledge, or both. So, to make progress in this space, you must employ two important attributes: (a) your winning personality and (b) your commitment to evidence-based investigation.

Networking people

If you have the Esri application log signatures that suggest a network-related issue or failure, you will want to investigate in this space.

Most networking problems that relate to file shares will not express themselves very clearly in classic network monitoring or logging solutions. Network monitoring often focuses on capacity, throughput, and quality of service. That is useful for network planning and administration overall but not necessarily helpful for troubleshooting specific connections. A firewall’s log of denials is worth consulting, but it is not commonly where file share problem causes are found.

The answers are typically found by capturing network traffic during a short period of time when the problem occurs or is manually triggered. Capturing an intermittent problem is not as hard as it may seem. One can set up a ring buffer capture to keep capturing to stable inventory of files, overwriting them as time passes. Your organization’s network team, in addition to having the tools and privileges, will likely know how to do that. So, you don’t need to know exactly when the problem will next occur. 

In many cases, a ring buffer can persist many hours’ worth of network capture data on a rolling basis. When the problem does occur again, ask the network team to stop the trace. Then, use the timestamps from the Esri server logs to investigate the network capture information. With some solid network foundational knowledge (from the network team or you) and some dedication, a lot can be learned. Even if your team doesn’t have much network analysis knowledge, you can look for anomalies. Either there are or there aren’t abnormal network characteristics at the timestamps in question. If there is something abnormal or suspicious, additional experts can be brought in as needed. Specificity will likely be attractive to these experts.

File share people

If your Esri server log signatures do not suggest a network issue, you will want to investigate this space. File share solutions, like most IT systems, typically have logs. A review of these logs at the times in question is appropriate. If there are thematically related errors for the timestamps in question, you have a likely cause to investigate further. And, you have your file share administrative team (and their vendor’s support organization) to lead the way.

It is also possible that the file share logs do not have related errors. If the accumulated evidence has ruled out the other likely sources, and the file share logs do not indicate a problem, there is a remaining possibility. It is possible that the file share and the Esri software classify problems differently. For example, if a file share is designed or configured to provide eventual consistency, it will not record errors about that. But we know that ArcGIS Server is expecting immediate consistency. So, the file share sees no error, but ArcGIS Server is not getting what it needs.

Let’s explore this idea a bit more, as it does come up, using the example of a two-machine ArcGIS Server site with the configuration store and server directories stored on a file share. If machine A writes to a file, ArcGIS Server site machine B should be able to read that new information right away. That is immediate consistency: read after write for any client to the file share. So, if ArcGIS Server is saying, “Hey, I can’t find that file,” or “That file looks different than I thought,” and the file share is saying, “I don’t know of any problems,” what is your resulting hypothesis? Having ruled out other likely causes, your resulting hypothesis is that the file share is not providing “read after write” consistency. No other idea fits the data nearly as well.

What do you do with that? This is another investigation with the file share administration team. But, instead of focusing on an error in its logs, it is seeking to understand how multiple clients, looking at the same file, at the same time, see it in exactly the same state, always. No client can see the before state as valid after the file has been changed by another client. And no client can change a file when a different client has an exclusive lock on it. Uh-oh. We said the “L-word” (lock). That brings up one additional term or concept that you may have heard discussed in the past: opportunistic locking of files, also referred to as Oplocks.

Is the problem Oplocks?

While it is certainly possible that Oplocks are causing a problem for your system, for the problems we are talking about here the odds are against it. But, as evidence-based decision making is the way to make progress, you can use evidence to prove whether it is a cause.

It is worth thinking about what Oplocks, and the alternatives, are. Oplocks are opportunistic locks. The client (the Esri system) assumes, optimistically, that the files it knows about on the file share are unchanged and/or safe to change unless it hears from the server (the file share). The opposite of opportunistic locking is pessimistic locking (no locking is not an option). In that case, the client assumes, pessimistically, that other clients could be using the files it knows about on the file share, so it checks before doing anything. Opportunistic locks are a great strategy when most files are not being accessed by more than one client at any given time. Pessimistic locks are a better strategy when files are frequently accessed by more than one client at a time. What do we know about a multiple-machine ArcGIS Server or Portal for ArcGIS site? There are at least two clients accessing the same files at the same time.

So, Oplocks is suboptimal for Esri server use cases. But is it the cause of your problem? You don’t know that. Since you have your problem signature from your Esri logs, you can change the Oplocks settings and see if it changes the problem signature (the presence of the message and its frequency, given the same workload). These parameters can be changed in the client OS (the machines on which the Esri software runs) or the file share solution, depending on your configuration.

Here is how to do it if you are a local machine administrator for the client OS. Open a PowerShell command window as “as Administrator” and note the current settings:

Get-SmbClientConfiguration

Then, you can change them. The following commands will disable all client-side caching (including Oplocks):

Set-SmbClientConfiguration -OplocksDisabled 1

Set-SmbClientConfiguration -UseOpportunisticLocking 0

Set-SmbClientConfiguration -DirectoryCacheLifetime 0

Set-SmbClientConfiguration -FileNotFoundCacheLifetime 0

Set-SmbClientConfiguration -FileInfoCacheLifetime 0

 

You can then inspect these settings with this:

Get-SmbClientConfiguration

Note that the changes will take effect the next time the client establishes a new connection to the file share. Restarting the Esri server software would cause that to happen right away.

Once you have changed your settings, return to the Esri logs to see if the signature, the error message information and frequency for the same workload, has been addressed. If it has, you can celebrate. If it is not, then what?

Is the problem something else?

Yes—if you are at this point, the problem is something else.

You have disproven all reasonable hypotheses thus far. You are left with a diagnosis by exclusion. That diagnosis is that the file share solution (whether by design or not) is not providing immediate consistency. 

In this circumstance, it can be useful to have corroborating evidence. A very good way to do that is to compare with a different file share solution. While a file share from a Windows virtual machine may not be a solution that you or your organization want to adopt permanently, it can be very useful to an investigation. Esri does not have empirical evidence of file shares from a single Windows machine producing immediate consistency problems. There is no strong theoretical basis for it. And, if you turn off Oplocks as described above, you neutralize the weak theoretical bases also. The Windows machine file share would need to be configured for the same SMB parameters as the original file share.  PowerShell’s Set-SmbServerConfiguration command can be used to match most parameters that the file share administration team would indicate from their solution.

In any event, if you point your Esri server(s) at the Windows machine file share and the problem signature disappears, your diagnosis by exclusion has been corroborated. From there, the file share solution team can decide whether they want to try to match the immediate consistency characteristic or indicate that they do not want to provide such a service. So, you either have a solution or an answer. 

In conclusion

Investigating a suspected file share problem is one of the more challenging troubleshooting endeavors in the ArcGIS Enterprise administration space. This article has not hoped to make you, the individual reader, a successful solo practitioner in this space; rather, the objective is to give you some foundational information and a process. If you carefully execute this process, you can bring in many different specialists to help you reach the solution. In addition to involving the related domain experts from your own organization, you can bring Esri Technical Support and/or Professional Services to participate. With time, your careful execution and quality team participation can correctly diagnose problems in this space.

Appendix: Log messages that are correlated with file share issues

The information below is a partial list of error codes and messages that may indicate a file share issue in your ArcGIS Enterprise system.

Warning and severe log messages

At the default log level, the following codes and messages usually indicate a problem with a file share (in a multiple-machine site). It is often useful to look at the messages immediately before and after. When you do that, you want to use the machine, process, and thread fields to recognize related messages. Since there are many machines, processes, and threads, the immediately neighboring messages, by time, may be from some other process.

Enterprise Component

Level

Code

Message

Notes

Server

WARNING

7721

“failed to write heartbeat”

 

Server

WARNING

7712

“An error was encountered while synchronizing with the config store”

 

Server

SEVERE

6561

“Failed to return all folder configurations”

 

Server

SEVERE

9000

“Internal Server Error: “Service <name> not found”

Note that a request for a service that does not currently exist will generate a message like this also. For this to indicate a problem with a file share, the service must actually exist in the site.

Server

SEVERE

6605

“Failed to return all services configurations in the folder … (The system cannot find the file specified)”

 

Server

SEVERE

6652

“Unable to read the service … from the configuration store … (The system cannot find the file specified)”

 

Server

SEVERE

6566

“Failed to retrieve the status of the service … (The system cannot find the file specified)”

 

Server

SEVERE

9015

“Error getting list of services. … (The system cannot find the file specified)”

 

Server

SEVERE

6615

“Unable to retrieve 'Permissions' resource information. …”

This message is not exclusive to file share issues, but it can be associated with file share issues.

Enterprise portal

SEVERE

218037

“The Portal site has been initialized and configured but is currently not accessible because the content directory is not available …”

 

 

As should be evident from this table, most of the messages mention something about a file or directory that cannot be found.

Debug level

At the DEBUG logging level, there are more messages available that can provide more evidence. Thus, if you are looking for a potential file share problem, it may be useful to turn on the DEBUG log level for a period of time. These debug-level messages would mostly be useful to support (or not) hypotheses formed from other log levels or sources and add potentially relevant detail. By themselves, they would not be an actionable indication of a file share problem; there are other plausible interpretations.

  • SITE_IN_USE=ArcGIS Server Site is currently being configured by another administrative operation. Please try again later. - (6850)
  • MACHINE_BEING_CONFIGURED=Server machine ''{0}'' is currently being configured by another administrative operation. Please try again later. - (6851)
  • CONFIG_STORE_BEING_ACCESSED=Another administrative operation is currently accessing the store. Please try again later. - (6852)
  • CLUSTER_BEING_CONFIGURED=Cluster ''{0}'' is currently being configured by another administrative operation. Please try again later. - (6853)
  • CLUSTER_MACHINE_BEING_CONFIGURED=Cluster ''{0}'' - Machine ''{1}'' is currently being configured by another administrative operation. Please try again later. - (7506)
  • CLUSTER_RESOURCE_BEING_CONFIGURED=Cluster ''{0}'' resource is currently being configured by another administrative operation. Please try again later. - (7507)
  • METRIC_BEING_CONFIGURED=Metric ''{0}'' is currently being configured by another administrative operation. Please try again later. - (6614)
  • SERVICE_BEING_CONFIGURED=Service ''{0}''.''{1}'' in folder ''{2}'' is currently being configured by another administrative operation. Please try again later. - (6854)
  • SERVICE_CONFIG_BEING_WRITTEN=Another administrative operation is currently writing the service configuration. Please try again later. - (6859)
  • FOLDER_IN_USE=folder ''{0}'' is currently used by another administrative operation. Please try again later. - (6877)

Some of these messages may also appear on DEBUG code 9999. 

2 Comments