A Fun Threading Situation With da.Walk

HaydenWelch

So I have some library code that handles loading in file databases and indexing the contained data. There are then grouped out into dictionaries so I iteratively use da.Walk to extract each datatype from the dataset.

This can of course be pretty slow, especially when loading in a database that's on a network fileserver. No problem though, this is something that can be solved pretty simply by creating some threads using concurrent.futures.ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, as_completed
from arcpy.da import Walk
from pathlib import Path

def walk(ds: str, dtype: str | None = None):
    """walk a dataset filtering on the supplied datatype"""
    paths: list[Path] = []
    for root, _, items in Walk(ds, datatype=dtype):
        for itm in items:
            paths.append(Path(root)/itm)
    return paths

def extract_types(ds: str, dtypes: list[str]) -> dict[str, list[Path]]:
    """Extract paths from a dataset grouped by type"""
    data: dict[str, list[Path]] = {}
    with ThreadPoolExecutor(max_workers=len(dtypes)) as executor:
        futures = {executor.submit(walk, ds, dtype): dtype for dtype in dtypes}
        for future in as_completed(futures):
            data[futures[future]] = future.result()
    return data

Seems simple enough, spool up one thread per Walk call and await the results so they can be done concurrently, lets run it:

>>> extract_types("My_GDB", ['FeatureClass', 'Table'])
{'FeatureClass': [], 'Table': []}

Hmmm. There's no output, but there is definitely both Tables and Feature Classes in that gdb... Let's try a syncronous extract method:

def extract_types_sync(ds: str, dtypes: list[str]) -> dict[str, list[Path]]:
    """Extract paths from a dataset grouped by type"""
    return {
        dtype: walk(ds, dtype)
        for dtype in dtypes
    }

And run that:

>>> extract_types_sync("My_GDB", ['FeatureClass', 'Table'])
{'FeatureClass': [Path("My_GDB/FC1"), Path("My_GDB/FC2")], 
'Table': [Path("My_GDB/Table1"), Path("My_GDB/Table2")]}

Okay, so there IS data in the database, and Walk is able to find it. Lets try the concurrent version again:

>>> extract_types("My_GDB", ['FeatureClass', 'Table'])
{'FeatureClass': [Path("My_GDB/FC1"), Path("My_GDB/FC2")], 
'Table': [Path("My_GDB/Table1"), Path("My_GDB/Table2")]}

So now the concurrent version is able to find the data, but only **after** running a Walk syncronously? This little bug persists through interpreter sessions it seems. So let's see if warming up the Walk function can fix it:

def extract_types(ds: str, dtypes: list[str]) -> dict[str, list[Path]]:
    """Extract paths from a dataset grouped by type"""
    for _ in Walk(ds): break
    data: dict[str, list[Path]] = {}
    with ThreadPoolExecutor(max_workers=len(dtypes)) as executor:
        futures = {executor.submit(walk, ds, dtype): dtype for dtype in dtypes}
        for future in as_completed(futures):
            data[futures[future]] = future.result()
    return data

And run one more time:

>>> extract_types("My_GDB", ['FeatureClass', 'Table'])
{'FeatureClass': [Path("My_GDB/FC1"), Path("My_GDB/FC2")], 
'Table': [Path("My_GDB/Table1"), Path("My_GDB/Table2")]}

Now it works! This is really odd though. I'm guessing that da.Walk relies on some global state that isn't initialized in a sub thread and must be initialized in the main thread. This is definitely odd behavior though, and I figured that I'd share it here in case anyone else happens to run into it. I am also curious how this pattern will work when 3.14 is adopted and we have access to the InterpreterPoolExecutor. Will the arcpy global state need to be shared for functions as simple as da.Walk?

HaydenWelch

I ended up doing a hybrid approach. Initially I was just checking the gdb directory and counting the .gdbtable files and switching modes depending on count.

Seems that it all kinda falls apart with nested datasets though (which most of my databases are) so I went down the path of using the GDAL reverse engineering project.

Then I realized that you can literally just read the opening xml tag bytes line by line. Ideally I could parse in the whole XML file, but since really just need the paths (the actual access to the features are done lazily). I can just strip out the paths.

Might not even need to worry about file locking since it loads the whole file in before it parses it.

May end up expanding on this in a submodule and actually properly parse the GDB using Python. It's kinda insane how a hacky 2 second pure Python solution is so much faster than every single official solution I tried.

(Describe is almost 10x slower than all of these options combined, and the List functions are about 2x slower when you're dealing with Datasets)

As to your Multiprocessing point, I try to avoid it since the process tree can play weirdly with exception flows. I've also had issues with multiprocessing kicking users out of their sessions or crashing because the session lock was in another process.

I may still give it a try tomorrow and see if it works, but I think the raw file read is probably the fastest solution since it has virtually no overhead.

Sidenote: I messed up by first 2 tables by a factor of 10, I forgot that timeit gives you total time for all runs, not average time.

JoshuaBixby

OK, I think I finally understand (or mostly) what is going on with arcpy.da.Walk and multithreading. As you know, I created some synthetic datasets to represent a range of GDB structure, and have been testing those datasets on local SSD, LAN SMB (~1.25 ms), and WAN SMB (~20 ms).

arcpy.da.Walk is an os.walk-shaped generator — one __next__() per directory — and it holds some in-process lock during each __next__() call, releasing between yields. A probe driving two Walk generators from two threads (thread A, thread B) showed the two threads' iterations alternate perfectly in lockstep (B, A, B, A, ...) on a gdb with feature datasets, with ~80% wall-clock overlap. So threads do get scheduled and do share the work, but they share it by taking turns at the lock, not by running truly in parallel. That puts a ceiling on how much threading can help. Total lock-holding time is bounded below by total work, so two threads can't beat one by much.

The practical consequence is that how much threading wins depends on your gdb's directory structure. A flat gdb is one directory with one big yield, so two threads can't interleave at all. This means the second thread just waits until the first is done. A gdb with many feature datasets has many yield points and the two threads can rotate through the lock productively, which is probably where some of your time savings is coming from in your tests. In my testing, ProcessPoolExecutor gave a more predictable ~1.6-1.9× speedup at WAN latency regardless of gdb shape, because separate processes have separate arcpy state and don't share the lock at all. The tradeoff is the ~3s cost of spinning up each worker (fresh arcpy import, license check), which is fine if each gdb walk takes more than a few seconds but eats the benefit on fast walks. If you're walking many gdbs over a WAN, the biggest time savings is probably one process per gdb rather than threading within one gdb — that axis scales cleanly and doesn't depend on what any individual gdb looks like.

HaydenWelch

That's about what I expected after running it on your test databases. I still ended up just abandoning Walk entirely and switched over to the raw read of the gdb table. I included fallbacks in case that breaks, but I'm already seeing a consistent 100x speedup just parsing the XML GDBTable info in the a00..4.gdbtable file. I think Walk and List* are doing a lot more under the hood than I actually need since the initialization is inherently lazy, so just finding the path is enough for my needs.