Writing Our Own File Geodatabase Parser

HaydenWelch

DISCALIMER: The information in this post shows a very hacky way of interacting with file geodatabases and should not be used unless you absolutely know what you're doing. The built in functions for interacting with geodatabases have a lot of additional safeguards in place that are not implemented by this very simple parser code.
That being said, this is still a useful thing to know if you need to interact with geodatabases frequently since being able to inspect system tables has allowed be to discover and fix a lot of small data corruption errors that can happen when a transaction fails or ArcPro crashes while writing to the database.

Intro

Recently I was running into odd behavior with da.Walk. After a bit of testing, Joshua and I determined that the built in Describe/Walk functions are heavily blocking, even though all the information they return is directly available in the GDB_Items system table and can be accessed in a non-blocking way.

Every file geodatabase comes with a series of system tables than define the contents of the database, these live in the a...n.tabl[e|x] files. Tables 1-7 are reserved for these system tables with are defined as follows:

File Name	Real Name
a00000001	GDB_SystemCatalog
a00000002	GDB_DBTune
a00000003	GDB_SpatialRefs
a00000004	GDB_Items
a00000005	GDB_ItemTypes
a00000006	GDB_ItemRelationships
a00000007	GDB_ItemRelationshipTypes

Table 4 (GDB_Items) contains all the information you need to Describe a geodatabase object and is what we will be targeting for the rest of this post.

Luckily we don't have to start from scratch, GDAL has an OpenFileGDB driver that is based on some reverse engineering work done by Even Rouault. This gives us the basic format of these gdb table files, which means we can now write something that can parse them. Decoding shape data is a tad more complex than text and number fields, so we'll ignore those for now since we just want to emulate the behavior of da.Walk.

We need to first pull the information from the headers of the gdbtable and gdbtablx files to determine row count and field offsets. The gdbtablx file has a 16byte header while the gdbtable file has a 40byte header, these headers encode information about total row count, field byte offsets, and total block count.

Decoding Headers

Since we know the format of the different structures in the GDB files, we can go ahead and write a helper class that lets us decode these sequentially. We aren't looking to write any data back to the file in this exercise, so we don't need to worry about keeping track of where everything is.

from typing import Any, Literal, Self
from struct import Struct
from collections import deque


class MemoryReader:
    _struct_cache = dict[str, Struct]()
    __slots__ = 'view', 'index', 'byte_order', 'stack'
    
    def __init__(self, view: memoryview, start: int = 0,
                 *,
                 byte_order: Literal['@','=','<','>','!'] = '=',
                 stack_size: int = 10,
        ) -> None:
        self.view = view
        self.index = start
        self.byte_order = byte_order
        self.stack = deque[Any](maxlen=stack_size)
    
    def __len__(self) -> int:
        return self.view.nbytes - self.index
    
    @property
    def last(self) -> Any:
        return self.stack[-1]
    
    def scan(self, n: int) -> Self:
        self.index += n
        return self
    
    def unpack(self, fmt: str) -> tuple[Any, ...]:
        if not any(fmt.startswith(c) for c in ('@','=','<','>','!')):
            fmt = self.byte_order+fmt
        struct = MemoryReader._struct_cache.setdefault(fmt, Struct(fmt))
        _sz = struct.size
        val = struct.unpack(self.view[self.index:self.index+_sz])
        self.scan(_sz)
        self.stack.extend(val)
        return val

    def int8(self, sign: bool=True) -> int:
        return self.unpack('b' if sign else 'B')[0]

    def int16(self, sign: bool=True) -> int:
        return self.unpack('h' if sign else 'H')[0]

    def int32(self, sign: bool=True) -> int:
        return self.unpack('i' if sign else 'I')[0]

    def int40(self, sign: bool=True) -> int:
        return self.int8(sign) | (self.int32(sign) << 8)
    
    def int48(self, sign: bool=True) -> int:
        return self.int8(sign) | (self.int40(sign) << 8)

    def int64(self, sign: bool=True) -> int:
        return self.unpack('q' if sign else 'Q')[0]
    
    def varint(self, sign: bool=True) -> int:
        neg = False
        if sign:
            val = self.int8(False)
            ret = (val & 0x3F)
            if val & 0x40: 
                neg = True
            if not (val & 0x80): 
                return -ret if neg else ret
            shift = 6
        else:
            shift = ret = 0
        while True:
            val = self.int8(False)
            ret = ret | ((val & 0x7F) << shift)
            if not val & 0x80: 
                break
            shift += 7
        return -ret if neg else ret
    
    def float32(self) -> float:
        return self.unpack('f')[0]
    
    def float64(self) -> float:
        return self.unpack('d')[0]

    def read(self, length: int) -> bytes:
        val = self.view[self.index:self.index+length].tobytes()
        self.scan(length)
        return val

    def decode(self, length: int, encoding: str = 'utf-8') -> str:
        return self.read(length).decode(encoding)

This base MemoryReader class implements all the formats used by the FileGDB spec. It also includes a stack to allow us to check the last decoded value (this is useful when you need to check the last decoded value to get a string buffer length).

Since our MemoryReader will advance the view whenever we decode a value, we can write our decoder as a sequential series of reads. Using the False flag specifies that the numeric value is stored as unsigned (which most values are since having a negative string length or count doesn't make much sense).

Now that we have a good abstraction for reading the header data, we can go ahead and implement some functions that get the data and returns a mapping of keys to values:

def decode_tablx_header(tablx: bytes) -> dict[str, Any]:
    reader = MemoryReader(memoryview(tablx), byte_order='<')
    _version = reader.int32(False)
    if _version == 3:
        return {
            'version'    : _version, 
            'blocks'     : reader.int32(False), 
            'row_count'  : reader.int32(False), 
            'byte_offset': reader.int32(False),
        }
    elif _version == 4:
        return {
            'version'    : _version, 
            'blocks'     : reader.int32(False), 
            '_unknown'   : reader.int32(False), 
            'byte_offset': reader.int32(False),
            
            # Special case for 64bit OIDs
            'row_count': 
                reader.scan(reader.last*1024*reader.stack[-3]).int64(False) 
                if reader.stack[-3] else 0,
            'varying_section': 
                reader.int32(False) 
                if reader.stack[-4] else 0
        }
    raise ValueError(f'Unknown GDB Version {_version} (must be 3 or 4)')

def decode_table_header(table: bytes) -> dict[str, Any]:
        reader = MemoryReader(memoryview(table), byte_order='<')
        _version = reader.int32(False)
        if _version == 3:
            return {
                'version'      : _version,
                'row_count'    : reader.int32(False),
                'row_size'     : reader.int32(False),
                'byte_offset'  : reader.int32(False),
                '_unknown'     : reader.int64(False),
                'file_size'    : reader.int64(False),
                'fields_offset': reader.int64(False),
            }
        elif _version == 4:
            return {
                'version'      : _version,
                'has_deleted'  : bool(reader.int32(False)),
                'row_size'     : reader.int32(False),
                'byte_offset'  : reader.int32(False),
                'row_count'    : reader.int64(False),
                'file_size'    : reader.int64(False),
                'fields_offset': reader.int64(False),
            }
        raise ValueError(f'Unknown GDB Version {_version} (must be 3 or 4)')

These functions take the raw bytes of the file and read the header data. The most important values here are going to be row_count and byte_offset. At this point though we now have an incredibly fast way to inspect the number of rows in a GDB Table! We only need to inspect the first few bytes of the gdbtablx file to find it. This can be orders of magnitude faster than functions like GetCount or sum(1 for _ in SearchCursor('table')).

Here's some testing on a table with ~150 features

>>> arcpy.management.GetCount('table')
187 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit sum(1 for _ in arcpy.da.SearchCursor('table', ['OID@]))
499 μs ± 14.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

>>> %timeit get_tablx_header(open('a..n.gdbtablx', 'rb').read())['row_count']
13.2 μs ± 131 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

We don't yet have a way to access the tables by name, but once we get the GDB_Items table decoded, we can bootstrap that process since it stores the table names and ids (a...n is a hexcoded value of the object's FID stored in the GDB_Items table)

Decoding Field Descriptions and Offsets

Now that we have the header info from the gdbtable and gdbtablx files, we can shift past them and get the field offsets and definitions that we will use later to decode the fields themselves:

def decode_field_descriptions(table: bytes) -> dict[str, Any]:
    fields_offset = decode_table_header(table)['fields_offset']
    reader = MemoryReader(memoryview(table), fields_offset, byte_order='<')
    return {
        'header_size': reader.int32(),
        'version'    : reader.int32(),
        'geom_info'  : reader.int32(False),
        'shape_type' : reader.last & 0xff,
        'has_m'      : bool(reader.last >> 24 & (1 << 6)),
        'has_z'      : bool(reader.last >> 24 & (1 << 7)),
        'columns'    : reader.int16()
    }
    
def decode_field_offsets(tablx: bytes):
    tablx_header = decode_tablx_header(tablx)
    reader = MemoryReader(memoryview(tablx), start=16, byte_order='<')
    row_count = tablx_header['row_count']
    row_size = tablx_header['byte_offset']
    for _ in range(row_count):
        if row_size == 4:
            yield reader.int32(False)
        elif row_size == 5:
            yield reader.int40(False)
        elif row_size == 6:
            yield reader.int48(False)

The field descriptions will be used to determine what the structure of a field at a given offset is. This is the first time that we've needed both the table and tablx info to determine the location of anything.

Decoding Fields

Now that we know where our field data is and what each field is, we can go ahead and write a MemoryReader subclass that defines the structure of different field types:

class FieldReader(MemoryReader):

    def read_generic(self) -> dict[str, Any]:
        return {
            'width': self.int8(False),
            'flag' : self.int8(False),
        }

    def read_default(self) -> dict[str, Any]:
        return {
            'width'  : self.int8(False),
            'flag'   : self.int8(False),
            'length' : self.int8(False),
            'default': self.decode(self.last, 'utf-16le') if self.stack[-2] & 4 else None
        }

    def read_geom(self, *, has_m: bool = False, has_z: bool = False) -> dict[str, Any]:
        self.int8(False) # discard this byte
        return {
            'flag'     : self.int8(False),
            'reference': self.decode(self.int16(), 'utf-16le'),
            'sys-flg'  : self.int8(False),
            'has-z'    : has_z,
            'has-m'    : has_m,
            'x-origin' : self.float64(),
            'y-origin' : self.float64(),
            'xy-scale' : self.float64(),
            'm-origin' : self.float64(),
            'm-scale'  : self.float64(),
            'z-origin' : self.float64(),
            'z-scale'  : self.float64(),
            'xy-tol'   : self.float64(),
            'm-tol'    : self.float64(),
            'z-tol'    : self.float64(),
            'x-min'    : self.float64(),
            'y-min'    : self.float64(),
            'x-max'    : self.float64(),
            'y-max'    : self.float64(),
            'z-min'    : self.float64() if has_z else None,
            'z-max'    : self.float64() if has_z else None,
            'm-min'    : self.float64() if has_m else None,
            'm-max'    : self.float64() if has_m else None,
            'grids'    : self.unpack(f'{self.unpack('BI')[-1]}d'),
        }

    def read_string(self) -> dict[str, Any]:
        return {
            'max_len': self.int32(False),
            'flag': self.int8(False),
            'default': self.decode(self.varint(False), 'utf-8') if self.last & 4 else self.varint(False) and None
        }

    def read_raster(self) -> dict[str, Any]:
        return {
            '_unk_flg'   : self.int8(),
            'flag'       : self.int8(False),
            'col_type'   : self.decode(self.int8(False), 'utf-16le'),
            'wkt'        : self.decode(self.int8(False), 'utf-16le'),
            'flags'      : (flg := self.int8(False)),
            'xorigin'    : self.float64() if flg else None,
            'yorigin'    : self.float64() if flg else None,
            'xyscale'    : self.float64() if flg else None,
            'morigin'    : self.float64() if flg else None,
            'mscale'     : self.float64() if flg else None,
            'zorigin'    : self.float64() if flg else None,
            'zscale'     : self.float64() if flg else None,
            'xytol'      : self.float64() if flg else None,
            'mtol'       : self.float64() if flg else None,
            'ztol'       : self.float64() if flg else None,
            'raster_type': self.int8(),
        }

This won't yet give us row information, but it will give us all the info we need to determine the location of each field in each row of the gdbtable rows section:

def decode_fields(table: bytes, tablx: bytes) -> tuple[int, dict[str, dict[str, Any]]]:
    """Returns a tuple with the beginning of the row data section and the field info dict"""
    _fields: dict[str, dict[str, Any]] = {}
    desc = decode_field_descriptions(table)
    table_header = decode_table_header(table)
    has_m = desc['has_m']
    has_z = desc['has_z']
    cols = desc['columns']
    # +14 byte offset skips the field definition section
    reader = FieldReader(memoryview(table), table_header['fields_offset']+14, byte_order='<')
    for _ in range(cols):
        info: dict[str, Any] = {
            'name'      : reader.decode(reader.int8(False)*2, 'utf-16le'),
            'alias'     : reader.decode(reader.int8(False)*2, 'utf-16le'),
            'field_type': reader.int8(False),
            'flag'      : None,
        }
        field_type = info['field_type']
        if field_type == 4:
            info.update(reader.read_string())
        elif field_type == 7:
            info.update(reader.read_geom(has_m=has_m, has_z=has_z))
        elif field_type == 9:
            info.update(reader.read_raster())
        elif field_type in (6, 8, 10, 11, 12):
            info.update(reader.read_generic())
        else:
            info.update(reader.read_default())
        _fields[info['name']] = info
    
        if info['flag'] is None:
            info['nullable'] = None
            info['required'] = None
            info['editable'] = None
        else:
            info['nullable'] = bool(info['flag'] & 1)
            info['required'] = bool(info['flag'] & 2)
            info['editable'] = bool(info['flag'] & 4)
    return reader.index, _fields

Since the row data starts right after the field description section, we need to return the byte offset at the end of the field description. This will be the start byte for row decoding.

Decoding Rows (The Fun Part)

Now that we have the field definitions and offsets, we can write another MemoryReader subclass that can decode actual row records. Since geometry decoding is a bit complex, we're going to skip it for now, but if there's enough interest I can create a new post that goes over implementing it.

from datetime import datetime, timedelta
START_DATE = datetime(1899, 12, 30, 0, 0, 0, 0)

class RowReader(MemoryReader):
    def read_geometry_field(self) -> int:
        size = self.varint(False)
        geo_type = self.varint(False)
        self.scan(size)
        return geo_type
        
    def read_binary_field(self, blob_len: int) -> bytes:
        return self.read(self.varint(False))
        
    def read_string_field(self, encoding: str = 'utf-8') -> str:
        val = self.read(self.varint(False))
        try:
            return val.decode(encoding)
        except UnicodeDecodeError:
            return str(val)
    
    def read_xml_field(self, encoding: str = 'utf-8', strict: bool = False) -> str:
        val = self.read(self.varint(False))
        try:
            return val.decode(encoding, 'strict' if strict else 'replace')
        except UnicodeDecodeError:
            return str(val)
    
    def read_raster_field(self, raster_type: int) -> str | int | bytes:
        if raster_type == 0:
            return self.decode(self.varint(False), 'utf-16le')
        elif raster_type == 1:
            return self.int32()
        elif raster_type == 2:
            return self.read(self.varint(False))
        raise ValueError(f'Unknown Raster Type {raster_type}')
        
    def read_uuid_field(self) -> str:
        b = self.read(16)
        return (
            '{'
                f"{b[3]:02x}{b[2]:02x}{b[1]:02x}{b[0]:02x}-"
                f"{b[5]:02x}{b[4]:02x}-"
                f"{b[7]:02x}{b[6]:02x}-"
                f"{b[8]:02x}{b[9]:02x}-"
                f"{b[10]:02x}{b[11]:02x}{b[12]:02x}{b[13]:02x}{b[14]:02x}{b[15]:02x}"
            '}'
        )
    
    def read_objectid_field(self, version: int) -> int:
        if version == 3:
            return self.int32()
        elif version == 4:
            return self.int64()
        else:
            raise ValueError(f'Unknown OID version {version}')
    
    def read_generic_field(self, field_type: int) -> Any:
        if field_type == 0:
            return self.int16()
        elif field_type == 1:
            return self.int32()
        elif field_type == 2:
            return self.float32()
        elif field_type == 3:
            return self.float64()
        elif field_type == 13:
            return self.int64()
        elif field_type == 5:
            return START_DATE + timedelta(self.float64())
        elif field_type == 14:
            return (START_DATE + timedelta(self.float64())).date()
        elif field_type == 15:
            return (START_DATE + timedelta(self.float64())).time()
        elif field_type == 16:
            _date = (START_DATE + timedelta(self.float64()))
            _utc_offset = timedelta(hours=self.int16())
            return _date + _utc_offset

And now we just need to implement this reader in a row decoder:

def decode_rows(table: bytes, tablx: bytes) -> list[dict[str, Any]]:
    offset, fields = decode_fields(table, tablx)
    table_header = decode_table_header(table)
    row_count = table_header['row_count']
    if not row_count: 
        return []
    
    tablx_header = decode_tablx_header(tablx)
    field_offsets = decode_field_offsets(tablx)
    version = tablx_header['version']
    nullable = sum(1 for f in fields.values() if f['nullable'])
    reader = RowReader(memoryview(table), start=offset, byte_order='<')
    rows: list[dict[str, Any]] = []
    for fid, offset in enumerate(field_offsets, start=1):
        if not offset:
            continue
        reader.index = offset
        row = {}
        blob_len = reader.int32(False) # unused?
        flags: list[int] = [
            reader.int8(False) 
            for _ in range(0, nullable, 8)
        ]
        flag_test = 0
        for name, desc in fields.items():
            field_type = desc['field_type']
            
            if field_type == 6:
                row[name] = fid
                continue
            
            if flags and desc['nullable']:
                is_null = (flags[flag_test >> 3] & (1 << (flag_test % 8)))
                flag_test += 1
                if is_null:
                    row[name] = None
                    continue
                
            if field_type == 7:
                row[name] = reader.read_geometry_field()
            elif field_type == 8:
                row[name] = reader.read_binary_field(blob_len)
            elif field_type == 9:
                row[name] = reader.read_raster_field(desc['raster_type'])
            elif field_type in (4, 12):
                row[name] = reader.read_string_field()
            elif field_type in (10, 11):
                row[name] = reader.read_uuid_field()
            elif field_type == 6:
                row[name] = reader.read_objectid_field(version)
            else:
                row[name] = reader.read_generic_field(desc['field_type'])
        rows.append(row)
    return rows

This should now give us enough information to be able to write our own walk function!

Implementing da.Walk

This will not be a full implementation of da.Walk, but instead a function that returns a dictionary of Item Type -> list(Path). In my current use case, this is the most valuable information since I need to dispatch different item types to specific places:

from pathlib import Path
def get_items(gdb: str, dtypes: list[str] | None = None) -> dict[str, list[str]]:
    path = Path(gdb)
    item_types = decode_rows(
        (path / 'a00000005.gdbtable').read_bytes(),  
        (path / 'a00000005.gdbtablx').read_bytes()
    )
    items = decode_rows(
        (path / 'a00000004.gdbtable').read_bytes(),  
        (path / 'a00000004.gdbtablx').read_bytes()
    )
    item_fields = 'Type', 'Path', 'Definition', 'Name'
    
    type_map: dict[str, str] = {i['UUID']: i['Name'] for i in item_types}
    gdb_items: dict[str, list[Any]] = {k: [] for k in item_fields}
    for item in items:
        for k, v in item.items():
            if k not in item_fields:
                continue
            gdb_items[k].append(v)
    gdb_items['Type'] = [type_map[i].replace(' ','') for i in gdb_items['Type']]
    requested_types = {t for t in gdb_items['Type'] if dtypes is None or t in dtypes}
    
    ret = {k: list[Path]() for k in dtypes or requested_types}
    for dtype, catalog_path, definition, name in zip(*gdb_items.values()):
        # Determine if a FeatureClass is actually an AnnotationClass
        if definition and 'esriFTAnnotation' in definition:
            dtype = 'Annotation'
            ret.setdefault(dtype, [])
        # Skip non-requested types
        if dtype not in ret:
            continue
        # Remove the leading slash from CatalogPath to allow path joining
        if catalog_path and catalog_path.startswith('\\'):
            catalog_path = catalog_path[1:]
        # For pathless types (like Domains), just return 
        # the name joined to the root path
        ret[dtype].append(path / (catalog_path or name or ''))
    return ret

Here we finally put everything together to read the GDB_Items table. This table contains most of the information we need to identify all the children of the geodatabase as well as their definitions (which are stored in the 'Definition' column.)

>>> get_items(r'database.gdb', ['FeatureClass', 'Table'])
{
    'FeatureClass': [
        WindowsPath('database.gdb/fc1'),
        WindowsPath('database.gdb/fc2'),
        WindowsPath('database.gdb/ds1/dfc3'),
        WindowsPath('database.gdb/ds2/fc4')
    ],
    'Table': [
        WindowsPath('database.gdb/tbl1'),
        WindowsPath('database.gdb/tbl2')
    ]
}

Testing

Now that we have it working, let's compare the speed to the builtin da.Walk:

# Test done using a wide_datasets template from Joshua's code in the linked Walk question
>>> %timeit get_items('wide_datasets.gdb')
6.37 ms ± 766 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit list(Walk('wide_datasets.gdb'))
1.05 s ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Final Thoughts

While this technique can give you really great results if you need speed, you do lose a lot of the normal functionality provided by the builtins. Any additional flagging or conditions you want will need to be manually implemented. In my use case, getting filepaths and types is all I needed and I was willing to drop the support of arcpy to get there. If you want to check out the full implementation of this you can find it here.

There's also the issue that not all the byte values are known since the actual file structure of the GDB isn't published and getting to this point required a lot of reverse engineering work. Which means at any point ESRI could change the internal structure of a database and your existing parser will break. Though it does seem that most of the structure is pretty consistent. Especially the headers.

The primary usecase for this system is inspecting corrupted file databases. I have ended up with plenty of broken gdbs over the years that were born from sudden crashes and power failures. If you happen to have one of those, reading the raw data in it can help you at least recover some of the data (in this case, I would use the GDAL OpenFileGDB driver though since it's more feature complete).

HaydenWelch

So I've found out that you can actually point a Cursor at the system tables!

>>> list(SearchCursor(str(r'database.gdb\GDB_Items'), ['*']))
[(1, ...), ...]

This doesn't give you the header information, but it does give you enough to re-implement walk in a much simpler way:

from arcpy.da import SearchCursor
def walk(gdb: Path | str, *dtypes: str) -> dict[str, list[Path]]:
    _gdb_path = Path(gdb)
    _types_table = str(_gdb_path / 'GDB_ItemTypes')
    _items_table = str(_gdb_path / 'GDB_Items')
    gdb_types = dict(SearchCursor(_types_table, ['UUID', 'Name']))
    gdb_items = {t: list[Path]() for t in dtypes or gdb_types.values()}
    with SearchCursor(_items_table, ['Type', 'Path']) as cur:
        for typ, pth in cur:
            typ = gdb_types[typ]
            pth = pth or ''
            if typ in gdb_items:
                gdb_items[typ].append(_gdb_path / pth.lstrip('\\'))
    return gdb_items

The Cursor will skip XML fields sadly, so you can't get the detailed definitions, but this is definitely a more drag and drop solution

EDIT: It also seems that you can determine Annotations using the DatasetSubtype1 and DatasetSubtype2 fields:

def walk(gdb: Path | str, *dtypes: str) -> dict[str, list[Path]]:
    gdb = Path(gdb)
    _types = str(gdb / 'GDB_ItemTypes')
    _items = str(gdb / 'GDB_Items')
    gdb_types = dict(
        (uuid, typ.replace(' ', '')) 
        for uuid, typ in SearchCursor(_types, ['UUID', 'Name'])
    )
    gdb_items = {
        t: list[Path]() 
        for t in dtypes or gdb_types.values()
    }
    _flds = 'Type', 'Path', 'DatasetSubtype1', 'DatasetSubtype2'
    with SearchCursor(_items, _flds) as cur:
        for typ, pth, *_dsst in cur:
            if not pth:
                continue
            
            typ = gdb_types[typ]
            if typ not in gdb_items: 
                continue
            pth = gdb / pth.lstrip('\\')
            gdb_items[typ].append(pth)
            if tuple(_dsst) == (11,4):
                gdb_items.setdefault('Annotation', [])
                gdb_items['Annotation'].append(pth)
    return gdb_items