Definition queries versus splitting out data

DuncanHornby · ‎02-17-2016

All,

I am about to embark on project to edit an existing ESRI desktop AddIn I had created a few years ago for a client which processes AIS data (vessel movement data).

The original tool was designed around a volume of data that was specific to their original project. The data is held in a file geodatabase with key fields having attribute indexes built and the FeatureClass compressed. The client has been throwing larger and larger datasets and the current tool logic needs to be improved so it can deal with these larger datasets. Currently the tool gets slower as the data size increases. The tool is doing a lot of spatial querying and the logic of this needs to change.

I believe their large datasets are about 2 million rows.

I've had a good think about how to approach this and I've come up with 2 solutions: apply definition queries so that the tool is only ever looking at a subset of data based upon date or do some pre-processing splitting out the data into separate FeatureClasses based upon date, so physically creating new but smaller datasets.

Does anyone have any advice, have they done something similar and know of any pit falls? I was wondering if I go down the route of definition queries, is there any performance degradation as the size of the source dataset increases. For argument sake lets say the definition query always subsets a similar number of rows. Is the performance of that influenced by the size of the underlying table? Should I "explode" the dataset into new featureclasses then any processing will always be from a source dataset of a smaller size (probably around 20,000 rows).

I ask this question as the mechanisms for these 2 solutions are quite different and before I start felt I could benefit from other peoples wisdom/experiences?

Duncan

DanPatterson_Retired · ‎02-17-2016

If the dataset is:

not going to be updated during the period of the analysis... split ...
if the data is going to be used in other analyses... split ...
if only parts of the data are of interest .... split ...
I have a few other reasons and a long explanation, but I found this to be the fastest in any scenario that I have encountered.
the above doesn't mean that there aren't other scenarios where a DQ would be faster.... so

View solution in original post

DanPatterson_Retired · ‎02-17-2016

If the dataset is:

not going to be updated during the period of the analysis... split ...
if the data is going to be used in other analyses... split ...
if only parts of the data are of interest .... split ...
I have a few other reasons and a long explanation, but I found this to be the fastest in any scenario that I have encountered.
the above doesn't mean that there aren't other scenarios where a DQ would be faster.... so

VinceAngelo · ‎02-17-2016

Two million features is quite large, especially in a temporally sequenced table which is likely to be spatially fragmented. Generally, the advice is to organize like data in a single table, but the exception is when the data won't ever be used together (especially when it just increases spatial fragmentation). Not only would your application benefit from having the data split into tables, but the tables should probably be spatially sorted during the split process, to optimize spatial selection performance.

- V

DuncanHornby · ‎02-17-2016

Dan/Vince,

Thank you for your response. It sounds very much like I should split and run!

Sorry Vince but Dan gets the star he pipped you to the post. You have both given me useful insight and something to think about.

Much appreciated.

Duncan