AnsweredAssumed Answered

Exploring geopackages, sqlite3 and python to replace shapefiles

Question asked by kimo on Sep 18, 2014
Latest reply on Sep 25, 2014 by kimo

Now that geopackages using sqlite has been publicised on the front cover of ArcNews I am wondering how you are getting on testing out this new format.

 

Of course it will be more fully supported in 10.3, but you can try out a lot of features at 10.2 already. I wanted to test out performance on medium datasets which I count as between 1 million and 10 million records. This size performs very badly with current geoprocessing tools and larger tables can be broken down and partitioned. I note that the cartographic team considers 50,000 records to be the basic target for partitioning using the very useful grid tool to split a large set of points into equal counts. Sqlite is the base database engine, which has been extended for spatial columns in spatialite and then conventions defined in the OGC geopackage to use spatialite for data exchange, vendor extensions, projections and so on. It is expected to replace shapefiles for data exchange.

 

Here is what I have tried:

  • Use the sqlite3 module built into Python 2.7
  • Install sqlite3 64 bit standalone
  • Install SQLite2009Pro development tool for testing SQL expressions. Recommended for debugging queries.
  • Read a few books on sqlite to understand the limits and conventions. Lots of tips and tricks to handle special features of sqlite.

I haven't tried yet:

  • Any spatialite extensions to manipulate spatial data. This is really because support is not there in ArcGIS  yet, and I have a large table problem to solve. I want to make a subset of related tables to go with a clip. This is very hard currently and takes hours and hours.

 

The good parts

Sqlite has a very wide, but hidden, support base because it is used in every mobile phone. So you are already using it! It is being used by ArcGIS Mobile apps to manage disconnected editing of a geodatabase.

Writing python code to manipulate sqlite databases is compact and powerful using SQL expressions.

Loading large CSV tables into a database is blinding fast compared to alternatives. I can load a set of tables totalling 50 million records and 4.5GB in 18 minutes instead of 9 hours.

It is easy to exchange tables between sqlite and filegeodatabases in ArcCatalog.

Complex SQL expressions with multiple joins across multiple databases are very very fast and don't need indexes. I can extract 1 million records from a series of tables containing 18 million records with 3 joins and two conditional tests in less than 100 seconds without indexes.

Since it is entirely open and already well supported it is a ready-made exchange format with direct access for editing (although it is read-only in ArcGIS at 10.2 you can use python to edit.)

 

The missing parts

There are quite a few tricks to learn to sidestep awkward features in sqlite.

While the text datatype is unlimited in width it is essential to define a width when creating a table if it is going to translate to a filegeodatabase.

The width is ignored by spatialite, but is kept as metadata and used to translate to other formats where it is required. If you don't define a width then an index in filegeodatabases will fail because the width will default to some huge value 18,000 chars.

I am still getting used to the internal rowid and primary keys. Do not define your primary key as a primary key or it will disappear into the OID. Instead define a key to be primary as an autoincrement so that it will go into the OID and you can keep your static primary key to edit as you wish.

I have taken the advice in the O'Reilly book 'Using Sqlite'  not to use the SQL construct to create a table from a select statement because the field widths are not set. Instead make a template from the source table and use INSERT INTO.

Performance using execute functions in python always fails for large tables.The same sql works just fine from a command line or development tool. I do not know why or how to work around this. This makes it hard to automate queries on large tables.

 

Adding geometry

 

You can add geometry by extending sqlite with spatial extensions that are compatible with spatialite 4.+ using a hidden Arcpy function.

The extension is .gpkg but .sqlite also works thank goodness, my fingers cannot spell gpkg.

 

 #------------------------------------------------------------------------------- # Name:        createSqliteDB.py # Purpose:     now use gpkg as extension # # Author:      kimo # # Created:     19/09/2014 29/10/2014 # Copyright:   (c) kimo 2014 # Licence:     Createive Commons 3.0 NZ #-------------------------------------------------------------------------------  import arcpy import sys import os  try:     path = sys.argv[1]     db   = sys.argv[2]     geom = sys.argv[3] except:     path = 'e:/crs/currentgdb'     db = 'example.gpkg'     geom = 'ST_GEOMETRY' arcpy.AddMessage(path + " " + db) # geomOption = ['ST_GEOMETRY','SPATIALITE'] sqlite_database_path = '{}/{}'.format(path,db) if not arcpy.Exists(sqlite_database_path) :     arcpy.gp.CreateSQliteDatabase(sqlite_database_path,geom) else:     arcpy.AddWarning("db already exists")     print "db already exists" 

 

This is great! Because now you can use the sqlite database as if it was a filegeodatabase. The normal da.InsertCursor works just fine to insert new records including a geometry column. Unfortunately there is a bug in the da.UpdateCursor that crashes Python if you try to insert a geometry column (10.3 Prerelease). But it looks very promising. Note that there is a shortcut to creating a geometry directly from a WKT string using a token on the da.InsertCursor. It also took me a while to realise that I cannot just call the geometry column 'SHAPE' as reported in Listfields, I have to use a token with an @ in it.  No need to fiddle about, and it is fast.

 

#------------------------------------------------------------------------------- # Name:        loadfc_gpkg_demo.py # Purpose:     demonstrate loading a CSV fc into spatialite # # Author:      kimo # # Created:    29/10/2014 # Copyright:   (c) kimo 2014 # Licence:     Ollivier & Company #------------------------------------------------------------------------------- # changed shapefield token to SHAPE@WKT for da.InsertCursor 24 Oct 2014 import arcpy import sys import os import csv import datetime import sqlite3    def update_existing(rcl):     ''' use the same changeset to update itself     to test out the UpdateCursor on a geopackage     read in entire changeset first and build row lists     before opening cursors to limit the number of records     much simpler to delete updates and re-insert'''     desc = arcpy.Describe(rcl)     print "target fc:", rcl, desc.shapeType, desc.spatialReference.name, desc.featureType     # create a buffer list without objectid and shape_length and rename shape field to a token      buf = [f.name.upper() for f in desc.fields]     print "Desc fields in target:",buf     buf.remove('OBJECTID')     # buf.remove('SHAPE_LENGTH') there isn't one?     buf[buf.index("SHAPE")] = 'SHAPE@WKT'     # shape field in cursor name MUST be one of special tokens so use SHAPE@WKT     dBuf = {f.upper():buf.index(f) for f in buf}     print "dBuf:", dBuf     # delete update or insert records into the featureclass     # just testing change = 'UPDATE' or 'DELETE'     debug = True     # splits changeset into action arrays ready to go     delete_rows = []     insert_rows = []     update_rows = []     with open(csv_file,'r') as f:         reader = csv.reader(f)         n = 0         for row in reader:             if reader.line_num == 1:                 print "header:", len(row),row                 rowCSV = [r.upper().replace('__','') for r in row] # get rid of double underscores in __change__                 print "CSV header:",rowCSV                 rowCSV[rowCSV.index("SHAPE")] = 'SHAPE@WKT'                 dCSV = {r.upper():rowCSV.index(r) for r in rowCSV}                 print "dCSV:", dCSV                 for f in dBuf.keys():                     try:                         print "Field: {} Buf {}, CSV {} ".format(f, dBuf[f], dCSV[f])                     except:                         print "Field name dictionaries do not match", f                         sys.exit(1)                 print                 print '-------------------- BEGIN --------------------------'             else:                 n+=1                 # fieldmapping by name dictionaries                 # initialise a new buffer                 feat = [None] * len(buf)                 for f in buf: # iterate over field names                     # just poke shape straight into buffer field as a WKT string!                     if row[dCSV[f]]: # populate if not an empty data element, otherwise leave as null                         if f == 'FID':                             feat[dBuf[f]] = int(row[dCSV[f]].split('.')[-1])                             # print f,feat[dBuf[f]]                         else:                             feat[dBuf[f]] = row[dCSV[f]]                         if row[dCSV['CHANGE']] == 'DELETE':                             delete_rows = delete_rows + [row[dCSV['ID']],feat]                         if row[dCSV['CHANGE']] == 'UPDATE':                             update_rows = update_rows + [row[dCSV['ID']],feat]                         if row[dCSV['CHANGE']] == 'INSERT':                             insert_rows = insert_rows + [row[dCSV['ID']],feat]     print "{} deletes {} updates {} inserts".format(len(delete_rows),len(update_rows),len(insert_rows))     # apply deletes     sql_query = "ID in " + str(tuple(row[0] for row in delete_rows))     print sql_query ##    with arcpy.da.UpdateCursor(rcl, '*', sql_query) as cur: # shape called SHAPE@ or SHAPE@WKT ##        for row in cur: ##            cur.deleteRow(row)      deltaTime = datetime.datetime.now() - start     print n,"Well Done", deltaTime     return  def create_new(rcl):     ''' create a new featureclass from scratch'''     # arcpy.management.CreateFeatureclass(ws, rcl, template='rcl_changeset_template', spatial_reference=sr)     arcpy.management.CreateFeatureclass(ws, rcl, 'POLYLINE', spatial_reference=sr)     arcpy.management.AddField(rcl, 'FID', 'LONG') # actually not a long but a complex string     arcpy.management.AddField(rcl, 'CHANGE', 'TEXT', field_length=16)     arcpy.management.AddField(rcl, 'ID', 'LONG')     arcpy.management.AddField(rcl, 'ALT_ID', 'LONG')     arcpy.management.AddField(rcl, 'STATUS', 'TEXT', field_length=4)     arcpy.management.AddField(rcl, 'NON_CADASTRAL_RD', 'TEXT', field_length=1)     arcpy.management.AddField(rcl, 'AUDIT_ID', 'LONG')     arcpy.management.AddField(rcl, 'SE_ROW_ID', 'LONG')     desc = arcpy.Describe(rcl)     print "target fc:", rcl, desc.shapeType, desc.spatialReference.name, desc.featureType     # create a buffer list without objectid and shape_length and rename shape field to a token      buf = [f.name.upper() for f in desc.fields]     print "Desc fields in target:",buf     buf.remove('OBJECTID')     # buf.remove('SHAPE_LENGTH')     buf[buf.index("SHAPE")] = 'SHAPE@WKT'     # shape field in cursor name MUST be one of special tokens so use SHAPE@WKT     dBuf = {f.upper():buf.index(f) for f in buf}     print "dBuf:", dBuf     # insert records into the featureclass     # maybe for speed we could use arcpy.da.Editor for bulk commits?     debug = False     icur = arcpy.da.InsertCursor(rcl, buf) # shape called SHAPE@ or SHAPE@WKT     with open(csv_file,'r') as f:         reader = csv.reader(f)         n = 0         for row in reader:             if reader.line_num == 1:                 print "header:", len(row),row                 rowCSV = [r.upper().replace('__','') for r in row] # get rid of double underscores in __change__                 print "CSV header:",rowCSV                 rowCSV[rowCSV.index("SHAPE")] = 'SHAPE@WKT'                 dCSV = {r.upper():rowCSV.index(r) for r in rowCSV}                 print "dCSV:", dCSV                 for f in dBuf.keys():                     try:                         print "Field: {} Buf {}, CSV {} ".format(f, dBuf[f], dCSV[f])                     except:                         print "Field name dictionaries do not match", f                         sys.exit(1)                 print                 print '-------------------- BEGIN --------------------------'             else:                 n+=1                 # fieldmapping by name dictionaries                 # initialise a new buffer                 feat = [None] * len(buf)                 for f in buf: # iterate over field names                     # just poke shape straight into buffer field as a WKT string!                     if row[dCSV[f]]: # populate if not an empty data element, otherwise leave as null                         if f == 'FID':                             feat[dBuf[f]] = int(row[dCSV[f]].split('.')[-1])                             # print f,feat[dBuf[f]]                         else:                             feat[dBuf[f]] = row[dCSV[f]]                 icur.insertRow(feat)     del icur     deltaTime = datetime.datetime.now() - start     print n,"Well Done", deltaTime     return  # ------------------------------------ main --------------------------- try:     csv_file = sys.argv[1] except:     csv_file = 'e:/crs/source/future/script/layer-2023-changeset.csv' start = datetime.datetime.now() # rcl changeset if not arcpy.Exists('e:/crs/source/future/future.gpkg'):     arcpy.gp.CreateSQLiteDatabase('e:/crs/source/future/future.gpkg') ws = 'e:/crs/source/future/future.gpkg' arcpy.env.workspace = ws arcpy.env.overwriteOutput = True rcl = 'rcl_changeset' sr = arcpy.SpatialReference(4167) # 4167 NZGD_2000 # sample is DD not NZTM so cannot use template NZTM projection if arcpy.Exists(rcl):     update_existing(rcl) else:     create_new(rcl)     #arcpy.management.Delete(rcl) 

 

So what to do about the failed UpdateCursor if I want to do some editing?

A search of the options for open source python tools suggests a module called pyspatialite will provide an extension to the sqlite3 python module. This works well on Linux because it is built-in to tools such as OSGeo4W and QGIS. There is no stand-alone windows executable I could find for pyspatialite to install in Python27/ArcGIS10.0. However there is a tricky way you can do this without trying to recompile a package from source on Windows.

  1. Copy the site-package pyspatialite out of OSGeo4W into the ArcGIS10.2 version site-packages
  2. Copy all the DLLs in the OSGeo4W/bin directory into a new directory under ArcGIS10.2 say called binOSGeo4w
  3. Add this new folder to the system path.

Test your new module with a script from SpatiaLite / Python

Outcomes