SQuan-esristaff

Searching strings in SQLite with SQL LIKE

Blog Post created by SQuan-esristaff Employee on Nov 29, 2018

Introduction

 

I have some strings in my database. I would like to query them. Also, I want the results quick. Simple, use SQL LIKE and put an index on it.

 

Oh wait, that's weird, it didn't work like how I want... this blog covers the common traps with searching text strings in SQLite.

 

Scenario

 

Let's look at a sample database table. This one is representing property parcels and their very famous owners.

 

CREATE TABLE parcel
(
  owner TEXT
);

INSERT INTO parcel (owner) values ('Bill Gates');
INSERT INTO parcel (owner) values ('Steve Jobs');
INSERT INTO parcel (owner) values ('Jack Dangermond');
INSERT INTO parcel (owner) values ('Steve Wozniak');
INSERT INTO parcel (owner) values ('Tim Cook');
INSERT INTO parcel (owner) values ('Mark Zuckerberg');

 

We will be running queries on the above data, similar to:

 

SELECT * FROM parcel;

 

and, at the same time, we will be studying the query's execution plan with something similar to:

 

EXPLAIN QUERY PLAN SELECT * FROM parcel;

 

Using and optimizing LIKE

 

Once you dabble in a bit of SQL you'll quickly realize that you need to use LIKE for your string searches. For example, if we want to find all owners with names beginning with the letter J we do this:

 

SELECT *
FROM parcel
WHERE owner LIKE 'j%';
-- EXPLAIN QUERY PLAN: SCAN TABLE parcel
-- OUTPUT: {"owner":"Jack Dangermond"}

 

Here, we observe that it found the right result, i.e. Jack Dangermond, however, the query needed to do a full table scan (i.e. SCAN TABLE parcel) to find that this was the only result.

 

Let's attempt to speed this up with an index and try again:

 

CREATE INDEX ix_parcel_owner ON parcel (owner);
SELECT *
FROM parcel
WHERE owner LIKE 'j%'
-- EXPLAIN QUERY PLAN: SCAN TABLE parcel
-- OUTPUT: {"owner":"Jack Dangermond"}

 

Well, we got the right result, i.e. Jack Dangermond, but, why is it still doing a full table scan (SCAN TABLE parcel)?

 

Why didn't it use my index ix_parcel_owner?

 

The answer is SQLite, unlike other databases, implements LIKE as a case insensitive search so we got Jack Dangermond which begins with a capital J even though our search pattern was "j%" which begins with a lowercase J. The index that we created was a case sensitive index. LIKE ignored the index because it needed a case insensitive index.

 

To fix this, we try again. We create an index specific for case insensitive searches.

 

CREATE INDEX ix_parcel_owner_collate ON parcel (owner COLLATE NOCASE);
SELECT *
FROM   parcel
WHERE  owner LIKE 'j%';
-- EXPLAIN QUERY PLAN: SEARCH TABLE parcel USING COVERING INDEX ix_parcel_owner_collate (owner>? AND owner<?)
-- OUTPUT: {"owner":"Jack Dangermond"}

 

There, we did it. The result is still Jack Dangermond, this time the index was used (SEARCH TABLE parcel USING COVERING INDEX ix_parcel_owner_collate). Take home message, COLLATE NOCASE is your friend.

 

Alternatives to LIKE for string contains searches

 

Now that we've got some success with LIKE, let's use it to find more things. Let's see if we can find all owners with the letter C anywhere in their name:

 

SELECT *
FROM parcel
WHERE owner LIKE '%c%';
-- EXPLAIN QUERY PLAN: SCAN TABLE parcel
-- OUTPUT: {"owner":"Jack Dangermond"}
-- OUTPUT: {"owner":"Tim Cook"}
-- OUTPUT: {"owner":"Mark Zuckerberg"}

 

Okay, we got the results we wanted. Jack Dangermond, Tim Cook and Mark Zuckerberg all have the letter C is their names. However, why are we back at full table scans (SCAN TABLE parcel)? Why isn't the index (ix_parcel_owner_collate) being used any more? That's because the B-Tree index being used works like how you look up names in a phone book. It works great if you have the starting letter(s) (i.e. divide the book in half, choose the half your letter is in, divide the book in half again).

 

However, because we don't have a starting letter, we can no longer use the phone book trick. We're back scanning every record in the table. The index we created is useless for this type of query.

 

So, what can we do? Well, there's a technique you can use but it requires rewriting the query, have a look at the following:

 

SELECT *, INSTR(LOWER(owner), LOWER('c')) idx
FROM parcel;
-- EXPLAIN QUERY PLAN: SCAN TABLE parcel
-- OUTPUT: {"idx":0,"owner":"Bill Gates"}
-- OUTPUT: {"idx":0,"owner":"Steve Jobs"}
-- OUTPUT: {"idx":3,"owner":"Jack Dangermond"}
-- OUTPUT: {"idx":0,"owner":"Steve Wozniak"}
-- OUTPUT: {"idx":5,"owner":"Tim Cook"}
-- OUTPUT: {"idx":8,"owner":"Mark Zuckerberg"}

SELECT *
FROM parcel
WHERE INSTR(LOWER(owner), LOWER('c')) > 0;
-- EXPLAIN QUERY PLAN: SCAN TABLE parcel
-- OUTPUT: {"owner":"Jack Dangermond"}
-- OUTPUT: {"owner":"Tim Cook"}
-- OUTPUT: {"owner":"Mark Zuckerberg"}

 

What is happening is we're using a function INSTR which returns the position of a substring (i.e. the letter C) in a string (i.e. owner). If the substring doesn't exist, you will simply get 0. We use the LOWER function on both the letter C and the owner to make it a case insensitive search.

 

The result is still the same as the previous version involving LIKE, i.e. we are getting all owners with the letter C in their names. However, the query plan is still doing a full table scan (SCAN TABLE parcel).

 

So, what's the point?

 

Well, SQLite allows you to index expressions! Think of it like a pre-calculated column.

 

CREATE INDEX ix_parcel_owner_instr_c ON parcel (INSTR(LOWER(owner), LOWER('c')));

 

The above statement will create an index. It may take some time. For example, if you had over 200000 records, that index may take several seconds to create. Consider that to be a good thing. Time spent here means the INSTR expression is being calculated for all records in the database once and only once. Every time we add a new record or modify an existing record, it will be the only time when that INSTR expression is calculated / recalculated.

 

That expression will never be recalculated at the time of the query. The query would just reuse the pre-calculated value that was stored in the index:

 

 

SELECT *
FROM parcel
WHERE INSTR(LOWER(owner), LOWER('c')) > 0;
-- EXPLAIN QUERY PLAN: SEARCH TABLE parcel USING INDEX ix_parcel_owner_instr_c (<expr>>?)
-- OUTPUT: {"owner":"Jack Dangermond"}
-- OUTPUT: {"owner":"Tim Cook"}
-- OUTPUT: {"owner":"Mark Zuckerberg"}

 

The above confirms we are now using the index (SEARCH TABLE parcel USING INDEX ix_parcel_owner_instr_c).

 

Code Sample

 

This AppStudio code sample was used to generate all the SQL content in this blog:

 

import QtQuick 2.7
import QtQuick.Controls 2.1
import ArcGIS.AppFramework 1.0
import ArcGIS.AppFramework.Sql 1.0

App {
    id: app
    width: 800 * AppFramework.displayScaleFactor
    height: 640 * AppFramework.displayScaleFactor

    property string logText: ""

    Flickable {
        id: flickable

        anchors.fill: parent
        anchors.margins: 10

        contentWidth: textArea.width
        contentHeight: textArea.height
        clip: true

        TextArea {
            id: textArea

            width: flickable.width
            wrapMode: Text.WrapAtWordBoundaryOrAnywhere
            selectByMouse: true
            text: logText
        }
    }

    SqlDatabase {
        id: db
        databaseName: ":memory:"
    }

    Component.onCompleted: {
        db.open();
        exec( [
            "CREATE TABLE parcel",
            "(",
            "    owner TEXT",
            ");"
            ].join("\n") );
        exec("INSERT INTO parcel (owner) values ('Bill Gates');");
        exec("INSERT INTO parcel (owner) values ('Steve Jobs');");
        exec("INSERT INTO parcel (owner) values ('Jack Dangermond');");
        exec("INSERT INTO parcel (owner) values ('Steve Wozniak');");
        exec("INSERT INTO parcel (owner) values ('Tim Cook');");
        exec("INSERT INTO parcel (owner) values ('Mark Zuckerberg');");
        exec( [
            "SELECT *",
            "FROM parcel",
            "WHERE owner LIKE 'j%';"
            ].join("\n") );
        exec("CREATE INDEX ix_parcel_owner ON parcel (owner);");
        exec( [
            "SELECT *",
            "FROM parcel",
            "WHERE owner LIKE 'j%';"
            ].join("\n") );
        exec("CREATE INDEX ix_parcel_owner_collate ON parcel (owner COLLATE NOCASE);");
        exec( [
            "SELECT *",
            "FROM parcel",
            "WHERE owner LIKE 'j%';"
            ].join("\n") );
        exec( [
            "SELECT *",
            "FROM parcel",
            "WHERE owner LIKE '%c%';"
            ].join("\n") );
        exec( [
            "SELECT *, INSTR(LOWER(owner), LOWER('c')) idx",
            "FROM parcel;",
            ].join("\n") );
        exec( [
            "SELECT *",
            "FROM parcel",
            "WHERE INSTR(LOWER(owner), LOWER('c')) > 0;"
            ].join("\n") );
        exec("CREATE INDEX ix_parcel_owner_instr_c ON parcel (INSTR(LOWER(owner), LOWER('c')));");
        exec( [
            "SELECT *",
            "FROM parcel",
            "WHERE INSTR(LOWER(owner), LOWER('c')) > 0;"
            ].join("\n") );
    }

    function exec(sql) {
        logText += "\n";
        logText += sql + "\n";

        var explain = db.query("EXPLAIN QUERY PLAN " + sql);
        if (!explain.error && explain.first()) {
            logText += "-- EXPLAIN QUERY PLAN: %1\n".arg(explain.values.detail);
            explain.finish();
        }

        var query = db.query(sql);
        if (query.error) {
            logText += query.error.databaseText;
            logText += query.error.driverText;
            logText += qsTr("NativeErrorCode: %1").arg(query.error.nativeErrorCode);
            logText += qsTr("ErrorType: %1").arg(query.error.type);
            return;
        }

        var ok = query.first();
        while (ok) {
            logText += "-- OUTPUT: %1\n".arg(JSON.stringify(query.values));
            ok = query.next();
        }
        query.finish();
    }
}

 

 

Summary

 

If speed matters to you, you cannot just create an index and just leave it there.

 

You need to check whether your queries use the index with EXPLAIN QUERY PLAN. If your index isn't being used, look at your query. Look at your WHERE clause. Think of what is happening there. Rewrite your WHERE clause if necessary. Create indexes that matches your WHERE clause.

 

If necessary, be prepared to index on expressions. Don't go overboard, we didn't create an index for every letter of the alphabet. That wasn't in our requirements today. We only wanted to search for the letter C and do that better. Over time, requirements changes. We probably will become disinterested in search for the letter C, then, feel free to drop that index. However, if a new criteria becomes more important, e.g. we want to search for all owners with Jack in the first, middle or last name, then, we will create an index to help with that purpose.

Outcomes