Postgres Indexes for Python Programmers

Setup

dropdb pgindexes
createdb -E utf8 pgindexes

What is an index?

An index is just a data structure that makes the searching faster for a specific column in a database.

What would happen without an index?

To see if the row value matched the query, database software would have to look at every single row in the table.

Why not index every column?

It takes up space, and the bigger your table is, the bigger your index will be. Another performance hit with indexes is that you must do the same actions on your index whenever you add, delete, or update entries in the corresponding table. Keep in mind that an index must include the same up-to-date data as whatever is in the table column covered by the index.

Postgres Index Types

Hash

A hash index, inverts the relationship between a primary key and a value in the table.

Supported Operators

B-tree

Balanced tree, not binary tree.

Optimised for parallel index scans
Very shallow unlike normal binary trees
Cheap to keep balanced
Sorted which allows for the < / > operators.

Official docs: https://www.postgresql.org/docs/current/btree-intro.html

Good explainer: https://rahulreddyr3.medium.com/indexing-in-postgresql-b-tree-4377bd72d7e5

Supported Operators

<
<=
=
>=
BETWEEN
IN
IS NULL
IS NOT NULL

GIN

Generalised Inverted Indexes

Most commonly used for full text searching
Good for when you have multiple values in a column, e.g array, jsonb types.

Table

pk	menu item
1	chicken burger
2	chickpea burger

Index

val	pk
chicken	1
chickpea	2
burger	[1, 2]

Supported operators

<@
@>
=
&&

GiST

Generalised Search Tree

Good for when a value of a row can overlap with the same column in another row. Think polygons on a map, or to return only rows where the polygon contains a point.

GiST indexes can also be used on text columns for full text search similar to GINs but there are some fixed size constrains for GiST indexes.

Further reading: https://medium.com/postgres-professional/indexes-in-postgresql-5-gist-86e19781b5db

SP-GiST

Space partitioned GiST

As the name implies this index type is similar to GiST, but this works better for data where the values can be highly clustered like post codes or IP addresses.

BRIN

Block range indexes

When you have a huge dataset that is organised by dates, such as log events, BRIN indexes allow you to rapidly skip or eliminate a lot of the irrelevant data. BRIN indexes are also kept as smaller indexes in relation to the entire data size, making them an excellent choice for huge datasets.

Docs: https://www.postgresql.org/docs/current/brin-intro.html

Python Implementation

Example: A database table storing logins

Rows

Rows will be represented as a namedtuple.

from collections import namedtuple

Login = namedtuple('Login', ['email', 'date', 'user_agent', 'ip'])

Tables

Tables will be represented as a Python list

[
    Login(

        email='[email protected]',
        date=datetime.date(2010, 4, 13),
        user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/531.0 (KHTML, like Gecko) Chrome/54.0.869.0 Safari/531.0',
        ip='44.14.199.207'
    ),
    Login(
        email='[email protected]',
        date=datetime.date(2013, 11, 3),
        user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/531.0 (KHTML, like Gecko) Chrome/94.0.822.0 Safari/541.0',
        ip='22.12.189.17'
    )
]

Postgres equivalent

CREATE TABLE IF NOT EXISTS logins (
       id serial primary key,
       email VARCHAR(100) unique not null,
       date timestamp not null default NOW(),
       user_agent VARCHAR(1000) not null,
       ip inet not null
);

CREATE TABLE

Populate a “table”

from pprint import pprint
import faker
import random

fake = faker.Faker()

def make_row():
    return Login(
        fake.unique.email(),
        fake.date_object(),
        fake.chrome(),
        fake.ipv4_public()
    )

table = [make_row() for _ in range(10000)]
random_login = random.choice(table)
last_login = table[-1]
pprint(last_login)

Find a login by email address

from timeit import timeit

def find_by_email(t, email):
    for x in t:
        if x.email == email:
            return x

timeit(
    'find_by_email(table, last_login.email)',
    globals=globals(),
    number=10000
)

With a hash index

h_i = {login.email: i for i, login in enumerate(table)}
def find_by_email_with_hash_index(t, email, hash_index):
    return table[hash_index[email]]

# peek at the index
pprint(dict(list(h_i.items())[:3]))

timeit(
    'find_by_email_with_hash_index(table, last_login.email, h_i)',
    globals=globals(),
    number=10000
)

r4vi / pgindexes Goto Github PK

pgindexes's Introduction

Postgres Indexes for Python Programmers

Setup

What is an index?

What would happen without an index?

Why not index every column?

Postgres Index Types

Hash

Supported Operators

B-tree

Supported Operators

GIN

Generalised Inverted Indexes

Table

Index

Supported operators

GiST

Generalised Search Tree

SP-GiST

Space partitioned GiST

BRIN

Block range indexes

Python Implementation

Rows

Tables

Postgres equivalent

Populate a “table”

Find a login by email address

With a hash index

pgindexes's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org