dropdb pgindexes
createdb -E utf8 pgindexes
An index is just a data structure that makes the searching faster for a specific column in a database.
To see if the row value matched the query, database software would have to look at every single row in the table.
It takes up space, and the bigger your table is, the bigger your index will be. Another performance hit with indexes is that you must do the same actions on your index whenever you add, delete, or update entries in the corresponding table. Keep in mind that an index must include the same up-to-date data as whatever is in the table column covered by the index.
A hash index, inverts the relationship between a primary key and a value in the table.
=
Balanced tree, not binary tree.
- Optimised for parallel index scans
- Very shallow unlike normal binary trees
- Cheap to keep balanced
- Sorted which allows for the
<
/>
operators.
Official docs: https://www.postgresql.org/docs/current/btree-intro.html
Good explainer: https://rahulreddyr3.medium.com/indexing-in-postgresql-b-tree-4377bd72d7e5
<
<=
=
>=
BETWEEN
IN
IS NULL
IS NOT NULL
- Most commonly used for full text searching
- Good for when you have multiple values in a column, e.g
array
,jsonb
types.
Further reading:
- https://towardsdatascience.com/how-gin-indices-can-make-your-postgres-queries-15x-faster-af7a195a3fc5
- https://pganalyze.com/blog/gin-index
pk | menu item |
---|---|
1 | chicken burger |
2 | chickpea burger |
val | pk |
---|---|
chicken | 1 |
chickpea | 2 |
burger | [1, 2] |
<@
@>
=
&&
Good for when a value of a row can overlap with the same column in another row. Think polygons on a map, or to return only rows where the polygon contains a point.
GiST indexes can also be used on text columns for full text search similar to GINs but there are some fixed size constrains for GiST indexes.
Further reading: https://medium.com/postgres-professional/indexes-in-postgresql-5-gist-86e19781b5db
As the name implies this index type is similar to GiST, but this works better for data where the values can be highly clustered like post codes or IP addresses.
When you have a huge dataset that is organised by dates, such as log events, BRIN indexes allow you to rapidly skip or eliminate a lot of the irrelevant data. BRIN indexes are also kept as smaller indexes in relation to the entire data size, making them an excellent choice for huge datasets.
Docs: https://www.postgresql.org/docs/current/brin-intro.html
Example: A database table storing logins
Rows will be represented as a namedtuple
.
from collections import namedtuple
Login = namedtuple('Login', ['email', 'date', 'user_agent', 'ip'])
Tables will be represented as a Python list
[
Login(
email='[email protected]',
date=datetime.date(2010, 4, 13),
user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/531.0 (KHTML, like Gecko) Chrome/54.0.869.0 Safari/531.0',
ip='44.14.199.207'
),
Login(
email='[email protected]',
date=datetime.date(2013, 11, 3),
user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/531.0 (KHTML, like Gecko) Chrome/94.0.822.0 Safari/541.0',
ip='22.12.189.17'
)
]
CREATE TABLE IF NOT EXISTS logins (
id serial primary key,
email VARCHAR(100) unique not null,
date timestamp not null default NOW(),
user_agent VARCHAR(1000) not null,
ip inet not null
);
CREATE TABLE |
---|
from pprint import pprint
import faker
import random
fake = faker.Faker()
def make_row():
return Login(
fake.unique.email(),
fake.date_object(),
fake.chrome(),
fake.ipv4_public()
)
table = [make_row() for _ in range(10000)]
random_login = random.choice(table)
last_login = table[-1]
pprint(last_login)
from timeit import timeit
def find_by_email(t, email):
for x in t:
if x.email == email:
return x
timeit(
'find_by_email(table, last_login.email)',
globals=globals(),
number=10000
)
h_i = {login.email: i for i, login in enumerate(table)}
def find_by_email_with_hash_index(t, email, hash_index):
return table[hash_index[email]]
# peek at the index
pprint(dict(list(h_i.items())[:3]))
timeit(
'find_by_email_with_hash_index(table, last_login.email, h_i)',
globals=globals(),
number=10000
)