Giter Site home page Giter Site logo

Comments (14)

djezzzl avatar djezzzl commented on June 12, 2024 1

Hi @toydestroyer,

Thank you for the follow-up discussion!

If I may just quote some of the answers from there:

I saw that topic too. Some of the answers indeed sound interesting but researching them proves they are not worthy, IMHO.

In my case, this is definitely a win!

Could you please check it in your database and share the results? Would be there a win in terms of memory reduction or performance gain?

But I understand that this is not for everyone. Especially considering that Rails added an API to create enum columns only from version 7, and only for PostgreSQL: rails/rails@4eef348. And it doesn't provide an API for adding/deleting/changing values once an enum is created. So you have to write raw SQL in your migration files or use third-party gems (#1, #2).

I think it's still fine to have the checker in DatabaseConsistency even if it helps only you. In the end, anybody can simply disable it if they don't want/disagree. Giving the customers (engineers) this choice is a big plus.

To sum up:

  • I'm fine to have a checker that promotes enums.
  • I would be happy to have the rationale so we can add it to the wiki.
  • Maybe we should change the severity from "issue" to "suggestion/warning" for this checker.

Some of my findings:

create type example_enum as enum ('value1', 'value2', 'value3');
create table tmp_varchar (example varchar);
create table tmp_enum (example example_enum);

insert into tmp_varchar (select (ARRAY['value1', 'value2', 'value3'])[generate_series % 3 + 1] from generate_series(1, 1000000));
insert into tmp_enum(select example::example_enum from tmp_varchar);

create index tmp_varchar_index on tmp_varchar(example);
create index tmp_enum_index on tmp_enum(example);

analyze tmp_varchar;
analyse tmp_enum;

select
    pg_size_pretty(sum(pg_column_size('example'))) as total_size,
    pg_size_pretty(avg(pg_column_size('example'))) as average_size,
    count(*) count
from tmp_varchar;
-- 7813 kB,8.0000000000000000 bytes,1000000

select
    pg_size_pretty(sum(pg_column_size('example'))) as total_size,
    pg_size_pretty(avg(pg_column_size('example'))) as average_size,
    count(*) count
from tmp_enum;
-- 7813 kB,8.0000000000000000 bytes,1000000

select version();
-- PostgreSQL 12.9 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit


explain analyze select * from tmp_varchar where example = 'value2';
-- Bitmap Heap Scan on tmp_varchar  (cost=6218.32..14788.33 rows=331600 width=7) (actual time=14.082..45.372 rows=333334 loops=1)
--   Recheck Cond: ((example)::text = 'value2'::text)
--   Heap Blocks: exact=4425
--   ->  Bitmap Index Scan on tmp_varchar_index  (cost=0.00..6135.43 rows=331600 width=0) (actual time=13.688..13.688 rows=333334 loops=1)
--         Index Cond: ((example)::text = 'value2'::text)
-- Planning Time: 0.065 ms
-- Execution Time: 54.115 ms

explain analyze select * from tmp_enum where example = 'value2';
-- Bitmap Heap Scan on tmp_enum  (cost=6294.10..14915.35 rows=335700 width=4) (actual time=11.765..41.599 rows=333334 loops=1)
--   Recheck Cond: (example = 'value2'::example_enum)
--   Heap Blocks: exact=4425
--   ->  Bitmap Index Scan on tmp_enum_index  (cost=0.00..6210.18 rows=335700 width=0) (actual time=11.381..11.382 rows=333334 loops=1)
--         Index Cond: (example = 'value2'::example_enum)
-- Planning Time: 0.069 ms
-- Execution Time: 50.327 ms

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024 1

Readability (compared to integer-based enums). Data analysts will be able to see what the data is about, instead of asking developers questions like "what does user status 6 mean?"

I absolutely agree that integer-based enums with ActiveRecord are hard to read. It can be simply solved with string-based enum though.

For example, enum :status, {active: 'active', inactive: 'inactive', ...}.

Consistency. Nobody with direct access to the DB will be able to typo the value.

That's true. And I agree that could be critical for data-based logic.

So, We have a "consistency" reason right now which for me is already good. However, it would be great to have some more. If you can find them, please feel free to share.

In any case, I think I will be focused on something else in the upcoming weeks so if you want, feel free to do a PR.

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024 1

That is understandable to me!

I wish you and your family Happy Christmas and New Year!

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024 1

Hi @toydestroyer,

I just finished reading this. Thank you again for the extraordinary explorations! I'm pretty confident this benchmarking deserves a post to share.

-- => enum_size: 58777600 (56 MB)
-- => int2_size: 58777600 (56 MB)
-- => int4_size: 58777600 (56 MB)

The above looks odd to me, but I think it is because of how PostgreSQL allocates pages. There is probably some minimum byte size. That's why int2 and int4 produces the same results.

It would be nice to have some sanity check on my code because I'm pretty sure I'm very biased towards native enums 😆

The code looks legit to me.

I was also super surprized about int2 and int4, that they consume an equal amount of disk space.

Same feeling.

I used the text datatype for simplicity. varchar doesn't make a difference (I checked). varchar(255) is less space-consuming but insignificantly.

Makes sense to me.

And I'm even more skeptical about strings with foreign keys now. They are still strings and consume disk space as regular strings, and it seems like having a foreign key does affect the insert performance 🤷‍♂ī¸

Yes, this is disappointing, to be honest. The table that stores it is tiny, yet inserting time increases almost twice.

=========================================================================

I think we should try to conclude.

Native enum type in PostgreSQL 15+.

Pros:

  • has the best performance
  • has the best memory efficiency
  • ensures consistency
  • straightforward to read
    Cons:
  • Hard to manipulate (rename/drop)

I think we can promote enums with the checker. It would also be great to highlight the reasoning and potential drawbacks. It would be super cool to mention how to overcome them.

P.S. Maybe it isn't efficient in the lower version, but we have a benchmarking to confirm that.

Thank you again for the topic, and please feel free to do a PR; I will be happy to merge and release it.

from database_consistency.

toydestroyer avatar toydestroyer commented on June 12, 2024

Could you please check it in your database and share the results? Would be there a win in terms of memory reduction or performance gain?

I think the main benefits in our use case are not performance related, but:

  • Consistency. Nobody with direct access to the DB will be able to typo the value.
  • Readability (compared to integer-based enums). Data analysts will be able to see what the data is about, instead of asking developers questions like "what does user status 6 mean?"

I would be happy to have the rationale so we can add it to the wiki.

Got it, I'll do more research on this!

from database_consistency.

toydestroyer avatar toydestroyer commented on June 12, 2024

While I'm still researching, want to point out an interesting note in the ActiveRecord::Enum documentation regarding string columns:

Finally it's also possible to use a string column to persist the enumerated value. Note that this will likely lead to slower database queries:

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024

I understand that it makes sense that the comparison of integers should be faster than the comparison of strings, however, I see the following results:

create table tmp_1(a integer);
create table tmp_2(a varchar);

insert into tmp_1 (select generate_series % 3 from generate_series(1, 1000000));
insert into tmp_2 (select (ARRAY['active', 'inactive', 'invalid'])[a + 1] from tmp_1);

create index tmp_1_index on tmp_1(a);
create index tmp_2_index on tmp_2(a);

analyse tmp_1;
analyze tmp_2;

explain analyse select count(*) from tmp_1 where a = 1;
-- Finalize Aggregate  (cost=6386.73..6386.74 rows=1 width=8) (actual time=11.920..13.236 rows=1 loops=1)
--   ->  Gather  (cost=6386.51..6386.72 rows=2 width=8) (actual time=11.847..13.229 rows=3 loops=1)
--         Workers Planned: 2
--         Workers Launched: 2
--         ->  Partial Aggregate  (cost=5386.51..5386.52 rows=1 width=8) (actual time=10.235..10.236 rows=1 loops=3)
--               ->  Parallel Index Only Scan using tmp_1_index on tmp_1  (cost=0.42..5038.14 rows=139347 width=0) (actual time=0.021..6.298 rows=111111 loops=3)
--                     Index Cond: (a = 1)
--                     Heap Fetches: 0
-- Planning Time: 0.062 ms
-- Execution Time: 13.260 ms

explain analyse select count(*) from tmp_2 where a = 'active';
-- Finalize Aggregate  (cost=6379.72..6379.73 rows=1 width=8) (actual time=12.200..13.513 rows=1 loops=1)
--   ->  Gather  (cost=6379.51..6379.72 rows=2 width=8) (actual time=12.125..13.507 rows=3 loops=1)
--         Workers Planned: 2
--         Workers Launched: 2
--         ->  Partial Aggregate  (cost=5379.51..5379.52 rows=1 width=8) (actual time=10.350..10.350 rows=1 loops=3)
--               ->  Parallel Index Only Scan using tmp_2_index on tmp_2  (cost=0.42..5032.04 rows=138986 width=0) (actual time=0.021..6.398 rows=111111 loops=3)
--                     Index Cond: (a = 'active'::text)
--                     Heap Fetches: 0
-- Planning Time: 0.067 ms
-- Execution Time: 13.540 ms

drop table tmp_1;
drop table tmp_2;

I would say the gain doesn't worth it.

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024

Hi @toydestroyer,

Could you please say if there is any update on the topic? Just wondering if you had a moment to work on the PR.

from database_consistency.

toydestroyer avatar toydestroyer commented on June 12, 2024

Hey @djezzzl, no updates ☚ī¸

I don't think I'll be able to do anything productive this year, my mind is already on holiday 😂

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024

Hi @toydestroyer,

Happy New Year!

I'm reading the book "SQL Antipatterns" by Bill Karwin at the moment, and one thing I liked is the solution for our discussion here.
You can read more about it on medium.

Putting it shortly, the best way to handle it is to have a separate table as "enum storage" and define a foreign key constraint.
This gives a flexible, portable, high-performable, and durable solution.

CREATE TABLE users (
  id bigserial primary key,
  full_name varchar,
  status varchar

  FOREIGN KEY (status) REFERENCES user_statuses(status)
)

CREATE TABLE user_statuses (
  varchar status primary key,
  active boolean -- optional if we want to deprecate some of the statuses but keep them in the data
)

What do you think about enforcing this instead of enum?

from database_consistency.

toydestroyer avatar toydestroyer commented on June 12, 2024

Hey @djezzzl 🎉

I see how it is better than enums, especially for adding and removing values. It also looks like a nice solution to enforce consistency if you already store enums as strings.

But as a downside of this approach, I see maintenance overhead. For example, ~60% of our models have enums (some more than 1). We'll need to create many extra tables that will add unnecessary noise to our schema.rb. And we'll need to ensure that records in those tables are in sync between environments.

But I do think it is perfectly fine to have multiple checkers and allow developers to choose the ones that they think are good for their use case, like rubocop allows you to choose styles you want to enforce (" vs ').

What do you think about extending the configuration options of DatabaseConsitency to be something like:

DatabaseConsistencyCheckers:
  EnumTypeChecker:
    enforced_style: enum # foreign_key, string, integer

So far, the gem was mainly ensuring consistency between model definitions and the database schema, but it can do more. Namely — checking style consistency (good enough for v2? 🤔). Because at the end of the day, the gem is called DatabaseConsistency.

What do you think?

P. S. Saw your recent medium post. Will there be a checker for that? 😉
P. P. S. I hope I'll be able to allocate some time in the coming weeks to work on the checker to ensure the native enum type.

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024

Hi @toydestroyer ,

Sorry for taking long to reply.

We'll need to create many extra tables that will add unnecessary noise to our schema.rb.

This point should be fine as you have to define the same amount of enums. They may have shorter definitions, though.

And we'll need to ensure that records in those tables are in sync between environments.

This one is a bit more annoying but:

  • rails have a sophisticated way of producing seeds;
  • some companies prefer obfuscated dumps to populate other environments, so it should be trivial to keep it in sync;

But I do think it is perfectly fine to have multiple checkers and allow developers to choose the ones that they think are good for their use case

Good suggestion! As long as we have a rationale behind every strategy, we can keep them all.

So far, in your cases, PostgreSQL enums make more sense to you, and I respect that. And still prefer it after considering the approach with tables and foreign keys. So the only thing that remains is to write in plain text both directions and the reasoning when to choose what so everybody can learn this.

Could you please help me with that? It could be a good theme for the following article 😄

Namely — checking style consistency (good enough for v2? thinking). Because at the end of the day, the gem is called DatabaseConsistency.

Brilliant!

P. S. Saw your recent medium post. Will there be a checker for that?

Glad to hear that! Would you have any feedback?

And yes, the check is planned along with "hash index promotion." So adding to your idea, the library could cover three things now:

  • optimizations
  • consistency
  • styles

What do you think?

P. P. S. I hope I'll be able to allocate some time in the coming weeks to work on the checker to ensure the native enum type.

That would be awesome, but if you have strict time coding, I can do that, but I would like to ask your help on finishing preparations first so we can release it smoothly without surprising anybody.

from database_consistency.

toydestroyer avatar toydestroyer commented on June 12, 2024

Okay, so I did some benchmarking myself on my laptop (MBP M1 16GB), and here are my results:

-- 0. Postgres version
SELECT version();
-- => PostgreSQL 15.1 (Homebrew) on aarch64-apple-darwin22.1.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

-- 1. Tablespaces
-- create a tablespace for enum data
CREATE TABLESPACE enum_tablespace LOCATION './enum_test/enum';

-- create a tablespace for string data
CREATE TABLESPACE string_tablespace LOCATION './enum_test/string';

-- create a tablespace for string data with a foreign key
CREATE TABLESPACE string_fk_tablespace LOCATION './enum_test/string_fk';

-- create a tablespace for smallint data
CREATE TABLESPACE int2_tablespace LOCATION './enum_test/int2';

-- create a tablespace for smallint data
CREATE TABLESPACE int4_tablespace LOCATION './enum_test/int4';

-- create a tablespace for smallint data
CREATE TABLESPACE int8_tablespace LOCATION './enum_test/int8';

-- 2. Support
-- enum storage for string_fk
CREATE TABLE transaction_statuses (
  status text primary key
) TABLESPACE string_fk_tablespace;

INSERT INTO transaction_statuses (status) VALUES ('pending'), ('processing'), ('completed'), ('failed');

-- enum
CREATE TYPE transaction_status AS ENUM ('pending', 'processing', 'completed', 'failed');

-- 3. Tables
-- create a table with an enum column to store transaction status in the enum_tablespace tablespace
CREATE TABLE enum_transactions (
    id SERIAL PRIMARY KEY,
    status transaction_status
) TABLESPACE enum_tablespace;

-- create a table with a string column to store transaction status in the string_tablespace tablespace
CREATE TABLE string_transactions (
    id SERIAL PRIMARY KEY,
    status text
) TABLESPACE string_tablespace;

-- create a table with a string column to store transaction status in the string_tablespace tablespace
CREATE TABLE string_fk_transactions (
    id SERIAL PRIMARY KEY,
    status text,
		FOREIGN KEY (status) REFERENCES transaction_statuses(status)
) TABLESPACE string_fk_tablespace;

-- create a table with a smallint column to store transaction status in the int2_tablespace tablespace
CREATE TABLE int2_transactions (
    id SERIAL PRIMARY KEY,
    status INT2
) TABLESPACE int2_tablespace;

-- create a table with an integer column to store transaction status in the int4_tablespace tablespace
CREATE TABLE int4_transactions (
    id SERIAL PRIMARY KEY,
    status INT4
) TABLESPACE int4_tablespace;

-- create a table with a bigint column to store transaction status in the int8_tablespace tablespace
CREATE TABLE int8_transactions (
    id SERIAL PRIMARY KEY,
    status INT8
) TABLESPACE int8_tablespace;

-- insert 1 million rows of random data into the enum_transactions table
INSERT INTO enum_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1])::transaction_status FROM generate_series(1, 1000000));
-- => 2.343000s

-- insert 1 million rows of random data into the string_transactions table
INSERT INTO string_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1]) FROM generate_series(1, 1000000));
-- => 2.431000s

-- insert 1 million rows of random data into the string_fk_transactions table
INSERT INTO string_fk_transactions (status) (SELECT ((ARRAY['pending', 'processing', 'completed', 'failed'])[generate_series % 4 + 1]) FROM generate_series(1, 1000000));
-- => 4.561000s

-- insert 1 million rows of random data into the int2_transactions table
INSERT INTO int2_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 2.083000s

-- insert 1 million rows of random data into the int4_transactions table
INSERT INTO int4_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 1.898000s

-- insert 1 million rows of random data into the int8_transactions table
INSERT INTO int8_transactions (status) (SELECT generate_series % 4 FROM generate_series(1, 1000000));
-- => 1.845000s

Before proceeding, please take a look that inserting into a table with a foreign key takes about twice as long as other examples.

Vacuuming and analyzing:

VACUUM ANALYZE enum_transactions;
VACUUM ANALYZE string_fk_transactions;
VACUUM ANALYZE string_transactions;
VACUUM ANALYZE int2_transactions;
VACUUM ANALYZE int4_transactions;
VACUUM ANALYZE int8_transactions;

Check the disk usage of each table

SELECT pg_total_relation_size('enum_transactions') as enum_size,
       pg_total_relation_size('string_transactions') as string_size,
       pg_total_relation_size('string_fk_transactions') as string_fk_size,
       pg_total_relation_size('int2_transactions') as int2_size,
       pg_total_relation_size('int4_transactions') as int4_size,
       pg_total_relation_size('int8_transactions') as int8_size;

-- => enum_size:      58777600 (56 MB)
-- => string_size:    66822144 (64 MB)
-- => string_fk_size: 66822144 (64 MB)
-- => int2_size:      58777600 (56 MB)
-- => int4_size:      58777600 (56 MB)
-- => int8_size:      66813952 (64 MB)

I also used tablespaces because I wanted to see actual disk usage. So here's du -d1:

-- 71888 ./int4      (35M)
-- 71888 ./int2      (35M)
-- 88320 ./string_fk (43M)
-- 71408 ./enum      (35M)
-- 88272 ./int8      (43M)
-- 88288 ./string    (43M)

Cleanup:

-- tables
DROP TABLE IF EXISTS enum_transactions;
DROP TABLE IF EXISTS string_transactions;
DROP TABLE IF EXISTS string_fk_transactions;
DROP TABLE IF EXISTS int2_transactions;
DROP TABLE IF EXISTS int4_transactions;
DROP TABLE IF EXISTS int8_transactions;

DROP TABLE IF EXISTS transaction_statuses;

-- types
DROP TYPE transaction_status;

-- tablespaces
DROP TABLESPACE IF EXISTS enum_tablespace;
DROP TABLESPACE IF EXISTS string_tablespace;
DROP TABLESPACE IF EXISTS string_fk_tablespace;
DROP TABLESPACE IF EXISTS int4_tablespace;
DROP TABLESPACE IF EXISTS int2_tablespace;
DROP TABLESPACE IF EXISTS int8_tablespace;

Conclusion

Native enum is a clear winner for me. It combines the space efficiency of integers and the readability of strings. It also enforces consistency.

It would be nice to have some sanity check on my code because I'm pretty sure I'm very biased towards native enums 😆

I used the text datatype for simplicity. varchar doesn't make a difference (I checked). varchar(255) is less space-consuming but insignificantly.

I was also super surprized about int2 and int4, that they consume an equal amount of disk space.

And I'm even more skeptical about strings with foreign keys now. They are still strings and consume disk space as regular strings, and it seems like having a foreign key does affect the insert performance 🤷‍♂ī¸

Thoughts? 😊

from database_consistency.

djezzzl avatar djezzzl commented on June 12, 2024

Wow! Such tremendous benchmarking, @toydestroyer!

I'm so sorry I haven't looked at this early as I had to. I will do my best to have a good look this weekend.

from database_consistency.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.