Giter Site home page Giter Site logo

Comments (17)

brycethornton avatar brycethornton commented on August 21, 2024 2

@shayonj Thank you so much for spending some time on this. Just FYI, for this run I had upgraded to an insanely big database instance size on RDS (something like a db.m7g.16xlarge) and we didn't run out of available memory or CPU though I know many PG settings dictate some of this regardless of how much power the machine has.

As for the the way PG does a seq scan for foreign key validation, that surprises me but I'm sure they have their reasons. You'd think that if the column we are looking for has an index they would use that but maybe they don't want to trust the accuracy of an index when validating constraints?

I'm excited to try this out again with the "validation constraints one in each transaction" change you just made. I have a couple more table re-writes coming up soon.

from pg-osc.

jfrost avatar jfrost commented on August 21, 2024 2

Hey Bryce, if you try it again, can you make sure you have Performance Insights enabled on the RDS instance? That might give us a clue what's going wrong if Shayon's recent change doesn't fix it.

from pg-osc.

jfrost avatar jfrost commented on August 21, 2024 1

@brycethornton Was the query you killed a pg-osc query or something else? It would be great to know what the queries were blocked on, but I realize the forensics are likely non-existent.

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024 1

So, looking into this further, I think its likely the full table scan affecting the other writes in the database, likely from resource exhaustion and not the lock itself.

From reading a bit into how ATRewriteTable works with validateForeignKeyConstraint which further performs a seq scan, it appears that the seq scan is the intentional behavior during foreign key validations

Ref: https://github.com/postgres/postgres/blob/5f27b5f848a433ba54c521ccb889788b8f4d6ba7/src/backend/commands/tablecmds.c#L12070

There are few things I can think of

  • Perform the validation constraints one in each transaction (PR in coming)
  • Allow a flag in pg-osc to not mark a constraint as validated and keep it as NOT VALID. This would help in your case I believe, however it means the constraint will always stay in a not valid state and only validate new rows. Maybe thats OK?
  • Update the pg system catalog to mark the constraint as valid thus by passing seq scan. I am not sure if this is possible managed databases like AWS RDS, Heroku, etc since most don't allow direct updates in pg catalog tables

from pg-osc.

brycethornton avatar brycethornton commented on August 21, 2024 1

@jfrost The next ones I need to do are on Heroku so definitely less visibility there but I'll do my best to report any performance issues I see.

After looking at my upcoming work I may just leave all the default settings in place vs. using the new "skip validations" option. There is only one foreign key on the tables I'm going to be rewriting and it's to a table with a few million records vs. 1.5 billion like the time I wrote about in this issue so I'm hoping it won't cause any issues. Though, one of the tables I'll be re-writing has around 1.3 billion records so still lots of validation to do.

One idea I had for pg-osc would be to do some quick observations about the table (and related tables) before actually starting the process. We could warn the user about any potential issues with large tables, constraints, etc. Or, we could just keep this functionality as it's own process that runs with a preview command instead of perform. Just thoughts. Here's what the output could look like:

pg-online-schema-change preview {all the same alter/connection options as perform}

Table to alter: table_name_from_alter_statement
Estimated number of records in table: 345,000,000
Size of table on disk: 30GB
Estimated time to write new table: 8 hours (based on a test run of 1% of your records)
Estimated time to validate constraints: 6 hours (based on ?)
Constraints to validate:
- Referential foreign key: column_name -> foreign_key_table (1.2B records) (WARNING: this could take a significant amount of time due to the large number of records)
- Referential foreign key: another_column_name -> another_table (200 records)
- Self foreign key: some_table has foreign key to table_name_from_alter_statement.id (20M records)
- Self foreign key: some_table2 has foreign key to table_name_from_alter_statement.id (2M records)
- Self foreign key: some_table3 has foreign key to table_name_from_alter_statement.id (2B records) (WARNING: this could take a significant amount of time due to the large number of records)

Suggestions:
- Given the large number of records in some related tables you may want to use the `--skip-foreign-key-validation` flag and then validate constraints separately to speed up the process.
- ???

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024 1

Thats a great idea, i have been ideating a few features for apro version and dry run is one of them. I think it may make sense to have it on the open source version too. If you don't mine opening a ticket, otherwise i will get to it later.

Def keep us posted, hopefully it goes smooth on the next one 🀞🏾 . It might be tough but you would tail the lock statements on a loop from the query here

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

hm! thats interesting. The validate constraint grabs a SHARE UPDATE EXCLUSIVE so the lock up shouldn't happen. Is it blocking reads/writes or other ALTER commands perhaps?

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

We could add a CHECK constraint to be safe, so that data moving forward will be validated and then validate the older rows too and accordingly remove the NOT VALID constraint, but even in the first place it shouldn't block writes I believe

from pg-osc.

brycethornton avatar brycethornton commented on August 21, 2024

Sorry I don't have more details. The last time this bit me I was jolted out of bed by PagerDuty and I wearily killed a query to get things going again. Writes were blocked but I'm not sure why. Here's the relevant output from pg-osc at the time:

{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.295+00:00","v":0,"msg":"Running query","query":"SELECT pg_get_serial_sequence('pgosc_st_table_name', 'id') as sequence_name\n","version":"0.9.2"}
WARNING:  there is already a transaction in progress
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.300+00:00","v":0,"msg":"Running query","query":"SELECT setval((select pg_get_serial_sequence('pgosc_st_table_name', 'id')), (SELECT max(id) FROM table_name));\n;\nALTER TABLE table_name RENAME to pgosc_op_table_table_name;\nALTER TABLE pgosc_st_table_name RENAME to table_name;\n\nALTER TABLE table_name ADD CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id FOREIGN KEY (query_id) REFERENCES query_table(id) DEFERRABLE INITIALLY DEFERRED NOT VALID;\n\nALTER TABLE table_name RESET (autovacuum_enabled, toast.autovacuum_enabled);\nDROP TRIGGER IF EXISTS primary_to_audit_table_trigger ON table_name;\n","version":"0.9.2"}
WARNING:  there is already a transaction in progress
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.323+00:00","v":0,"msg":"Running query","query":"COMMIT;","version":"0.9.2"}
WARNING:  there is already a transaction in progress
WARNING:  there is no transaction in progress
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.326+00:00","v":0,"msg":"Running query","query":"SET statement_timeout = 0;","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":30,"time":"2023-08-03T06:28:53.328+00:00","v":0,"msg":"Performing ANALYZE!","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.328+00:00","v":0,"msg":"Running query","query":"ANALYZE VERBOSE table_name;","version":"0.9.2"}
INFO:  analyzing "public.table_name"
INFO:  "table_name": scanned 30000 of 32264370 pages, containing 1854617 live rows and 0 dead rows; 30000 rows in sample, 1994601637 estimated total rows
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":30,"time":"2023-08-03T06:29:17.047+00:00","v":0,"msg":"Validating constraints!","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:29:17.047+00:00","v":0,"msg":"Running query","query":"SELECT  conrelid::regclass AS table_on,\n        confrelid::regclass AS table_from,\n        contype as constraint_type,\n        conname AS constraint_name,\n        convalidated AS constraint_validated,\n        pg_get_constraintdef(oid) AS definition\nFROM   \tpg_constraint\nWHERE  \tcontype IN ('f', 'p')\n","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:29:17.049+00:00","v":0,"msg":"Running query","query":"ALTER TABLE table_name VALIDATE CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id;","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":30,"time":"2023-08-03T07:01:46.205+00:00","v":0,"msg":"Exception raised, rolling back query","rollback":true,"query":"ALTER TABLE table_name VALIDATE CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id;","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":60,"time":"2023-08-03T07:01:46.206+00:00","v":0,"msg":"Something went wrong: ERROR:  canceling statement due to user request\nCONTEXT:  SQL statement \"SELECT fk.\"query_id\" FROM ONLY \"public\".\"table_name\" fk LEFT OUTER JOIN ONLY \"public\".\"query_table\" pk ON ( pk.\"id\" OPERATOR(pg_catalog.=) fk.\"query_id\") WHERE pk.\"id\" IS NULL AND (fk.\"query_id\" IS NOT NULL)\"\n","e":"ERROR:  canceling statement due to user request\nCONTEXT:  SQL statement \"SELECT fk.\"query_id\" FROM ONLY \"public\".\"table_name\" fk LEFT OUTER JOIN ONLY \"public\".\"query_table\" pk ON ( pk.\"id\" OPERATOR(pg_catalog.=) fk.\"query_id\") WHERE pk.\"id\" IS NULL AND (fk.\"query_id\" IS NOT NULL)\"\n","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T07:01:46.206+00:00","v":0,"msg":"Running query","query":"DROP TRIGGER IF EXISTS primary_to_audit_table_trigger ON table_name;\nDROP TABLE IF EXISTS pgosc_at_table_name_6332cd;\nDROP TABLE IF EXISTS pgosc_st_table_name;\n\nRESET statement_timeout;\nRESET client_min_messages;\nRESET lock_timeout;\n","version":"0.9.2"}
/var/lib/gems/3.0.0/gems/pg_online_schema_change-0.9.2/lib/pg_online_schema_change/query.rb:78:in `exec': ERROR:  canceling statement due to user request (PG::QueryCanceled)
CONTEXT:  SQL statement "SELECT fk."query_id" FROM ONLY "public"."table_name" fk LEFT OUTER JOIN ONLY "public"."query_table" pk ON ( pk."id" OPERATOR(pg_catalog.=) fk."query_id") WHERE pk."id" IS NULL AND (fk."query_id" IS NOT NULL)"

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

Yeah, looks like the ALTER TABLE table_name VALIDATE CONSTRAINT.... query was started at 2023-08-03T06:29:17.049+00:00 and then the statement timeout request came in at 2023-08-03T07:01:46.205+00:00. Its possible the query was running for ~32 mins, but it shouldn't be blocking writes, esp. since the swap is done.

Whats the postgres version you are on? And yeah, i guess if there is an snapshot of queries that were blocked, that would be immensely helpful but hard to get ofc.

from pg-osc.

brycethornton avatar brycethornton commented on August 21, 2024

We're on Postgres v13. Yeah, I'm kicking myself for not getting the specific query I killed but I believe it was the ALTER TABLE table_name VALIDATE CONSTRAINT ... query. The queries that were blocked were inserts into the table I was re-writing (the "Main Table" below). Maybe the related table was in a locked state for some reason with all the foreign-key constraint validation happening and that didn't allow new rows in the primary table to validate their foreign key when being added. I know it'll be nearly impossible to figure out without the ability to see the queries or re-create the issue. Here is some very rough info about the tables involved:

Main Table to be re-written:

  • 1.9 billion rows
  • each row has a foreign key reference to a row in Secondary Table

Secondary Table:

  • 1.5 billion rows

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

hm! interesting. No worries regarding the blocking queries.

Q: Is this the first time you have the run pg-osc on the table or has it run successfully in the past?

Also, does any of the table have non foreign key related constraints?

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

Also, if you still have the shadow table hanging around (i think not, because of the cleanup script), would love to see the DDL /table structure for it \d+ $shadow_table

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

Also, looks like the swap has already happened so you are on the new table which means the constraints would still be marked as NOT VALID you can confirm it with \d+ $table-name from a psql session or from

SELECT  conrelid::regclass AS table_on,
                  confrelid::regclass AS table_from,
                  contype as constraint_type,
                  conname AS constraint_name,
                  convalidated AS constraint_validated,
                  pg_get_constraintdef(oid) AS definition
          FROM   	pg_constraint
          WHERE  	contype IN ('f', 'p')

If so, and if you are willing to try again then you can run ALTER TABLE table_name VALIDATE CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id; and see what kind of locks are held by running the following query ( modified from here). It will show the blocked and blocking query, along with the locks

 SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.mode  AS   blocking_lock_mode,
         blocked_locks.mode  AS   blocked_locks_mode,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid

    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

Thinking out loud: Because ALTER TABLE table_name VALIDATE CONSTRAINT constraint_name; only takes a β€˜SHARE UPDATE EXCLUSIVE and performs a full table scan, I wonder if its the full table scan on the large table is causing a resource contention (CPU/memory/io) on the database and if that is causing the slowdown in reads/writes πŸ€”

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

Closing the issue now after the latest release which has some workarounds, in case that is useful - https://github.com/shayonj/pg-osc/releases/tag/v0.9.5

I hope the last comments help, sorry if its not a satisfactory response. Feel free to share any findings and happy to help diagnose too.

from pg-osc.

shayonj avatar shayonj commented on August 21, 2024

thanks! def curious to see if it works out 🀞🏾

from pg-osc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.