Comments (17)
@shayonj Thank you so much for spending some time on this. Just FYI, for this run I had upgraded to an insanely big database instance size on RDS (something like a db.m7g.16xlarge) and we didn't run out of available memory or CPU though I know many PG settings dictate some of this regardless of how much power the machine has.
As for the the way PG does a seq scan for foreign key validation, that surprises me but I'm sure they have their reasons. You'd think that if the column we are looking for has an index they would use that but maybe they don't want to trust the accuracy of an index when validating constraints?
I'm excited to try this out again with the "validation constraints one in each transaction" change you just made. I have a couple more table re-writes coming up soon.
from pg-osc.
Hey Bryce, if you try it again, can you make sure you have Performance Insights enabled on the RDS instance? That might give us a clue what's going wrong if Shayon's recent change doesn't fix it.
from pg-osc.
@brycethornton Was the query you killed a pg-osc
query or something else? It would be great to know what the queries were blocked on, but I realize the forensics are likely non-existent.
from pg-osc.
So, looking into this further, I think its likely the full table scan affecting the other writes in the database, likely from resource exhaustion and not the lock itself.
From reading a bit into how ATRewriteTable
works with validateForeignKeyConstraint
which further performs a seq scan, it appears that the seq scan is the intentional behavior during foreign key validations
There are few things I can think of
- Perform the validation constraints one in each transaction (PR in coming)
- Allow a flag in
pg-osc
to not mark a constraint as validated and keep it asNOT VALID
. This would help in your case I believe, however it means the constraint will always stay in a not valid state and only validate new rows. Maybe thats OK? - Update the pg system catalog to mark the constraint as valid thus by passing seq scan. I am not sure if this is possible managed databases like AWS RDS, Heroku, etc since most don't allow direct updates in pg catalog tables
from pg-osc.
@jfrost The next ones I need to do are on Heroku so definitely less visibility there but I'll do my best to report any performance issues I see.
After looking at my upcoming work I may just leave all the default settings in place vs. using the new "skip validations" option. There is only one foreign key on the tables I'm going to be rewriting and it's to a table with a few million records vs. 1.5 billion like the time I wrote about in this issue so I'm hoping it won't cause any issues. Though, one of the tables I'll be re-writing has around 1.3 billion records so still lots of validation to do.
One idea I had for pg-osc
would be to do some quick observations about the table (and related tables) before actually starting the process. We could warn the user about any potential issues with large tables, constraints, etc. Or, we could just keep this functionality as it's own process that runs with a preview
command instead of perform
. Just thoughts. Here's what the output could look like:
pg-online-schema-change preview {all the same alter/connection options as perform}
Table to alter: table_name_from_alter_statement
Estimated number of records in table: 345,000,000
Size of table on disk: 30GB
Estimated time to write new table: 8 hours (based on a test run of 1% of your records)
Estimated time to validate constraints: 6 hours (based on ?)
Constraints to validate:
- Referential foreign key: column_name -> foreign_key_table (1.2B records) (WARNING: this could take a significant amount of time due to the large number of records)
- Referential foreign key: another_column_name -> another_table (200 records)
- Self foreign key: some_table has foreign key to table_name_from_alter_statement.id (20M records)
- Self foreign key: some_table2 has foreign key to table_name_from_alter_statement.id (2M records)
- Self foreign key: some_table3 has foreign key to table_name_from_alter_statement.id (2B records) (WARNING: this could take a significant amount of time due to the large number of records)
Suggestions:
- Given the large number of records in some related tables you may want to use the `--skip-foreign-key-validation` flag and then validate constraints separately to speed up the process.
- ???
from pg-osc.
Thats a great idea, i have been ideating a few features for apro
version and dry run is one of them. I think it may make sense to have it on the open source version too. If you don't mine opening a ticket, otherwise i will get to it later.
Def keep us posted, hopefully it goes smooth on the next one π€πΎ . It might be tough but you would tail the lock statements on a loop from the query here
from pg-osc.
hm! thats interesting. The validate constraint grabs a SHARE UPDATE EXCLUSIVE
so the lock up shouldn't happen. Is it blocking reads/writes or other ALTER commands perhaps?
from pg-osc.
We could add a CHECK
constraint to be safe, so that data moving forward will be validated and then validate the older rows too and accordingly remove the NOT VALID
constraint, but even in the first place it shouldn't block writes I believe
from pg-osc.
Sorry I don't have more details. The last time this bit me I was jolted out of bed by PagerDuty and I wearily killed a query to get things going again. Writes were blocked but I'm not sure why. Here's the relevant output from pg-osc
at the time:
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.295+00:00","v":0,"msg":"Running query","query":"SELECT pg_get_serial_sequence('pgosc_st_table_name', 'id') as sequence_name\n","version":"0.9.2"}
WARNING: there is already a transaction in progress
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.300+00:00","v":0,"msg":"Running query","query":"SELECT setval((select pg_get_serial_sequence('pgosc_st_table_name', 'id')), (SELECT max(id) FROM table_name));\n;\nALTER TABLE table_name RENAME to pgosc_op_table_table_name;\nALTER TABLE pgosc_st_table_name RENAME to table_name;\n\nALTER TABLE table_name ADD CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id FOREIGN KEY (query_id) REFERENCES query_table(id) DEFERRABLE INITIALLY DEFERRED NOT VALID;\n\nALTER TABLE table_name RESET (autovacuum_enabled, toast.autovacuum_enabled);\nDROP TRIGGER IF EXISTS primary_to_audit_table_trigger ON table_name;\n","version":"0.9.2"}
WARNING: there is already a transaction in progress
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.323+00:00","v":0,"msg":"Running query","query":"COMMIT;","version":"0.9.2"}
WARNING: there is already a transaction in progress
WARNING: there is no transaction in progress
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.326+00:00","v":0,"msg":"Running query","query":"SET statement_timeout = 0;","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":30,"time":"2023-08-03T06:28:53.328+00:00","v":0,"msg":"Performing ANALYZE!","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:28:53.328+00:00","v":0,"msg":"Running query","query":"ANALYZE VERBOSE table_name;","version":"0.9.2"}
INFO: analyzing "public.table_name"
INFO: "table_name": scanned 30000 of 32264370 pages, containing 1854617 live rows and 0 dead rows; 30000 rows in sample, 1994601637 estimated total rows
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":30,"time":"2023-08-03T06:29:17.047+00:00","v":0,"msg":"Validating constraints!","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:29:17.047+00:00","v":0,"msg":"Running query","query":"SELECT conrelid::regclass AS table_on,\n confrelid::regclass AS table_from,\n contype as constraint_type,\n conname AS constraint_name,\n convalidated AS constraint_validated,\n pg_get_constraintdef(oid) AS definition\nFROM \tpg_constraint\nWHERE \tcontype IN ('f', 'p')\n","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T06:29:17.049+00:00","v":0,"msg":"Running query","query":"ALTER TABLE table_name VALIDATE CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id;","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":30,"time":"2023-08-03T07:01:46.205+00:00","v":0,"msg":"Exception raised, rolling back query","rollback":true,"query":"ALTER TABLE table_name VALIDATE CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id;","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":60,"time":"2023-08-03T07:01:46.206+00:00","v":0,"msg":"Something went wrong: ERROR: canceling statement due to user request\nCONTEXT: SQL statement \"SELECT fk.\"query_id\" FROM ONLY \"public\".\"table_name\" fk LEFT OUTER JOIN ONLY \"public\".\"query_table\" pk ON ( pk.\"id\" OPERATOR(pg_catalog.=) fk.\"query_id\") WHERE pk.\"id\" IS NULL AND (fk.\"query_id\" IS NOT NULL)\"\n","e":"ERROR: canceling statement due to user request\nCONTEXT: SQL statement \"SELECT fk.\"query_id\" FROM ONLY \"public\".\"table_name\" fk LEFT OUTER JOIN ONLY \"public\".\"query_table\" pk ON ( pk.\"id\" OPERATOR(pg_catalog.=) fk.\"query_id\") WHERE pk.\"id\" IS NULL AND (fk.\"query_id\" IS NOT NULL)\"\n","version":"0.9.2"}
{"name":"pg-online-schema-change","hostname":"ip-10-0-1-1","pid":2685640,"level":20,"time":"2023-08-03T07:01:46.206+00:00","v":0,"msg":"Running query","query":"DROP TRIGGER IF EXISTS primary_to_audit_table_trigger ON table_name;\nDROP TABLE IF EXISTS pgosc_at_table_name_6332cd;\nDROP TABLE IF EXISTS pgosc_st_table_name;\n\nRESET statement_timeout;\nRESET client_min_messages;\nRESET lock_timeout;\n","version":"0.9.2"}
/var/lib/gems/3.0.0/gems/pg_online_schema_change-0.9.2/lib/pg_online_schema_change/query.rb:78:in `exec': ERROR: canceling statement due to user request (PG::QueryCanceled)
CONTEXT: SQL statement "SELECT fk."query_id" FROM ONLY "public"."table_name" fk LEFT OUTER JOIN ONLY "public"."query_table" pk ON ( pk."id" OPERATOR(pg_catalog.=) fk."query_id") WHERE pk."id" IS NULL AND (fk."query_id" IS NOT NULL)"
from pg-osc.
Yeah, looks like the ALTER TABLE table_name VALIDATE CONSTRAINT
.... query was started at 2023-08-03T06:29:17.049+00:00
and then the statement timeout request came in at 2023-08-03T07:01:46.205+00:00
. Its possible the query was running for ~32 mins, but it shouldn't be blocking writes, esp. since the swap is done.
Whats the postgres version you are on? And yeah, i guess if there is an snapshot of queries that were blocked, that would be immensely helpful but hard to get ofc.
from pg-osc.
We're on Postgres v13. Yeah, I'm kicking myself for not getting the specific query I killed but I believe it was the ALTER TABLE table_name VALIDATE CONSTRAINT ...
query. The queries that were blocked were inserts into the table I was re-writing (the "Main Table" below). Maybe the related table was in a locked state for some reason with all the foreign-key constraint validation happening and that didn't allow new rows in the primary table to validate their foreign key when being added. I know it'll be nearly impossible to figure out without the ability to see the queries or re-create the issue. Here is some very rough info about the tables involved:
Main Table to be re-written:
- 1.9 billion rows
- each row has a foreign key reference to a row in Secondary Table
Secondary Table:
- 1.5 billion rows
from pg-osc.
hm! interesting. No worries regarding the blocking queries.
Q: Is this the first time you have the run pg-osc on the table or has it run successfully in the past?
Also, does any of the table have non foreign key related constraints?
from pg-osc.
Also, if you still have the shadow table hanging around (i think not, because of the cleanup script), would love to see the DDL /table structure for it \d+ $shadow_table
from pg-osc.
Also, looks like the swap has already happened so you are on the new table which means the constraints would still be marked as NOT VALID
you can confirm it with \d+ $table-name
from a psql session or from
SELECT conrelid::regclass AS table_on,
confrelid::regclass AS table_from,
contype as constraint_type,
conname AS constraint_name,
convalidated AS constraint_validated,
pg_get_constraintdef(oid) AS definition
FROM pg_constraint
WHERE contype IN ('f', 'p')
If so, and if you are willing to try again then you can run ALTER TABLE table_name VALIDATE CONSTRAINT table_name_query_id_e53c7c03_fk_query_table_id;
and see what kind of locks are held by running the following query ( modified from here). It will show the blocked and blocking query, along with the locks
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.mode AS blocking_lock_mode,
blocked_locks.mode AS blocked_locks_mode,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
from pg-osc.
Thinking out loud: Because ALTER TABLE table_name VALIDATE CONSTRAINT constraint_name;
only takes a βSHARE UPDATE EXCLUSIVE
and performs a full table scan, I wonder if its the full table scan on the large table is causing a resource contention (CPU/memory/io) on the database and if that is causing the slowdown in reads/writes π€
from pg-osc.
Closing the issue now after the latest release which has some workarounds, in case that is useful - https://github.com/shayonj/pg-osc/releases/tag/v0.9.5
I hope the last comments help, sorry if its not a satisfactory response. Feel free to share any findings and happy to help diagnose too.
from pg-osc.
thanks! def curious to see if it works out π€πΎ
from pg-osc.
Related Issues (20)
- Set the appropriate next sequence value on the new table HOT 11
- εΎδΈιηε·₯ε · HOT 1
- Potential need for a `pull-batch-delay` option HOT 4
- Views aren't updated after the swap HOT 7
- Auto-Vacuum Isn't Re-enabled HOT 14
- Using pg-osc only for backfilling data HOT 4
- Copy phase is acquiring a lock on my original table I don't know why HOT 18
- Ability to show progress HOT 16
- Views from different schema aren't updated because of error HOT 1
- INSERT's are failing for long table names HOT 2
- Scope of this tool HOT 1
- Add a preview/dry-run command
- fails to get view definition if it is from another schema HOT 4
- Does pg-osc is able to insert millions of rows to an existing table? HOT 2
- will support `after` sql syntax ? HOT 1
- How does this will work in a live project. HOT 4
- Unable to perform alterations more than once on a table. HOT 2
- Update Readme reg- docker setup of pg-osc HOT 2
- Error case with delta-count and pull-batch-count values HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pg-osc.