paradedb / pg_analytics Goto Github PK

View Code? Open in Web Editor NEW

86.0 1.0 9.0 399 KB

DuckDB-powered analytics for Postgres

Home Page: https://paradedb.com

License: GNU Affero General Public License v3.0

Rust 100.00%

analytics arrow columnar datafusion lakehouse paradedb parquet postgres postgresql

pg_analytics's Introduction

pg_analytics's People

Contributors

Stargazers

Watchers

Forkers

abdi-29 devanbenz weijun-h kysshsy hailelagi

pg_analytics's Issues

Support Replace based on a version key

Since pg_analytics is append only, it might be useful to replace existing entries based on a version key (for exampled updated at). Similar to what happens in clickhouse ReplacingMergeTree.

What is the roadmap for these cases or is full UPDATE support planned?

Type coercion fails between certain types for UDFs

Bug Description
When registering a UDF (paradedb/paradedb#818) with certain parameter datatypes, (e.g. Int32 params), a numeric constant is passed in as Int64, leading to a type coercion failure.

How To Reproduce
Register a UDF with a parameter of type integer and then call the function with a numeric constant. For example:

CREATE FUNCTION add(integer, integer) RETURNS integer
    AS 'select $1 + $2;'
    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;

SELECT add(1, a) FROM TABLE x;

You should get a type coercion error from Int64 to Int32.

Proposed Fix
Unclear! Could be a DataFusion error, or could be solved using some plan optimization params.

PREPARE statement doesn't work

What happens?

PREPARE test_query(text) AS
SELECT count(*) FROM predictions where client = $1;
EXECUTE test_query('example');

I get this warning when I run the query and it takes forever:
WARNING: This query was not fully pushed down to DuckDB because DuckDB returned an error. Query times may be impacted. If you would like to see this query pushed down, please submit a request to https://github.com/paradedb/paradedb/issues with the following context:
Not implemented Error: Prepared statement argument types are not supported, use CAST

With this identical query it works in a few milliseconds:

SELECT count(*) FROM predictions where client = 'example';

DuckDB itself supports it: https://duckdb.org/docs/api/c/prepared.html

To Reproduce

Run the query from above

OS:

Ubuntu

ParadeDB Version:

v0.8.4

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_lakehouse Extension

Full Name:

Janis

Affiliation:

Typewise

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

Implement PARTITION BY

What
In Postgres when creating a table you can specify PARTITION BY options.

We should pass these options to Datafusion so Datafusion can partition accordingly.

Why
Please describe why this feature is important.

How
Please describe how you'd implement this feature.

Write tests for type conversion

What
We have a pretty comprehensive type conversion framework to convert between DataFusion and Postgres types, but there are no tests in place for the more complex types.

Why
Type conversion determinism.

How
Test suite!

Implement Kafka ingestion

What
The 3 most common sources for data ingestion into ClickHouse today are:

CSV
Postgres
Kafka

We support CSV and Postgres (obviously), but we probably want to support Kafka. Event streams are a very common use for analytics.

I was talking with Tyler, one of the main inventors of delta-rs, and he mentioned: https://github.com/delta-io/kafka-delta-ingest. He mentioned that, at Scribd, they implemented kafka-delta-ingest to drastically reduce costs and ingestion bottlenecks when ingesting data from Kafka. I believe this could be the right way for us to implement Kafka ingestion as well, and potentially offer really good performance compared to other solutions out there (would need to benchmark).

Why
^

How
^

Unexpected behavior with large precision/scales for numerics

Bug Description
Large precision/scales on numerics cause strange errors to be thrown.

I noticed that when we store numerics in Datafusion, they get multiplied by the scale. For instance if I have a numeric(5,2) and I store the value 1.01, it gets stored as 101. When we retreive the numeric value back we scale it down, but this seems to cause some problems where if you have an extremely large scale (say 30) then you're multiplying the number by 10^30 before storing it in Datafusion and that causes overflow problems.

How To Reproduce

CREATE TABLE t (a numeric(30,30)) USING parquet;
pg_analytics=# insert into t values (0.01);
ERROR:  Precommit callback failed: Delta(Arrow { source: ParseError("parse decimal overflow") })

Proposed Fix

Fix foreign table `EXPLAIN`s

What

EXPLAIN queries over pg_lakehouse foreign tables are not working correctly for a few reasons:

Running EXPLAIN on a query that's fully pushed down to DuckDB returns the query plan of the foreign data wrapper, which isn't the actual query plan. This is because the FDW's query plan is only used if the query fails in the executor hook and falls back to the FDW.

For instance, EXPLAIN SELECT COUNT(*) returns

explain select count(*) from hits;
                          QUERY PLAN
--------------------------------------------------------------
 Aggregate  (cost=1.00..1.01 rows=1 width=8)
   ->  Foreign Scan on hits  (cost=0.00..1.00 rows=1 width=0)
         DuckDB Scan: SELECT * FROM public.hits
(3 rows)

Which would be the correct plain IF this query fell back to the FDW. In reality, this entire query is pushed down to DuckDB, so the query plan should be:

explain select count(*) from hits;
                          QUERY PLAN
--------------------------------------------------------------
DuckDB Scan: SELECT COUNT(*) FROM public.hits
(1 row)

EXPLAIN with options like ANALYZE is not working at all. This is due to the fact that
a) DuckDB does not recognize all the Postgres EXPLAIN options like VERBOSE, COSTS, etc.
b) EXPLAIN ANALYZE is not properly handled. Right now it throws an error.

Why
EXPLAIN should return the correct query plan.

How
Two ideas I've considered are:

Using the process utility hook to intercept all EXPLAIN queries. However, we still need to figure out how to handle EXPLAIN options not supported by DuckDB.
Convert the executor hook to a custom scan, which should make EXPLAIN queries work out of the box. However, I'm not sure about the scope of work that this would entail or if custom scans introduce unforeseen side effects to query performance.

Detoast datums

What
In our insert_tuples function we need to detoast datums or else large text values won't work.

Why
Please describe why this feature is important.

How
Please describe how you'd implement this feature.

UDF update and deletion

What
Upon CREATE OR REPLACE FUNCTION or DELETE FUNCTION we need to delete all functions with the corresponding signature and allow them to be reloaded when explicitly called.

Why
Right now, if a UDF is updated or deleted, the DataFusion context will not reflect this if the function has already been used once during the session.

How
Use the function name to delete all functions with the corresponding name from the context registry. Once overloaded functions are supported, this will become more complex, as it will require finding the right function by signature as well.

Need this PR to become available in the next release of DataFusion: apache/datafusion#9239

DATE_TRUNC('day', date_day) not supported

What happens?

When running this query I get an error:

SELECT DATE_TRUNC('day', date_day), AVG("queryLatency")
FROM predictions
GROUP BY DATE_TRUNC('day', date_day)

Error:
Query 1 ERROR at Line 1: : ERROR: Column date_trunc has Arrow data type Date32 but is mapped to the BuiltIn(TIMESTAMPOID) type in Postgres, which are incompatible. If you believe this conversion should be supported, please submit a request at https://github.com/paradedb/paradedb/issues.

Since DATE_TRUNC() is quite fundamental for many analysis, for whenever you want to group by day, week, month or year, it would be awesome if this could be fixed somehow.
Tools like Metabase often offer users a way to choose granularity of an analysis

To Reproduce

Do a GROUP BY on DATE_TRUNC(). It happens with date and timestamp columns and also for 'week' etc.

OS:

Ubuntu(?) with PostgreSQL 16.3

ParadeDB Version:

0.8.3

Full Name:

Janis

Affiliation:

Typewise

What is the latest build you tested with? If possible, we recommend testing by compiling the latest `dev` branch.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

Support collation

What
Enable collation on columns/queries. Collation currently doesn't do anything:

CREATE TABLE names (
    name_en TEXT COLLATE "en_US",
    name_sv TEXT COLLATE "sv_SE"
);

INSERT INTO names (name_en, name_sv) VALUES ('Ostergren', 'Ostergren'), ('Östergren', 'Östergren');

-- These both return in the same order but they shouldn't
SELECT name_en FROM names ORDER BY name_en COLLATE "en_US";
SELECT name_sv FROM names ORDER BY name_sv COLLATE "sv_SE";

Why
Collation is useful for ordering text fields by language.

How

Add support for DuckDB Geospatial extensions

What
Now that DuckDB serves as pg_lakehouse query engine, it'd be very useful if it was possible to leverage both official and unofficial extensions in queries targeting foreign tables.

In our specific case we would be interested in https://github.com/isaacbrodsky/h3-duckdb & https://github.com/duckdb/duckdb_spatial/

Why
Unlock different use cases.

Strange behavior with inserts on new tables

I tried digging into this, but I can't seem to get past datafusion/deltalake, so maybe it's coming from within one of those crates? Or it could be something wrong in our implementation.

If you create a new table and insert into it and then try to execute a query with a UDF, the UDF crashes because pgrx won't operate on a new thread. The UDF executes on a different thread from the main one. If you call paradedb.init() and try again, it works... This also happens after a table has been newly truncated and inserted into. It isn't a problem after inserting on a non-empty table. For now, there's a stopgap solution to make sure we avoid the crash, but it doesn't get to the root of the threading issue.

@neilyio also reported some weird behavior with executing SQL directly from Rust when a new table is created and then inserted into in the same semicolon-separated query. The new data just doesn't get inserted!

Writer failures causes table to become unqueryable

Bug Description
When the writer fails, the table is no longer queryable.

How To Reproduce

-- Trigger writer error
CREATE TABLE t (a numeric(30,30)) USING parquet;
insert into t values (0.01);
ERROR:  Precommit callback failed: Delta(Arrow { source: ParseError("parse decimal overflow") })

-- Now try to query table
SELECT * FROM t;
ERROR:  Failed to get public.t

Proposed Fix

Support UDF overloading

What
Right now, only one UDF can be registered under a single name and used on columnar tables. We want multiple overloaded UDFs to be able to be used for different use cases in a columnar table query.

Why
Function overloading is a built-in feature of Postgres. We should support it!

How
Infer the signature from the logical plan, pass the signature along with the function name as an argument to the UDF, use non-lossy type conversion (#826) to get the corresponding function OID and call the function.

Bug: Time interval "CAST((NOW() + INTERVAL '-40 day')" leads to error

What
Running this query results in a warning that query can't be pushed down and finally crashes after 20-50 seconds.

SELECT *
FROM parquet_predictions as predictions
WHERE
 "predictions"."date_day" BETWEEN CAST((NOW() + INTERVAL '-40 day') AS date)
  AND CAST((NOW() + INTERVAL '-1 day') AS date)

This is the error I get:

WARNING: This query was not fully pushed down to DuckDB because DuckDB returned an error. Query times may be impacted. If you would like to see this query pushed down, please submit a request to https://github.com/paradedb/paradedb/issues with the following context:
Binder Error: No function matches the given name and argument types '+(TIMESTAMP WITH TIME ZONE, INTERVAL)'. You might need to add explicit type casts.
Candidate functions:
+(TINYINT) -> TINYINT
+(TINYINT, TINYINT) -> TINYINT
+(SMALLINT) -> SMALLINT
+(SMALLINT, SMALLINT) -> SMALLINT
+(INTEGER) -> INTEGER
+(INTEGER, INTEGER) -> INTEGER
+(BIGINT) -> BIGINT
+(BIGINT, BIGINT) -> BIGINT
+(HUGEINT) -> HUGEINT
+(HUGEINT, HUGEINT) -> HUGEINT
+(FLOAT) -> FLOAT
+(FLOAT, FLOAT) -> FLOAT
+(DOUBLE) -> DOUBLE
+(DOUBLE, DOUBLE) -> DOUBLE
+(DECIMAL) -> DECIMAL
+(DECIMAL, DECIMAL) -> DECIMAL
+(UTINYINT) -> UTINYINT
+(UTINYINT, UTINYINT) -> UTINYINT
+(USMALLINT) -> USMALLINT
+(USMALLINT, USMALLINT) -> USMALLINT
+(UINTEGER) -> UINTEGER
+(UINTEGER, UINTEGER) -> UINTEGER
+(UBIGINT) -> UBIGINT
+(UBIGINT, UBIGINT) -> UBIGINT
+(UHUGEINT) -> UHUGEINT
+(UHUGEINT, UHUGEINT) -> UHUGEINT
+(DATE, INTEGER) -> DATE
+(INTEGER, DATE) -> DATE
+(INTERVAL, INTERVAL) -> INTERVAL
+(DATE, INTERVAL) -> TIMESTAMP
+(INTERVAL, DATE) -> TIMESTAMP
+(TIME, INTERVAL) -> TIME
+(INTERVAL, TIME) -> TIME
+(TIMESTAMP, INTERVAL) -> TIMESTAMP
+(INTERVAL, TIMESTAMP) -> TIMESTAMP
+(TIME WITH TIME ZONE, INTERVAL) -> TIME WITH TIME ZONE
+(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE
+(TIME, DATE) -> TIMESTAMP
+(DATE, TIME) -> TIMESTAMP
+(TIME WITH TIME ZONE, DATE) -> TIMESTAMP WITH TIME ZONE
+(DATE, TIME WITH TIME ZONE) -> TIMESTAMP WITH TIME ZONE
+(ANY[], ANY[]) -> ANY[]

LINE 4: ...*
FROM parquet_predictions as predictions
WHERE
"predictions"."date_day" BETWEEN CAST((NOW() + INTERVAL '-40 day') AS date)

Why
Metabase filters data like that if you select "Last x days" and this can't be changed there.
The same query runs fine if I call it directly via DuckDB.

Non-lossy type conversion

What
Currently, conversion between datafusion and postgres types is lossy, and we don't want it to be!

Why
In order to support joins and function overloading, we need to have exact type conversion. To search up a specific function in postgres, for example, we need to search by function signature. If the type conversion is lossy, then the signature will not be the same between creation and searching.

How
Update type translators.

Test GCS/other data stores against public buckets

What
Some object stores like GCS are difficult to emulate locally b/c there's no localstack or testcontainer support. The best way to test them seems to connect them to own buckets.

Why
Testing.

How

JSON not supported

What happens?

When running this query I get an error:

SELECT "varchar_column"::JSON
FROM predictions

Error:
Query 1 ERROR at Line 1: : ERROR: Column varchar_column has Arrow data type Utf8 but is mapped to the BuiltIn(JSONOID) type in Postgres, which are incompatible. If you believe this conversion should be supported, please submit a request at https://github.com/paradedb/paradedb/issues.

To Reproduce

OS:

Ubuntu(?) with PostgreSQL 16.3

ParadeDB Version:

0.8.3

Full Name:

Janis

Affiliation:

Typewise

What is the latest build you tested with? If possible, we recommend testing by compiling the latest `dev` branch.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

To Reproduce

Try to cast to JSON, either in a query or when creating the foreign table.

OS:

Ubuntu(?) with PostgreSQL 16.3

ParadeDB Version:

0.8.3

Full Name:

Janis

Affiliation:

Typewise

What is the latest build you tested with? If possible, we recommend testing by compiling the latest `dev` branch.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

Cache Parquet metadata in `pg_analytics`

What feature are you requesting?

DuckDB has a feature, enable_object_cache, which can cache Parquet metadata in-memory. This is known to improve query performance. We would like to cache this on-disk, so that it can persist across queries.

Why are you requesting this feature?

See above ^

What is your proposed implementation for this feature?

I'm not sure, see #30 for reference

Full Name:

Philippe Noël

Affiliation:

ParadeDB

Use `heap_prepare_insert` and `HeapTuple` in `insert` functions

What
Postgres provides APIs to convert slots into a HeapTuple, and then pass the HeapTuple into heap_prepare_insert which populates them with the correct xmin, xmax, and infomask flags. We're currently reimplementing this logic ourselves in the insert tableam functions and should instead let Postgres handle it.

Why
By not using heap_prepare_insert we're opening ourselves up to inconsistencies with Postgres.

How

Support logical replication for `parquet` Tables

What
We haven't tested that it works yet, and there are probably a few things which will need to be implemented in order to do that. This is related to paradedb/paradedb#728.

Why
Please describe why this feature is important.

How
Please describe how you'd implement this feature.

Query bytea type result error

What happens?

I am developing quals push-down feature and have found that the query result is abnormal

pg_lakehouse=# select binary_col from t1   where  binary_col = '\x68656c6c6f';
 binary_col
------------
(0 rows)

pg_lakehouse=# select binary_col from t1   where  binary_col = E'\\x68656c6c6f';
 binary_col
------------
(0 rows)

pg_lakehouse=# select binary_col from t1;
    binary_col
------------------
 \x68656c6c6f
 \x
 \x70617271756574
(3 rows)

pg_lakehouse=# select binary_col from t1   where  binary_col = E'';
 binary_col
------------
 \x
(1 row)

pg_lakehouse=#

To Reproduce

export parquet file in the code. test_arrow_types.parquet
create foreign table

CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler
VALIDATOR parquet_fdw_validator;

CREATE SERVER parquet_server
FOREIGN DATA WRAPPER parquet_wrapper;

CREATE FOREIGN TABLE t1 (
    boolean_col       boolean,
	int8_col          smallint,
	int16_col         smallint,
	int32_col         integer,
	int64_col         bigint,
	uint8_col         smallint,
	uint16_col        integer,
	uint32_col        bigint,
	uint64_col        numeric(20),
	float32_col       real,
	float64_col       double precision,
	date32_col        date,
	date64_col        date,
	binary_col        bytea,
	large_binary_col  bytea,
	utf8_col          text,
	large_utf8_col    text
)
SERVER parquet_server
OPTIONS (files '/home/kyss/labs/paradedb/test_arrow_types.parquet');

query

select binary_col from t1   where  binary_col = '\x68656c6c6f';
select binary_col from t1   where  binary_col = E'\\x68656c6c6f';
select binary_col from t1   where  binary_col = E'';
select binary_col from t1;

OS:

x86

ParadeDB Version:

0.8.6.

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_lakehouse Extension

Full Name:

kyss

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

Support speculative inserts

What
The tableAM speculative insert function is not implemented.

Why
This means that ON CONFLICT clauses for inserts are not supported.

How

feat: Schema evolution

What
It's currently not possible to change the schema of a parquet table. delta-io/delta-rs#2024 might change that...

Why
Please describe why this feature is important.

How
Please describe how you'd implement this feature.

Interruping `create_bm25` causes all subsequent queries to fail

Bug Description
Running create_bm25 over a Parquet table and then cancelling it before the transaction finishes causes all subsequent queries to error with

ERROR:  there is no reactor running, must be called from the context of a Tokio 1.x runtime

How To Reproduce

CREATE TABLE IF NOT EXISTS temp_json ( j JSONB );
\copy temp_json FROM '/path/to/wiki-articles.json' WITH CSV QUOTE E'\x01' DELIMITER E'\x02';
CREATE TABLE wikipedia_articles ( id SERIAL, url TEXT, title TEXT, body TEXT ) USING parquet;
INSERT INTO wikipedia_articles (url, title, body)
SELECT values->>'url' AS url,
       values->>'title' AS title,
       values->>'body' AS body
FROM   (SELECT j AS values from temp_json) A;
CALL paradedb.create_bm25(
        index_name => 'search_idx',
        schema_name => 'public',
        table_name => 'wikipedia_articles',
        key_field => 'id',
        text_fields => '{title: {}, body: {}}'
);

-- Now, cancel this transaction and run any other query
SELECT * FROM wikipedia_articles LIMIT 1;
-- Error!

Proposed Fix

Add support for time_bucket() in ParadeDB pg_lakehouse

What feature are you requesting?

A time_bucket() function for pg_lakehouse queries.

Why are you requesting this feature?

There are some helpful functions in DuckDB when working with timeseries data (https://duckdb.org/docs/sql/functions/date.html). In particular, time_bucket() comes in quite handy. TimescaleDB also has this and has a nice writeup explaining it (https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/).

What is your proposed implementation for this feature?

I'm not acquainted well enough yet with ParadeDB to propose an implementation, just hopeful that it could be added for queries that are pushed down to DuckDB.

Full Name:

Bryan Russett

Affiliation:

Caurus

Rebase `paradedb/wrappers` onto latest and add PG13 support

What feature are you requesting?

They've added support for PG13 in upstream wrappers. We should rebase our fork and add support to PG13 here.

Why are you requesting this feature?

We want to support all PGDG official versions

What is your proposed implementation for this feature?

See above^

Full Name:

Philippe Noël

Affiliation:

ParadeDB

Support `test-pg_analytics` in CI

What feature are you requesting?

Since the pg_analytics and pg_lakehouse have relocated here, 'test-pg_analytics' needs to be moved here as well.
https://github.com/paradedb/paradedb/blob/dev/.github/workflows/test-pg_analytics.yml

Why are you requesting this feature?

It would benefit the testing.

What is your proposed implementation for this feature?

No response

Full Name:

Weijun Huang

Affiliation:

None

Parquet tables should vacuum aborted/deleted rows

What
VACUUM should delete rows with a committed xmax and xmin/xmax with aborted transaction IDs.

Why
Clean up storage on vacuum.

How
This should be done in the table AM vacuum method.

Support subtransactions/savepoints

What
Support SAVEPOINT rollbacks for parquet tables.

Why

How
Need to implement #19 first.

Introduce debug functions to `pg_analytics`

What
Users are getting table not found errors that are tough to debug because we don't have access to their DB. We should introduce a function that prints the DataFusion schema, i.e.

SELECT * FROM paradedb.parquet_schema();
-- Returns a table with all DataFusion tables and their respective schemas

Why
For debugging

How

Support as many Postgres types as we can (e.g. JSON)

What
We're missing timestamp w/ timezone, JSON, and probably a bunch of others that can be mapped to datafusion types.

Why
Please describe why this feature is important.

How
Please describe how you'd implement this feature.

Update `pgrx` to the stable 0.12 version once it is released

          This feels unsafe. I'm not sure if we want to match all other cases to user mapping relation id. I made a PR to pgrx a while back to expose `USER_MAPPING_ID` to `pg_sys` but it would require we upgrade `pgrx`.

Originally posted by @rebasedming in #37 (comment)

Support for views

What feature are you requesting?

Querying a view that is on top of a pg_lakehouse/DuckDB table doesn't work and returns an error.

Why are you requesting this feature?

We need views to join two tables and add some dynamic columns to our static parquet files.

What is your proposed implementation for this feature?

No response

Full Name:

Janis

Affiliation:

Typewise

Sanitize user mapping options

What
We should sanitize the user mapping options just like how we sanitize the foreign table options.

Why
Reduce user error.

How

Queries without schema specification fail to fully push down in DuckDB

What happens?

Create two foreign tables in different schemas, one in tpch and one in public. Queries that don't specify a schema will fail to fully push down to DuckDB.

To Reproduce

This issue is unrelated to specific data.

pg_lakehouse=# create schema tpch;
CREATE SCHEMA
pg_lakehouse=# CREATE FOREIGN TABLE tpch.t1 (
    boolean_col       boolean,
        int8_col          smallint,
        int16_col         smallint,
        int32_col         integer,
        int64_col         bigint,
        uint8_col         smallint,
        uint16_col        integer,
        uint32_col        bigint,
        uint64_col        numeric(20),
        float32_col       real,
        float64_col       double precision,
        date32_col        date,
        date64_col        date,
        binary_col        bytea,
        large_binary_col  bytea,
        utf8_col          text,
        large_utf8_col    text
)
SERVER parquet_server
OPTIONS (files '/home/kyss/labs/paradedb/test_arrow_types.parquet');
pg_lakehouse=# CREATE FOREIGN TABLE t2 (
    boolean_col       boolean,
        int8_col          smallint,
        int16_col         smallint,
        int32_col         integer,
        int64_col         bigint,
        uint8_col         smallint,
        uint16_col        integer,
        uint32_col        bigint,
        uint64_col        numeric(20),
        float32_col       real,
        float64_col       double precision,
        date32_col        date,
        date64_col        date,
        binary_col        bytea,
        large_binary_col  bytea,
        utf8_col          text,
        large_utf8_col    text
)
SERVER parquet_server
OPTIONS (files '/home/kyss/labs/paradedb/test_arrow_types.parquet');
CREATE FOREIGN TABLE
pg_lakehouse=# select * from t2 LEFT JOIN tpch.t1 ON TRUE;
WARNING:  This query was not fully pushed down to DuckDB because DuckDB returned an error. Query times may be impacted. If you would like to see this query pushed down, please submit a request to https://github.com/paradedb/paradedb/issues with the following context:
Catalog Error: Table with name t2 does not exist!
Did you mean "public.t2"?
LINE 1: select * from t2 LEFT JOIN tpch.t1 ON TRUE

OS:

linux

ParadeDB Version:

0.8.6

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

kyss

Affiliation:

Did you include all relevant data sets for reproducing the issue?

N/A - The reproduction does not require a data set

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

Store command IDs (i.e. `cmin`) with tuples

What
Using the same approach we have to store a tuple's xmin/xmax we should also be storing a tuple's cmin/cmax.

Why
While xmin/xmax determine if a tuple is visible across transactions, cmin/cmax determine if a tuple is visible within the same transaction. This is probably necessary for subtransaction/savepoint support.

How

FDW execution order

data.csv

What happens?

using FDW on join tables. As you can see, the result is different with the same query.

pg_analytics=# set enable_nestloop = on;
SET
pg_analytics=# select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
 count
-------
     6
(1 row)

pg_analytics=# explain select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
                               QUERY PLAN
------------------------------------------------------------------------
 Aggregate  (cost=69.39..69.40 rows=1 width=8)
   ->  Nested Loop Left Join  (cost=0.00..63.01 rows=2550 width=0)
         ->  Nested Loop  (cost=0.00..2.01 rows=1 width=0)
               Join Filter: (t1.value = t2.value)
               ->  Foreign Scan on t1  (cost=0.00..1.00 rows=1 width=0)
                     DuckDB Scan: SELECT value FROM public.t1
               ->  Foreign Scan on t2  (cost=0.00..1.00 rows=1 width=0)
                     DuckDB Scan: SELECT value FROM public.t2
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=0)
(9 rows)
pg_analytics=# set enable_nestloop = off;
SET
pg_analytics=# explain select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Aggregate  (cost=10000000069.40..10000000069.41 rows=1 width=8)
   ->  Nested Loop Left Join  (cost=10000000001.01..10000000063.03 rows=2550 width=0)
         ->  Hash Join  (cost=1.01..2.03 rows=1 width=0)
               Hash Cond: (t1.value = t2.value)
               ->  Foreign Scan on t1  (cost=0.00..1.00 rows=1 width=0)
                     DuckDB Scan: SELECT value FROM public.t1
               ->  Hash  (cost=1.00..1.00 rows=1 width=0)
                     ->  Foreign Scan on t2  (cost=0.00..1.00 rows=1 width=0)
                           DuckDB Scan: SELECT value FROM public.t2
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=0)
(10 rows)

pg_analytics=# select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
 count
-------
  2077
(1 row)

To Reproduce

CREATE FOREIGN TABLE t1()
SERVER csv_server
OPTIONS (files '/home/kyss/labs/paradedb/data.csv');

CREATE FOREIGN TABLE t2()
SERVER csv_server
OPTIONS (files '/home/kyss/labs/paradedb/data.csv');

CREATE TABLE t3(a int);

see above

OS:

x86

ParadeDB Version:

0.9.0

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

kysshsy

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

Yes, I have

WAL support for `parquet` tables

What
parquet tables should write to the write-ahead-log (WAL).

Why
This would unlock features like backups/replication.

How
In the table access method.

Show users a better error message for complex updates

What
Currently, UPDATE clauses are not supported. If the user executes a query that starts with UPDATE, they see a clean error message. But if they execute an UPDATE nested inside another query, like a CTE, they'll see a confusing error that just says failed to fetch tuple being updated.

Why
So users don't think this is an unexpected bug.

How
The reason they're seeing this is because the table access method is failing behind the scenes on UPDATE clauses. The error comes before the update tableAM method is hit, and after the scan table method runs. I suspect it has to do with the fact that Postgres is looking for the tuple in block storage with a ctid, but we don't store tuples there.

I also noticed that this error message doesn't show up on Postgres 12.

Support timestamp with time zone

What
Support the timestamptz type

Why
timestamptz is a built-in Postgres type

How
I have a branch open for this. I'm blocked because DataFusion doesn't support the AT TIME ZONE clause which is used to query timestamptz columns. I've opened an issue for it in DataFusion.

Support multi-level partition tables

What feature are you requesting?

Support for foreign tables in a multi-level partitioned table setup (see example).

Why are you requesting this feature?

When you have a lot of large files in one object store key/directory, scanning all of the files can be prohibitively time consuming. Changing the storage structure to support hive keys may not always be an option.

What is your proposed implementation for this feature?

This is an example of a multi-level partition table, but in many cases only the first partition level would be needed.

Example:

A “root” table that will have two partition “levels”:

CREATE TABLE my_table(
  <column definitions>
) PARTITION BY list(id_1);

The table for the first partition level:

create table if not exists my_table_123
partition of my_table
for values in ('123') partition by list(id_2);

Many partition tables may be created at this level - one for each id_1 value in the root table.

Note: If these first two tables have to be created as foreign tables in order for the last partition level table to be a foreign table, they shouldn’t require a files option.

The table for the next partition level looks like this:

create foreign table if not exists my_table_123_abc
partition of my_table_123
for values in ('abc')
options (files 's3://<bucket_name>/.../<filename>.parquet');

Many partition tables may be created at this level as well - one for each id_2 value in the first partition level tables.

Note: when creating this final “leaf” partition table as a foreign table, a files option should be required.

So, when querying the root table with a where clause like this:

select *
from my_table
where id_1 = '123'
and   id_2 = 'abc'

...only one parquet file should be scanned - the one specified by the files option of the “leaf” partition table. Or, if a glob pattern is used in the files option of the leaf partition table, only the (ideally) few number of matching files will be scanned.

Full Name:

Patrick D. Park

Affiliation:

Payzer

Show users a better error message for unsupported timestamp ranges

What
Datafusion only supports queries with timestamps from 1677 to 2262 (https://github.com/apache/arrow-datafusion/blob/b7f477237cc91bbdf90e655db181ca4f0a64fc25/datafusion/functions/src/datetime/common.rs#L34). When users try to query with a timestamp outside of this range i.e. SELECT * FROM t WHERE a < '1650-01-01 01:01:01'::timestamp they get a confusing error message about nanosecond overflow. We should intercept this error message and show them a more helpful one.

Why

How

Support `UPDATE`

What
We've implemented support for DELETE but UPDATE is still a todo.

Why
Enable users to update parquet tables.

How
It can be done the same way that DELETE is currently implemented.

Push down `quals` in foreign table scan

What
The begin_scan function of the foreign data wrapper ignores quals, which means that predicates are not pushed down to DuckDB in the case that the query falls back to the FDW scan.

Why
Pushing down quals will make queries that fall back to the FDW faster. Could be useful for federated joins.

How
Parse quals and add them to the DuckDB query. I believe the clickhouse_fdw has a good implementation for this.

paradedb / pg_analytics Goto Github PK

pg_analytics's Introduction

Website • Docs • Community • Blog • Changelog

Status

Roadmap

Get Started

Deploying ParadeDB

Extensions

Docker Image

Helm Chart

ParadeDB Cloud

Support

Contributing

License

pg_analytics's People

Contributors

Stargazers

Watchers

Forkers

pg_analytics's Issues

What happens?

To Reproduce

OS:

ParadeDB Version:

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

Full Name:

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

What happens?

To Reproduce

OS:

ParadeDB Version:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing by compiling the latest dev branch.

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

What happens?

To Reproduce

OS:

ParadeDB Version:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing by compiling the latest dev branch.

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

To Reproduce

OS:

ParadeDB Version:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing by compiling the latest dev branch.

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

What feature are you requesting?

Why are you requesting this feature?

What is your proposed implementation for this feature?

Full Name:

Affiliation:

What happens?

To Reproduce

OS:

ParadeDB Version:

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

Full Name:

Affiliation:

Did you include all relevant data sets for reproducing the issue?

Did you include the code required to reproduce the issue?

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

What feature are you requesting?

Why are you requesting this feature?

What is your proposed implementation for this feature?

Full Name:

Affiliation:

What feature are you requesting?

What is the latest build you tested with? If possible, we recommend testing by compiling the latest `dev` branch.

What is the latest build you tested with? If possible, we recommend testing by compiling the latest `dev` branch.

What is the latest build you tested with? If possible, we recommend testing by compiling the latest `dev` branch.