Comments (7)
Hi @samdaviestvg thanks for opening an issue.
Regarding the issue of the decimal type: yes, you're right. This can currently only expressed by stating type: decimal
, but neither precision nor scale can be set. I've created an issue to extend the Data Contract Specification here: datacontract/datacontract-specification#41 Feel free to post your thoughts also there on how to specify it.
Could you provide a small example parquet file and the accompanying datacontract.yaml regarding the time zone issue? This would help me to reproduce exactly the error, as I have not been able to do this on my machine. Especially, which type did you encode in the parquet file for the timestamp_tz column?
from datacontract-cli.
Thank you. The timezone is encoded as required int64 field_id=4 createddate (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
Uploading file types yaml and parquet are not supported on github, please see raw text version. If they are of no use please let me know
We donβt support that file type.
Try again with GIF, JPEG, JPG, MOV, MP4, PNG, SVG, WEBM, CPUPROFILE, CSV, DMP, DOCX, FODG, FODP, FODS, FODT, GZ, JSON, JSONC, LOG, MD, ODF, ODG, ODP, ODS, ODT, PATCH, PDF, PPTX, TGZ, TXT, XLS, XLSX or ZIP.
dataContractSpecification: 0.9.3
id: iceberg-ingestion
info:
title: ingestion to s3/iceberg
version: 0.0.1
description: The ingestion of parquet files from s3 into iceberg table format
### servers
servers:
dev:
type: s3
location: 's3/path/here'
format: parquet
### terms
#terms:
# usage:
# limitations:
# billing:
# noticePeriod:
### models
models:
complaintcost_c:
description: complaintcost_c
type: table
fields:
id:
type: varchar
required: true
primary: true
description: ID
isdeleted:
type: boolean
description: ISDELETED
required: true
name:
type: varchar
description: NAME_C
required: true
createddate:
type: timestamp_tz
description: CREATEDDATE
required: true
parquet as json
{"id":"a2a3M000000iMFCQA2","isdeleted":false,"name":"CC-00002121","createddate":1686557172000}
from datacontract-cli.
Can you send me the parquet file via email, please? [email protected]
from datacontract-cli.
Thank you. I could reproce the issue.
The tool basically loads the parquet file via DuckDB and uses soda-core to execute quality checks. The import using DuckDB leads to the following SQL schema:
{'column_name': {0: 'id', 1: 'isdeleted', 2: 'name', 3: 'createddate'}, 'column_type': {0: 'VARCHAR', 1: 'BOOLEAN', 2: 'VARCHAR', 3: 'TIMESTAMP WITH TIME ZONE'}, 'null': {0: 'YES', 1: 'YES', 2: 'YES', 3: 'YES'}, 'key': {0: None, 1: None, 2: None, 3: None}, 'default': {0: None, 1: None, 2: None, 3: None}, 'extra': {0: None, 1: None, 2: None, 3: None}}
This means that we need to convert the "timestamp_tz" to "TIMESTAMP WITH TIME ZONE" as the type that is used for DuckDB. This is a bug, and we should fix this. Basically, we need to add a type mapping based on the SQL dialect used (in this case the SQL dialect of DuckDB + parquet).
from datacontract-cli.
Here's the python script I used to detect the DuckDB SQL Schema of this parquet file:
import duckdb
con = duckdb.connect()
con.execute("INSTALL parquet;")
con.execute("LOAD parquet;")
con.execute("CREATE VIEW test as (SELECT * FROM 'testfile.parquet')")
print(con.execute("DESCRIBE test").df().to_dict())
from datacontract-cli.
I had to add a similar mapping for CSV files:
https://github.com/datacontract/datacontract-cli/blob/main/datacontract/export/csv_type_converter.py
from datacontract-cli.
Added a fix. Targeted for release v0.10.4
from datacontract-cli.
Related Issues (20)
- JsonSchema importer doesn't support array types HOT 2
- Trino tests are flaky
- Dbt model in data contract fails HOT 2
- Case issue on postgres table names HOT 2
- Avro import does not support 'enum' type HOT 5
- Pyspark dependency is required despite marked as optional HOT 4
- Checking for Databricks ARRAY<STRING> HOT 7
- Application install installs every available package version for moto HOT 1
- Resolve to fields within a definition HOT 1
- Export to Unity Catalog
- Development Env. is broken HOT 3
- Enable tests in Google Cloud Buckets HOT 1
- Typo in a documentation HOT 1
- Import: No support of AWS Athena (Trino) DDLs HOT 2
- Glue import with database and glue-table parameters returns all tables in a Glue database HOT 2
- Glue: `map` data type is not supported HOT 5
- Delta table support for server type local is not implemented yet, it only checks for azure and then directly goes to AWS s3 HOT 1
- Breaking method is not working with bigint columns HOT 5
- import dbt source into datacontrac HOT 2
- import sql (postgres ddl) into datacontract.yml: numeric --> variant HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacontract-cli.