The lakeapi from bmsuisse

Architecture of LakeAPi; Image transparent

Doesn‘t look good in dark mode.

Install dependencies takes ages

@aersam I fixed the blocking bug but somehow messed up the dependencies. They take now ages to install. Everything works though.

Please have a look if you have some time.

Too many items in a combi parameter leads to recursion error

Related to Pypika Bug

It is an old issue and won't be solved. There is probably another approach needed.

Implicit add parameters for partition columns

Filters on partitions are always fast, therefore we should enable them by default. You could still hide those be giving them an empty operators array:

params:
      - name: partition_col
        operators: []

Remove Datafusion from LakeAPI Architecture Image

Combine Metadata tag under one tag "Metadata" to clean up OpenAPI /doc

Update to Pydantic V2 and FastAPI 0.100.0

Pydantic V2 is out:
https://github.com/pydantic/pydantic/releases/tag/v2.0

FastAPI should follow soon:
https://github.com/tiangolo/fastapi/releases/tag/0.99.0

Input should be a valid string

When parameter is optional and not used as a filter.

2023-07-18T06:49:45.330570936Z Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
2023-07-18T06:49:45.330576836Z For further information visit https://errors.pydantic.dev/2.1.2/v/string_type

Option to always use duckdb storage backend

Use storage backend not only in search but always if option is enabled. Maybe also use duckdb's primary key option for indexing. Could help further with performance.

primary key

Should also not be a problem if duckdb storage is not stable as we would use it on the fly.

Update Readme to use Datasource intead of Dataframe

Add additional context for Databases (ODBC)

Directly read delta files has its limitation. I tried it for CIP and the response time is not good enough for a web application.

I think we should store the data in a traditional database with indexes. However, we could still use lakeapi to serve the data by extending the context class.

Prefix trick

Do we have the option to define a prefix partition without converting it to MD5. For example, if the Key is already an MD5 hash key, it doesn't really make sense to convert it again to another MD5 hash key.

It can block

I was able to block the API. We might have to reverse and still not use copy into.

Perf Issue. Something seems to block after multiple calls of the same endpoint with the same parameter

Bump duckdb to 0.8.1

A lot of bugfixes in this release
https://github.com/duckdb/duckdb/releases/tag/v0.8.1

Enable security layer also on the metadata endpoints

Enable security layer on all endpoints

Migrate away from pypika

use either sqlglot or ibis

Enforce schema in Post request

At the moment you can input arbitrary parameter and we just return the unfiltered dataset if the parameter does not exist in the schema. We should return an error here.

Correctly implement arrow-stream

Get rid of arrow-stream as a response type.

Multi parameter fails with more than 1000 parameters

Current approach is not optimal as more than 1000 parameters results in recursion overflow.

Combi fields do not take data type into account

This works:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": "0",
"priceperunit": "0",
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

This does not work:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": 0,
"priceperunit": 0,
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}

But the output and schema is correct.

Get rid of polars extensions

At the moment used because of this bug in polars:

polars issue 7627

Considering in memory

For optimal performance, it would still sometimes make sense to keep small data in memory.

We can discuss that. If we implement it right, we should not have issues.

Bug with latest Polars Version

TypeError: 'bytes' object is not callable

Sorting with Null is always first (for desc and asc ordering)

Add support for ODBC Source

Like ROAPI is doing it:

https://roapi.github.io/docs/config/databases.html

Better code structure

SQL Execution is split between dataframe and endpoint, sometimes diff is a bit weird
Do we really need groupby/joins/sortby? If yes: test and document

Initializing DuckDB can lead to Error 500

Probably this happens when writing to the duckdb data store for the first time.

Version 0.10.0 after 0.9.0

We will go for 0.10.0 after 0.9.0 and not 1.0.0.

Documentation with Sphinx

Drop support for Avro

We do not use it, nor test it. Only polars provides support for it, DuckDB does not

Cache response

As we store the response as files, we could use it for a certain period and return it directly for the same request.

Important to respect response header:
https://github.com/long2ice/fastapi-cache/blob/8bfe814c3662343d2ecc39fcb6e31a2575ebfe9d/fastapi_cache/decorator.py#L198

In general, fastapi-cache is a good reference.

Change versioning path

Per default, API Management wants to add the versioning at the end. Therefore, we should change the pattern
from api/v1 to v1/api.

But this would mean a breaking change, and we have to be careful and also change all clients relying on the existing endpoints.

Add "sort by" to example

Md5 hash with Integer is not working

3-06-01T13:34:33.402792616Z ^^^^^^^^^^^^^^^^^^^^^^^^
2023-06-01T13:34:33.402797816Z File "/tmp/8db629d97f7a394/antenv/lib/python3.11/site-packages/bmsdna/lakeapi/core/dataframe.py", line 158, in get_partition_filter
2023-06-01T13:34:33.402801816Z hashvl = hashlib.md5(value.encode("utf8")).hexdigest()
2023-06-01T13:34:33.402810116Z ^^^^^^^^^^^^
2023-06-01T13:34:33.402814016Z AttributeError: 'int' object has no attribute 'encode'

Value can be integer.

hashvl = hashlib.md5(str(value).encode("utf8")).hexdigest()

Should solve the problem.

Test different input file format

We should also test for CSV, Excel, Json inputs. Otherwise, we should not claim to be able to load different file formats in the docs :-).

Sort by direction desc doesn't work

bump duckdb to version 0.8.0

Provide basic example on how to set it up on an Azure Web Service

Might be helpful.

A dedicated Repo should be used for that.

Use of Azure Authentication

Do we need it to integrate the API with Streamlit or other Azure hosted Apps?

FileNotFoundError: [Errno 2] No such file or directory: 'config.yml'

config.yml must be present even if you want to change to another later.

Drop create delta table can result in internal error

DuckDB probably holds information in memory. Drop re-create a delta table can therefore cause an internal error (file not there anymore). In this case we should catch the error and do a manual refresh.

Switch to Robyn https://robyn.tech/

Long term goal as soon as Robyn catches up with FastAPI

Ditch support for Polars

The latest release has a "bug" and there is also a need for the Polars extension because the Polars delta reader also has a bug.

Does it make sense to keep support for Polars? Seems to add unnecessary complexity for no real benefit (speed is apart with DuckDB)

Of course, we now use Polars to serialise data into the various formats. We may have to look for solutions there.

Arrow format and get rid of IPC
Test DuckDB storage backend

bmsuisse / lakeapi Goto Github PK

lakeapi's People

Contributors

Stargazers

Watchers

Forkers

lakeapi's Issues

Recommend Projects

Recommend Topics

Recommend Org