bmsuisse / lakeapi Goto Github PK
View Code? Open in Web Editor NEWAPI for distributing Data Lake Data
License: MIT License
API for distributing Data Lake Data
License: MIT License
Doesn‘t look good in dark mode.
@aersam I fixed the blocking bug but somehow messed up the dependencies. They take now ages to install. Everything works though.
Please have a look if you have some time.
Related to Pypika Bug
It is an old issue and won't be solved. There is probably another approach needed.
Filters on partitions are always fast, therefore we should enable them by default. You could still hide those be giving them an empty operators array:
params:
- name: partition_col
operators: []
Pydantic V2 is out:
https://github.com/pydantic/pydantic/releases/tag/v2.0
FastAPI should follow soon:
https://github.com/tiangolo/fastapi/releases/tag/0.99.0
When parameter is optional and not used as a filter.
2023-07-18T06:49:45.330570936Z Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
2023-07-18T06:49:45.330576836Z For further information visit https://errors.pydantic.dev/2.1.2/v/string_type
Use storage backend not only in search but always if option is enabled. Maybe also use duckdb's primary key option for indexing. Could help further with performance.
Should also not be a problem if duckdb storage is not stable as we would use it on the fly.
Directly read delta files has its limitation. I tried it for CIP and the response time is not good enough for a web application.
I think we should store the data in a traditional database with indexes. However, we could still use lakeapi to serve the data by extending the context class.
Do we have the option to define a prefix partition without converting it to MD5. For example, if the Key is already an MD5 hash key, it doesn't really make sense to convert it again to another MD5 hash key.
I was able to block the API. We might have to reverse and still not use copy into.
A lot of bugfixes in this release
https://github.com/duckdb/duckdb/releases/tag/v0.8.1
Enable security layer on all endpoints
use either sqlglot or ibis
At the moment you can input arbitrary parameter and we just return the unfiltered dataset if the parameter does not exist in the schema. We should return an error here.
Get rid of arrow-stream as a response type.
Current approach is not optimal as more than 1000 parameters results in recursion overflow.
This works:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": "0",
"priceperunit": "0",
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}
This does not work:
{
"pk": [
{
"article_ean": "4041551502007",
"branch_code": "0",
"supplieramount": 0,
"priceperunit": 0,
"article_suppliernumber": "8021",
"supplier_number": "976261"
}
]
}
But the output and schema is correct.
At the moment used because of this bug in polars:
For optimal performance, it would still sometimes make sense to keep small data in memory.
We can discuss that. If we implement it right, we should not have issues.
TypeError: 'bytes' object is not callable
Like ROAPI is doing it:
Probably this happens when writing to the duckdb data store for the first time.
We will go for 0.10.0 after 0.9.0 and not 1.0.0.
We do not use it, nor test it. Only polars provides support for it, DuckDB does not
As we store the response as files, we could use it for a certain period and return it directly for the same request.
Important to respect response header:
https://github.com/long2ice/fastapi-cache/blob/8bfe814c3662343d2ecc39fcb6e31a2575ebfe9d/fastapi_cache/decorator.py#L198
In general, fastapi-cache is a good reference.
Per default, API Management wants to add the versioning at the end. Therefore, we should change the pattern
from api/v1
to v1/api
.
But this would mean a breaking change, and we have to be careful and also change all clients relying on the existing endpoints.
3-06-01T13:34:33.402792616Z ^^^^^^^^^^^^^^^^^^^^^^^^
2023-06-01T13:34:33.402797816Z File "/tmp/8db629d97f7a394/antenv/lib/python3.11/site-packages/bmsdna/lakeapi/core/dataframe.py", line 158, in get_partition_filter
2023-06-01T13:34:33.402801816Z hashvl = hashlib.md5(value.encode("utf8")).hexdigest()
2023-06-01T13:34:33.402810116Z ^^^^^^^^^^^^
2023-06-01T13:34:33.402814016Z AttributeError: 'int' object has no attribute 'encode'
Value can be integer.
hashvl = hashlib.md5(str(value).encode("utf8")).hexdigest()
Should solve the problem.
We should also test for CSV, Excel, Json inputs. Otherwise, we should not claim to be able to load different file formats in the docs :-).
Might be helpful.
A dedicated Repo should be used for that.
Do we need it to integrate the API with Streamlit or other Azure hosted Apps?
config.yml must be present even if you want to change to another later.
DuckDB probably holds information in memory. Drop re-create a delta table can therefore cause an internal error (file not there anymore). In this case we should catch the error and do a manual refresh.
Long term goal as soon as Robyn catches up with FastAPI
The latest release has a "bug" and there is also a need for the Polars extension because the Polars delta reader also has a bug.
Does it make sense to keep support for Polars? Seems to add unnecessary complexity for no real benefit (speed is apart with DuckDB)
Of course, we now use Polars to serialise data into the various formats. We may have to look for solutions there.
Datafusion does not support Json Reading nor is there a way to register a pyarrow.Table instance. therefore, we cannot enable json tests for datafusion
Also the docs are ... not as good as in polars and duckdb
@aersam the newest Pydantic version is causing problem with Pyright. Can you take a look?
What do we need to have for version 1.0.0?
As I think it is a killer feature to get sub second results on a large dataset
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.