Parse SQL into JSON so we can translate it for other datastores!
After converting from sql to spark, data engineers need to write the spark code for ETL pipeline instead of using YAML(SQL) which can improve the performance of ETL job, but it still makes the ETL development longer than before.
Then we have one question: can we have a solution which can have both good calculation performance (Spark) and quick to develop (YAML - SQL)?
YES, we have !!!
We plan to combine the benefits from Spark and YAML (SQL) to create the platform or library to develop the ETL pipeline.
May 2022 - There are over 900 tests. This parser is good enough for basic usage, including:
SELECT
featureFROM
featureINNER
JOIN and LEFT JOIN featureON
featureWHERE
featureGROUP BY
featureHAVING
featureORDER BY
featureAGG
feature- WINDOWS FUNCTION feature (
SUM
,AVG
,MAX
,MIN
,MEAN
,COUNT
) - ALIAS NAME feature
WITH
STATEMENT feature
pip install databathing
You may also generate PySpark Code from the a given SQL Query. This is done by the Pipeline, which is in Version 1 state (May2022).
>>> from databathing import Pipeline
>>> pipeline = Pipeline("SELECT * FROM Test WHERE info = 1")
'final_df = Test\\\n.filter("info = 1")\\\n.selectExpr("a","b","c")\n\n'
In the event that the databathing is not working for you, you can help make this better but simply pasting your sql (or JSON) into a new issue. Extra points if you describe the problem. Even more points if you submit a PR with a test. If you also submit a fix, then you also have my gratitude.
Please follow this blog to update verion - https://circleci.com/blog/publishing-a-python-package/
See the tests directory for instructions running tests, or writing new ones.
May 2022
Features and Functionalities - PySpark Version
SELECT
featureFROM
featureINNER
JOIN and LEFT JOIN featureON
featureWHERE
featureGROUP BY
featureHAVING
featureORDER BY
featureAGG
feature- WINDOWS FUNCTION feature (
SUM
,AVG
,MAX
,MIN
,MEAN
,COUNT
) - ALIAS NAME feature
WITH
STATEMENT feature