open-metadata / openmetadata-spark-agent Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 20.19 MB

License: Apache License 2.0

Shell 0.09% Java 99.91%

openmetadata-spark-agent's People

Contributors

Stargazers

Watchers

Forkers

chiragrank efthymiosh

openmetadata-spark-agent's Issues

Data lineage between different database services is not showing

We have openmetadata setup on AKS clsuter using helm chart. we have connected databricks and azure sql database services. when we create table on databricks using a table from azure sql, the data lineage should be azure_sql_table --> databricks_table, but this lineage is not coming, we can see lineage between tables on databricks, like databricls_table1 --> databricks_table2. we tried creating tables with openmetadata-spark-agent as well as openmetadata-spark-agent-1.0-beta, nothing is giving expected result.

Support iceberg catalog of type Glue

The agent does not support glue-backed iceberg catalogs, failing with:

24/04/23 12:17:37 ERROR PlanUtils3: Catalog glue is unsupported
io.openlineage.spark3.agent.lifecycle.plan.catalog.UnsupportedCatalogException: glue
	at io.openlineage.spark3.agent.lifecycle.plan.catalog.IcebergHandler.getDatasetIdentifier(IcebergHandler.java:83)
	at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)

where glue is a catalog of type glue.

This is supported in openlineage-spark starting from 1.8.0

Improve column level lineage

spark.sql("insert into table1 (col1) select concat(col2, col3) from table2 limit 100").show()

this query only generates column level lineage between col1 and col2, we are missing col3 in this lineage

infinite recursion error。

df1 = spark.sql("select ....");
df.createOrRepalceTempView("view1");
df2 = spark.sql("select c1 from view1");
This program will report an error.
infinite recursion error。
org.apache.spark.sql.catalyst.expressions.AttributeReference->org.apache.spark.sql.catalyst.expressions.AttributeReference["canonicalized"]

Can not create pipeline with spark job

Hi there, I am trying to create a pipeline with Spark job (My code is based on this tutorial: https://docs.open-metadata.org/v1.3.x/connectors/ingestion/lineage/spark-lineage#spark-lineage-ingestion). The tables used in the spark job all exist in OM, the spark job has a finished status, and a pipeline service named My_pipeline_service has been created.

But I didn't see any pipeline. What can I do to create the pipeline with the spark job? I have only added metadata ingestion in the DB service, do I need to add any other type of ingestion to create the pipeline?

This is my code:
`from pyspark.sql import SparkSession
import sys
import calendar
import time
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

ss_conf = {'database': 'test_db',
'password': 'my_pwd',
'port': '3306',
'host': 'my_host',
'ssl': 'false',
'username': 'my_username',
}

OM_JWT = "my_om_jwt"

spark = (
SparkSession.builder.master("local")
.appName("localTestApp").config(
"spark.jars",
"/thuvien/driver/singlestore-spark-connector_2.12-4.1.3-spark-3.3.0.jar,/thuvien/driver/mariadb-java-client-3.1.4.jar,/thuvien/driver/singlestore-jdbc-client-1.2.0.jar,/thuvien/driver/commons-dbcp2-2.9.0.jar,/thuvien/driver/commons-pool2-2.11.1.jar,/thuvien/driver/spray-json_2.10-1.2.5.jar,/thuvien/driver/openmetadata-spark-agent-1.0-beta.jar",
)
.config(
"spark.extraListeners",
"org.openmetadata.spark.agent.OpenMetadataSparkListener",
)
.config("spark.openmetadata.transport.hostPort", "my_hostPort")
.config("spark.openmetadata.transport.type", "openmetadata")
.config("spark.openmetadata.transport.jwtToken", OM_JWT)
.config(
"spark.openmetadata.transport.pipelineServiceName", "My_pipeline_service"
)
.config("spark.openmetadata.transport.pipelineName", "My_pipeline_name")
.config(
"spark.openmetadata.transport.pipelineDescription", "My ETL Pipeline"
)
.config(
"spark.openmetadata.transport.databaseServiceNames",
"analytic",
)
.config("spark.openmetadata.transport.timeout", "30")
.config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]")
.getOrCreate()
)

sc = spark.sparkContext
sc.setLogLevel("INFO")
#conf ss
spark.conf.set("spark.datasource.singlestore.clientEndpoint",ss_conf['host'])
spark.conf.set("spark.datasource.singlestore.user", ss_conf['username'])
spark.conf.set("spark.datasource.singlestore.password", ss_conf['password'])

df1 = (spark
.read
.format("singlestore")
.option("spark.datasource.singlestore.disablePushdown", "true")
.option("enableParallelRead", "automatic")
.option("parallelRead.Features", "readFromAggregatorsMaterialized,readFromAggregators")
.option("paralledRead.repartition", "true")
.option("paralledRead.maxNumPartitions", 20)
.option("parallelRead.repartition.columns", "isdn_key")
.load("test_db.uc16_02_tkc_20240229"))

print(df1.count())

df1.na.fill(0)
.na.fill("")
.write
.mode("Overwrite")
.format("singlestore").save("test_db.uc16_02_tkc_20240229_new")

spark.stop()
`

Data Lineage Not Added for New Table Creation in Spark

Description:
When creating a new Impala table using Spark with existing Impala tables, the data lineage is not being added. It seems that the toEntity in the code doesn’t retrieve information for the new table.

Expected Behavior:
in general, if it's a new table, I thought it would be appropriate to add a table to the data service (impala here) inside open metadata and data lineage should be automatically added for the new Impala table created in Spark, ensuring complete metadata lineage.

Actual Behavior:
No data lineage is added for the new table, resulting in incomplete metadata lineage.

Slack Link

https://openmetadata.slack.com/archives/C02B6955S4S/p1714439693952469?thread_ts=1714439693.952469&cid=C02B6955S4S

Additional Information:

Environment: YARN
Openmetadata Configs: 1.3x
OpenMetadata-spark-agent : 1.0.0-beta
Spark Version: 3.2.3
Impala Version: 4.1.1

WARN LogicalPlanSerializer: Unable to writeValueAsString

Fix the warning Unable to writeValueAsString, this results into infinite recursion and fails to generate the lineage attached the logs.

debug.txt

Jar Files

Kindly assist in providing the jar files for this project

java.util.NoSuchElementException: No value present ERROR when running ETL

I keep getting java.util.NoSuchElementException: No value present error when trying to to run my etl pipeline
the lineage information is not sent to my openmetadata instance, find code snippet for my spark code and the error message
Error.txt
snippet.py.txt

Cover lineage between s3 and databricks

when data is transferred from S3 to databricks this lineage is not captured

Support more versions of spark/scala

Any chance we can add support for other versions of Spark and Scala?

In particular, I'd be interested in:

spark 3.3, scala 2.13
spark 3.4, scala 2.13

Thank you!

open-metadata / openmetadata-spark-agent Goto Github PK

openmetadata-spark-agent's People

Contributors

Stargazers

Watchers

Forkers

openmetadata-spark-agent's Issues

Data lineage between different database services is not showing

Support iceberg catalog of type Glue

Improve column level lineage

infinite recursion error。

Can not create pipeline with spark job

Data Lineage Not Added for New Table Creation in Spark

WARN LogicalPlanSerializer: Unable to writeValueAsString

Jar Files

java.util.NoSuchElementException: No value present ERROR when running ETL

Cover lineage between s3 and databricks

Support more versions of spark/scala

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent