Giter Site home page Giter Site logo

openmetadata-spark-agent's People

Contributors

efthymiosh avatar mohityadav766 avatar ulixius9 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

openmetadata-spark-agent's Issues

Data lineage between different database services is not showing

We have openmetadata setup on AKS clsuter using helm chart. we have connected databricks and azure sql database services. when we create table on databricks using a table from azure sql, the data lineage should be azure_sql_table --> databricks_table, but this lineage is not coming, we can see lineage between tables on databricks, like databricls_table1 --> databricks_table2. we tried creating tables with openmetadata-spark-agent as well as openmetadata-spark-agent-1.0-beta, nothing is giving expected result.

Support iceberg catalog of type Glue

The agent does not support glue-backed iceberg catalogs, failing with:

24/04/23 12:17:37 ERROR PlanUtils3: Catalog glue is unsupported
io.openlineage.spark3.agent.lifecycle.plan.catalog.UnsupportedCatalogException: glue
	at io.openlineage.spark3.agent.lifecycle.plan.catalog.IcebergHandler.getDatasetIdentifier(IcebergHandler.java:83)
	at io.openlineage.spark3.agent.lifecycle.plan.catalog.CatalogUtils3.lambda$getDatasetIdentifier$2(CatalogUtils3.java:61)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)

where glue is a catalog of type glue.

This is supported in openlineage-spark starting from 1.8.0

Improve column level lineage

spark.sql("insert into table1 (col1) select concat(col2, col3) from table2 limit 100").show()

this query only generates column level lineage between col1 and col2, we are missing col3 in this lineage

infinite recursion error。

df1 = spark.sql("select ....");
df.createOrRepalceTempView("view1");
df2 = spark.sql("select c1 from view1");
This program will report an error.
infinite recursion error。
org.apache.spark.sql.catalyst.expressions.AttributeReference->org.apache.spark.sql.catalyst.expressions.AttributeReference["canonicalized"]

Can not create pipeline with spark job

Hi there, I am trying to create a pipeline with Spark job (My code is based on this tutorial: https://docs.open-metadata.org/v1.3.x/connectors/ingestion/lineage/spark-lineage#spark-lineage-ingestion). The tables used in the spark job all exist in OM, the spark job has a finished status, and a pipeline service named My_pipeline_service has been created.
Screenshot 2024-05-21 141945

But I didn't see any pipeline. What can I do to create the pipeline with the spark job? I have only added metadata ingestion in the DB service, do I need to add any other type of ingestion to create the pipeline?

This is my code:
`from pyspark.sql import SparkSession
import sys
import calendar
import time
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

ss_conf = {'database': 'test_db',
'password': 'my_pwd',
'port': '3306',
'host': 'my_host',
'ssl': 'false',
'username': 'my_username',
}

OM_JWT = "my_om_jwt"

spark = (
SparkSession.builder.master("local")
.appName("localTestApp").config(
"spark.jars",
"/thuvien/driver/singlestore-spark-connector_2.12-4.1.3-spark-3.3.0.jar,/thuvien/driver/mariadb-java-client-3.1.4.jar,/thuvien/driver/singlestore-jdbc-client-1.2.0.jar,/thuvien/driver/commons-dbcp2-2.9.0.jar,/thuvien/driver/commons-pool2-2.11.1.jar,/thuvien/driver/spray-json_2.10-1.2.5.jar,/thuvien/driver/openmetadata-spark-agent-1.0-beta.jar",
)
.config(
"spark.extraListeners",
"org.openmetadata.spark.agent.OpenMetadataSparkListener",
)
.config("spark.openmetadata.transport.hostPort", "my_hostPort")
.config("spark.openmetadata.transport.type", "openmetadata")
.config("spark.openmetadata.transport.jwtToken", OM_JWT)
.config(
"spark.openmetadata.transport.pipelineServiceName", "My_pipeline_service"
)
.config("spark.openmetadata.transport.pipelineName", "My_pipeline_name")
.config(
"spark.openmetadata.transport.pipelineDescription", "My ETL Pipeline"
)
.config(
"spark.openmetadata.transport.databaseServiceNames",
"analytic",
)
.config("spark.openmetadata.transport.timeout", "30")
.config("spark.openlineage.facets.disabled", "[spark_unknown;spark.logicalPlan]")
.getOrCreate()
)

sc = spark.sparkContext
sc.setLogLevel("INFO")
#conf ss
spark.conf.set("spark.datasource.singlestore.clientEndpoint",ss_conf['host'])
spark.conf.set("spark.datasource.singlestore.user", ss_conf['username'])
spark.conf.set("spark.datasource.singlestore.password", ss_conf['password'])

df1 = (spark
.read
.format("singlestore")
.option("spark.datasource.singlestore.disablePushdown", "true")
.option("enableParallelRead", "automatic")
.option("parallelRead.Features", "readFromAggregatorsMaterialized,readFromAggregators")
.option("paralledRead.repartition", "true")
.option("paralledRead.maxNumPartitions", 20)
.option("parallelRead.repartition.columns", "isdn_key")
.load("test_db.uc16_02_tkc_20240229"))

print(df1.count())

df1.na.fill(0)
.na.fill("")
.write
.mode("Overwrite")
.format("singlestore").save("test_db.uc16_02_tkc_20240229_new")

spark.stop()
`

Data Lineage Not Added for New Table Creation in Spark

Description:
When creating a new Impala table using Spark with existing Impala tables, the data lineage is not being added. It seems that the toEntity in the code doesn’t retrieve information for the new table.

Expected Behavior:
in general, if it's a new table, I thought it would be appropriate to add a table to the data service (impala here) inside open metadata and data lineage should be automatically added for the new Impala table created in Spark, ensuring complete metadata lineage.

Actual Behavior:
No data lineage is added for the new table, resulting in incomplete metadata lineage.

Slack Link

Additional Information:

  • Environment: YARN
  • Openmetadata Configs: 1.3x
  • OpenMetadata-spark-agent : 1.0.0-beta
  • Spark Version: 3.2.3
  • Impala Version: 4.1.1

Jar Files

Kindly assist in providing the jar files for this project

Support more versions of spark/scala

Any chance we can add support for other versions of Spark and Scala?

In particular, I'd be interested in:

  • spark 3.3, scala 2.13
  • spark 3.4, scala 2.13

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.