streamnative / pulsar-io-lakehouse Goto Github PK

pulsar lakehouse connector

License: Apache License 2.0

Java 98.30% Python 0.37% Shell 0.94% Dockerfile 0.39%

pulsar-io-lakehouse's Introduction

Pulsar IO :: Lakehouse Connector

The Lakehouse connector is a Pulsar IO connector for synchronizing data between Lakehouse (Delta Lake, Iceberg and Hudi) and Pulsar. It contains two types of connectors:

Lakehouse source connector Currently support DeltaLake

This source connector can capture data changes from delta lake through DSR and writes data to Pulsar topics.

Lakehouse sink connector Currently support DeltaLake, Hudi and Iceberg.

This sink connector can consume pulsar topic data and write into Lakehouse and users can use other big-data engines to process the delta lake table data further.

Currently, Lakehouse connector versions (x.y.z) are based on Pulsar versions (x.y.z).

Delta connector version	Pulsar version	Doc
2.9.x	2.9.2	- Lakehouse source connector - Lakehouse sink connector

Lakehouse Demos

Lakehouse	Demo
Delta Lake	Delta Lake Source and Sink Demo

Project layout

Below are the sub folders and files of this project and their corresponding descriptions.

  ├── conf // stores configuration examples.
  ├── docs // stores user guides.
  ├── src // stores source codes.
  │   ├── checkstyle // stores checkstyle configuration files.
  │   ├── license // stores license headers. You can use `mvn license:format` to format the project with the stored license header.
  │   │   └── ALv2
  │   ├── main // stores all main source files.
  │   │   └── java
  │   ├── spotbugs // stores spotbugs configuration files.
  │   └── test // stores all related tests.
  │

Build delta connector

Requirements:

Java JDK 11 or JDK 8
Maven 3.6.1+

Compile and install without cloud dependency:

$ mvn clean install -DskipTests

Compile and install with cloud dependency (Including aws, gcs and azure):

$ mvn clean install -P cloud -DskipTests

Run Unit Tests:

$ mvn test

Run Individual Unit Test:

$ mvn test -Dtest=unit-test-name (e.g: ParquetReaderTest)

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

pulsar-io-lakehouse's People

Contributors

Stargazers

Watchers

Forkers

zymap ethqunzhong biolspikachu wjcwin heshengjun811 tspannhw erdal-pb tmdc-io coomia-ai grigorizaika1 ttww97 suyambuganesh82 oneebhkhan din-global jtchen-study q6cyber fathom-io

pulsar-io-lakehouse's Issues

[Bug] Does not work with iceberg hive catalog

Does not work with iceberg hive catalog
Dependency missing for the hive catalog Missing org.apache.iceberg.hive.HiveCatalog.

To Reproduce
Steps to reproduce the behavior:

Pulsar 3.1.0 deployed on Kubernetes using helm chart - Helm Chart and custom pulsar image
Deployed private minio in private cluster
Deployed hive metastore in private cluster
Deployed pulsar sink v3.1.0.4 with iceberg hive catalog
Started 5hz data flow on a persistent pulsar topic
Receiving error message Missing org.apache.iceberg.hive.HiveCatalog [java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/UnknownDBException]

Expected behavior
The sink should work and tables should be created in minio.

Sink Configuration

{
    "tenant":"unifiyadkinville",
    "namespace":"ifs",
    "name":"icerberg_sink",
    "parallelism":2,
    "inputs": [
      "persistent://unif/if/fiber"
    ],
    "archive": "connectors/pulsar-io-lakehouse-3.1.0.4-cloud.nar",
    "processingGuarantees":"EFFECTIVELY_ONCE",
    "configs":{
        "type":"iceberg",
        "maxCommitInterval":60,
        "maxRecordsPerCommit":60000,
        "catalogName":"test_v1",
        "tableNamespace":"iceberg_sink_test",
        "tableName":"ice_sink_person",
        "catalogProperties":{
            "uri":"thrift://hive-metastore-service.hive:9083",
            "warehouse":"s3a://warehouse/iceberg",
            "catalog-impl":"hiveCatalog"
        }
    }
}

Custom Pulsar Image

# Use a simple base for downloading and setting permissions
FROM alpine as downloader
ADD https://github.com/streamnative/pulsar-io-lakehouse/releases/download/v3.1.0.4/pulsar-io-lakehouse-3.1.0.4-cloud.nar /tmp/pulsar-io-lakehouse-3.1.0.4-cloud.nar
ADD https://github.com/streamnative/pulsar-io-lakehouse/releases/download/v3.1.0.4/pulsar-io-lakehouse-3.1.0.4.nar /tmp/pulsar-io-lakehouse-3.1.0.4.nar
ADD https://repo1.maven.org/maven2/org/apache/hive/hive-metastore/3.1.2/hive-metastore-3.1.2.jar /tmp/hive-metastore-3.1.2.jar
ADD https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar /tmp/iceberg-hive-runtime-0.13.2.jar

RUN chmod 644 /tmp/pulsar-io-lakehouse-3.1.0.4-cloud.nar
RUN chmod 644 /tmp/pulsar-io-lakehouse-3.1.0.4.nar
RUN chmod 644 /tmp/hive-metastore-3.1.2.jar
RUN chmod 644 /tmp/iceberg-hive-runtime-0.13.2.jar

# Use the Pulsar image
FROM apachepulsar/pulsar-all:3.1.0
COPY --from=downloader /tmp/pulsar-io-lakehouse-3.1.0.4-cloud.nar /pulsar/connectors/pulsar-io-lakehouse-3.1.0.4-cloud.nar
COPY --from=downloader /tmp/pulsar-io-lakehouse-3.1.0.4.nar /pulsar/connectors/pulsar-io-lakehouse-3.1.0.4.nar
COPY --from=downloader /tmp/hive-metastore-3.1.2.jar /pulsar/lib/hive-metastore-3.1.2.jar
COPY --from=downloader /tmp/iceberg-hive-runtime-0.13.2.jar /pulsar/lib/iceberg-hive-runtime-0.13.2.jar

# Continue with the rest of your Dockerfile...
COPY ./iceberg.json /pulsar/connectors/iceberg.json

Error Log

2023-10-03T20:26:59,826+0000 [lakehouse-io-1-1] ERROR org.apache.pulsar.ecosystem.io.lakehouse.sink.SinkWriter - process record failed.
java.lang.IllegalArgumentException: Cannot initialize Catalog implementation org.apache.iceberg.hive.HiveCatalog: Cannot find constructor for interface org.apache.iceberg.catalog.Catalog
        Missing org.apache.iceberg.hive.HiveCatalog [java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/UnknownDBException]
        at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:182) ~[iceberg-core-0.13.1.jar:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.iceberg.CatalogLoader$HiveCatalogLoader.loadCatalog(CatalogLoader.java:107) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.iceberg.TableLoader$CatalogTableLoader.<init>(TableLoader.java:124) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.iceberg.TableLoader$CatalogTableLoader.<init>(TableLoader.java:112) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.iceberg.TableLoader.fromCatalog(TableLoader.java:48) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.iceberg.IcebergWriter.<init>(IcebergWriter.java:83) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.LakehouseWriter.getWriter(LakehouseWriter.java:41) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.SinkWriter.getOrCreateWriter(SinkWriter.java:148) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.ecosystem.io.lakehouse.sink.SinkWriter.run(SinkWriter.java:104) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.77.Final.jar:4.1.77.Final]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.lang.NoSuchMethodException: Cannot find constructor for interface org.apache.iceberg.catalog.Catalog
        Missing org.apache.iceberg.hive.HiveCatalog [java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/UnknownDBException]
        at org.apache.iceberg.common.DynConstructors$Builder.buildChecked(DynConstructors.java:227) ~[iceberg-common-0.13.1.jar:?]
        at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:180) ~[iceberg-core-0.13.1.jar:?]
        ... 12 more
2023-10-03T20:27:01,514+0000 [unifiyadkinville/ifs/icerberg_sink-0] ERROR org.apache.pulsar.ecosystem.io.lakehouse.SinkConnector - Exit caused by lakehouse writer stop working
2023-10-03T20:27:01,514+0000 [unifiyadkinville/ifs/icerberg_sink-0] INFO  org.apache.pulsar.functions.instance.JavaInstanceRunnable - Encountered exception in sink write:
org.apache.pulsar.ecosystem.io.lakehouse.exception.LakehouseConnectorException: Exit caused by lakehouse writer stop working
        at org.apache.pulsar.ecosystem.io.lakehouse.SinkConnector.write(SinkConnector.java:83) ~[cZDQlLbu3-KT07yEI4R2DQ/:?]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.sendOutputMessage(JavaInstanceRunnable.java:439) ~[?:?]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.handleResult(JavaInstanceRunnable.java:401) ~[?:?]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:341) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-10-03T20:27:01,519+0000 [unifiyadkinville/ifs/icerberg_sink-0] ERROR org.apache.pulsar.functions.instance.JavaInstanceRunnable - [unifiyadkinville/ifs/icerberg_sink:0] Uncaught exception in Java Instance
java.lang.RuntimeException: Failed to process message: 69:48:-1:151
        at org.apache.pulsar.functions.source.PulsarSource.lambda$buildRecord$6(PulsarSource.java:155) ~[org.apache.pulsar-pulsar-functions-instance-3.1.0.jar:3.1.0]
        at org.apache.pulsar.functions.source.PulsarRecord.fail(PulsarRecord.java:133) ~[org.apache.pulsar-pulsar-functions-instance-3.1.0.jar:3.1.0]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.sendOutputMessage(JavaInstanceRunnable.java:444) ~[org.apache.pulsar-pulsar-functions-instance-3.1.0.jar:3.1.0]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.handleResult(JavaInstanceRunnable.java:401) ~[org.apache.pulsar-pulsar-functions-instance-3.1.0.jar:3.1.0]
        at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:341) ~[org.apache.pulsar-pulsar-functions-instance-3.1.0.jar:3.1.0]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
2023-10-03T20:27:01,520+0000 [unifiyadkinville/ifs/icerberg_sink-0] INFO  org.apache.pulsar.functions.instance.JavaInstanceRunnable - Closing instance

Environment

Ubuntu 20.04
Pulsar version: 3.1.0
Deployment: cluster
Connector/offloader/protocol handler/... version: 3.1.0.4

Additional context
I haven't added any addition sink, source or java libraries in the pulsar image, as it was not mentioned in the document. Let me know if I am missing something.

[DISCUSS] Add extra metadata into lakehouse table field in sink connector

Is your enhancement request related to a problem? Please describe.
When we use pulsar sql to query data from pulsar topic, we will find presto has been add extra schema field into the table. For Lakehouse sink connector, whether we need to add a configuration to support add extra schema field into Lakehouse table, which will be easy to do integration for Lakehouse offloader.

[Enhancement request] supports bytes schema for the lakehouse connector

Is your enhancement request related to a problem? Please describe.
A clear and concise description of what the enhancement is.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug] Github action not show failed when there has tests failed

Describe the bug
https://github.com/streamnative/pulsar-io-lakehouse/runs/6501838457?check_suite_focus=true

[CVE] Verizon requests a fix for log4j 1.2.17 vulnerabilities

Describe the bug
Fix CVEs introduced by log4j 1.2.17

Requested by Verizon in streamnative/eng-support-tickets#218

To Reproduce

Go to snyk.io pulsar-io-lakehouse project
Enter log4j in the Snyk search box
Please fix the CVEs
CVE-2019-17571, CVE-2020-9488, CVE-2022-23302, CVE-2022-23305, CVE-2022-23307

Expected behavior
CVEs fixed

Screenshots

Environment

OS: [e.g. Ubuntu]
Pulsar version: [e.g. 2.7.0]
Deployment: [e.g. standalone]
Connector/offloader/protocol handler/... version: [e.g. 2.7.0]

Additional context
Add any other context about the problem here.

[Enhancement request] Add metrics for sink and source connector

[Enhancement request] Sink connector support pb and primitive schema format

Is your enhancement request related to a problem? Please describe.
A clear and concise description of what the enhancement is.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug] STRING Schema topic sink failed

I tried to run iceberg sink demo follow with pulsar-io-lakehouse sink docs. it fail commit record because getSchema result unexcepted.

Describe the bug
my test flow shows below:

create topic & produce data
first, I produce lots of data to test topic persistent://public/default/iceberg_test by Flink-connector.

message format like:
22061772,1670896138459,mmdc-bigdata-test,11.156.128.57,jobmanager,11.156.128.75,2022-12-13 09:48:58,29
and set topic-schema with bin/pulsar-admin schemas upload command.
therefore,test-topic schema show below:

2. Run the lakehouse sink connector

logs shows sink iceberg failed with schema exception.

there are two question：

why getSchemaType result different from these two ways:
record.getSchema().getSchemaInfo().getSchemaDefinition()=null
so records will skiped in sinkWriter.run

and i found that in getSchemaDefinition

if SchemaType=STRING/BYTES, it's SchemaDefinition will always be null cause sink failed.

Environment

Pulsar version: 2.9.3
Deployment: On-premises cluster
pulsar-io-lakehouse-connector version: 2.9.3.16

4 broker & 1 function-worker (run as a separate process in separate machines.)

[Bug] Multiple Hoodie writer commit conflict

Describe the bug
When we run the hudi sink with multiple instances, then using it sink a partitioned topic with failover subscription mode. Both instances will consume the message from the topic. Hudi supports the concurrency mode to support multiple writers.
When we enable this feature, the hudi writer will throw fileAlreadyExists exception and failed the commit.

org.apache.hudi.exception.HoodieIOException: Failed to create file file:/tmp/integration/hudi/.hoodie/20220517094051766.commit
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:745) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:560) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:536) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.saveAsComplete(HoodieActiveTimeline.java:183) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:270) ~[hudi-client-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:234) ~[hudi-client-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.client.HoodieJavaWriteClient.commit(HoodieJavaWriteClient.java:88) ~[hudi-java-client-0.11.0.jar:0.11.0]
	at org.apache.hudi.client.HoodieJavaWriteClient.commit(HoodieJavaWriteClient.java:51) ~[hudi-java-client-0.11.0.jar:0.11.0]
	at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:206) ~[hudi-client-common-0.11.0.jar:0.11.0]
	at org.apache.pulsar.ecosystem.io.sink.hudi.BufferedConnectWriter.flushRecords(BufferedConnectWriter.java:82) ~[PqY5lYEJSWPWMDq7E5HC2Q/:?]
	at org.apache.pulsar.ecosystem.io.sink.hudi.HoodieWriter.flush(HoodieWriter.java:85) ~[PqY5lYEJSWPWMDq7E5HC2Q/:?]
	at org.apache.pulsar.ecosystem.io.sink.SinkWriter.commitIfNeed(SinkWriter.java:128) ~[PqY5lYEJSWPWMDq7E5HC2Q/:?]
	at org.apache.pulsar.ecosystem.io.sink.SinkWriter.run(SinkWriter.java:113) [PqY5lYEJSWPWMDq7E5HC2Q/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_201]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_201]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already exists: file:/tmp/integration/hudi/.hoodie/20220517094051766.commit
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:315) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:353) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:403) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:466) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:445) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994) ~[hadoop-common-3.2.2.jar:?]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.lambda$create$2(HoodieWrapperFileSystem.java:222) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.executeFuncWithTimeMetrics(HoodieWrapperFileSystem.java:101) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.create(HoodieWrapperFileSystem.java:221) ~[hudi-common-0.11.0.jar:0.11.0]
	at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:740) ~[hudi-common-0.11.0.jar:0.11.0]
	... 16 more

To Reproduce
Steps to reproduce the behavior:

Create a partitioned topic pio
submit a hudi writer with following configuration:

    {
  "tenant": "public",
  "namespace": "default",
  "name": "lakehouse1",
  "topicName": "pio",
  "parallelism": 2,
  "sourceSubscriptionName": "sub",
  "processingGuarantees": "EFFECTIVELY_ONCE",
  "subscriptionType": "Failover",
  "type": "hudi",
  "inputs": [
    "pio"
  ],
  "archive": "/Volumes/work/github.com/streamnative/pulsar-io-lakehouse/target/pulsar-io-lakehouse-2.9.2.0-SNAPSHOT.nar",
  "className": "org.apache.pulsar.ecosystem.io.SinkConnector",
  "configs":
  {
    "type": "hudi",
    "hoodie.table.name": "hudi-connector-test",
    "hoodie.table.type": "COPY_ON_WRITE",
    "hoodie.base.path": "file:///tmp/integration/hudi",
    "hoodie.clean.async": "true",
    "hoodie.write.concurrency.mode": "optimistic_concurrency_control",
    "hoodie.cleaner.policy.failed.writes": "LAZY",
    "hoodie.write.lock.provider": "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider",
    "hoodie.write.lock.zookeeper.url": "localhost",
    "hoodie.write.lock.zookeeper.port": "2181",
    "hoodie.write.lock.zookeeper.lock_key": "pulsar_hudi",
    "hoodie.write.lock.zookeeper.base_path": "/hudi",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.partitionpath.field": "id",
    "maxRecordsPerCommit": 10
  }
}

produce message to the pio
See error

Expected behavior
Both the commit should success

Screenshots
If applicable, add screenshots to help explain your problem.

Environment

OS: [e.g. Ubuntu]
Pulsar version: [e.g. 2.7.0]
Deployment: [e.g. standalone]
Connector/offloader/protocol handler/... version: [e.g. 2.7.0]

Additional context
Add any other context about the problem here.

[Bugfix] The table created by the lakehouse sink failed to be read

Reproduce

Describe the bug
The table created by the lakehouse sink failed to read.
This bug comes from user feedback: https://github.com/streamnative/eng-support-tickets/issues/130.

Environment

OS: mac
Pulsar version: 2.9.3
Deployment: standalone
Connector: pulsar-io-lakehouse-2.9.2.22.nar

To Reproduce
Steps to reproduce the behavior:

Run the pulsar:2.9.3 locally.
Create a sink connector with the pulsar-io-lakehouse-2.9.2.22.nar.
Create a topic and send a message with Avro schema.
And then we can get the resulting files structure after a short iteration:

|-- _delta_log
|   |-- 00000000000000000000.json
|    -- 00000000000000000001.json
|-- part-0000-cb45d789-d6bf-43d5-adcc-10ea8e2f1f73-c000.snappy.parquet

Run the following command to start spark:

bin/spark-shell --packages io.delta:delta-core_2.12:1.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Load the files and read them as a delta table.

Expected behavior
Read the content and get one record. But in fact, there is nothing.

Screenshots

Test after bugfix

local

Environment

OS: mac
Pulsar version: 2.9.3
Deployment: standalone
Connector: pulsar-io-lakehouse-fix.nar
Storage: local

Result

Repeat the above steps, and we can get the expected result, as shown below.

cloud

Environment

OS: mac
Pulsar version: 2.9.3
Deployment: standalone
Connector: pulsar-io-lakehouse-cloud-fix.nar
Storage: AWS-S3

Download files from S3

#### Result

Conclusion

After the bugfix, the results were as expected.

[Bug] run iceberg sink process failed with Extracted Process death exception

I tried to run iceberg sink demo follow with pulsar-io-lakehouse sink docs, failed with log shows: "Function Container is dead with following exception. Restarting."

Describe the bug
lakehouse sink connector is available

i run iceberg sink demo with sink command.

iceberg-sink-config.yaml is:

{
    "tenant":"demo",
    "namespace":"demo",
    "name":"iceberg_sink_test",
    "parallelism":1,
    "inputs": [
      "iceberg_test"
    ],
    "archive": "connectors/pulsar-io-lakehouse-2.9.3.16.nar",
    "processingGuarantees":"EFFECTIVELY_ONCE",
    "configs":{
        "type":"iceberg",
        "maxCommitInterval":120,
        "maxRecordsPerCommit":10000000,
        "catalogName":"icebergSinkConnector",
        "tableNamespace":"euler_live_iceberg",
        "tableName":"qunzhong_flink_memory_usage",
        "catalogProperties":{
            "uri":"xxxx",
            "catalog-impl":"hiveCatalog"
        }
    }
}

process run failed with exception shows below:

2022-12-06T17:14:53,044+0800 [worker-scheduler-0] INFO  org.apache.pulsar.functions.runtime.process.ProcessRuntime - Started process successfully
2022-12-06T17:14:53,239+0800 [worker-scheduler-0] INFO  org.apache.pulsar.functions.worker.SchedulerManager - Schedule summary - execution time: 0.900067689 sec | total unassigned: 1 | stats: {"Added": 1, "Updated": 0, "removed": 0}
{
  "mmdctppulsar-worker-1" : {
    "originalNumAssignments" : 0,
    "finalNumAssignments" : 1,
    "instancesAdded" : 1,
    "instancesRemoved" : 0,
    "instancesUpdated" : 0,
    "alive" : true
  }
}
2022-12-06T17:15:23,436+0800 [function-timer-thread-33-1] ERROR org.apache.pulsar.functions.runtime.process.ProcessRuntime - Health check failed for iceberg_sink_test-0
java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
	at org.apache.pulsar.functions.runtime.process.ProcessRuntime.lambda$start$1(ProcessRuntime.java:188) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.3.jar:2.9.3]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.Status.asRuntimeException(Status.java:535) ~[io.grpc-grpc-api-1.45.1.jar:1.45.1]
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533) ~[io.grpc-grpc-stub-1.45.1.jar:1.45.1]
	at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	... 1 more
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /127.0.0.1:44195
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
	at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:320) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	... 1 more
2022-12-06T17:15:23,438+0800 [function-timer-thread-33-1] ERROR org.apache.pulsar.functions.runtime.process.ProcessRuntime - Extracted Process death exception
java.lang.RuntimeException:
	at org.apache.pulsar.functions.runtime.process.ProcessRuntime.tryExtractingDeathException(ProcessRuntime.java:404) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.functions.runtime.process.ProcessRuntime.isAlive(ProcessRuntime.java:391) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.functions.runtime.RuntimeSpawner.lambda$start$0(RuntimeSpawner.java:88) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.3.jar:2.9.3]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-12-06T17:15:23,439+0800 [function-timer-thread-33-1] ERROR org.apache.pulsar.functions.runtime.RuntimeSpawner - demo/demo/iceberg_sink_test Function Container is dead with following exception. Restarting.
java.lang.RuntimeException:
	at org.apache.pulsar.functions.runtime.process.ProcessRuntime.tryExtractingDeathException(ProcessRuntime.java:404) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.functions.runtime.process.ProcessRuntime.isAlive(ProcessRuntime.java:391) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.functions.runtime.RuntimeSpawner.lambda$start$0(RuntimeSpawner.java:88) ~[org.apache.pulsar-pulsar-functions-runtime-2.9.3.jar:2.9.3]
	at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53) ~[org.apache.pulsar-pulsar-common-2.9.3.jar:2.9.3]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
2022-12-06T17:15:23,441+0800 [function-timer-thread-33-1] INFO  org.apache.pulsar.functions.runtime.process.ProcessRuntime - Creating function log directory logs//functions/demo/demo/iceberg_sink_test

Environment

Pulsar version: 2.9.3
Deployment: On-premises cluster
pulsar-io-lakehouse-connector version: 2.9.3.16
4 broker & 1 function-worker (run as a separate process in separate machines.)

[Feature request] Add integration tests in the sn-tests repo

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[DISCUSS] Whether need support more primitive type

Motivation

In #75, we supported String and Bytes primitive types for the lakehouse sink connector. In general case, it is enough.

We are not sure whether we need to support other primitive types that Pulsar supported.

[Bug] Wrong "size" value in _delta_log metadata files in "Pulsar to Lakehouse" sink connector (target format Delta Lake)

Describe the bug
We are using a "Pulsar to Lakehouse" sink connector, trying to offload Pulsar topic's content to the S3 storage in Delta format, exactly as it was showed here.
The resulting files structure that we get in S3 after a short iteration:

|-- _delta_log
|   |-- 00000000000000000000.json
|    -- 00000000000000000001.json
|-- part-0000-091d0c83-15ca-43bd-9919-355ab559a3c0-c000.snappy.parquet

When we try to read this content with Spark:

ddl_query = f"CREATE TABLE {database_name}.{table_name}_delta \
                USING DELTA \
                LOCATION 's3://some/bucket'"
spark.sql(ddl_query)
spark.sql(f"select * from {database_name}.{table_name}_delta").show(5)

it fails with an exception:
Caused by: java.lang.RuntimeException: s3a://some/bucket/part-0000-091d0c83-15ca-43bd-9919-355ab559a3c0-c000.snappy.parquet is not a Parquet file. Expected magic number at tail, but found [28, 21, 2, 21]

When we try to debug it, we find out that the actual *.parquet file size is 725 bytes, while in the _delta_log/00000000000000000001.json we see that the 'size' parameter equals to 16:

{"commitInfo":{"timestamp":1668373624666,"operation":"WRITE","operationParameters":{},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"pulsar-sink-connector-version-2.9.1 Delta-Standalone/0.3.0"}}
{"txn":{"appId":"pulsar-delta-sink-connector","version":1,"lastUpdated":1668373624658}}
{"add":{"path":"part-0000-091d0c83-15ca-43bd-9919-355ab559a3c0-c000.snappy.parquet","partitionValues":{},"size":16,"modificationTime":1668373624659,"dataChange":true,"stats":"{}"}}

When we edit manually this file, set "size"=725 (which is the actual file size), then our Spark code works.
We tried to do this with two versions of the connector: 2.9.1 (which was demonstrated upon the link above by Streamnative at the conference), and with 2.10.1.12 (latest at the moment), both with "cloud" version of *.nar which writes to S3, and with another one that writes the files locally. The result is the same.

To Reproduce
Steps to reproduce the behavior:

Run Pulsar (in our case it was a standalone instance in Docker from an image apachepulsar/pulsar:2.10.2)
Download a file "pulsar-io-lakehouse-2.10.1.12.nar", copy it inside a container's class path
Create a topic in Pulsar
Instantiate a connector: bin/pulsar-admin sinks create --sink-config-file /tmp/jar/config.yaml
the content of config.yaml:

tenant: public
namespace: default
name: my-topic-2
parallelism: 1
inputs:
- my-topic-2
archive: /tmp/jar/pulsar-io-lakehouse-2.10.1.12.nar
processingGuarantees: EFFECTIVELY_ONCE
configs:
 type: delta
 maxCommitInterval: 120
 maxRecordsPerCommit: 10000000
 tablePath: file:///tmp/output/test_sink_v1
 processingGuarantees: "EXACTLY_ONCE"
 deltaFileType: "parquet"
 subscriptionType: "Failover"

Make sure that the connector works: bin/pulsar-admin sinks status --name my-topic-2
Produce something to a topic
Observe the output files from a connector

Expected behavior
We assume that the file size in _delta_log/*.json's must fit to the physical size of the *.parquet files as a result of this sink connector (when Delta target format is chosen), so that downstream appliations, such as Spark, could read this result as Delta format without throwing exceptions.

Environment

OS: inside a docker: apachepulsar/pulsar:2.10.2
Pulsar version: 2.10.2
Deployment: standalone
Connector/offloader/protocol handler/... version: pulsar-io-lakehouse-2.10.1.12.nar

[Bug] Delta Sink Connector does not work on M1 Macbook Pro

The sink connector does not work when writing a delta table to a local filepath on the M1 Macbook Pro
When testing the Lakehouse Delta sink connector on Pulsar Standalone mode and writing to a local filepath, the sink connector creates parquet files but the records being processed from the input topic are not written to the parquet file.

To Reproduce
Steps to reproduce the behavior:

Deploy Pulsar Standalone (version 3.1.0)
Deploy Lakehouse Sink Connector (version 2.11.0)
Produce messages to the input topic for the sink connector using a JSON Schema.
The Sink Connector logs will show that the records were received and empty parquet files (Zero Bytes) are created.
Query the delta table to confirm e.g. using Spark, the Spark dataframe will show that the delta table only has the schema and no data.

Expected behavior
The sink connector should create parquet files of non-zero bytes and reflect the data being received by the sink connector.

Screenshots

Environment

OS: Mac M1 OS Sonoma
Pulsar version: 3.1.0
Deployment: Standalone
Connector version: 2.11.0
Connector Type: Delta Sink

Additional context
Sink Connector logs when the message is received from input topic:

2023-10-31T16:48:51,547+0500 [lakehouse-io-1-1] INFO  io.delta.standalone.internal.DeltaLogImpl - Loading version 0.
2023-10-31T16:48:51,562+0500 [lakehouse-io-1-1] INFO  io.delta.standalone.internal.SnapshotImpl - [tableId=099106fa-dca5-46ac-89dd-217a4805e0cc] Created snapshot io.delta.standalone.internal.SnapshotImpl@ecf44c3
2023-10-31T16:48:51,837+0500 [lakehouse-io-1-1] INFO  io.delta.standalone.internal.DeltaLogImpl - Returning initial snapshot io.delta.standalone.internal.SnapshotImpl@ecf44c3
2023-10-31T16:48:51,893+0500 [lakehouse-io-1-1] INFO  org.apache.hadoop.io.compress.CodecPool - Got brand-new compressor [.snappy]
2023-10-31T16:48:51,990+0500 [lakehouse-io-1-1] INFO  org.apache.pulsar.ecosystem.io.lakehouse.parquet.DeltaParquetFileWriter - open: file:///tmp/delta_sink_test/part-0000-d712a4a2-c377-4735-ab5e-1b691b0090f7-c000.snappy.parquet parquet writer succeed. org.apache.parquet.hadoop.ParquetWriter@3c1a110b
2023-10-31T16:48:51,992+0500 [lakehouse-io-1-1] INFO  org.apache.pulsar.ecosystem.io.lakehouse.parquet.DeltaParquetFileWriter - start to close internal parquet writer

This is the last log visible and the connector then remains here.

After some debugging, I determined that crucially, the DeltaParquetFileWriter does not successfully close the writer object, which means the thread is blocked and the connector is unable to close and commit files.

@horizonzy and I explored this issue further:
Looking at the java stack trace of the connector process, we were able to see that the lakehouse-io thread was in a WAITING state and likely why not returning from the close() method. The thread is explicitly set to running state, except when there is an exception in which it is removed from a running state. However there were no exceptions and no stack trace to indicate that that was the case. Yan suggested we change the exception to a throwable. After making that change and running the sink connector, we saw this error:

ERROR org.apache.pulsar.ecosystem.io.lakehouse.sink.SinkWriter - process record failed. 
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64

Updating the snappy dependency and repackaging the connector and testing it solved the issue.

[Enhancement request] Master and branch-2.11.x support JDK 17

Is your enhancement request related to a problem? Please describe.
Pulsar 2.11.0+ uses JDK 17 to build and run, so the related plugins also need to support running on JDK 17

Describe the solution you'd like
Master and branch-2.11.x support JDK 17

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug] NPE when close the sink connector

Describe the bug
2022-06-17T00:41:45,756+0000 [public/default/lakehouse1-0] ERROR org.apache.pulsar.functions.instance.JavaInstanceRunnable - Failed to close sink
java.lang.NullPointerException: null
at org.apache.pulsar.ecosystem.io.lakehouse.SinkConnector.close(SinkConnector.java:93) ~[GwrwkDjp5lHCnEV7YGIWyA/:?]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.close(JavaInstanceRunnable.java:451) [io.streamnative-pulsar-functions-instance-2.10.0.3.jar:2.10.0.3]
at org.apache.pulsar.functions.instance.JavaInstanceRunnable.run(JavaInstanceRunnable.java:319) [io.streamnative-pulsar-functions-instance-2.10.0.3.jar:2.10.0.3]
at java.lang.Thread.run(Thread.java:829) [?:?]

streamnative / pulsar-io-lakehouse Goto Github PK

pulsar-io-lakehouse's Introduction

Pulsar IO :: Lakehouse Connector

Project layout

Build delta connector

License

pulsar-io-lakehouse's People

Contributors

Stargazers

Watchers

Forkers

pulsar-io-lakehouse's Issues

Reproduce

Test after bugfix

local

Environment

Result

cloud

Environment

Download files from S3

Conclusion

Motivation

Recommend Projects

Recommend Topics

Recommend Org