The spark-connector from vertica

Spike: external table for data already in HDFS

Discuss and plan implementation of creating external tables from data already stored.

Merge less cols without copy_column_list

Currently, the connector requires a copy column list if the number of cols in the df are less than the number of cols in the target table. This will be fixed by inferring the schema from the tempTable when building the merge statement.

Df schema: (col1, col2, col3)
Target table schema: (col1, col2, col3, col4)

Acceptance criteria: only the first three columns in the target table should be affected.

Update run-example.sh script to use valid mirror link

The current mirror site for Spark 3.0.2 does not work. Need to update to either Spark 3.0.3 or Spark 3.1.2.

Spike: Direct reads/writes

have a meeting to align on solution and scope
draft solution / design proposal

Publish non-uber jars and/or relocate vendor libraries

Hi,

It seems the connector is published as an assembly JAR, without relocating packages of the third-party libraries. That's generally a frowned-upon practice.

For me it manifests as a problem because the connector depends on scalactic in compile scope. Your version of Scalactic has a binary incompatibility with an older ScalaTest version that I'm currently limited to using in the test suite of an industrial project. I can't use build tool facilities to try excluding your Scalactic version, because it isn't a managed transitive dependency, vertica-spark brings it directly to the classpath in /org/scalactic. This is hard to debug, since the source of errors is not the Scalactic version that I see in my dependency tree from my ScalaTest version, it's a sneaky one.

The connector might be able to eliminate use of Scalactic as a runtime dep, but that isn't the root issue, this will likely arise with another vendored library for other users.

I understand there is a desire for shaded uber-jar packages in some cases. I believe good practice, and what I'd like to suggest, is:

Publish an artifact with no shading, allowing users to manage transitive dependencies with the facilities of their build tools.
Publish an artifact with shaded dependencies under a classifier such as shaded or assembly, as an expedient option for users that find binary compatibility problems in their builds.
The shaded artifact must relocate all vendor packages to avoid classpath conflicts in user builds.

I feel the first should be the default artifact, with no classifier—it's best to promote modularity as the first option—but you might invert them if you prefer.

I can try taking a stab at the sbt setup for this, if maintainers agree with it.

Thanks for your work on the new connector!

Add option to create external table out of written connector data

This covers half of the requested external table functionality. The scope here is replacing the copy step of the connector with a create external table step.

Outline of feature:

New option: create_external_table - boolean defaulting to false
Normal table should not be created
Copy and Cleanup steps disabled
Create external table step added
Verification of external table step added

Acceptance criterea:

Unit tests
Integration tests -- include String length mismatch and Decimal precision mismatch

Spike: External Table Support

Explore functionality and prototype.

Add option for creating an external table rather than loading data into Vertica

Compare schema used with INFER_EXTERNAL_TABLE_DDL
Test reading from external table afterwards

Add option for skipping the write step, and only doing the external table based on existing parquet files

Test with partitioned parquet files

Acceptance criteria:

Document outlining solution
Prototype

Contribution to Spark: "done" state on read

The spark V2 datasource API read side does not have any mechanism for performing overall operation cleanup on the driver node once the operation is completed.

This proposes to contribute to spark to add this functionality. Should provide a function called upon operation completion, with context of whether the operation failed or succeeded.

Fix scripts for Pyspark and s3 examples

PySpark example doesn't include script to spin up spark cluster, as a result without the correct initial setup when running "run-s3-example.sh" script may fail with error.

This includes adding automation for the PySpark example to make it easier to run. Previously, we didn't install Python3 or setup the Spark cluster, so this makes it easier to get started.

It also includes comments to the Python script to make it easier for people unfamiliar with PySpark to understand what the script is doing.

Spike: complex types

Currently, the V2 connector does not support complex data types (ARRAY, MAP, ROW). Trying to convert Vertica's ARRAYs into Spark ArrayTypes is a bit complicated due to ArrayType being abstract.

Gather knowledge from Vertica team on what changed w/ complex types in 11.0 - specifically w/ regards to parquet export and copy
Prototype
Split into task(s)

Determine new best default values for row group and file size

There is a bug, now fixed, that may cause Vertica to reserve much more disk space than needed for an export. For maximal support of older Vertica versions, we should investigate how the row group and file size params effect this happening, and change the default values to reflect this.

Acceptance criteria:

Report on how these parameters affect this bug
Change of default parameters to reflect this
Change to performance guide reflecting how to modify these parameters for optimal performance

Set Parquet Export Page Size Dynamically

There is an undocumented option in the parquet export: pageSizeKB.
Lowering this lowers the total memory that is required to be reserved per thread in parquet export, allowing for more threads.
Equation per thread is:
(2 x RowGroupSize) + (4 x ColNum x PageSize)

Experiment setting this lower -- see perf impact in non-memory-bound setting
Assuming not a significant decrease -- or decrease deemed worth the tradeoff - set page size relative to row group size. Something like:
pageSize = max(8kb, rowGroupSize/colCount)

Question regarding closed source v1 connector

Hi,

Does the v1 version of this connector also require an auxiliary file system like HDFS/s3?

Release 2.0.1

Acceptance Criteria:

external tables use case 1
YARN
sparklyr example
Merge statements
create changelog
run integration tests and manual tests on final build
create GitHub artifacts

Verify running parallel spark jobs on cluster w/ connector

Test connector under load of multiple spark jobs being run in parallel.

In order to do this, call spark to start a job from different threads.

Determine Optimal Partition Count

The working hypothesis: partition count should be at or near the number of cores in the spark cluster.

The goal of this issue is to thoroughly test this and get solid numbers on performance at different partition count vs cores available.

Test with databricks spark

Look into databricks spark and how it can work with our connector.

What licensing is needed?
What differences are there?
Run full test suite if possible

Investigate why count is off for a joined dataframe

When you take a dataframe created from our source, and join it with itself, then call count() on that joined df, the count returned is wrong.

Adjust video according to marketing's feedback

Acceptance Criteria:

Fully list out V1 connector's features
Emphasize V2's features and capabilities more

SSO: Kerberos Integration Tests in CI/CD (Github Actions)

Move Docker code from custom repo to spark-connector repo.
Fine Tune Docker Code, to ensure it is fully automated and doesn't require manual steps.
Run the automation from the GitHub Action on PR event.
Ensure we do not impact the current non Kerberos integration Test in Github Actions

Explore Sparklyr, create example using connector

Sparklyr (https://spark.rstudio.com/) is an interface for spark from the R language. We should try using the connector with this, and assuming it works, add an example project.

Update examples and functional tests to use WebHDFS

WebHDFS will be supported as a default configuration. Examples and functional tests needs to be updates to WebHDFS. More information about WebHDFS can be found here.

Following changes will be performed

Documentation will be updated to reflect WebHDFS is recommended to use.
Example applications will be updated to use WebHDFS
Integration tests will be updated to use WebHDFS
It is expected small amount of tests will remain using HDFS

Application Flow Diagram

Acceptance Criteria:

Create a high level application flow diagram, with accompanying lower-level read and write flow diagrams. These diagrams should help contributors understand the connector and how the different components work together in the project.

Use a smart client_label on each connection to vertica

Each connection made to Vertica should set a client label that identifies its purpose. This helps identify and track down problems. It also has the added benefit of helping quantify customer usage if customers share their scrutinize data with Vertica.

Here are some thoughts on what that could look like. Some of these might not be possible, and we could add more later if needed.

vkspark-[-config details][--purpose]
where is some scrubbed string that identifies this spark cluster
where the optional [-config details] details might be:
-vs - the version of the vertica spark connector
-sp - the version of spark
-sc - the version of scala
-py - the version of python
-ps - the version of pyspark
-n - the spark node number?
-m - the number of spark nodes in cluster
-k - kerberos was used to connect
-p - password was used to connect
-c - increment for each connection made from this node

e.g.
vspark-MyCluster5-vs2.0.1-sp3.1.0-sc2.13.1-n3-m12-k-c1--main
vspark-MyCluster5-vs2.0.1-sp3.1.0-sc2.13.1-n3-m12-k-c2--background_monitoring

[Spike] S3 Kerberos Authentication

Identify environment and effort to configure environment
Reach out to Benjamin to get more details on how Kerberos with S3 used or anticipated to be used
Build prototype for S3 Kerberos (might be a manual test / environment configuration)

Acceptance Criteria:

Detailed scope and estimations for S3 Kerberos support

Improve Performance Testing Documentation

Go through the performance testing documentation step by step. Make note of improvements and share with Alex before implementing changes.

Test reserved keywords in connector

Acceptance Criteria:

Add tests for reserved keywords for column names

Remove unnecessary methods in TestUtils

It is currently wasteful to provide all the functionality in TestUtils with every example. Therefore, we can remove the unnecessary methods and possibly code the rest inline.

Get details on Vertica Eon Accelerator

https://www.vertica.com/landing-page/eon-accelerator/

Find out:

Contact info for Accelator
How we get access
Basic details on what it is / how it works vs regular
Is this related to Vertica 11

Extend or replace parquet read support to support calendar interval type

The converter from Spark to Parquet that we are using from Spark (ParquetWriteSupport) does not support CalendarIntervalType. Extend or replace this to allow for writing this type.

Relative performance comparison between write and merge

Test performance of merge statements in devjail env on a 4-node cluster. Performance will be compared to a basic write and is expected to be slower.

Prototype for Merge statements

With the implementation of merge statements in our connector, users can process raw data in Spark and merge that data with an existing table in Vertica. The processed data will be written to a temporary table in Vertica before being merged. In the existing table, matched records will be updated and new records will be inserted, effectively constituting an “Upsert”. In order to execute a merge statement in our Spark Connector, the user needs to pass in a mergeKey, which will likely be an array of column attributes to join the existing table and temporary table on. If this option is not populated, a write will be performed as usual without the use of a temporary table.

Example use case: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/MergeTables/MergeExample.htm

Cache dependencies in GitHub Actions workflows

Currently, for some of our GitHub Actions workflows (e.g. S3 integration test runs) we use wget to fetch dependencies such as Spark and Hadoop. This takes a long time and may not be reliable either if the URL changes or goes down. To solve this, we should be caching dependencies. This repo seems to detail a way to make that possible: https://github.com/actions/cache

Refactor examples' script code

There is common code between all the scripts for each example that should be extracted out to minimize redundancy. A possible solution is to move all of this code into the Dockerfile so that it is only run when the containers are composed up.

Set up Eon Accelerator to test our connector

As a user you will log into https://www.vertica.com/accelerator and use that interface to talk indirectly with your own secure cloud. The link between the customer cloud and Vertica cloud is established during onboarding using cross account IAM access. Currently no fee for duration of the early access program, as customer just needs to take care of their own AWS bill. Comes with 5 databases per account and primary + 3 sub-clusters per databases

What do I need to get started?

AWS account
AWS admin access to setup cross account IAM access

Investigate planInputPartitions being called twice

If this is due to something we're doing, fix it.

If this is a natural spark behavior, cache data so that we do not do any expensive operations twice.

Handle failover hosts for HDFS, Vertica and Kerberos

We should allow the user to specify a list of hosts, and fall back to the ones later on the list if the earlier ones fail.

Merge Statements

With the implementation of merge statements in our connector, users can process raw data in Spark and merge that data with an existing table in Vertica. The processed data will be written to a temporary table in Vertica before being merged. In the existing table, matched records will be updated and new records will be inserted, effectively constituting an “Upsert”. In order to execute a merge statement in our Spark Connector, the user needs to pass in a mergeKey, which will likely be an array of column attributes to join the existing table and temporary table on. If this option is not populated, a write will be performed as usual without the use of a temporary table.

Example use case: https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/MergeTables/MergeExample.htm

Write Pipe Tests:

It should create a temporary table if merge key exists 
It should copy data to a temporary table if merge key exists

Integration Tests:

It should merge with an existing table in Vertica

Additional Requirements:

Make mergeKey a List of mergeKeys 
What happens if we’re in overwrite mode when we also want to perform a merge?
- Possible solution: force append mode in the logic if merge key exists 
Consider when schema of target table is different from dataframe (using copyColumnList)
True temporary table created as part of session
Performance testing comparing a merge to a normal write
Side note: if the merge key is not unique in the data frame, merge won’t work

Support running functional test suite using spark-submit

Would ideally be able to run the functional-tests integration test suite on a spark cluster using spark-submit. The current issue with this lies in the resource files the connector uses and getting them to spark.

Merge Statements Demo

Create a demo to show merge statements functionality:

Regular merge with overwrite mode
Using copy column list
Multiple columns

Important issues to note:

Merge key cols must have the same names in target table and df
Merge key values must be unique in the df
The number of columns in df must be less than or equal to the number of columns in target table (as long as we specify which columns are affected in the target table using copy column list)

Investigate bug with files not existing

Acceptance Criteria:

Document and confirm issue
Reproduce environment

Schema mismatch: option for logging example data

It's been requested to log an example of data that doesn't match schema. If added this should be an option as logging real data by default is a security concern.

Will be good to check this issue together with #293

Fix sbt installation in Dockerfile

JFrog have decided shut down Bintray, resulting in a 403 forbidden error when trying to wget http://dl.bintray.com/sbt/rpm/sbt-1.3.13.rpm. Possible solution: wget https://github.com/sbt/sbt/releases/download/v1.3.13/sbt-1.3.13.tgz and untar.

Acceptance criteria: GitHub actions pass.

Change Spark download mirror site

Current mirror site in our scripts is not functional. This issue is evident when we try to run s3 integration tests using the script that downloads Spark and Hadoop. The same link is also used in the PySpark example. Will change to one in the Apache archive.

Detail caveats around data type conversion in End User documentation

Our End User documentation should detail the caveats around data type conversion and potential truncation (i.e. StringType to VARCHAR/LONG VAR CHAR). If Spark's type sizes are unbounded, people need to know what the behaviour will be when going into Vertica.

vertica / spark-connector Goto Github PK

spark-connector's People

Contributors

Stargazers

Watchers

Forkers

spark-connector's Issues

Recommend Projects

Recommend Topics

Recommend Org