Giter Site home page Giter Site logo

vertica / pstl Goto Github PK

View Code? Open in Web Editor NEW
9.0 10.0 6.0 108.89 MB

Parallel Streaming Transformation Loader

Home Page: https://www.vertica.com/services/

License: Apache License 2.0

Shell 1.71% Scala 39.86% Java 55.98% Groovy 0.05% ANTLR 2.39%
bigdata vertica data-mining data-science ingestion streaming-data realtime-messaging etl-pipeline hadoop

pstl's Introduction

PSTL

One of the main hurdles has been the inability to ingest and transform large amounts of data from multiple sources in real time. While most organizations employ data analysts and scientists, the reality is that they spend most of their time—up to 80%, according to some generally accepted estimates—in data preparation: collecting data sets and cleaning and organizing data.

Parallel Streaming Transformation Loader (PSTL) from Vertica Professional Services is a Big Data solution that dramatically reduces both the time and latency involved in real time data collection, loading, and transformation.

To know more about PSTL click here

Features

  • A Spark application with out-of-thebox integration from Kafka to Vertica and Hadoop; integration to other data systems via no-code configurations
  • No-ETL, no-ELT, no-code required SQL streaming solution
  • Single set of semantics for multiple sinks (Vertica, Kafka, Hive Tables, Opentsdb, or Spark Datasets)
  • Out-of-the-box support for Confluent Kafka Sources
  • Processes semi-structured JSON, Avro, Protobuf, Delimited, and CSV data into optimized data at rest
  • Advanced job management of Spark Streaming Jobs
  • A no-code approach for Change Data Capture, Slowly Changing Dimensions, Streaming Table Mappings, from external JDBC connectors
  • A simple extensibility model for data validation and transformations

Contributing

If you are interested in fixing issues and contributing directly to the code base, please see the document How to Contribute.

Building

Prerequisites

To build PSTL, you will need to make sure the following components are available in your build environment:

  • Java 8
  • Maven 3
  • RPM Build tools

If you use brew, this is generally as simple as:

brew update
brew tap caskroom/versions
brew cask install java8
brew install maven
brew install rpm

Tarball(s)

To build PSTL and generate a tarball distribution, simply run the following:

mvn -DskipTests clean install -Passembly

The generated tarball will be located under pstl-assembly/target. Similarly, an unpacked tarball will be present in pstl-assembly/target without the .tar.gz suffix which you can use locally. Generally the only work required on your part to run the full distribution locally is to provide SPARK_HOME in conf/pstl-env.sh.

RPM(s)

To build PSTL and generate a RPM distribution, simply run the following:

mvn -DskipTests clean install -Passembly-rpm

The generated RPM will be located under pstl-assembly-rpm/target/rpm/pstl-assembly-rpm_2.11/RPMS/noarch/. Installing the RPM is as simple as rpm -ivh /path/to/rpm. Uninstalling the RPM is as simple as rpm -e pstl-assembly_2.11. By default the RPM will install to /usr/share/pstl.

Documentation

Please read the documentation here .

Contact

To contact us please e-mail to this address.

pstl's People

Contributors

arumugaguru-muruganandaguru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pstl's Issues

Deploy job error - ClassNotFoundException: local

Hi,
When we deploy a simple job (like the example in the Getting Started wiki) we get this exception:

Exception in thread "main" java.lang.ClassNotFoundException: local
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:67)
        at akka.actor.ReflectiveDynamicAccess$$anonfun$getClassFor$1.apply(DynamicAccess.scala:66)
        at scala.util.Try$.apply(Try.scala:161)
        at akka.actor.ReflectiveDynamicAccess.getClassFor(DynamicAccess.scala:66)
        at akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
        at akka.actor.ActorSystemImpl.liftedTree1$1(ActorSystem.scala:585)
        at akka.actor.ActorSystemImpl.<init>(ActorSystem.scala:578)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
        at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
        at com.microfocus.pstl.Driver$.main(Driver.scala:31)
        at com.microfocus.pstl.Driver.main(Driver.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Any idea how to solve this?

I saw that you mentionned the Class Not Found Issue in the operations guide but not sure how to solve this (https://github.com/vertica/PSTL/wiki/Operations-Guide)

Thanks

Job failed during deploy with Spark2.3 version

Hello,

We compiled the new PSTL with Spark 2.3 version and rpm got successfully build, but when tried to run a PSTL stream to stream join example, got the following error:

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.FunctionRegistry.registerFunction(Lorg/apache/spark/sql/catalyst/FunctionIdentifier;Lscala/Function1;)V
at com.microfocus.pstl.spark.SparkExtension$$anonfun$withSession$1$$anonfun$apply$1.apply(SparkExtension.scala:56)
at com.microfocus.pstl.spark.SparkExtension$$anonfun$withSession$1$$anonfun$apply$1.apply(SparkExtension.scala:55)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.microfocus.pstl.spark.SparkExtension$$anonfun$withSession$1.apply(SparkExtension.scala:55)
at com.microfocus.pstl.spark.SparkExtension$$anonfun$withSession$1.apply(SparkExtension.scala:52)
at scala.collection.immutable.Stream.foreach(Stream.scala:594)
at com.microfocus.pstl.spark.SparkExtension.withSession(SparkExtension.scala:52)
at com.microfocus.pstl.Driver$.main(Driver.scala:38)
at com.microfocus.pstl.Driver.main(Driver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/04/23 04:00:20 INFO spark.SparkContext: Invoking stop() from shutdown hook
19/04/23 04:00:20 INFO server.AbstractConnector: Stopped Spark@3adbe50f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/04/23 04:00:20 INFO ui.SparkUI: Stopped Spark web UI at http://15.120.132.182:4040
19/04/23 04:00:20 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/04/23 04:00:20 INFO memory.MemoryStore: MemoryStore cleared
19/04/23 04:00:20 INFO storage.BlockManager: BlockManager stopped
19/04/23 04:00:21 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
19/04/23 04:00:21 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/04/23 04:00:21 INFO spark.SparkContext: Successfully stopped SparkContext
19/04/23 04:00:21 INFO util.ShutdownHookManager: Shutdown hook called
19/04/23 04:00:21 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e6c8e20b-ff90-44d9-87c4-8d9045262701

JobGuardian should shutdown so TRIGGER ONCE works as expected

Currently, JobGuardian remains online if underlying JobDaemons and QueryDaemons die off intentionally (e.g., they are done). In a TRIGGER ONCE workload, a QueryDaemon stops after processing a micro-batch. If that was the only query in a job, JobDaemon stops after QueryDaemon stops. JobGuardian stops in response to this (if there was only one job launched in this JVM), but the JVM remains online.

From JobGuardian:

private def maybeShutdownCleanly(): Unit = {
  if(context.children.isEmpty) {
    context.stop(self)
  }
}

Should be:

private def maybeShutdownCleanly(): Unit = {
  if(context.children.isEmpty) {
    CoordinatedShutdown(context.system).run()
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.