Giter Site home page Giter Site logo

jy4618272 / kafka-connect-document-source Goto Github PK

View Code? Open in Web Editor NEW

This project forked from datareply/kafka-connect-document-source

0.0 2.0 0.0 11.8 MB

Kafka connector with content extraction to push extracte document contents.

Java 100.00%

kafka-connect-document-source's Introduction

kafka-connect-document-source

The connector is used to load data extracted from documents (PDF, Word, ..) to Kafka.

Building

You can build the connector with Maven using the standard lifecycle phases:

mvn clean
mvn package

Sample Configuration

name=document-source
connector.class=org.apache.kafka.connect.document.DocumentSourceConnector
tasks.max=1
schema.name=test_schema3
topic=test_topic3
files=/path/to/file/filename1.pdf,/path/to/file/filename2.docx
files.prefix=prefix
content.extractor=tika
output.type=text_xml

or as a JSON

{
	"name": "document-source",
	"connector.class": "org.apache.kafka.connect.document.DocumentSourceConnector",
	"tasks.max": "1",
	"schema.name": "test_schema3",
	"topic": "test_topic3",
	"files": "/path/to/file/filename1.pdf,/path/to/file/filename2.docx",
	"files.prefix": "prefix",
	"content.extractor": "tika",
	"output.type": "text_xml"
}
  • name: name of the connector
  • connector.class: class of the implementation of the connector
  • tasks.max: maximum number of tasks to create
  • schema.name: name to use for the schema
  • topic: name of the topic to append to
  • files: comma separated list of paths to the files to extract content from
  • files.prefix: prefix for the files
  • content.extractor: type of content extractor to use; possible values are:
    • 'tika': use the Apache Tika content extractor (default)
    • 'oracle': use the Oracle Clean Content content extractor (faster but not recommended as it's less tested and testable)
  • output.type: output type of extracted content, i.e. in what format the content should be sent to Kafka, can be one of:
    • 'text', to extract only the plain text
    • 'xml', to extract the content in a more structured format (XHTML) as well as the file's metadata
    • 'text_xml' or 'xml_text', to extract both (default)

Records

The records added to Kafka have following fields:

  • 'name': the name of the file the content has been extracted from
  • 'raw_content': string containing the raw contents of the file (plain text only)
  • 'metadata': JSON string containing all metadata fields extracted from the file (empty if plain text only is extracted)
  • 'content': string containing the structured content of the file (XHTML) (not present if plain text only is extracted)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.