Giter Site home page Giter Site logo

heritrix-cassandra's Introduction

Heritrix-Cassandra

A library for writing Heritrix 3 output directly to Cassandra as records.

Getting Started

  1. Visit http://github.com/openplaces/heritrix-cassandra/tree/master/releases/ and obtain a release of heritrix-cassandra that corresponds to the versions of Heritrix and Cassandra you are running. Consult the "Releases" section for more information.
  2. Copy the heritrix-cassandra-{version}.jar file into your Heritrix install's lib folder.
  3. Copy the following list of files from your Cassandra lib folder into your Heritrix install's lib folder:
    • apache-cassandra-*.jar
    • libthrift-*.jar
    • log4j-*.jar
    • slf4j-api-*.jar
    • slf4j-log4j*.jar
  4. Modify your Heritrix job configuration to use the heritrix-cassandra writer

crawler-beans.cxml:

<!-- DISPOSITION CHAIN -->
<bean id="cassandraParameters" class="org.archive.io.cassandra.CassandraParameters">
  <!-- "seeds" and "keyspace" are required, while port defaults to 9160 -->

  <!-- Pass a comma-separated list of servers to Cassandra here -->
  <property name="seeds" value="localhost,127.0.0.1" />

  <!-- This is the thrift port -->
  <property name="port" value="9160" />

  <!-- Your application specific keyspace -->
  <property name="keyspace" value="MyApplication" />

  <!-- Change the crawlColumnFamily from its default value of 'crawl' -->
  <property name="crawlColumnFamily" value="crawled_pages" />

  <!-- Other parameters are overridden similarly and a full list is provided below -->
</bean>

<bean id="cassandraWriterProcessor" class="org.archive.modules.writer.CassandraWriterProcessor">
  <property name="cassandraParameters">
    <!-- Referencing the named bean we defined above -->
    <ref bean="cassandraParameters" />
  </property>
</bean>

[...]

<bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">
  <property name="processors">
    <list>
      <!-- write to aggregate archival files... -->
      <ref bean="cassandraWriterProcessor"/>
      <!-- other references -->
    </list>
  </property>
</bean>

org.archive.io.cassandra.CassandraParameters

Parameter Default Value Description
seeds (none) Comma-separated list of Cassandra servers (can be a list of any nodes in your cluster).
port 9160 The Thrift port.
keyspace (none) The name of your Cassandra keyspace.
crawlColumnFamily crawl Name of the column family to use.
encodingScheme UTF-8 Encoding scheme you're using.
framedTransport false Whether to used Thrift's Framed Transport
contentPrefix content Name of the logical prefix used to save the raw content to. If contentColumnName is redefined, then this prefix will be overridden and no longer used.
contentColumnName raw_data Name of the column used to save the raw content to.
headersColumnName headers Name of the column used to save the headers to (if separateHeaders is set to true).
curiPrefix curi Name of the logical prefix used to store the metadata related to the crawl. If any of the following parameters are redefined, then this prefix will be overridden and no longer used with it.
ipColumnName ip Name of the column used to save the resolved ip to.
pathFromSeedColumnName path-from-seed Name of the column used to save the path from the seed to.
isSeedColumnName is-seed Name of the column used to store the boolean of whether the current entry is a seed.
viaColumnName via Name of the column used to store the via information.
urlColumnName url Name of the column used to store the url.
requestColumnName request Name of the column used to store the request header.
separateHeaders false Separate the HTTP response headers from the content.
maximumContentSize -1 Maximum size of the content string that will be saved. Anything larger will just not write to Cassandra. -1 indicates unlimited size.

Building

If you can't find a release that corresponds to your combination of Heritrix and Cassandra versions, then you can build your own version of heritrix-cassandra (granted that the APIs of each application haven't changed dramatically).

  1. Obtain the heritrix-cassandra source from http://github.com/openplaces/heritrix-cassandra
  2. Install Gradle (http://www.gradle.org/installation.html)
  3. Edit build.gradle and change the properties "version", "heritrix_version", "cassandra_version" accordingly.
  4. Run "gradle jar" in the command line, and your new jar should be in the build/libs folder.

Releases

Each release of heritrix-cassandra is compiled against different version combinations of Heritrix and Cassandra. The following table summarizes them.

heritrix-cassandra Heritrix Cassandra
0.8 3.1.0 0.7.6-2
0.7 3.0.1 0.6.5
0.5 3.0.0 0.6.1
0.4 3.0.0 0.6.1
0.3 3.0.0 0.6.0
0.2 3.0.0 0.6.1
0.1 3.0.0 0.6.0

heritrix-cassandra's People

Contributors

defg avatar greglu avatar openplaces avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.