Giter Site home page Giter Site logo

lucene-layer's Introduction

FoundationDB Lucene Layer

This layer provides two integration points with Lucene, FDBDirectory and FDBCodec. These are full implementations of the Directory and Codec interfaces which are backed entirely by FoundationDB.

FDBDirectory can be used on its own with the default Codec doing the interesting work. Files generated by Lucene are stored as blobs in the database instead of the file system.

FDBCodec, which must be used in conjunction with FDBDirectory, implements new serialization and data models for Lucene. This results in explicit keys and values in the database instead of file-like blobs.

Warning: Alpha Stage

This layer is at an early alpha stage (note the 0.0.1 version number). While most of the stock Lucene tests pass when using FDBDirectory, many currently fail when running with FDBCodec. There are no known correctness issues at this time but slowness and timeout issues could easily be hiding such problems.

Please try it out and let us know how it works (e.g. on our community site), but production usage is not recommended.

FDBCodec Data Model

The Subspace concept is used extensively to provide a simple, logical mapping and easy storage and retrieval. Each directory, segment and format are identified by a unique string. These identifier strings are then concatenated together to yield key ranges associated with each logical format being stored.

For example, assume we have a FDBDirectory created with the path ("lucene") and a segment named "_0". That would result in the following Tuples:

  • ("lucene", "_0", "dat") for DocValues
  • ("lucene", "_0", "inf") for FieldInfos
  • ("lucene", "_0", "liv") for LiveDocs
  • etc

Additional keys and values exist under each of those subspaces for storing the information associated with each format. In the documentation below, the full subspace is the concatenation of the directory, segment and format subspaces.

DocValuesFormat

Encodes/decodes strongly typed, per document values. See DocValuesFormat and FieldInfo.DocValuesType.

The long_BINARY, long_NUMERIC, long_SORTED and long_SORTED_SET key parts below refer to the DocValuesType enum ordinal() values.

Subspace: ("dat")

(str_fieldName, long_BINARY, long_doc0) => (bytes_value)
(str_fieldName, long_BINARY, long_doc1) => (bytes_value)
...
(str_fieldName, long_NUMERIC, long_doc0) => (long_value)
(str_fieldName, long_NUMERIC, long_doc1) => (long_value)
...
(str_fieldName, long_SORTED, "bytes", long_ordinal0) => (bytes_value)
(str_fieldName, long_SORTED, "bytes", long_ordinal1) => (bytes_value)
...
(str_fieldName, long_SORTED_SET, "ord", long_doc0) => (long_ordinal)
(str_fieldName, long_SORTED_SET, "ord", long_doc1) => (long_ordinal)
...
(str_fieldName, long_SORTED_SET, "bytes", long_ordinal0) => (bytes_value)
(str_fieldName, long_SORTED_SET, "bytes", long_ordinal1) => (bytes_value)
...
(str_fieldName, long_SORTED_SET, "doc_ord", long_doc0, long_ordinal0) => ()
(str_fieldName, long_SORTED_SET, "doc_ord", long_doc0, long_ordinal1) => ()
(str_fieldName, long_SORTED_SET, "doc_ord", long_doc1, long_ordinal0) => ()
...

FieldInfosFormat

Encodes/decodes filed metadata. See FieldInfosFormat and FieldInfos:

Subspace: ("inf")

(long_field0, "name") => (string_fieldName)
(long_field0, "has_index") => (boolean_value)
(long_field0, "has_payloads") => (boolean_value)
(long_field0, "has_norms") => (boolean_value)
(long_field0, "has_vectors") => (boolean_value)
(long_field0, "doc_values_type") => (string_docValuesType)
(long_field0, "norms_type") => (string_normsType)
(long_field0, "index_options") => (string_indexOptions)
(long_field0, "attr", string_attr0) => (string_value)
(long_field0, "attr", string_attr1) => (string_value)
...
(long_field1, "name") => (string_fieldName)
...

LiveDocsFormat

Encodes/decodes live-ness of documents. See LiveDocsFormat.

Subspace: ("liv")

(long_liveGen0) => (long_totalSize)
(long_liveGen0, long_setBitIndex0) => ()
(long_liveGen0, long_setBitIndex1) => ()
(long_liveGen1) => (long_totalSize)
...

NormsFormat

Encodes/decodes per-document score normalization values. See NormsFormat.

Subspace: ("len")

Uses DocValuesFormat with a different subspace extension.

PostingsFormat

Encodes/decodes terms, postings, and proximity data. See PostingsFormat.

Subspace: ("pst")

(long_field0, bytes_term0, "numDocs") => (littleEndianLong_value)
(long_field0, bytes_term0, long_doc0) => (long_termDocFreq)
(long_field0, bytes_term0, long_doc0, long_pos0) => (long_startOffset, long_endOffset, bytes_payload)
...
(long_field1, bytes_term1, "numDocs") => (littleEndianLong_value)
...

SegmentInfoFormat

Encodes/decodes segment metadata. See SegmentInfoFormat.

Subspace: ("si")

("doc_count")=> (long_docCount)
("is_compound_file") => (boolean_value)
("version") => (long_version)
("attr", string_attr0) => (string_value)
("attr", string_attr1) => (string_value)
...
("diag", string_diag0) => (string_value)
("diag", string_diag1) => (string_value)
...
("file", string_file0) => ()
("file", string_file1) => ()
...

StoredFieldsFormat

Encodes/decodes per-document fields. See StoredFieldsFormat.

The key parts long_TYPE and long_DATA below refer to constants values, currently 0 and 1.

Subspace: ("fld")

(long_doc0, long_TYPE, long_field0) => (string_typeName, long_dataIndex)
(long_doc0, long_TYPE, long_field1) => (string_typeName, long_dataIndex)
...
(long_doc0, long_DATA, long_field0, long_dataIndex, long_offset0) => (bytes_value)
(long_doc0, long_DATA, long_field0, long_dataIndex, long_offset1) => (bytes_value)
(long_doc0, long_DATA, long_field1, long_dataIndex, long_offset0) => (bytes_value)
...
(long_doc1, long_TYPE, long_field0) => (string_typeName, long_dataIndex)
...

TermVectorsFormat

Encodes/decodes per-document term vectors. See TermVectorsFormat.

Subspace: ("vec")

(long_doc0, "field", string_field0) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)
(long_doc0, "field", string_field1) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)
...
(long_doc0, "term", string_field0, bytes_term0) => (long_freq)
(long_doc0, "term", string_field0, bytes_term0, long_pos0) => (long_startOffset, long_endOffset, bytes_payload)
(long_doc0, "term", string_field0, bytes_term0, long_pos1) => (long_startOffset, long_endOffset, bytes_payload)
(long_doc0, "term", string_field0, bytes_term1) => (long_freq)
...
(long_doc1, "field", string_fieldName0) => (long_fieldNum, long_numTerms, boolean_hasPositions, boolean_hasOffsets, boolean_hasPayloads)
...

Running Built-In Tests

Maven is used for building, packaging and running tests.

$ mvn test

Running Lucene and Solr Tests

  1. Package fdb-lucene-layer

     $ mvn package
    
  2. Download the Solr source

     $ curl -O http://mirror.nexcess.net/apache/lucene/solr/4.4.0/solr-4.4.0-src.tgz
     $ tar xzf solr-4.4.0-src.tgz
     $ cd solr-4.4.0/
    
  3. Run the full test suite

     $ ant test -Dtests.codec=FDBCodec \
                -Dtests.directory=com.foundationdb.lucene.FDBTestDirectory \
                -lib  ../target/fdb-lucene-layer-0.0.1-SNAPSHOT.jar \
                -lib ../target/dependency/fdb-java-1.0.0.jar
    

lucene-layer's People

Contributors

nathanlws avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.