Giter Site home page Giter Site logo

sit's Introduction

SIT (Streaming Index Toy)

SIT is a lightweight TCP server that provides real-time full-text search over streams of JSON documents. It's also usable as a C library, where it can parse any stream using custom parsers.

Why?

  • Lightweight The executable is 228K on my machine. It only depends on libev and libc; and currently (and temporarily) uses Ruby for the build and test suites.
  • True realtime This is "true" real time, as opposed to "near" real time. SIT eliminates garbage collection, "commits," "flushes," and fsyncs. This is search highly optimized for recency and real-time.
  • Write, then read Trivially support search results which include the document you just added.
  • Computationally efficient percolation SIT is designed to add documents and execute query callbacks efficiently with up to 100,000 registered queries and/or connections.
  • Flexible pubsub Rather than publishing into channels, you can use SIT to publish into a global event stream, and subscribe to dynamic channels defined by search queries.

Protocol

SIT speaks a simple line-based pipelineable protocol. Any line that starts with a curly brace ({) is interpreted as a JSON document to add to the indexed dataset. Other lines are interpreted as commands.

All responses are JSON lines, and have a status key, which can be either ok, or error. Other keys are implemented on a per-command basis.

Search & Percolation

SIT provides traditional search, where you give a query and get a resultset back. It also provides percolation, where you can register a query that will notify you when matching documents are added to the index.

Commands

  • register QUERY
    Registers the query, or queries, for percolation. When any document that matches the query is added, SIT will print a "found" response in this stream.

    Sample request/response:

      > register title ~ "hello world" AND points > 4;
      < {"status": "ok", "message": "registered", "id": 29}
      # ...
      < {"status": "ok", "message": "found", "query_id": 29, "doc_id": 500, "doc": {"title": "hello sweet world", "points": 7}}
    

    The response to a register command includes an ID of the registered query.

  • unregister QUERYID
    Give the id provided in the register response, to stop the percolation.

    Sample request/response:

      > unregister 29
      < {"status": "ok", "message": "unregistered", "id": 29}
    
  • query QUERY
    Do a search.

    Sample request/response:

      # adding some docs
      > {"hello":"world 0"}
      > {"hello":"world 1"}
      > {"hello":"world 2"}
      > {"hello":"world 3"}
      > {"hello":"world 4"}
      > {"hello":"world 5"}
      > {"hello":"world 6"}
      > {"hello":"world 7"}
      > {"hello":"world 8"}
      > {"hello":"world 9"}
      >
      > query hello ~ world LIMIT 5;
      < {"status": "ok", "message": "querying", "id": 27}
      < {"status": "ok", "message": "found", "query_id": 27, "doc_id": 9, "doc": {"hello":"world 9"}}
      < {"status": "ok", "message": "found", "query_id": 27, "doc_id": 8, "doc": {"hello":"world 8"}}
      < {"status": "ok", "message": "found", "query_id": 27, "doc_id": 7, "doc": {"hello":"world 7"}}
      < {"status": "ok", "message": "found", "query_id": 27, "doc_id": 6, "doc": {"hello":"world 6"}}
      < {"status": "ok", "message": "found", "query_id": 27, "doc_id": 5, "doc": {"hello":"world 5"}}
      < {"status": "ok", "message": "complete", "id": 27}
    

Query Language

SIT has a simple query language, composed of boolean operations (AND, OR, NOT) over clauses. You can append LIMIT N to the end of a query. Queries are terminated with either a newline or a semicolon. The following are valid clauses:

  field_name ~ string 
  field_name > integer
  field_name = integer
  field_name < integer
  field_name >= integer
  field_name <= integer
  field_name != integer

Tokenization

SIT is designed to use pluggable tokenization strategies.

TODO: Describe tokenization 101 basics, concepts, examples.

Supported tokenization strategies

  • whitespace
  • Unicode UAX#29 word boundaries
  • regular expression pattern tokenizer

Pull requests are welcome.

What is the tilde?

The tilde indicates a full-text search. A full-text search identifies documents where a given term is present in the specified field.

The full-text search, title ~ hello will match JSON documents with a field named title which contain a token of hello. For example, the document {"title":"hello full text search"} when tokenized with a simple whitespace tokenizer.

A quoted value for the tilde operator will match documents where all the terms of the quoted text are present in the named field. For example, title ~ "hello world" will be transformed into (title ~ hello AND title ~ world).

TODO

  • Nested JSON support
  • Composable streams
  • Persistance
  • Replace
  • wchar/unicode support
  • Facets
  • Aggregations
  • Quoted/span/proximity queries
  • selectable parsers besides JSON in the server? (e.g., log output, msgpack, BSON)

Quick Start

Feeling adventurous? Run SIT on your own system and try some of these demos.

Downloading and building SIT

git clone git://github.com/fizx/sit.git
cd sit
bundle install
bundle exec rake

Demo: Running SIT with simple inputs and searches

Start the server with ./sit, which reads commands from standard input, and prints its output to standard out.

./sit
{"hello":"world"}
# {"status": "ok", "message": "added", "doc_id": 0"}
query hello ~ world;
# {"status": "ok", "message": "querying", "id": 0}
# {"status": "ok", "message": "found", "query_id": 0, "doc_id": 0, "doc":
# {"hello":"world"}}
# {"status": "ok", "message": "complete", "id": 0}

Demo: Twitter Streaming API

Reminder, this is pre-release software. We use Twitter Streaming APIs as testing to find new and novel edge cases which cause crashes.

You should assume that this demo has a 50/50 chance to format your hard drive, and proceed accordingly. Preferably by reading the source to see whether this paragraph is serious or not.

First, install and authenticate twurl.

Next, start a server listening to the network.

./sit --port 4000
[INFO] [2013:03:0221:15:53] Successfully started server.

Now stream documents from Twitter, via netcat, to your running server.

twurl -t -H stream.twitter.com /1/statuses/sample.json | nc localhost 4000

sit's People

Contributors

fizx avatar nz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.