Giter Site home page Giter Site logo

xapian-fu's Introduction

Xapian Fu

XapianFu is a Ruby library for working with Xapian databases. It builds on the GPL licensed Xapian Ruby bindings but provides an interface more in-line with “The Ruby Way”(tm) and is considerably easier to use.

For example, you can work almost entirely with Hash objects - XapianFu will handle converting the Hash keys into Xapian term prefixes when indexing and when parsing queries.

It also handles storing and retrieving hash entries as Xapian::Document values. XapianFu basically gives you a persistent Hash with full text indexing (and ACID transactions).

Installation

sudo gem install xapian-fu

Xapian Bindings

Xapian Fu requires the Xapian Ruby bindings to be available. On Debian/Ubuntu, you can install the ‘ruby-xapian` package to get them. Alternatively, you can install the `xapian-ruby` gem, which reportedly will also provide them. You can also just get the upstream Xapian release and manually install it.

Documentation

XapianFu::XapianDb is the corner-stone of XapianFu. A XapianDb instance will handle setting up a XapianFu::XapianDocumentsAccessor for reading and writing documents from and to a Xapian database. It makes use of XapianFu::QueryParser for parsing and setting up a query.

XapianFu::XapianDoc represents a document retrieved from or to be added to a Xapian database.

Basic usage example

Create a database, add 3 documents to it and then search and retrieve them.

require 'xapian-fu'
include XapianFu
db = XapianDb.new(:dir => 'example.db', :create => true,
                  :store => [:title, :year])
db << { :title => 'Brokeback Mountain', :year => 2005 }
db << { :title => 'Cold Mountain', :year => 2004 }
db << { :title => 'Yes Man', :year => 2008 }
db.flush
db.search("mountain").each do |match|
  puts match.values[:title]
end

Ordering of results

Create an in-memory database, add 3 documents to it and then search and retrieve them in year order.

db = XapianDb.new(:store => [:title], :sortable => [:year])
db << { :title => 'Brokeback Mountain', :year => 2005 }
db << { :title => 'Cold Mountain', :year => 2004 }
db << { :title => 'Yes Man', :year => 2008 }
db.search("mountain", :order => :year)

will_paginate support

Simple integration with the will_paginate Rails helpers.

@results = db.search("mountain", :page => 1, :per_page => 5)
will_paginate @results

Spelling correction

Spelling suggestions, like Google’s “Did you mean…” feature:

db = XapianDb.new(:dir => 'example.db', :create => true)
db << "There is a mouse in this house"
@results = db.search "moose house"
unless @results.corrected_query.empty?
  puts "Did you mean '#{@results.corrected_query}'"
end

Transactions support

Ensure that a group of documents are either entirely added to the database or not at all - the transaction is aborted if an exception is raised inside the block. The documents only become available to searches at the end of the block, when the transaction is committed.

db = XapianDb.new(:store => [:title, :year], :sortable => [:year])
db.transaction do
  db << { :title => 'Brokeback Mountain', :year => 2005 }
  db << { :title => 'Cold Mountain', :year => 2004 }
  db << { :title => 'Yes Man', :year => 2008 }
end
db.search("mountain")

Complete field definition examples

Fields can be described in more detail using a hash. For example, telling XapianFu that a particular field is a Date, Fixnum or Bignum will allow very efficient on-disk storage and will ensure the same type of object is instantiated when returning those stored values. And in the case of Fixnum and Bignum, allows you to order search results without worrying about leading zeros.

db = XapianDb.new(:fields => {
                               :title => { :store => true },
                               :released => { :type => Date, :store => true },
                               :votes => { :type => Fixnum, :store => true }
                             })
db << { :title => 'Brokeback Mountain', :released => Date.parse('13th January 2006'), :votes => 105302 }
db << { :title => 'Cold Mountain, :released => Date.parse('2nd January 2004'), :votes => 45895 }
db << { :title => 'Yes Man', :released => Date.parse('26th December 2008'), :votes => 44936 }
db.search("mountain", :order => :votes)

Simple max value queries

Find the document with the highest :year value

db.documents.max(:year)

Special queries

XapianFu supports Xapian’s ‘MatchAll` and `MatchNothing` queries:

db.search(:all)
db.search(:nothing)

Search examples

Search on particular fields

db.search("title:mountain year:2005")

Boolean AND (default)

db.search("ruby AND rails")
db.search("ruby rails")

Boolean OR

db.search("rails OR sinatra")
db.search("rails sinatra", :default_op => :or)

Exclude certain terms

db.search("ruby -rails")

Wildcards

db.search("xap*")

Phrase searches

db.search("'a steamer in the gene pool'", :phrase => true)

And any combinations of the above:

db.search("(ruby OR sinatra) -rails xap*")

Custom term weights

Sometimes you may want to increase the weight of a particular term in a document. Xapian supports adding {extra weight}(trac.xapian.org/wiki/FAQ/ExtraWeight) to a term at index time by providing an integer “wdf” (default is 1).

You may set an optional :weights option when initializing a XapianDb. The :weights option accepts a Proc or Lambda that will be called with the key, value and list of document fields as each term is indexed. Your function should return an integer to set the weight to.

XapianDb.new(:weights => lambda {|k, v, f| k == :title ? 3 : 1}

Boolean terms

If you want to implement something like [this](getting-started-with-xapian.readthedocs.org/en/latest/howtos/boolean_filters.html#searching), then:

db = XapianDb.new(
  fields: {
    name:   {:index => true},
    colors: {:boolean => true}
  }
)

db << {name: "Foo", colors: ["red", "black"]}
db << {name: "Foo", colors: ["red", "green"]}
db << {name: "Foo", colors: ["blue", "yellow"]}

db.search("foo", filter: {:colors => ["red"]})

The main thing here is that filtering by color doesn’t affect the relevancy of the documents returned.

Facets

Many times you want to allow users to narrow down the search results by restricting the query to specific values of a given category. This is called [faceted search](readthedocs.org/docs/getting-started-with-xapian/en/latest/xapian-core-rst/facets.html).

To find out which values you can display to your users, you can do something like this:

results = db.search("foo", facets: [:colors, :year])

results.facets
# {
#   :colors => [
#     ["blue",  4]
#     ["red",   1]
#   ],
#
#   :year => [
#     [2010, 3],
#     [2011, 2],
#     [2012, 1]
#   ]
# }

When filtering by one of these values, it’s best to define the field as boolean (see section above) and then use ‘:filter`:

db.search("foo", filter: {colors: ["blue"], year: [2010]})

ActiveRecord Integration

XapianFu always stores the :id field, so you can easily use it with something like ActiveRecord to index database records:

db = XapianDb.new(:dir => 'posts.db', :create => true)
Post.all.each { |p| db << p.attributes }
docs = db.search("custard")
docs.each_with_index { |doc,i| docs[i] = Post.find(doc.id) }

Combine it with the max value search to do batch delta updates by primary key:

db = XapianDb.new(:dir => 'posts.db')
latest_doc = db.documents.max(:id)
new_posts = Post.find(:all, :conditions => ['id > ?', lastest_doc.id])
new_posts.each { |p| db << p.attributes }

Or by :updated_at field if you prefer:

db = XapianDb.new(:dir => 'posts.db', :fields => { :updated_at => { :type => Time, :store => true } })
last_updated_doc = db.documents.max(:updated_at)
new_posts = Post.find(:all, :conditions => ['updated_at >= ?', last_updated_doc.updated_at])
new_posts.each { |p| db << p.attributes }

Deleted records won’t show up in results but can eventually put your result pagination out of whack. So, you’ll need to track deletions yourself, either with a deleted_at field, some kind of delete log or perhaps by reindexing once in a while.

db = XapianDb.new(:dir => 'posts.db')
deleted_posts = Post.find(:all, :conditions => 'deleted_at is not null')
deleted_posts.each do |post|
  db.documents.delete(post.id)
  post.destroy
end

More Info

Author

John Leach ([email protected])

Copyright

Copyright © 2009-2012 John Leach

License

MIT (The Xapian library is GPL)

Mailing list

rubyforge.org/mailman/listinfo/xapian-fu-discuss

Web page

johnleach.co.uk/documents/xapian-fu

Github

github.com/johnl/xapian-fu/tree/master

xapian-fu's People

Contributors

djanowski avatar johnl avatar pjk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xapian-fu's Issues

Searching for phrases in fields

The following code does not work.

require 'xapian-fu'

db = XapianFu::XapianDb.new(:create => true, :store => [:title, :year])
db << { :title => 'Brokeback Mountain', :year => 2005 }
db << { :title => 'Brokeback Mountain Again', :year => 2006 }
db << { :title => 'Cold Mountain', :year => 2004 }
db << { :title => 'Yes Man', :year => 2008 }
db.flush
db.search('title:"Brokeback Mountain"', :fields => [:title]).each do      |match|
   p match.values[:title]
end

Stopwords file open mode

Hello,
on Debian using Ruby 1.9.2 I had to change line 37 of lib/xapian_fu/stopper_factory.rb from
open(stop_words_filename(lang), "r") do |f|
to
open(stop_words_filename(lang), "rb") do |f|
basically adding a b to the open mode, to avoid an "invalid byte sequence in US-ASCII (ArgumentError)" with the italian stopwords file.
There is probably a better solution but it is something that you should be aware of.
Thanks!

Documentation on Document updating

I'm having trouble figuring out how to update a document. Could you please provide some documentation or at least comment on this issue? Thanks!

Complex types support

It would be nice if Array and Hash at least were supported for serialize/deserialize

Date range searching

Wonder if you have any plans, or would consider, adding date range searching functionality?

Thanks :)

Requiring the Gem

How do you require the gem?

$ gem list | grep -i xapian

xapian-fu (1.0.1)

irb(main):002:0> require 'xapian'
LoadError: no such file to load -- xapian
irb(main):003:0> require 'xapian-fu'
LoadError: no such file to load -- xapian-fu
irb(main):004:0> require 'xapian_fu'
LoadError: no such file to load -- xapian
irb(main):005:0> require 'xapianfu'
LoadError: no such file to load -- xapianfu

Text encoding

Since all text in xapian is utf-8, strings coming back out of xapian-fu should be encoded in utf-8 (probably just by calling force_encoding('utf-8') on strings as they come out)

Right now the strings come out marked as local encoding, but are actually utf-8, and this causes some problems.

Add support for Xapian::QueryParser::set_max_wildcard_expansion()

When a Xapian database contains many variants of the same, short words (such as test1, test2, test3, test4, test5, test6...), researching wildcard patterns such as "test_" becomes very expansive, both in time and memory, than longer one such as "test4_" or "test40*", for example.

The raw Xapian binding provides a QueryParser::set_max_wildcard_expansion function [1] which allow the user to limit the number of expansions retained in this use case (by default 0, or unlimited): However, there does not appear to any easy way to achieve using the Xapian-fu.

Would it be possible to add support for a :max_wildcard_expansion option in search requests?

[1] http://xapian.org/docs/apidoc/html/classXapian_1_1QueryParser.html#7651d48cdc661c0605c475925170cc71

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.