Giter Site home page Giter Site logo

search's Introduction

Simple search engine with TF-IDF ranking

This project started as a simple search engine following the general idea of this blog post. A starting point implementation was given in Python and can be found here. The task was to make fitted changes to optimize the given implementation. A walkthrough of all the changes are described in the report.pdf, currently in Greek.

Search engines

There are 3 major stages in developing a search engine:

  1. Finding/Crawling the Data
  2. Building the index
  3. Using the index to answer queries

On top of this, we can add result ranking (tf-idf, PageRank, etc), query/document classification and maybe some Machine Learning to keep track of user's past queries and selected results to improve the search engine's performance.

Indexer

The Indexer was benchmarked using a small random book subset from the Project Gutenberg collection. The improvements made to the Indexer are:

  1. Static lists to sets conversion, for faster searches. Speedup: +5%
  2. Precompiled Regex, for faster matches. Speedup: +2%
  3. Stopword removal using Modified Penn Treebank Tag-Set2 closed class categories. Inverted Index size: -5%
  4. Multi-process parallel Indexer with 2 worker processes. Speedup: +50%
  5. Serialize inverted index as Pickle file. Inverted Index size: -30%
  6. Apply D-Gap encoding. Inverted Index size: -25%

The overall performance for (P = 2) worker processes, is a 2.68 speedup and a -52% reduction in the Inverted Index file size, compared to the original implementation. To get help on the Indexer sub system’s execution arguments, type the following command in the projects' directory:

$ python BuildIndex.py --help

Query

The Query sub system, uses the inverted Index, and supports standard and phrase queries, with tf-idf rankings. To get help on the Query sub system's execution arguments, type the following command in the projects' directory:

$ python QueryIndex.py --help

Recommender

A proof of concept, Recommender sub-system, was created for the Gutenberg collection's books. The first, is independent of this implementation, but can easily be adapted to work with the Query sub-system. The provided recommendations are based on other users' ratings on the books (collaborative filtering). The similarity of the users is calculated using the Pearson correlation coefficient.

The books' ratings were created randomly and a book index was created by parsing the master_list.csv located in the books directory. The last, contains the title, author, id etc. information for all the downloaded books. To execute the Recommender sub-system type the following command in the projects’ directory:

$ python Recommender.py

Crawler

The crawler sub-system uses the Scrapy web crawling framework. A custom spider was created, to parse the project Gutenberg's website and download books. A custom Book item was created to represent the new entities and a MySQL pipeline was used to insert the books in a database. The books later, can be inputted to the Indexer with some minor changes.

The database credentials as well as the crawler's configuration are located in the crawler/settings.py configuration file. To run the crawler, execute the following commands in the project's directory:

$ cd crawler
$ srcapy crawl gutenberg
# Setting crawler's max pages = 3 
$ srcapy crawl gutenberg -a maxpages=3

Useful Links

search's People

Contributors

memaskal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.