The search from memaskal

Simple search engine with TF-IDF ranking

This project started as a simple search engine following the general idea of this blog post. A starting point implementation was given in Python and can be found here. The task was to make fitted changes to optimize the given implementation. A walkthrough of all the changes are described in the report.pdf, currently in Greek.

Search engines

There are 3 major stages in developing a search engine:

Finding/Crawling the Data
Building the index
Using the index to answer queries

On top of this, we can add result ranking (tf-idf, PageRank, etc), query/document classification and maybe some Machine Learning to keep track of user's past queries and selected results to improve the search engine's performance.

Indexer

The Indexer was benchmarked using a small random book subset from the Project Gutenberg collection. The improvements made to the Indexer are:

Static lists to sets conversion, for faster searches. Speedup: +5%
Precompiled Regex, for faster matches. Speedup: +2%
Stopword removal using Modified Penn Treebank Tag-Set2 closed class categories. Inverted Index size: -5%
Multi-process parallel Indexer with 2 worker processes. Speedup: +50%
Serialize inverted index as Pickle file. Inverted Index size: -30%
Apply D-Gap encoding. Inverted Index size: -25%

The overall performance for (P = 2) worker processes, is a 2.68 speedup and a -52% reduction in the Inverted Index file size, compared to the original implementation. To get help on the Indexer sub system’s execution arguments, type the following command in the projects' directory:

$ python BuildIndex.py --help

Query

The Query sub system, uses the inverted Index, and supports standard and phrase queries, with tf-idf rankings. To get help on the Query sub system's execution arguments, type the following command in the projects' directory:

$ python QueryIndex.py --help

Recommender

A proof of concept, Recommender sub-system, was created for the Gutenberg collection's books. The first, is independent of this implementation, but can easily be adapted to work with the Query sub-system. The provided recommendations are based on other users' ratings on the books (collaborative filtering). The similarity of the users is calculated using the Pearson correlation coefficient.

The books' ratings were created randomly and a book index was created by parsing the master_list.csv located in the books directory. The last, contains the title, author, id etc. information for all the downloaded books. To execute the Recommender sub-system type the following command in the projects’ directory:

$ python Recommender.py

Crawler

The crawler sub-system uses the Scrapy web crawling framework. A custom spider was created, to parse the project Gutenberg's website and download books. A custom Book item was created to represent the new entities and a MySQL pipeline was used to insert the books in a database. The books later, can be inputted to the Indexer with some minor changes.

The database credentials as well as the crawler's configuration are located in the crawler/settings.py configuration file. To run the crawler, execute the following commands in the project's directory:

$ cd crawler
$ srcapy crawl gutenberg
# Setting crawler's max pages = 3 
$ srcapy crawl gutenberg -a maxpages=3

Useful Links

Nltk installation and usage:
Search Engines General:
Project Gutenberg’s TOP 600 April's 2003 e-books in .txt format:
- ftp://ftp.ibiblio.org/pub/docs/books/gutenberg/1/1/2/2/11220/ (August 2003 CD)

memaskal / search Goto Github PK

search's Introduction

Simple search engine with TF-IDF ranking

Search engines

Indexer

Query

Recommender

Crawler

Useful Links

search's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent