Giter Site home page Giter Site logo

cz4034inforetrievalgrp10's Introduction

CZ4034InfoRetrievalGrp10

The aim of this project is to perform the various tasks in the different stages of information retrieval and adopt different methods to enhance and optimise each stage. The stages include Crawling, Indexing and Querying and Classification. In this report, our group has selected Amazon Books as the target website for crawling, and use the crawled information for further processing.

Python dependencies

Ensure Python and pip are installed on the machine and both are included in the path variable.

Crawling

Crawl links for books under the selected topic

To start crawling links, cd to the root folder and run:

cd crawl
python crawlLinks.py

crawlLinks.py now crawls links under cook-outdoor cooking. If links for other topics are to be crawled, change the parameters at line 18 and line 47 in the script.

Crawl book details using the links

To start crawling book details, run:

python crawlBooks.py

Change the input (links) and output (book details) files if you want to crawl different topics.

Indexing

To start indexing, cd to the root folder and run:

cd solr/solr-7.2.1/bin
solr start
solr create -c amazon
cd ..
python solr_indexing.py

Querying

To install django, run:

pip install django

To start the django web server, run:

cd gui
python manage.py runserver

Open a web browser and go to the link

127.0.0.1:8000

Type your query for the books, select the book categories and then enter/click the submit button for querying.

Classification

To grab python packages for classification, run:

pip install scikitlearn
pip install pandas
pip install numpy

To run classification, cd to the root folder and run:

cd classification
python classification2.py

cz4034inforetrievalgrp10's People

Contributors

getsong avatar aiqing1130 avatar zhiyanggg avatar mikiwong avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.