Giter Site home page Giter Site logo

ialexmp / irwa Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 8.43 MB

Research in the field of Information Retrieval and Web Analytics (IRWA). The project involves the development of a search engine using Python 3, implementing various indexing and ranking algorithms.

Jupyter Notebook 98.17% Python 1.04% CSS 0.05% HTML 0.74%
analytics cosine-similarity data-analysis data-visualization evaluation-metrics exploratory-data-analysis indexing-algorithms information-retrieval machine-learning natural-language-processing python ranking-algorithms search-engine text-processing tf-idf user-interface web-analytics word2vec

irwa's Introduction

Information Retrieval and Web Analytics (IRWA)

Group

Group: G_201_4.

Summary

Based on the learned from theoretical classes, the seminars, the lab exercises and our own research, we are asked to build using Python 3 a search engine implementing different indexing and ranking algorithms.

Part Topic Delivery Date
Part 1 Text Processing and Exploratory Data Analysis 21/10/2023
Part 2 Indexing and Evaluation 29/10/2023
Part 3 Ranking 14/11/2023
Part 4 User Interface and Web Analytics 02/12/2023

Project Instructions

Part 1:

In this part of the project, you will find the initial steps. Among them, the importation of the different elements that will be used in the project, these documents are located in the Data folder. Furthermore, a character cleaning process for the tweets has been performed using the preprocess() function. For easier data analysis, a date format change has also been applied using preprocess_date().

Next, in the second section, we have conducted a series of studies to better understand the data. Some of the studies include: Word cloud, Histogram, Boxplot, ... The results of these studies can be observed here.

Finally, it's worth mentioning that to execute this, you simply need to run each of the cells in order.

Part 2:

In that case, the focus is on indexing and evaluation. Data preparation involves merging previous work into a new notebook, creating a new dataframe, and indexing tweets to construct inverted indexes. A custom function, "create_index," is developed for this purpose.

A search engine, "search," is built to retrieve tweets for specific queries using keywords derived from word cloud analysis. The evaluation consists of two components: one involving a subset of the dataset and the other using expert judgment to assess document relevance. Various evaluation metrics are presented for different queries, comparing two cases. The analysis also includes a two-dimensional scatter plot using the T-SNE algorithm to visualize relationships between tweets in the dataset, with a notable dense cluster of points suggesting similarities among tweets in that area and symmetrical distribution around the origin indicating balanced word embeddings.

The results of these studies can be observed here.

Part 3:

In this section, we explored diverse methods for document retrieval and scoring. We started with TF-IDF and cosine similarity, a traditional approach that assesses document relevance based on term frequency and inverse document frequency. We then introduced Our-Score, which incorporates social media engagement metrics (retweets and likes) to better align with the context of social media content ranking. The comparison revealed that Our-Score can unearth more relevant social media content.

Additionally, we experimented with Word2Vec embeddings, leveraging semantic understanding, and discussed the potential benefits and challenges of transformer-based embedding.

The results of these studies can be observed here.

Part 4:

Our project delivers an intuitive search engine interface with refined result displays, including user-friendly features like a navigation bar, session insights, emotional analysis, and a Light/Dark mode. We optimize performance by persistently storing the index for faster loads. On the analytics front, we offer dynamic showcases of user interactions, an intuitive dashboard for document clicks, sentiment analysis, and insightful visualizations of searched queries. Session details, including IP and engagement duration, are captured and saved, enriching user experience. These implementations collectively prioritize user-centric interaction and provide robust analytics for understanding user behavior within the search engine.

Detailed information about this section and the final results is accessible here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.