Giter Site home page Giter Site logo

twitisen's Introduction

πŸ“Š Twitisen – Your Twitterverse βœ¨πŸ“²

In our Software Developer coursework, we utilized the Twitter API to collect tweet data and employed Natural Language Processing (NLP) for in-depth analysis. The program, designed with user-friendly principles, seamlessly integrates Python, Unittest for rigorous testing, and an Extract, Load, Transform (ELT) pipeline for efficient data processing. With a strong emphasis on data visualization techniques, we've created a versatile tool that not only collects Twitter data but also employs NLP for insightful analysis. The program doesn't just gather information; it transforms raw data into meaningful visualizations, showcasing trends and patterns derived from the Twitterverse. This project highlights our collaborative synergy in developing a comprehensive and effective computer program for Twitter data analysis and visualization. We store the data securely in MongoDB.

Table of Contents

Lessons Learned πŸŽ“

Back to top

🧠 NLP (Natural Language Processing) 🧠

  • πŸ“ˆ Implementing NLP techniques for sentiment analysis to gauge user opinions.
  • βš™οΈ Filtering Unnecessary Data: Removing emojis and special characters to clean the text.
  • βœ‚οΈ Tokenization: Breaking down text into individual tokens (words or phrases).
  • βš–οΈ Normalization: Converting verbs from their base form (Verb 3) to their infinitive form (Verb 1).
  • β›” Removing Stopwords: Eliminating common words (e.g., "the," "and") that don't carry significant meaning.
  • πŸ“Š Sentiment Analysis: Determining the emotional tone or sentiment expressed in the text.

Python Programming

  • 🐍 Mastering Python for efficient scripting and data manipulation.
  • πŸ§ͺ Writing modular and reusable code for improved maintainability.
  • 🌐 Utilizing Python for data extraction, transformation, and loading (ETL) processes.

Twitter API

  • πŸ•ŠοΈ Extracting data from Twitter using the Twitter API.
  • πŸ”„ Transforming raw Twitter data for analysis and visualization.

GUI (Graphical User Interface)

  • πŸ–₯️ Developing user-friendly graphical interfaces for data visualization.
  • 🎨 Enhancing user experience through intuitive design.

Unittest

  • πŸ§ͺ Implementing unit tests for code reliability and robustness.
  • πŸš€ Ensuring the correctness of data extraction and transformation processes.

Data Visualization

  • πŸ“Š Creating compelling visualizations to convey insights effectively.
  • πŸ“ˆ Using tools like Matplotlib or Plotly for graphical representation.

ELT Pipeline (Extract, Load, Transform)

  • πŸš€ Designing and implementing efficient ELT pipelines.
  • πŸ”„ Extracting data from various sources, transforming it, and loading it into databases.

MongoDB

  • πŸ—„οΈ Storing and retrieving data efficiently using MongoDB.
  • πŸ” Ensuring data security and scalability.

Screenshots πŸ“·

Back to top

πŸ›’οΈπŸ”— ELT Pipeline πŸ”—πŸ›’οΈ
Exclusive Summary
Firstly, we extract tweets from the Twitter API. The API provides information such as id, username, datetime, text, favorite count, retweet count, and location.
Secondly, we store the extracted data in MongoDB, referring to this dataset as raw data.
Thirdly, we apply a complex algorithm to transform the data. We filter out URL symbols, numeric symbols, emojis, and special characters using our custom implementation and Lexto+. Given the dataset's diverse language composition, our focus is solely on Thai and English. For Thai language, we tokenize and normalize using Lexto+, while for English, we utilize NLTK. We clean Thai stop words with PythaiNLP and English stop words with NLTK.
Fourthly, we store the cleaned data in MongoDB, naming this dataset as clean data.
Lastly, we utilize the cleaned data for data visualization. The visualization includes sentiments, a donut chart, word cloud, bar chart, and spatial chart, all of which are presented on the GUI.

πŸ—‚οΈπŸ’½ Database Schema πŸ’½πŸ—‚οΈ
Back to top
We have four independent databases. The 'tweets' database will contain raw data collected from the Twitter API. The 'cleaned_data' database will store transformed or cleaned data. The 'locations' database will include the location and coordinates of tweets. The 'sentiments' database will house keywords that users use for searching in the Twitter search bar and the corresponding ranked results.
tweets cleaned_data locations sentiments
PK: id
FK: location
PK: id PK: id PK: id

πŸ”πŸ§ Competitor Analysis πŸ§πŸ”
Back to top
Before creating our pipeline, we conducted research on other competitors. We aimed to merge the strengths and improve the weaknesses identified during the analysis.

πŸ€–πŸ“₯ Extracting Algorithm πŸ“₯πŸ€–
Back to top
extract
The algorithm will sweep the timeline in periods of 14 days, creating a checkpoint. The extraction area will cover 7 days before the checkpoint and 7 days after the checkpoint.
extract
If it reaches the end of the timeline, the checkpoint will be set as the end date. In some cases, this may result in a duplicate extraction. However, there is no need to worry because we have an algorithm that checks whether the data has already been extracted. The algorithm compares the tweet ID of the desired tweet with the tweet ID in ypur database.

πŸ“…πŸ•‘ Timeline Classification πŸ•‘πŸ“…
Back to top
timeline
This is an example of how it actually works: the green line represents the checkpoints, This is a continuous timeline where each day is consecutive.
timeline
I've implemented a binary search for timeline classification, making it faster than the regular approach.
timeline
If the time period is an odd number of days, we will calculate the checkpoint using the following formula, as shown. In this process we will extract the checkpoint first.
timeline
From the checkpoint we will extract two date at the same time.
timeline
If the time period is an even number of days, we will calculate the checkpoint using the following formula, as shown. In this process we will extract the checkpoints at the same time.
timeline
Like the previous one, from the checkpoint we will extract two date at the same time.
timeline
This is an example of how it actually works: the green line represents the checkpoints, This is a discrete timeline where each day is non-consecutive. We will calculate the checkpoint using the following formula, as shown.
timeline
Since we have two types of timelinesβ€”continuous and discrete. We use an algorithm to identify them. First, we sort the timeline, and then we check if the dates are consecutive. If they are, it's a continuous timeline; if not, it's a discrete timeline.
timeline
If the date difference is 1, it is considered consecutive. However, if it is not, the sum of consecutive differentials will be less than the length of the timeline.
timeline
Lastly, this is the difference between two timelines.

GUI Designing 🎨

Back to top
This is our initial design, sketched by hand. We created a rough draft of the GUI in a low-fidelity (Lofi) format and transform it into GUI using pyqt5. The disadvantage of this design is...

  • No spatial chart
  • Few options to extract data
  • The search input field is too large.
  • Shows only three tabs
  • Bad layout

πŸ—œοΈπŸ§© Prototype 1 πŸ§©πŸ—œοΈ
Back to top
prototype1
prototype1
prototype1
GUI1
GUI1
GUI1

This is our second design. We created a rough draft of the GUI in a low-fidelity (Lofi) format and transform it into GUI using pyqt5. This time we named the program as Twitter Harvest and recolor it into darkmode. The disadvantage of this design is...

  • There is an unnecessary push button.
  • Too many separate pages make it difficult for users to use.
  • The layout of the elements is inconsistent.
  • Too many pushbuttons, difficult to use.

πŸ—œοΈπŸ§© Prototype 2 πŸ§©πŸ—œοΈ
Back to top
prototype2
prototype2
prototype2
prototype2
prototype2
prototype2

Description of prototype 3

πŸ—œοΈπŸ§© Prototype 3 πŸ§©πŸ—œοΈ
Back to top
prototype3
prototype3
prototype3
prototype3
prototype3
prototype3
prototype3
prototype3

Description of version 1.0

πŸŽ‰πŸš€ Version 1.0 πŸŽ‰πŸš€
Back to top
ver1
ver1
ver1
ver1
ver1
ver1
ver1
ver1

Contributor πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Back to top

twitisen's People

Contributors

nshpam avatar tw94sh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.