📊 Twitisen – Your Twitterverse ✨📲

In our Software Developer coursework, we utilized the Twitter API to collect tweet data and employed Natural Language Processing (NLP) for in-depth analysis. The program, designed with user-friendly principles, seamlessly integrates Python, Unittest for rigorous testing, and an Extract, Load, Transform (ELT) pipeline for efficient data processing. With a strong emphasis on data visualization techniques, we've created a versatile tool that not only collects Twitter data but also employs NLP for insightful analysis. The program doesn't just gather information; it transforms raw data into meaningful visualizations, showcasing trends and patterns derived from the Twitterverse. This project highlights our collaborative synergy in developing a comprehensive and effective computer program for Twitter data analysis and visualization. We store the data securely in MongoDB.

Lessons Learned 🎓
Screenshots 📷
GUI Designing 🎨
Contributor 👩‍💻👨‍💻

Lessons Learned 🎓

🧠 NLP (Natural Language Processing) 🧠

📈 Implementing NLP techniques for sentiment analysis to gauge user opinions.
⚙️ Filtering Unnecessary Data: Removing emojis and special characters to clean the text.
✂️ Tokenization: Breaking down text into individual tokens (words or phrases).
⚖️ Normalization: Converting verbs from their base form (Verb 3) to their infinitive form (Verb 1).
⛔ Removing Stopwords: Eliminating common words (e.g., "the," "and") that don't carry significant meaning.
📊 Sentiment Analysis: Determining the emotional tone or sentiment expressed in the text.

Python Programming

🐍 Mastering Python for efficient scripting and data manipulation.
🧪 Writing modular and reusable code for improved maintainability.
🌐 Utilizing Python for data extraction, transformation, and loading (ETL) processes.

Twitter API

🕊️ Extracting data from Twitter using the Twitter API.
🔄 Transforming raw Twitter data for analysis and visualization.

GUI (Graphical User Interface)

🖥️ Developing user-friendly graphical interfaces for data visualization.
🎨 Enhancing user experience through intuitive design.

Unittest

🧪 Implementing unit tests for code reliability and robustness.
🚀 Ensuring the correctness of data extraction and transformation processes.

Data Visualization

📊 Creating compelling visualizations to convey insights effectively.
📈 Using tools like Matplotlib or Plotly for graphical representation.

ELT Pipeline (Extract, Load, Transform)

🚀 Designing and implementing efficient ELT pipelines.
🔄 Extracting data from various sources, transforming it, and loading it into databases.

MongoDB

🗄️ Storing and retrieving data efficiently using MongoDB.
🔐 Ensuring data security and scalability.

Screenshots 📷

🛢️🔗 ELT Pipeline 🔗🛢️

Firstly, we extract tweets from the Twitter API. The API provides information such as id, username, datetime, text, favorite count, retweet count, and location. Secondly, we store the extracted data in MongoDB, referring to this dataset as raw data. Thirdly, we apply a complex algorithm to transform the data. We filter out URL symbols, numeric symbols, emojis, and special characters using our custom implementation and Lexto+. Given the dataset's diverse language composition, our focus is solely on Thai and English. For Thai language, we tokenize and normalize using Lexto+, while for English, we utilize NLTK. We clean Thai stop words with PythaiNLP and English stop words with NLTK. Fourthly, we store the cleaned data in MongoDB, naming this dataset as clean data. Lastly, we utilize the cleaned data for data visualization. The visualization includes sentiments, a donut chart, word cloud, bar chart, and spatial chart, all of which are presented on the GUI.

🛢️🔗 ELT Pipeline 🔗🛢️

Firstly, we extract tweets from the Twitter API. The API provides information such as id, username, datetime, text, favorite count, retweet count, and location.
Secondly, we store the extracted data in MongoDB, referring to this dataset as raw data.
Thirdly, we apply a complex algorithm to transform the data. We filter out URL symbols, numeric symbols, emojis, and special characters using our custom implementation and Lexto+. Given the dataset's diverse language composition, our focus is solely on Thai and English. For Thai language, we tokenize and normalize using Lexto+, while for English, we utilize NLTK. We clean Thai stop words with PythaiNLP and English stop words with NLTK.
Fourthly, we store the cleaned data in MongoDB, naming this dataset as clean data.
Lastly, we utilize the cleaned data for data visualization. The visualization includes sentiments, a donut chart, word cloud, bar chart, and spatial chart, all of which are presented on the GUI.

🗂️💽 Database Schema 💽🗂️
Back to top

We have four independent databases. The 'tweets' database will contain raw data collected from the Twitter API. The 'cleaned_data' database will store transformed or cleaned data. The 'locations' database will include the location and coordinates of tweets. The 'sentiments' database will house keywords that users use for searching in the Twitter search bar and the corresponding ranked results.

tweets	cleaned_data	locations	sentiments
PK: id FK: location	PK: id	PK: id	PK: id

🔍🧐 Competitor Analysis 🧐🔍
Back to top

Before creating our pipeline, we conducted research on other competitors. We aimed to merge the strengths and improve the weaknesses identified during the analysis.

🤖📥 Extracting Algorithm 📥🤖
Back to top

The algorithm will sweep the timeline in periods of 14 days, creating a checkpoint. The extraction area will cover 7 days before the checkpoint and 7 days after the checkpoint.

If it reaches the end of the timeline, the checkpoint will be set as the end date. In some cases, this may result in a duplicate extraction. However, there is no need to worry because we have an algorithm that checks whether the data has already been extracted. The algorithm compares the tweet ID of the desired tweet with the tweet ID in ypur database.

📅🕑 Timeline Classification 🕑📅
Back to top

This is an example of how it actually works: the green line represents the checkpoints, This is a continuous timeline where each day is consecutive.

I've implemented a binary search for timeline classification, making it faster than the regular approach.

If the time period is an odd number of days, we will calculate the checkpoint using the following formula, as shown. In this process we will extract the checkpoint first.

From the checkpoint we will extract two date at the same time.

If the time period is an even number of days, we will calculate the checkpoint using the following formula, as shown. In this process we will extract the checkpoints at the same time.

Like the previous one, from the checkpoint we will extract two date at the same time.

This is an example of how it actually works: the green line represents the checkpoints, This is a discrete timeline where each day is non-consecutive. We will calculate the checkpoint using the following formula, as shown.

Since we have two types of timelines—continuous and discrete. We use an algorithm to identify them. First, we sort the timeline, and then we check if the dates are consecutive. If they are, it's a continuous timeline; if not, it's a discrete timeline.

If the date difference is 1, it is considered consecutive. However, if it is not, the sum of consecutive differentials will be less than the length of the timeline.

Lastly, this is the difference between two timelines.

GUI Designing 🎨

Back to top
This is our initial design, sketched by hand. We created a rough draft of the GUI in a low-fidelity (Lofi) format and transform it into GUI using pyqt5. The disadvantage of this design is...

No spatial chart
Few options to extract data
The search input field is too large.
Shows only three tabs
Bad layout

🗜️🧩 Prototype 1 🧩🗜️
Back to top

This is our second design. We created a rough draft of the GUI in a low-fidelity (Lofi) format and transform it into GUI using pyqt5. This time we named the program as Twitter Harvest and recolor it into darkmode. The disadvantage of this design is...