In our Software Developer coursework, we utilized the Twitter API to collect tweet data and employed Natural Language Processing (NLP) for in-depth analysis. The program, designed with user-friendly principles, seamlessly integrates Python, Unittest for rigorous testing, and an Extract, Load, Transform (ELT) pipeline for efficient data processing. With a strong emphasis on data visualization techniques, we've created a versatile tool that not only collects Twitter data but also employs NLP for insightful analysis. The program doesn't just gather information; it transforms raw data into meaningful visualizations, showcasing trends and patterns derived from the Twitterverse. This project highlights our collaborative synergy in developing a comprehensive and effective computer program for Twitter data analysis and visualization. We store the data securely in MongoDB.
Firstly, we extract tweets from the Twitter API. The API provides information such as id, username, datetime, text, favorite count, retweet count, and location. Secondly, we store the extracted data in MongoDB, referring to this dataset as raw data. Thirdly, we apply a complex algorithm to transform the data. We filter out URL symbols, numeric symbols, emojis, and special characters using our custom implementation and Lexto+. Given the dataset's diverse language composition, our focus is solely on Thai and English. For Thai language, we tokenize and normalize using Lexto+, while for English, we utilize NLTK. We clean Thai stop words with PythaiNLP and English stop words with NLTK. Fourthly, we store the cleaned data in MongoDB, naming this dataset as clean data. Lastly, we utilize the cleaned data for data visualization. The visualization includes sentiments, a donut chart, word cloud, bar chart, and spatial chart, all of which are presented on the GUI.
We have four independent databases. The 'tweets' database will contain raw data collected from the Twitter API. The 'cleaned_data' database will store transformed or cleaned data. The 'locations' database will include the location and coordinates of tweets. The 'sentiments' database will house keywords that users use for searching in the Twitter search bar and the corresponding ranked results.
Before creating our pipeline, we conducted research on other competitors. We aimed to merge the strengths and improve the weaknesses identified during the analysis.
The algorithm will sweep the timeline in periods of 14 days, creating a checkpoint. The extraction area will cover 7 days before the checkpoint and 7 days after the checkpoint.
If it reaches the end of the timeline, the checkpoint will be set as the end date. In some cases, this may result in a duplicate extraction. However, there is no need to worry because we have an algorithm that checks whether the data has already been extracted. The algorithm compares the tweet ID of the desired tweet with the tweet ID in ypur database.
This is an example of how it actually works: the green line represents the checkpoints, This is a continuous timeline where each day is consecutive.
I've implemented a binary search for timeline classification, making it faster than the regular approach.
If the time period is an odd number of days, we will calculate the checkpoint using the following formula, as shown. In this process we will extract the checkpoint first.
From the checkpoint we will extract two date at the same time.
If the time period is an even number of days, we will calculate the checkpoint using the following formula, as shown. In this process we will extract the checkpoints at the same time.
Like the previous one, from the checkpoint we will extract two date at the same time.
This is an example of how it actually works: the green line represents the checkpoints, This is a discrete timeline where each day is non-consecutive. We will calculate the checkpoint using the following formula, as shown.
Since we have two types of timelinesβcontinuous and discrete. We use an algorithm to identify them. First, we sort the timeline, and then we check if the dates are consecutive. If they are, it's a continuous timeline; if not, it's a discrete timeline.
If the date difference is 1, it is considered consecutive. However, if it is not, the sum of consecutive differentials will be less than the length of the timeline.
Lastly, this is the difference between two timelines.
GUI Designing π¨
Back to top
This is our initial design, sketched by hand. We created a rough draft of the GUI in a low-fidelity (Lofi) format and transform it into GUI using pyqt5. The disadvantage of this design is...
This is our second design. We created a rough draft of the GUI in a low-fidelity (Lofi) format and transform it into GUI using pyqt5. This time we named the program as Twitter Harvest and recolor it into darkmode. The disadvantage of this design is...
There is an unnecessary push button.
Too many separate pages make it difficult for users to use.