Lambda Architecture implementation using Apache Storm, Hadoop and HBase to perform Twitter real-time image processing analysis.
The goal of this project is find the most representative images from images obtained from Twitter through a certain keyword.
To find these representative images, we will use the K-Means algorithm.
Link to the whole paper
- Apache Hadoop-3.2.1
- Apache HBase-2.2.3
- Apache Storm-2.1.0
- Twitter4j-4.0.4
- Lire ( Lucene Image Retrieval )-8.0.0
- Gradle-6.3
Hbase, Storm and Hadoop have to be installed and set correctly on your pseudo-distribuited cluster.
To collect the images from Twitter, you have to get your personal Twitter Developers credential and insert it on .txt file. You will find an example inside the project as FakeCredential.txt
Start in this order:
- Hadoop
- HBase
- Storm ( it start automatically on Eclipse in my case )
After that, you can finally execute in this order:
TwitterRealTimeImageProcessing.java
- insert the keyword inside the arguments list
HadoopDriver.java
- Choose the correct value as number of center, threshold and a file where to write the centers.
- With CEDD Descriptor, we obtain a 144-dimensional vector.
- If you want change descriptor, change also
FeatureExtractorCEDD.java
and all corrispondences.
At the end of kmeans, an HTLM page will show the results obtained
This whole project was executed on Ubuntu 20.04.2 LTS