###By Keshav Ramaswamy
-
The tool built is used to analyze tweets(/hashtags) read as JSON data from an input file and writes the rolling average degree to an output file.
-
Functions:
- Analyze the hashtag tweets and build a hashtag graph.
- Update the graph over a 60 second window, continuously reading tweets.
- Calculate the average degree output for every new tweet read and write the output to an output file.
-
The source files are in
src/modules/
and their tests are insrc/modules/tests/
- twitter_hashtag_graph_analyzer.py
- hashtag_graph.py
- tweet.py
- test_twitter_hashtag_graph_analyzer.py
- test_hashtag_graph.py
- test_tweet.py
- mock_data.py
- Execute the bash script
./run.sh
- The python source filetwitter_hashtag_graph_analyzer.py
is executed with optional arguments. - The two arguments it expects -
input_file
andoutput_file
are set to their default values:./tweet_input/tweets.txt
and./tweet_output/output.txt
. So no arguments are expected. - Pass in arguments if some other file path for input or output have to be given as in the manner:
python ./src/Modules/twitter_hashtag_graph_analyzer.py input_filepath output_filepath
from the terminal. - Note:
- The tool requires the use of library
NetworkX
- usepip install networkx
to installNetworkX
manually if the source files are executed from the terminal. The bash script,run.sh
will automatically install the library if it's not installed. - The python module,
argparse
is used for command-line argument parsing - usepip install argparse
to install the library. - The development is in
Python 2.7
-run.sh
acceptspython
as Python 2.7 only.
- The tool requires the use of library
- The unit tests can be executed either by executing
tests/run_tests.sh
fromsrc/modules/
or runningpy.test ./tests/
fromsrc/modules/
from the terminal directly. - The library
pytest
is required for unit testing. Usepip install pytest
for installing manually else execute run_tests.sh which automatically installs the library if not present. - Additional test suites are added in
/insight_testsuite/tests/
. They can be tested by executing./insight_testsuite/run_tests.sh
- The hashtag graph is implemented using the mentioned module, NetworkX.
- NetworkX is a python library used for building graphs and networks. The choice for this library is due to the fact that NetworkX "is suitable for operation on large real-world graphs: e.g., graphs in excess of 10 million nodes and 100 million edges".(See https://networkx.github.io/documentation/latest/)
- The basic graph data structure in NetworkX is a "dictionary of dictionaries of dictionaries" where each node can be any hashable object - strings, numbers, files or functions while an edge is represented as a tuple of nodes.
- The choice of going for the use of NetworkX instead of "reinventing the wheel" is due to the fact that real world
hashtag graphs are expected to be sparse and NetworkX uses adjacency lists since the storage requirements are the
lowest( proportional to
m + n
form
nodes andn
edges). - As mentioned earlier, the adjacency list is maintained as a "dictionary of dictionaries of dictionaries" where each node in the graph is a key whose values are dictionaries with keys as its neighbours mapped to the edge data.
- For example for a graph G with edges,'A - B' and 'B - C' is represented in the form of an adjacency list in the following manner, with default edge data as 1:
G = networkx.Graph()
G.add_edge(’A’,’B’)
G.add_edge(’B’,’C’)
print G.adj
{’A’: {’B’: 1},
’B’: {’A’: 1, ’C’: 1},
’C’: {’B’: 1}}
- This choice of data structure contributes to fast lookup or removal of edges and nodes. For more information regarding the inside of NetworkX, see: http://conference.scipy.org/proceedings/SciPy2008/paper_2/full_text.pdf
- After extensive profiling with the python module,
cprofilev
and resulting optimization of the code, the tool was able to scale to the required levels. - The scalability graph is shown below where it processes 10000 tweets within 22 sec.
- The accompanying analysis is included in an ipython notebook at
./scalability_analysis.ipynb