Giter Site home page Giter Site logo

2816569402 / advanced-data-mining-project Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bdice/advanced-data-mining-project

0.0 0.0 0.0 192.02 MB

Community Detection on Higher-Order Networks: Identifying Patterns in US Air Traffic

Home Page: http://web.eecs.umich.edu/~dkoutra/courses/W19_598/

License: GNU Affero General Public License v3.0

Python 1.81% Shell 0.01% Jupyter Notebook 86.59% Common Lisp 0.16% Dockerfile 0.01% Makefile 0.09% R 0.02% C 0.01% C++ 4.96% HTML 0.02% JavaScript 0.01% TeX 6.29% SWIG 0.02%

advanced-data-mining-project's Introduction

Community Detection on Higher-Order Networks: Identifying Patterns in US Air Traffic

GitHub GitHub top language Visualize with nbviewer Data managed with signac

Bradley Dice (@bdice), Shannon Moran (@shannon-moran), Dan McCusker (@dm1181), University of Michigan.

Project Overview

In this project, we studied air traffic patterns from 1993 to 2018 using higher-order network representations [1]. Many airline passengers travel in round-trips and have layovers in airports that act as hubs for their carrier of choice. Because of patterns like these, the place a passenger flies next strongly depends on the previous airports they've visited. Network analyses that assume first-order Markov processes (random walks) on the network of US airports cannot capture important dynamics and patterns in flight data, and higher-order networks are one solution to this problem.

Our analysis processes nearly one billion ticket coupons covering 25 years of data obtained from the Bureau of Transportation Statistics [2]. We build higher-order networks on the itineraries and detect communities with InfoMap [3] in the resulting graph structure. All data and workflows were managed with the signac framework [4]. This project was completed for the course Advanced Data Mining with Danai Koutra (EECS 598, Winter 2019) at the University of Michigan [5]. Read the poster or the report.

Results

The primary results of our project are documented in this report. Some of our major findings are below.

Finding 1: Higher-Order Networks Exhibit Meaningful Clusters

Communities in First-Order vs. Higher-Order Networks

Shown are the communities identified using InfoMap for (a) a first order network and (b) a higher-order network. Only one community (the largest) is shown for the first order network, while the 5 largest communities are shown for the HON. Data is a snapshot from 2011 Q1. Node colors indicate communities, and sizes correspond to flow volume, defined in the text. Labels indicate three letter IATA airport codes.

Finding 2: Airline Mergers Do Not Significantly Change Network Structure

Data from Delta-Northwest Merger, 2010

Shown are data for (a) 2004 Q3 and (b) 2013 Q1. Delta (DL) and Northwest (NW) officially merged in 2010 Q1. Interestingly, we observe that this merger did not lead to major structural changes in the networks. Community numbers in the headers correspond to the order in which they were found by the InfoMap algorithm, and are not otherwise meaningful. For each community, the associated pie chart indicates the carriers represented, weighted by flow through the community. In the maps, node color and size indicates flow in the community (see colorbar). Labels indicate three letter IATA airport codes.

Finding 3: Airport PageRanks Show Seasonal Dependence

Seasonal Changes in First-Order PageRank

Shown above is seasonality of U.S. airports, as measured by PageRank in the first order network. Node colors correspond to the PageRank seasonality correlation coefficient described in the text and indicated on the colorbar. Node sizes correspond to PageRank in 2018 Q3, and labels indicate three letter IATA airport codes.

Airline Data Workflow

The primary source code of this project is in the subdirectory airline_data. This contains code which fetches ticket information from the Airline Origin and Destination Survey (DB1B) and analyzes it using PySpark, HON, and InfoMap [1, 2, 3].

Exploratory data analysis was performed in Jupyter notebooks. These notebooks, included in airline_data/exploration/, were used to prototype code, collect intermediate results, and visualize the data.

The workflow was written as a series of operations in a signac-flow project, described below. These operations include data cleaning, processing, analysis, and visualization tasks.

In addition to Jupyter notebooks, the data space was visualized using signac-dashboard, which enabled rapid inspection of data plots generated by the signac-flow project.

Installation

Use the requirements file (pip install -U -r requirements.txt) to acquire project dependencies.

Some notes:

  • It was necessary to use Python 2 for compatibility with Spark 2.2 on the flux-hadoop cluster.
  • The dependency cartopy is much easier to install via conda than pip. It is only needed for generating map visualizations.

Running the workflow

Run python init.py to create the signac data space.

Run python project.py run --progress --show-traceback to execute the workflow. The operations below are performed independently for each signac job, corresponding to one quarter of data. All file paths are relative to the job workspace directory.

  1. fetch_data: Fetches data with urllib.request.
  2. remove_readmes: Removes readme.html from all zip files (having an extra file in the zip breaks pandas).
  3. determine_size: Reads dataset sizes (rows, columns) from each zip file with pandas and saves that metadata to the job document.
  4. unzip_data: Unzips the CSV file.
  5. label_data: Renames the CSV file (for consistency across all jobs).
  6. get_IATA: Locate unique airport codes in this quarter of data, for cross-reference with the IATA database.
  7. extract_edges: Use PySpark to count the number of occurrences for each directed edge (origin, destination), output to edges.tsv/part-*.
  8. combine_edgefile: Combine the (distributed) outputs from edges.tsv/part-* into one file, all_edges.tsv.
  9. extract_itineraries: Use PySpark to combine all trajectories from the same itinerary, output to hon_itineraries.txt/part-*.
  10. combine_itineraries: Combine the (distributed) outputs from hon_itineraries.txt/part-* into one file, all_hon_itineraries.txt.
  11. plot_first_order_pagerank: Computes, saves, and plots PageRank data from the first-order network. Creates first_order_pagerank.txt and pr_1st_order.png.
  12. generate_hon: Uses pyHON (modified from original source code for ease of use) to generate the higher-order network. Produces hon_rules.txt and hon_network.txt.
  13. hon_pagerank: Runs pyHON PageRank algorithm on higher-order network and saves PageRank data to hon_pagerank.txt.
  14. plot_hon_pagerank: Plots PageRanks from nodes in the higher-order network, saved to pr_hon.png.
  15. hon_prepare_infomap: Translates the format of the higher-order network to the states format for processing with InfoMap. Produces hon_infomap.txt.
  16. hon_infomap: Executes InfoMap algorithm, producing hierarchical cluster outputs in hon_infomap.clu and hon_infomap.tree.
  17. plot_hon_communities: Plots each community with more than 0.5% of the network's flow, saved in hon_community_plots/hon_community_*.png.
  18. first_order_prepare_infomap: Translates the format of the first-order network to the Pajek format for processing with InfoMap. Produces first_order_infomap.txt.
  19. plot_first_order_communities: Plots each community with more than 0.5% of the network's flow, saved in first_order_community_plots/first_order_community_*.png.
  20. plot_hon_top_communities: Creates a plot similar to the previous HON community plots, overlaying the top 5 communities and coloring them by their respective clusters in hon_top_communities.png.
  21. plot_hon_community_carriers: For each community with more than 0.5% of the network's flow, plot a pie chart of different airlines' presence within that community, weighted by the sum of origin and destination carrier flows. Produces hon_community_plots/hon_community_carriers_*.png and hon_community_carriers.json.

Users may specify which operation should be run with the -o operation_name flag, like python project.py run -o fetch_data --progress --show-traceback. For the data generation in this project, operations 1-5 were run on a local directory. Then, the data was copied into HDFS and operations 6-21 were run on HDFS (accessed via hadoop-fuse).

The command python project.py status -d shows the status of the workflow, including which operations will run next.

Data Formats

The all_edges.tsv file follows this tab-separated format:

origin_id destination_id count

where origin_id and destination_id come from the DB1B data's identifiers and count is the number of coupons found for that edge.

The all_hon_itineraries.txt file follows the space-separated line input format required by the pyHON code [1]:

unique_id node1 node2 node3 node4 ... nodeN

where unique_id is an identifier for the whole trajectory and node1 through nodeN are stops along that trajectory.

InfoMap uses the states format and Pajek format described above.

Sample Data

The directory airline_data/sample_data_2011Q1/ contains most of the files generated by this workflow for the data from 2011 Q1. See the sample data README for example images and additional information.

References

  1. Xu, J., Wickramarathne, T. L., & Chawla, N. V. (2016). Representing higher-order dependencies in networks. Science Advances, 2(5), e1600028. Publication, GitHub repository.
  2. Bureau of Transportation Statistics, Office of Airline Information. Airline Origin and Destination Survey (DB1B). Data Source.
  3. Rosvall, M., Esquivel, A. V, Lancichinetti, A., West, J. D., & Lambiotte, R. (2014). Memory in network flows and its effects on spreading dynamics and community detection. Nature Communications, 5, 4630. Publication, Website, GitHub repository.
  4. Adorf, C. S., Dodd, P. M., Ramasubramani, V., & Glotzer, S. C. (2018). Simple data and workflow management with the signac framework. Comput. Mater. Sci., 146(C):220-229, 2018. Publication, Website, GitHub repository.
  5. Koutra, D. EECS 598 W19 course webpage.

advanced-data-mining-project's People

Contributors

bdice avatar danielmccusker avatar shannon-moran avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.