Giter Site home page Giter Site logo

admhw4's Introduction

Homework4

Algorithmic Methods of Data Mining

Authors:

  • Biko Catalano
  • Andrea Marcocchia
  • Malick Alexandre Sarr Ngorovitch

The folder contains:

  • load_save: file.py some functions to save, load and read the following format: json, pickle and graph

  • Part_1: file.py, process and create graph

  • Part_2_A: file.py, subgraph induced by conference in input, some centralities measures

  • Part_2_B: file.py, Hop Distance of subgraph induced

  • Part_3_A: file.py, Evaluate dijkstra distance from a starting node

  • Part_3_B: file.py, GroupNumbers of a subset of nodes

  • building_output: file.py, built a nice output, using previous modules

  • main_code: file.ipynb, run building_output.py and give the output in a notebook format

  • report: file.pdf, extensively comment the results obtained

Analyze modules

load_save.py

A module that contains functions to load, save json and pickle and to save and read graph. In input is requested the path of the file without file extension.

The file are called in this way:

def load_json(name):
    with open(name + '.json', 'r') as f:
        return json.load(f)

Part_1.py

We create a class, parsejson_graph. This module provides, when called, to create a dictionary and a graph. It takes in input the json file. This class has the following methods:

  • jaccard_sim

    This function computes the jaccard similarity between two nodes.

  • dict_name

    this function creates a dictionary that has the author_id as keys, and as values a nested dictionary with the following attributes:

    • author_name: name
    • id_publication_int: list of publication
    • id_conference_int: list of conference
  • graph_complete

This function build the graph. It use the dictionary created in dict_name function to give the attributes to nodes.

  • pub_id

    This function provide to build a index of publications:

    • publication_id: values list of authors

    This index allows a faster creation of edges. Compute the combination among authors with the same publications.

  • add_weighted_edges

Add weighted edges to graph. We use the itertools.combinations to find all the combinations of linked authors. This functions provides to weight the nodes using the function jaccard_sim.

part_2_A.py

This module contains:

  • class Part_2_a

    this class take in input a conference_id, the dictionary and the graph. It builds the subgraph induced and compute some centralities measure.

In this module, there are also:

  • 3 functions: plot the previous centralities measure (degree, betweenness and closeness)

  • 1 function that plots correlation measures and prints correlation values in a table

part_2_B.py

This module contains:

  • class Part_2_b:

    This class has 3 methods:

    • author_neighbors: it searchs and return a list of all neighbours of a node

    • hop_distance: Compute all neighbours at given distance d and compute induced subgraph. Originally, we had found the methods ego_graph, but we have seen it was very slow, so we prefer build the function ex-novo, using only the methods nx.all_neighbors(graph,node). Iterating this method we have found all edges linked to starting node at distance d, given in input. We used this list of edges to build the subgraph.

    • draw_author_subgraph:draw the induced subgraph, obtained in the previous function. The input node is blue in the plot, all the others are red.

part_3_A

This module contains two functions:

  • weight_dict: create a dictionary that contains for each node of the graph all the linked nodes and their weigths. It ask in input only the graph. This function returns a dictionary.

  • dijkstra_heap2: return for a node in input (start) the shortest path for all the nodes that have path with the starting one. This function asks in input a starting node and the dictionary obtained in the previous function. The return is a dictionary.

part_3_B

This module contains only one function:

  • group_number: this function return a dictionary with authors_ID of the subset as keys, and for values the node that have for GroupNumber that key. The requested input is the subset of nodes, the dictionary obtained in part_3_A with function weight_dict and the graph.

building_output

This module contains six functions that organize the output for all the modules already analyzed:

  • Part_1
  • Part_2_A
  • Part_2_B
  • Part_3_A
  • Part_3_A_solution
  • Part_3_B

How to run the code

Open Jupyter notebook and run main_code.ipynb.

The output is in .ipynb file, in order to obtain clear results and plots, following the homework questions.

It is necessary to save the database (.json file) in the same directory in which are stored the other modules. In other case is necessary to write the full path.

The code works by default with full_dbpl stored in the same directory. If you want to change the database or the path, you have to change the argument of building_output.Part_1(new argument), in main_code.ipynb.

The homework questions are at the following link: http://aris.me/contents/teaching/data-mining-ds-2017/homeworks/homework4.pdf

admhw4's People

Contributors

andremarco avatar

Watchers

James Cloos avatar Malick Sarr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.