Giter Site home page Giter Site logo

jacquelinegarrahan / silk-road-author-identification Goto Github PK

View Code? Open in Web Editor NEW
3.0 0.0 8.0 159.14 MB

EECE5644 final project documentation. Applies LSTM and RNN neural networks to authorship classification in dark web marketplaces using Twitter GloVe vector representaions.

License: GNU General Public License v3.0

Python 16.94% Jupyter Notebook 83.06%

silk-road-author-identification's Introduction

Silk Road Author Identification

This is the source code for my EECE5644 final project (Summer 2018), which aims to apply LSTM and RNN neural networks to classifying sequence representations by author in Silk Road forums. The code contains notebooks for preparing and running both models types, as well as building datasets. Vector representations of the forum posts were built using the Stanford gloVe vector representations. The BuildDataset notebook includes utility functions for building embedding matrices from the Stanford vector representations and preparing datasets. The dataset used for this project was compiled by Gwern Branwen for the dataset, and can be accessed at https://www.gwern.net/DNM-archives.

Installation

Install the stanford gloVe twitter and wikipedia pretrained using wget. Note, these will take a very long time to install.
$ wget http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip
$ wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

Unzip the files:

$ unzip glove.twitter.27B.zip $ unzip glove.6B.zip

The parse_file code will prepare the pages in a given directory. Pre-prepared files are provided in the files folder.

The environment for the project can be installed using conda:

$ conda env install -f environment.yml

$ conda activate silk-road-author-id

Once activated, install the environment using ipykernel:

$ python -m ipykernel install --user --name=silk-road-author-id

Launch the notebooks:

$ jupyter notebook

File naming scheme:

data frames: {GLOVE_TYPE}_{N_AUTHORS}_{EMBEDDING_VECTOR_SIZE}_{INPUT_SIZE}_df.pickle embedding matrices: {GLOVE_TYPE}_{N_AUTHORS}_{EMBEDDING_VECTOR_SIZE}_{INPUT_SIZE}_embedding.pickle

silk-road-author-identification's People

Contributors

jacquelinegarrahan avatar

Stargazers

 avatar  avatar  avatar

silk-road-author-identification's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.