Giter Site home page Giter Site logo

stefanostone / biman_bot_detection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ssc-oscar/biman_bot_detection

0.0 0.0 0.0 2.12 MB

Code for the Bot Detection paper: https://dl.acm.org/doi/abs/10.1145/3379597.3387478

License: MIT License

Shell 0.17% R 1.63% Jupyter Notebook 98.20%

biman_bot_detection's Introduction

BIMAN: Bot Identification by commit Message, commit Association, and author Name

Code for running the BIMAN bot detection method, as stated in the MSR 2020 paper:

"Detecting and Characterizing Bots that Commit Code"

Data Processing Steps Listed below using WoC:

Get Author to Commit maps (a2c):

cat bot_authors| ~/lookup/getValues -vQ a2c |gzip > paper_a2c.gz # all commits for all bot authors

Getting List of all Commits from a2c map:

Use code in bot_datagen.ipynb (first code block)

Getting Commit Contents from List of Commits:

zcat commits.gz| ~/lookup/showCnt commit 2 | gzip > paper_cnt.gz

Getting Commit to Project map (c2p):

zcat commits.gz| ~/lookup/getValues -vQ c2p | gzip > paper_c2p.gz

Getting Commit to File map (c2f):

zcat commits.gz| ~/lookup/getValues -vQ c2f | gzip > paper_c2f.gz

Snapshot of the Data

Snapshot of the different data files used by the scripts is available in data_snapshot folder. This is for demonstration purpose, so that researchers using this Tool know how to format their data if they are not using WoC tool to generate the data.

Running BIMAN

Running BIN (name based detection) approach:

./BIN.sh |file with list of Author IDs|

Example Author ID file structure:

dependabot[bot] <[email protected]>
felix <[email protected]>
John Smith <[email protected]>
Abbot <[email protected]>

Only dependabot[bot] <[email protected]> should come out as a bot.

Running BIM:

Use code in bot_datagen.ipynb

Running BICA:

Prepare data using code in bot_datagen.ipynb Run Random Forest using Code in BICA_BIMAN.ipynb

You can use the Pre-Trained model for prediction directly: BICA_model.Rdata

Predictors (In Order):

'Uniq.File.Exten' 'Tot.FilesChanged' 'Std.File.pCommit' 'Tot.uniq.Projects' 'Avg.File.pCommit' 'Median.Project.pCommit'

Description:

Variable Name Variable Description
Tot.FilesChanged Total number of files changed by the author in all their commits (including duplicates)
Uniq.File.Exten Total number of different file extensions in all the author's commits
Std.File.pCommit Std. Deviation of the number of files per commit
Avg.File.pCommit Mean number of files per Commit
Tot.uniq.Projects Total number of different projects the author's commits have been associated with
Median.Project. pCommit Median number of projects the author's commits have been associated with (with duplicates); we took the median value, because the distribution of projects per commit was very skewed, and the mean was heavily influenced by the maximum value.

Running BIMAN:

Code snippet available in BICA_BIMAN.ipynb

You can use the Pre-Trained model for prediction directly: ensemble.Rdata

Predictors (In Order):

p, ratio, name

Description:

Variable Name Variable Type Variable Description
p numeric Probability of Author being a bot from BICA prediction
ratio numeric 1-(no. of message templates detected/no. of messages), as calculated by BIM
name Factor w/ 2 levels "0","1" Whether the author has the word "Bot" in their name in the required pattern, as indicated by BIN : "0"-> not a bot, "1" -> bot

biman_bot_detection's People

Contributors

tapjdey avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.