Giter Site home page Giter Site logo

gram21 / bert4dat Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 1.34 MB

Supplementary Material for the ECSA2020 paper "Does BERT Understand Code? -- An Exploratory Study on the Detection of Architectural Tactics in Code" by Keim et al.

License: GNU General Public License v3.0

Jupyter Notebook 81.41% Python 18.59%
bert language-model architectural-patterns software-architecture

bert4dat's Introduction

GitHub

Open In Colab

BERT4DAT

Supplementary material for the paper "Does BERT Understand Code? -- An Exploratory Study on the Detection of Architectural Tactics in Code" by Jan Keim, Angelika Kaplan, Anne Koziolek, and Mehdi Mirakhorli for the 14th European Conference on Software Architecture (ECSA 2020).

Note that we are not able to provide the actual models that were used to produce the results of the paper. The results may still be reproduced with the supplied notebooks and correct configurations.

The original data sets can be found in at SoftwareDesignLab/Archie.

How to cite

@InProceedings{Keim2020,
author={Keim, Jan and Kaplan, Angelika and Koziolek, Anne and Mirakhorli, Mehdi},
title={Does {BERT} Understand Code? -- An Exploratory Study on the Detection of Architectural Tactics in Code},
booktitle={14th European Conference on Software Architecture (ECSA 2020)},
year={2020},
publisher={Springer International Publishing},
}

This repository can be cited as:

@software{keim2020_3925166,
  author       = {Jan Keim and Angelika Kaplan and Anne Koziolek and Mehdi Mirakhorli},
  title        = {Gram21/BERT4DAT},
  month        = jul,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v1.0},
  doi          = {10.5281/zenodo.3925165},
  url          = {https://doi.org/10.5281/zenodo.3925165}
}

How to use

The artifacts can be downloaded from Zenodo.

We recommend you to open this repository in Google Colab (this link opens the submission repository directly in Colab and allows you to open one of the notebooks). With Colab, you should be able to open the notebooks, set the preferred configuration parameters and run all cells (CTRL+F9 or Runtime -> Run All). A GPU as hardware accelerator should already be used by Colab. We have a cell that checks if a GPU is used; if it turns out that none is used, please enable the GPU (in Colab: Edit -> Notebook settings -> Hardware accelerator: GPU).

If you plan on running locally, you need to install Jupyter. Furthermore, you might have to install further python dependencies than the ones installed in the notebooks (first cell) depending on your python installation. You have to make sure that you installed all python libraries that are imported in the second cell via pip. It is neccessary to install PyTorch. You will need a machine with a very potent GPU (at least 12GB GPU RAM is recommended) as the pretrained BERT model is very memory hungry. Also, you have to make sure that your GPU and drivers support CUDA. We recommend Ubuntu as operating system. Moreover, some parts of the notebooks are coded and designed for Colab; there might be some differences in appearance. There should be no problems regarding the code, but there is still the possibility that you might experience some issues (regarding your installation, system, setup etc.).

Each notebook has a cell with the configuration that can be adapted and allows tuning hyperparameters and configure experiment set-ups like sampling or folding strategy. Details on the hyperparameters and settings are outlined in the respective Notebooks.

This repository contains the code used in the paper, as well as additional results:

  • notebooks contains the python notebooks (code):
  • scripts contains various scripts:
    • prepareInput.py: Script to preprocess the data
    • eval.py: Script used to proces the log produced for evaluation and calculate the metrics
  • eval contains the results of all tested hyperparameter configurations for each task
  • data contains the already preprocessed data to be directly used in the notebooks.
    • Files with the prefix 1_ are part of the TSE paper by Mirakhorli et al. from 2016 (see attribution)
    • Files with the prefix 2_ are part of the ICSE 2012 paper by Mirakhorli et al. (see attribution)
    • Files with the prefix Hadoop are part of the Hadoop case stuy from the TSE paper by Mirakhorli et al. from 2016 (see attribution)
    • The suffix BegOnly_512 means that classes are cut after 512 tokens
    • The suffix Shrunk means that classes are shrunk by removing method bodies

Attribution

The preprocessed datasets are based on the original dataset that can be found at SoftwareDesignLab/Archie. These datasets are used in the following papers:

  • Mirakhorli, M., & Cleland-Huang, J. (2016). Detecting, Tracing, and Monitoring Architectural Tactics in Code. IEEE Annals of the History of Computing, (03), 205-220.
  • Mirakhorli, M., Shin, Y., Cleland-Huang, J., & Cinar, M. (2012, June). A tactic-centric approach for automating traceability of quality concerns. In 2012 34th international conference on software engineering (ICSE) (pp. 639-649). IEEE.

For the case study, we used Apache Hadoop.

bert4dat's People

Contributors

gram21 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.