tushartushar / deeplearningsmells Goto Github PK

Smelling smells using Deep Learning

License: Apache License 2.0

Python 100.00%

deeplearningsmells's Introduction

Detecting smells using deep learning

Overview

The figure below provides an overview of the experiment. We download 1,072 C# and 100 Java repositories from GitHub. We use Designite and DesigniteJava to analyze C# and Java code respectively. We use CodeSplit to extract each method and class definition into separate files from C# and Java programs. Then the learning data generator uses the detected smells to bifurcate code fragments into positive or negative samples for a smell - positive samples contains the smell while the negative samples are free from that smell. We apply preprocessing operations on these samples such as removing duplicates and feed the output to Tokenizer. Tokenizer takes a method or class definition and generates integer tokens for each token in the source code. The output of Tokenizer is ready to feed to neural networks.

Generating Data and Curation

Download repositories

We used the following protocol to identify and download our subject systems.

We download repositories containing C# and Java code from GitHub. We use RepoReapers [1] to filter out low-quality repositories. RepoReapers analyzes GitHub repositories and provides scores for nine dimensions their quality. These dimensions are architecture, community, continuous integration, documentation, history, license, issues, and unit tests.
We selected all the repositories where at least 6 out of 8 and 7 out of 8 RepoReapers' dimensions had suitable scores for C# and Java repositories respectively. We consider a score suitable if it has a value greater than zero.
Next, we discarded repositories with less than 5 stars and less than 1,000 LOC.
RepoReapers do not include forked repositories.
Following these criteria, we get a filtered list of 1,072 C# and 2,528 Java repositories. We select 100 repositories randomly from the the filtered list of Java. Finally, we download and analyze 1,072 C# and 100 Java repositories.

Smells detection

We use Designite to detect smells in C# code. Designite is a software design quality assessment tool for projects written in C#. It supports detection of eleven implementation, 19 design, and seven architecture smells. It also provides commonly used code metrics and other features such as trend analysis, code clones detection, and dependency structure matrix to help us assess the software quality. A free academic license can be requested for all the academic purposes. Similar to the C# version, we developed DesigniteJava which is an open-source tool to analyze Java code. We use DesigniteJava to detect smells in Java codebase. The tool supports detection of 17 design and ten implementation smells.

We use console version of Designite (version 2.5.10) and DesigniteJava (version 1.1.0) to analyze C# and Java code respectively and detect design and implementation smells in each of the downloaded repositories.

Splitting code fragments

CodeSplit are utilities to split the methods or classes written in C# and Java source code into individual files. Hence, given a C# or Java project, the utilities can parse the code correctly (using Roslyn for C# and Eclipse JDT for Java), and emit the individual methods or classes fragments into separate files following hierarchical structure (i.e., namespaces/packages becomes folders). CodeSplit for Java is an open-source project that can be found on GitHub. CodeSplit for C# can be downloaded freely online.

Generating learning data

The learning data generator requires information from two sources:

A list of detected smells for each analyzed repository.
The path to the folder where code fragments are stored corresponding to the repository.

The program takes a method (or class in case of design smell) at a time and checks whether the given smell has been detected in the method (or class). If yes, the program puts the code fragment into positive folder corresponding to the smell; otherwise into the negative folder.

Input: smells_result_path, code_fragments_base_path, smells_list
Output: code fragments in learning_data/<smell>/(positive|negative) folders

for smell in smells_list:
    for repo in smells_result_path:
        all_smells_files = get files containing detected smells
        positive_file_list = initialize a list
        for file_smell_info in all_smells_files:
            file_name = read a line where detected smell is smell and compose a file name from namespace and class information
            if file_name exists in repo folder in code_fragments_base_path:
                add the file to positive_file_list
        for file in files in repo folder in code_fragments_base_path:
            if file is in positive_file_list:
                copy file in learning_data/<smell>/positive folder
             else
                copy file in learning_data/<smell>/negative folder

Here, smells_result_path has the following structure:

smells_result_path
	- repo_1
		- analysis_summary.csv
		- project1_archSmells.csv
		- project1_implSmells.csv
		- project1_designSmells.csv
		- project1_methodMetrics.csv
		- project1_classMetrics.csv
		- project2_archSmells.csv
		- project2_implSmells.csv
		- project2_designSmells.csv
		- project2_methodMetrics.csv
		- project2_classMetrics.csv
		...
	- repo_2
		...
	- repo_n
		...

all_smells_files is a collection of files containing information about detected smells. For implementation smells, it is a list of projectN_implSmells.csv; similarly, for design smells, it is a list of projectN_designSmells.csv.

Each project1_implSmells.csv has the following columns:

Implementation_smell_name | Namespace_name | Class_name | File_path | Method_name | Description |

Each project1_designSmells.csv has the following columns:

Design_smell_name | Namespace_name | Class_name | File_path | Description |

Tokenizing learning data

Machine learning algorithms including neural networks take vectors of numbers as input. Hence, we need to convert source code into vectors of numbers honoring the language keywords and other semantics. Tokenizer is an open-source tool to tokenize source code into integer vectors, symbols, or discrete tokens. It supports six programming languages currently including C# and Java.

Input: learning_data_path, smells_list
Output: tokenized output in tokenizer_out/<smell>/<dim>/(training|eval)/(positive|negative) folders

for all smell in smells_list do:
    for all dim in (1d, 2d) do:
        for all training_case in (training, eval) do:
            for all learning_case in (positive, negative) do:
                in_file = get file path at learning_data_path/smell/training_case/learning_case
                out_file = create a file if not exists at tokenizer_out/smell/dim/training_case/learning_case
                tokenized_file = tokenize in_file using tokenizer; pass appropriate parameters based on dim and programming language
                if size of out_file is greater than 50 MB:
                    out_file = create a new file
                append the contents of tokenized_file at the end of out_file

Data format

For 1D format, each sample is stored in a line.

Sample-1
Sample-2
...

For 2D format, two samples are separated by one new line

Sample-1-line1
Sample-1-line2
...
Sample-1-linen

Sample-2-line1
Sample-2-line2
...
Sample-2-linen

...

Data preparation

The stored samples are read into numpy arrays, preprocessed, and filtered. We first perform bare minimum preprocessing to clean the data. For both 1D and 2D samples, we scan all the samples for each smell and remove duplicates if exists.

We balance the number of samples for training by choosing the smaller number from positive and negative sample count for training. We discard the remaining training samples from the larger side. We figure out the maximum input length (or, maximum input height and width in case of 2-D samples) for an individual sample. To filter out the outliers, we read all the samples into a numpy array and compute mean and standard deviation. We discard all the samples where the length of the sample is greater than mean + standard deviation. This filtering helps us keeping the training set in reasonable bounds and avoids waste of memory and processing resources. Finally, we shuffle the array of input samples along with its corresponding labels array.

Input: tokenize_out_path, smell, dim
Output: training_data, training_labels, eval_data, eval_labels

for all training_case in (training, eval) do:
    folder_path = initialize the folder path for the training_case, smell, and dim
    filter out duplicates in the samples in folder_path
    if training_case is training:
        total_cases = initialize with minimum of total positive and negative samples
    outlier_threshold = read all samples into a numpy array, compute mean and standard deviation, and set the threshold at mean + standard deviation
    if training_case is training:
        data, labels = read all samples into a numpty array (maximum total_cases per positive and negative) of type float with size of sample less than the outlier_threshold and set their corresponding labels in the labels array
    else:
        data, labels = read all samples into a numpty array of type float with size of sample less than the outlier_threshold and set their corresponding labels in the labels array
    shuffle data and labels

References

Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating GitHub for engineered software projects. Empirical Software Engineering 22, 6 (01 Dec 2017), 3219–3253. https://doi.org/10.1007/s10664- 017- 9512- 6

deeplearningsmells's People

Contributors

Stargazers

Watchers

deeplearningsmells's Issues

If Designite can learn code smell, why do we need a DL model to do it again?

thanks for sharing the code and several implementations. I may have missed something about it. Why would we use DL to analyze code smell again, if Designite is already doing its job? What's the additional value besides that we've learned a model similar to Designite?

Instruction on running the model

How can I run the trained model to test out a new repository? I could not find any .h5 files that could be the weights for the DL model. I may be wrong; still pretty new to the field of machine learning and deep learning.

Thanks for your fantastic work!

How can I find all_cs_repos or all_java_repos?

Hello. Thank you for the awesome project.

I am trying to run the program on my own but I could not find all_cs_repos and BatchFiles. I have made some adjustments and thought I can use training_data_cs but analyze_repositories is searching for .csproj files and could not find any. My question is, should we make our own dataset or can i really use training_data_cs ?

Here is the data_curation_main.py

# This program generates required data and put the data into required form to apply machine/deep learning
# in a step by step way.
# Set the parameters first before running it.
# Steps are in general independent from another steps except dependence on data consumed by some steps
# are generated by previous steps.

# -- imports --
# import cs_designite_runner
# import tokenizer_runner
import cs_code_split_runner

import os

import cs_designite_runner


# New folders getting created: codesplit_out_class,

# Base directory of the project: go up 2 levels from current directory: ../../data
BASE_DIR = os.path.abspath(os.path.join(os.getcwd()))
DATA_BASE_PATH = os.path.join(os.getcwd(), "data")


CS_REPO_SOURCE_FOLDER = os.path.join(DATA_BASE_PATH, "training_data_cs")
BATCH_FILES_FOLDER = os.path.join(DATA_BASE_PATH, "BatchFiles")
CS_SMELLS_RESULTS_FOLDER = os.path.join(DATA_BASE_PATH, "designite_out_new")
CS_DESIGNITE_CONSOLE_PATH = os.path.join(
    BASE_DIR, "Designite_4_1_1_0", "DesigniteConsole.exe"
)

CS_CODE_SPLIT_OUT_FOLDER_CLASS = os.path.join(DATA_BASE_PATH, "codesplit_out_class")
CS_CODE_SPLIT_OUT_FOLDER_METHOD = os.path.join(DATA_BASE_PATH, "codesplit_out_method")
CS_CODE_SPLIT_MODE_CLASS = "-c"
CS_CODE_SPLIT_MODE_METHOD = "-m"
CS_CODE_SPLIT_EXE_PATH = os.path.join(
    BASE_DIR, "CodeSplit", "CodeSplit_1_1_0_0", "CodeSplit.exe"
)

TOKENIZER_EXE_PATH = os.path.join(DATA_BASE_PATH, "tokenizer.exe")
CS_TOKENIZER_OUT_PATH = os.path.join(DATA_BASE_PATH, "tokenizer_out")

JAVA_REPO_SOURCE_FOLDER = os.path.join(DATA_BASE_PATH, "all_java_repos")
JAVA_SMELLS_RESULTS_FOLDER = os.path.join(DATA_BASE_PATH, "designite_out_java")
DESIGNITE_JAVA_JAR_PATH = os.path.join(DATA_BASE_PATH, "DesigniteJava.jar")

JAVA_CODE_SPLIT_OUT_FOLDER_CLASS = os.path.join(DATA_BASE_PATH, "codesplit_java_class")
JAVA_CODE_SPLIT_OUT_FOLDER_METHOD = os.path.join(
    DATA_BASE_PATH, "codesplit_java_method"
)

JAVA_CODE_SPLIT_MODE_CLASS = "class"
JAVA_CODE_SPLIT_MODE_METHOD = "method"
JAVA_CODE_SPLIT_EXE_PATH = os.path.join(DATA_BASE_PATH, "CodeSplitJava.jar")

JAVA_LEARNING_DATA_FOLDER_BASE = os.path.join(DATA_BASE_PATH, "smellML_data_java")
JAVA_TOKENIZER_OUT_PATH = os.path.join(DATA_BASE_PATH, "tokenizer_out_java")


if __name__ == "__main__":
    # 1. Run Designite to analyze C# repositories
    # This step requires that you have downloaded C# repositories to analyze and have installed
    # Designite on your machine. Designite can be downloaded from its website (http://www.designite-tools.com).
    # alright
    cs_designite_runner.analyze_repositories(
        CS_REPO_SOURCE_FOLDER,
        BATCH_FILES_FOLDER,
        CS_SMELLS_RESULTS_FOLDER,  # -> designite_out
        CS_DESIGNITE_CONSOLE_PATH,
    )

    # 2. Run codeSplit for all C# repositories
    # 2.1 Run codeSplit to generate class code fragments (each code fragment will contain a class definition)
    cs_code_split_runner.cs_code_split(
        CS_REPO_SOURCE_FOLDER,
        CS_CODE_SPLIT_OUT_FOLDER_CLASS,
        CS_CODE_SPLIT_MODE_CLASS,
        CS_CODE_SPLIT_EXE_PATH,
    )

    # 2.2 Run codeSplit to generate method code fragments (each code fragment will contain a method definition)
    cs_code_split_runner.cs_code_split(
        CS_REPO_SOURCE_FOLDER,
        CS_CODE_SPLIT_OUT_FOLDER_METHOD,
        CS_CODE_SPLIT_MODE_METHOD,
        CS_CODE_SPLIT_EXE_PATH,
    )

    # 3. Run learning data generator that will classify code fragments into either positive or negative cases
    # based on occurrence of smell in that fragment
    # cs_learning_data_generator.generate_data(CS_SMELLS_RESULTS_FOLDER, CS_CODE_SPLIT_OUT_FOLDER_CLASS,
    #                                        CS_CODE_SPLIT_OUT_FOLDER_METHOD, CS_LEARNING_DATA_FOLDER_BASE)

    # 4. Run tokenizer to convert code fragments into vectors/matrices of numbers that can be fed to neural network.
    # tokenizer_runner.tokenize("CSharp", CS_LEARNING_DATA_FOLDER_BASE, CS_TOKENIZER_OUT_PATH, TOKENIZER_EXE_PATH)

    # 5-8. We repeat the step 1 to 4 for Java repositories
    # 5. Run DesigniteJava to analyze Java repositories
    # java_designite_runner.analyze_repositories(JAVA_REPO_SOURCE_FOLDER, JAVA_SMELLS_RESULTS_FOLDER, DESIGNITE_JAVA_JAR_PATH)

    # 6. Run CodeSplit for all Java repositories
    # 6.1 Run codeSplit to generate class code fragments
    # java_codeSplit_runner.java_code_split(JAVA_REPO_SOURCE_FOLDER, JAVA_CODE_SPLIT_MODE_CLASS,
    #                                       JAVA_CODE_SPLIT_OUT_FOLDER_CLASS, JAVA_CODE_SPLIT_EXE_PATH)

    # 6.2 Run codeSplit to generate method code fragments
    # java_codeSplit_runner.java_code_split(JAVA_REPO_SOURCE_FOLDER, JAVA_CODE_SPLIT_MODE_METHOD,
    #                                       JAVA_CODE_SPLIT_OUT_FOLDER_METHOD, JAVA_CODE_SPLIT_EXE_PATH)

    # 7. Run learning data generator that will classify java code fragments into either positive or negative cases
    #     # based on occurrence of smell in that fragment
    # java_learning_data_generator.generate_data(JAVA_SMELLS_RESULTS_FOLDER, JAVA_CODE_SPLIT_OUT_FOLDER_CLASS,
    #                                            JAVA_CODE_SPLIT_OUT_FOLDER_METHOD, JAVA_LEARNING_DATA_FOLDER_BASE)

    # 8. Run tokenizer to convert code fragments into vectors/matrices of numbers that can be fed to neural network.
    # tokenizer_runner.tokenize(
    #     "Java",  # tokenizer_language
    #     JAVA_LEARNING_DATA_FOLDER_BASE,  # tokenizer_input_base_path
    #     JAVA_TOKENIZER_OUT_PATH,  # tokenizer_out_base_path
    #     TOKENIZER_EXE_PATH,  # tokenizer_exe_path
    # )

Is "ValueError: Dimensions must be equal" normal?

Thank you for the awesome project!

I was trying to run this project on my own and ran into an issue, can you please check it out:

Epoch 1/20
in user code:

    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/engine/training.py", line 1284, in train_function  *
        return step_function(self, iterator)
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/engine/training.py", line 1268, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in run_step  **
        outputs = model.train_step(data)
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/engine/training.py", line 1051, in train_step
        loss = self.compute_loss(x, y, y_pred, sample_weight)
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/engine/training.py", line 1109, in compute_loss
        return self.compiled_loss(
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 265, in __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/losses.py", line 142, in __call__
        losses = call_fn(y_true, y_pred)
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/losses.py", line 268, in call  **
        return ag_fn(y_true, y_pred, **self._fn_kwargs)
    File "/Users/nguyenbinhminh/miniconda3/envs/deepsmells/lib/python3.9/site-packages/keras/losses.py", line 1470, in mean_squared_error
        return backend.mean(tf.math.squared_difference(y_pred, y_true), axis=-1)

    ValueError: Dimensions must be equal, but are 700 and 714 for '{{node mean_squared_error/SquaredDifference}} = SquaredDifference[T=DT_FLOAT](model_107/dense_131/Relu, IteratorGetNext:1)' with input shapes: [?,700,1], [?,714,1].

Here are the steps:

I don't have a super strong computer so I created a very small subset of your ComplexMethod data ComplexMethod.zip
I modified autoencoder.py to run only ComplexMethod smell
I ran python3 autoencoder.py from program/dl_models folder

Sorry for my poor English, if you need more information please ask!