jetbrains-research / code2seq Goto Github PK

View Code? Open in Web Editor NEW

61.0 6.0 17.0 6.28 MB

PyTorch's implementation of the code2seq model.

License: MIT License

Python 95.99% Shell 4.01%

code2seq pytorch ml4code ml4se pytorch-lightning

code2seq's Introduction

code2seq

PyTorch's implementation of code2seq model.

Installation

You can easily install model through the PIP:

pip install code2seq

Dataset mining

To prepare your own dataset with a storage format supported by this implementation, use on the following:

Original dataset preprocessing from vanilla repository
astminer: the tool for mining path-based representation and more with multiple language support.
PSIMiner: the tool for extracting PSI trees from IntelliJ Platform and creating datasets from them.

Available checkpoints

Method name prediction

Dataset (with link)	Checkpoint	# epochs	F1-score	Precision	Recall	ChrF
Java-small	link	11	41.49	54.26	33.59	30.21
Java-med	link	10	48.17	58.87	40.76	42.32

Configuration

The model is fully configurable by standalone YAML file. Navigate to config directory to see examples of configs.

Examples

Model training may be done via PyTorch Lightning trainer. See it documentation for more information.

from argparse import ArgumentParser

from omegaconf import DictConfig, OmegaConf
from pytorch_lightning import Trainer

from code2seq.data.path_context_data_module import PathContextDataModule
from code2seq.model import Code2Seq


def train(config: DictConfig):
    # Define data module
    data_module = PathContextDataModule(config.data_folder, config.data)

    # Define model
    model = Code2Seq(
        config.model,
        config.optimizer,
        data_module.vocabulary,
        config.train.teacher_forcing
    )

    # Define hyper parameters
    trainer = Trainer(max_epochs=config.train.n_epochs)

    # Train model
    trainer.fit(model, datamodule=data_module)


if __name__ == "__main__":
    __arg_parser = ArgumentParser()
    __arg_parser.add_argument("config", help="Path to YAML configuration file", type=str)
    __args = __arg_parser.parse_args()

    __config = OmegaConf.load(__args.config)
    train(__config)

code2seq's People

Stargazers

Watchers

Forkers

bhavyagera10 zeta1999 gmoralejo chengxiao19961022 jumormt bzz haimoshri miaodexingz peter-devine cuidi34 thirunayan22 nashid teruto725 lyriccoder easy-forks niktannn iamfaith

code2seq's Issues

'\n' mixed in Vocabulary['token']

it seems that counter in vocabulary is counting 'token' tokens with a newline character.
for example, vocabulary.pkl in java-small dataset, i can find
'return': 6020684,
and
'return\n': 33290,
separately.

i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample,
but im little confused whether this problem(mixing '\n' in tokens) is intended.

thank you!

can i get an information about the java-med dataset parser?

(from #109) i'm trying make inference with code2seq-java-med.yaml

from the dataset, i found 'java_paths_methods.json', which looks very similar to older version of astminer configuration fille, but i couldn't figure it out more.
can you help me with this?

Scripts

Resolved

Application to real case study

Hello JetBrains team,

Could you please give some hints giving a whole software of code written in different programming languages, how it's possible to apply the code2seq tool to it?

Save model for prediction

Thanks JetBrains team,

I would like to train the model first on a Python dataset and Java dataset both of size 20k files. Then once trained do the prediction to get embeddings for both java and python files. We would like to have 1 vector embedding predicted after traiing of size say 120 for each file. Is that possible please with your implementation?

Also, do you have a trained model to avoid training anyway please? Because we had issue training code2seq original model on Python dataset as the tensor is very large and we got OOM error (out of memory). Our PC is 16 GB RAM and 4GB GPU. The dataset we used to train code2seq original model was only around 1 GB for both train and test, but we got OOM error. Can we run your model or we need higher hardware requirements please?

Thanks.

Improve Code2Class usage scenario

By discussing in #115, it seems that we need to

Code2Class usage for building vector representations (current implementation requires a number of classes that are undefined for vector representations).
Add a model description and usage documentation for this model.
Update all configs for the last version of the model usage pipeline.

Make typechecker happy

Probably a nit pick, but it would be nice to add some annotations and make typechecker happy

$ mypy .

utils/metrics.py:21: error: Incompatible types in assignment (expression has type "float", variable has type "int")
utils/metrics.py:23: error: Incompatible types in assignment (expression has type "float", variable has type "int")
utils/metrics.py:25: error: Incompatible types in assignment (expression has type "float", variable has type "int")
dataset/path_context_dataset.py:24: error: Need type annotation for '_buffered_files_paths' (hint: "_buffered_files_paths: List[<type>] = ...")
dataset/path_context_dataset.py:46: error: Argument 1 to "len" has incompatible type "None"; expected "Sized"
dataset/path_context_dataset.py:65: error: Unsupported left operand type for >= ("None")
dataset/path_context_dataset.py:68: error: Argument 1 to "_prepare_buffer" of "PathContextDataset" has incompatible type "None"; expected "int"
dataset/path_context_dataset.py:70: error: Unsupported operand types for + ("None" and "int")
dataset/path_context_dataset.py:70: error: Incompatible types in assignment (expression has type "int", variable has type "None")
dataset/path_context_dataset.py:71: error: Unsupported left operand type for >= ("None")
dataset/path_context_dataset.py:73: error: Argument 1 to "_prepare_buffer" of "PathContextDataset" has incompatible type "None"; expected "int"
dataset/path_context_dataset.py:74: error: Value of type "None" is not indexable
model/base_code_model.py:41: error: Incompatible types in assignment (expression has type "Adam", variable has type "SGD")
model/code2seq.py:69: error: Argument 1 to "update" of "dict" has incompatible type "Dict[str, int]"; expected "Mapping[str, Tensor]"
model/code2seq.py:93: error: Argument 1 to "update" of "dict" has incompatible type "Dict[str, int]"; expected "Mapping[str, Tensor]"
train.py:35: error: Incompatible types in assignment (expression has type "Callable[[str], Code2ClassConfig]", variable has type "Callable[[str], Code2SeqConfig]")
train.py:37: error: Argument 1 to "Code2Class" has incompatible type "Code2SeqConfig"; expected "Code2ClassConfig"
train.py:46: error: "None" has no attribute "dir"

Not able to replicate results of paper

Hi, thanks for sharing your code!
After running the code for 12 epochs (early stopping) on the java-small dataset, the F1 score on testing data is only 26.6% (the original paper shows 43.2%).
Do you have any ideas on this? Thanks!

Passing in new java source code snippets to get predictions

How can we pass in a single snippet of java code as string into the model and get a comment generated from the model, for example, what are the preprocessing steps involved?

Can't install black 20.8b1 with pip

pip install code2seq
Collecting code2seq
  Using cached code2seq-0.0.3-py3-none-any.whl (32 kB)
Requirement already satisfied: numpy==1.20.1 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.20.1)
Requirement already satisfied: wandb==0.10.20 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (0.10.20)
Requirement already satisfied: hydra-core==1.0.6 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.0.6)
Requirement already satisfied: pytorch-lightning==1.1.7 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.1.7)
Requirement already satisfied: omegaconf==2.0.6 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (2.0.6)
Requirement already satisfied: tqdm==4.58.0 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (4.58.0)
Requirement already satisfied: torch==1.7.1 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.7.1)
Collecting black==20.8b1
  Using cached black-20.8b1.tar.gz (1.1 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
  WARNING: Requested black==20.8b1 from https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz#sha256=1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea (from code2seq), but installing version 0.0.0
WARNING: Discarding https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz#sha256=1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea (from https://pypi.org/simple/black/) (requires-python:>=3.6). Requested black==20.8b1 from https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz#sha256=1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea (from code2seq) has inconsistent version: filename has '20.8b1', but metadata has '0.0.0'
Collecting code2seq
  Using cached code2seq-0.0.2-py3-none-any.whl (32 kB)
  Using cached code2seq-0.0.1-py3-none-any.whl (32 kB)
  Using cached code2seq-0.0.0-py3-none-any.whl (32 kB)
ERROR: Cannot install code2seq==0.0.0, code2seq==0.0.1, code2seq==0.0.2 and code2seq==0.0.3 because these package versions have conflicting dependencies.

The conflict is caused by:
    code2seq 0.0.3 depends on black==20.8b1
    code2seq 0.0.2 depends on black==20.8b1
    code2seq 0.0.1 depends on black==20.8b1
    code2seq 0.0.0 depends on black==20.8b1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

i think there's a problem between windows and black 20.8b1, could you loosen the range of package versions?

thank you.

Beam search for code2seq

Hi, is it possible to run beam search with the current code?

Failed to run minimal code example

I tried to run the minimal code example with this configuration file but failed. It reported this error:

Traceback (most recent call last):
  File "code2seq_example.py", line 34, in <module>
    train(__config)
  File "code2seq_example.py", line 24, in train
    trainer = Trainer(max_epochs=config.hyper_parameters.n_epochs)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 354, in __getattr__
    key=key, value=None, cause=e, type_override=ConfigAttributeError
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/base.py", line 196, in _format_and_raise
    type_override=type_override,
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 470, in _get_node
    raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key hyper_parameters
    full_key: hyper_parameters
    object_type=dict

What I did:

Copy the code example to a .py file and this configutation file to a .yaml file
Download the data from the url https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/java-paths-methods/java-small.tar.gz in the configuration file and compress it.
Change the first line of the .yaml file (data_folder) to the location of data.
Run this script python xxx.py xxx.yaml

I noticed that this error was related to "yaml" but I don't know where exactly the problem lies. I want to run a demo of code2seq model by this library. Thank you!

What is the difference between these three models?

code2seq
typed-code2seq
code2class

Debug output \w hyperparameters

Original code2seq implementation prints the summary of hyperparameters before training begins

Training batch size:			 512
Dataset path:				 java-small/java-small
Training file path:			 java-small/java-small.train.c2s
Validation path:			 java-small/java-small.val.c2s
Taking max contexts from each example:	 200
Random path sampling:			 True
Embedding size:				 128
Using BiLSTMs, each of size:		 128
Decoder size:				 320
Decoder layers:				 1
Max path lengths:			 9
Max subtokens in a token:		 5
Max target length:			 6
Embeddings dropout keep_prob:		 0.75
LSTM dropout keep_prob:			 0.5
============================================
Number of trainable params: 10950144

It would be very nice do something similar as well.