Giter Site home page Giter Site logo

code2seq's Introduction

code2seq

JetBrains Research Github action: build Code style: black

PyTorch's implementation of code2seq model.

Installation

You can easily install model through the PIP:

pip install code2seq

Dataset mining

To prepare your own dataset with a storage format supported by this implementation, use on the following:

  1. Original dataset preprocessing from vanilla repository
  2. astminer: the tool for mining path-based representation and more with multiple language support.
  3. PSIMiner: the tool for extracting PSI trees from IntelliJ Platform and creating datasets from them.

Available checkpoints

Method name prediction

Dataset (with link) Checkpoint # epochs F1-score Precision Recall ChrF
Java-small link 11 41.49 54.26 33.59 30.21
Java-med link 10 48.17 58.87 40.76 42.32

Configuration

The model is fully configurable by standalone YAML file. Navigate to config directory to see examples of configs.

Examples

Model training may be done via PyTorch Lightning trainer. See it documentation for more information.

from argparse import ArgumentParser

from omegaconf import DictConfig, OmegaConf
from pytorch_lightning import Trainer

from code2seq.data.path_context_data_module import PathContextDataModule
from code2seq.model import Code2Seq


def train(config: DictConfig):
    # Define data module
    data_module = PathContextDataModule(config.data_folder, config.data)

    # Define model
    model = Code2Seq(
        config.model,
        config.optimizer,
        data_module.vocabulary,
        config.train.teacher_forcing
    )

    # Define hyper parameters
    trainer = Trainer(max_epochs=config.train.n_epochs)

    # Train model
    trainer.fit(model, datamodule=data_module)


if __name__ == "__main__":
    __arg_parser = ArgumentParser()
    __arg_parser.add_argument("config", help="Path to YAML configuration file", type=str)
    __args = __arg_parser.parse_args()

    __config = OmegaConf.load(__args.config)
    train(__config)

code2seq's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

code2seq's Issues

'\n' mixed in Vocabulary['token']

it seems that counter in vocabulary is counting 'token' tokens with a newline character.
for example, vocabulary.pkl in java-small dataset, i can find
'return': 6020684,
and
'return\n': 33290,
separately.

i personally fixed this problem by stripping path_context on Vocabulary._process_raw_sample,
but im little confused whether this problem(mixing '\n' in tokens) is intended.

thank you!

Application to real case study

Hello JetBrains team,

Could you please give some hints giving a whole software of code written in different programming languages, how it's possible to apply the code2seq tool to it?

Save model for prediction

Thanks JetBrains team,

I would like to train the model first on a Python dataset and Java dataset both of size 20k files. Then once trained do the prediction to get embeddings for both java and python files. We would like to have 1 vector embedding predicted after traiing of size say 120 for each file. Is that possible please with your implementation?

Also, do you have a trained model to avoid training anyway please? Because we had issue training code2seq original model on Python dataset as the tensor is very large and we got OOM error (out of memory). Our PC is 16 GB RAM and 4GB GPU. The dataset we used to train code2seq original model was only around 1 GB for both train and test, but we got OOM error. Can we run your model or we need higher hardware requirements please?

Thanks.

Improve Code2Class usage scenario

By discussing in #115, it seems that we need to

  • Code2Class usage for building vector representations (current implementation requires a number of classes that are undefined for vector representations).
  • Add a model description and usage documentation for this model.
  • Update all configs for the last version of the model usage pipeline.

Make typechecker happy

Probably a nit pick, but it would be nice to add some annotations and make typechecker happy

$ mypy .

utils/metrics.py:21: error: Incompatible types in assignment (expression has type "float", variable has type "int")
utils/metrics.py:23: error: Incompatible types in assignment (expression has type "float", variable has type "int")
utils/metrics.py:25: error: Incompatible types in assignment (expression has type "float", variable has type "int")
dataset/path_context_dataset.py:24: error: Need type annotation for '_buffered_files_paths' (hint: "_buffered_files_paths: List[<type>] = ...")
dataset/path_context_dataset.py:46: error: Argument 1 to "len" has incompatible type "None"; expected "Sized"
dataset/path_context_dataset.py:65: error: Unsupported left operand type for >= ("None")
dataset/path_context_dataset.py:68: error: Argument 1 to "_prepare_buffer" of "PathContextDataset" has incompatible type "None"; expected "int"
dataset/path_context_dataset.py:70: error: Unsupported operand types for + ("None" and "int")
dataset/path_context_dataset.py:70: error: Incompatible types in assignment (expression has type "int", variable has type "None")
dataset/path_context_dataset.py:71: error: Unsupported left operand type for >= ("None")
dataset/path_context_dataset.py:73: error: Argument 1 to "_prepare_buffer" of "PathContextDataset" has incompatible type "None"; expected "int"
dataset/path_context_dataset.py:74: error: Value of type "None" is not indexable
model/base_code_model.py:41: error: Incompatible types in assignment (expression has type "Adam", variable has type "SGD")
model/code2seq.py:69: error: Argument 1 to "update" of "dict" has incompatible type "Dict[str, int]"; expected "Mapping[str, Tensor]"
model/code2seq.py:93: error: Argument 1 to "update" of "dict" has incompatible type "Dict[str, int]"; expected "Mapping[str, Tensor]"
train.py:35: error: Incompatible types in assignment (expression has type "Callable[[str], Code2ClassConfig]", variable has type "Callable[[str], Code2SeqConfig]")
train.py:37: error: Argument 1 to "Code2Class" has incompatible type "Code2SeqConfig"; expected "Code2ClassConfig"
train.py:46: error: "None" has no attribute "dir"

Not able to replicate results of paper

Hi, thanks for sharing your code!
After running the code for 12 epochs (early stopping) on the java-small dataset, the F1 score on testing data is only 26.6% (the original paper shows 43.2%).
Do you have any ideas on this? Thanks!

Can't install black 20.8b1 with pip

pip install code2seq
Collecting code2seq
  Using cached code2seq-0.0.3-py3-none-any.whl (32 kB)
Requirement already satisfied: numpy==1.20.1 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.20.1)
Requirement already satisfied: wandb==0.10.20 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (0.10.20)
Requirement already satisfied: hydra-core==1.0.6 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.0.6)
Requirement already satisfied: pytorch-lightning==1.1.7 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.1.7)
Requirement already satisfied: omegaconf==2.0.6 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (2.0.6)
Requirement already satisfied: tqdm==4.58.0 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (4.58.0)
Requirement already satisfied: torch==1.7.1 in c:\miniconda\envs\code2seq-jb\lib\site-packages (from code2seq) (1.7.1)
Collecting black==20.8b1
  Using cached black-20.8b1.tar.gz (1.1 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
  WARNING: Requested black==20.8b1 from https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz#sha256=1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea (from code2seq), but installing version 0.0.0
WARNING: Discarding https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz#sha256=1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea (from https://pypi.org/simple/black/) (requires-python:>=3.6). Requested black==20.8b1 from https://files.pythonhosted.org/packages/dc/7b/5a6bbe89de849f28d7c109f5ea87b65afa5124ad615f3419e71beb29dc96/black-20.8b1.tar.gz#sha256=1c02557aa099101b9d21496f8a914e9ed2222ef70336404eeeac8edba836fbea (from code2seq) has inconsistent version: filename has '20.8b1', but metadata has '0.0.0'
Collecting code2seq
  Using cached code2seq-0.0.2-py3-none-any.whl (32 kB)
  Using cached code2seq-0.0.1-py3-none-any.whl (32 kB)
  Using cached code2seq-0.0.0-py3-none-any.whl (32 kB)
ERROR: Cannot install code2seq==0.0.0, code2seq==0.0.1, code2seq==0.0.2 and code2seq==0.0.3 because these package versions have conflicting dependencies.

The conflict is caused by:
    code2seq 0.0.3 depends on black==20.8b1
    code2seq 0.0.2 depends on black==20.8b1
    code2seq 0.0.1 depends on black==20.8b1
    code2seq 0.0.0 depends on black==20.8b1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

i think there's a problem between windows and black 20.8b1, could you loosen the range of package versions?

thank you.

Failed to run minimal code example

I tried to run the minimal code example with this configuration file but failed. It reported this error:

Traceback (most recent call last):
  File "code2seq_example.py", line 34, in <module>
    train(__config)
  File "code2seq_example.py", line 24, in train
    trainer = Trainer(max_epochs=config.hyper_parameters.n_epochs)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 354, in __getattr__
    key=key, value=None, cause=e, type_override=ConfigAttributeError
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/base.py", line 196, in _format_and_raise
    type_override=type_override,
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 351, in __getattr__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/home/junkaichen/miniconda3/envs/d2l/lib/python3.7/site-packages/omegaconf/dictconfig.py", line 470, in _get_node
    raise ConfigKeyError(f"Missing key {key}")
omegaconf.errors.ConfigAttributeError: Missing key hyper_parameters
    full_key: hyper_parameters
    object_type=dict

What I did:

  1. Copy the code example to a .py file and this configutation file to a .yaml file
  2. Download the data from the url https://s3.eu-west-1.amazonaws.com/datasets.ml.labs.aws.intellij.net/java-paths-methods/java-small.tar.gz in the configuration file and compress it.
  3. Change the first line of the .yaml file (data_folder) to the location of data.
  4. Run this script python xxx.py xxx.yaml

I noticed that this error was related to "yaml" but I don't know where exactly the problem lies. I want to run a demo of code2seq model by this library. Thank you!

Debug output \w hyperparameters

Original code2seq implementation prints the summary of hyperparameters before training begins

Training batch size:			 512
Dataset path:				 java-small/java-small
Training file path:			 java-small/java-small.train.c2s
Validation path:			 java-small/java-small.val.c2s
Taking max contexts from each example:	 200
Random path sampling:			 True
Embedding size:				 128
Using BiLSTMs, each of size:		 128
Decoder size:				 320
Decoder layers:				 1
Max path lengths:			 9
Max subtokens in a token:		 5
Max target length:			 6
Embeddings dropout keep_prob:		 0.75
LSTM dropout keep_prob:			 0.5
============================================
Number of trainable params: 10950144

It would be very nice do something similar as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.