Giter Site home page Giter Site logo

saltudelft / type4py Goto Github PK

View Code? Open in Web Editor NEW
61.0 8.0 13.0 206 KB

Type4Py: Deep Similarity Learning-Based Type Inference for Python

License: Apache License 2.0

Python 98.16% Shell 1.10% CSS 0.07% HTML 0.19% Dockerfile 0.47%
typeinference deeplearning python type4py similarity-learning machinelearning ml4se

type4py's Introduction

Type4Py: Deep Similarity Learning-Based Type Inference for Python

GH Workflow GH Workflow

This repository contains the implementation of Type4Py and instructions for re-producing the results of the paper.

Dataset

For Type4Py, we use the ManyTypes4Py dataset. You can download the latest version of the dataset here. Also, note that the dataset is already de-duplicated.

Code De-deduplication

If you want to use your own dataset, it is essential to de-duplicate the dataset by using a tool like CD4Py.

Installation Guide

Requirements

Here are the recommended system requirements for training Type4Py on the MT4Py dataset:

  • Linux-based OS (Ubuntu 18.04 or newer)
  • Python 3.6 or newer
  • A high-end NVIDIA GPU (w/ at least 8GB of VRAM)
  • A CPU with 16 threads or higher (w/ at least 64GB of RAM)

Quick Install

git clone https://github.com/saltudelft/type4py.git && cd type4py
pip install .

Usage Guide

Follow the below steps to train and evaluate the Type4Py model.

1. Extraction

NOTE: Skip this step if you're using the ManyTypes4Py dataset.

$ type4py extract --c $DATA_PATH --o $OUTPUT_DIR --d $DUP_FILES --w $CORES

Description:

  • $DATA_PATH: The path to the Python corpus or dataset.
  • $OUTPUT_DIR: The path to store processed projects.
  • $DUP_FILES: The path to the duplicate files, i.e., the *.jsonl.gz file produced by CD4Py. [Optional]
  • $CORES: Number of CPU cores to use for processing projects.

2. Preprocessing

$ type4py preprocess --o $OUTPUT_DIR --l $LIMIT

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects. For the MT4Py dataset, use the directory in which the dataset is extracted.
  • $LIMIT: The number of projects to be processed. [Optional]

3. Vectorizing

$ type4py vectorize --o $OUTPUT_DIR

Description:

  • $OUTPUT_DIR: The path that was used in the previous step to store processed projects.

4. Learning

$ type4py learn --o $OUTPUT_DIR --c --p $PARAM_FILE

Description:

  • $OUTPUT_DIR: The path that was used in the previous step to store processed projects.

  • --c: Trains the complete model. Use type4py learn -h to see other configurations.

  • --p $PARAM_FILE: The path to user-provided hyper-parameters for the model. See this file as an example. [Optional]

5. Testing

$ type4py predict --o $OUTPUT_DIR --c

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects.
  • --c: Predicts using the complete model. Use type4py predict -h to see other configurations.

6. Evaluating

$ type4py eval --o $OUTPUT_DIR --t c --tp 10

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects.
  • --t: Evaluates the model considering different prediction tasks. E.g., --t c considers all predictions tasks, i.e., parameters, return, and variables. [Default: c]
  • --tp 10: Considers Top-10 predictions for evaluation. For this argument, You can choose a positive integer between 1 and 10. [Default: 10]

Use type4py eval -h to see other options.

Reduce

To reduce the dimension of the created type clusters in step 5, run the following command:

Note: The reduced version of type clusters causes a slight performance loss in type prediction.

$ type4py reduce --o $OUTPUT_DIR --d $DIMENSION

Description:

  • $OUTPUT_DIR: The path that was used in the first step to store processed projects.
  • $DIMENSION: Reduces the dimension of type clusters to the specified value [Default: 256]

Converting Type4Py to ONNX

To convert the pre-trained Type4Py model to the ONNX format, use the following command:

$ type4py to_onnx --o $OUTPUT_DIR

Description:

  • $OUTPUT_DIR: The path that was used in the usage section to store processed projects and the model.

VSCode Extension

vsm-version

Type4Py can be used in VSCode, which provides ML-based type auto-completion for Python files. The Type4Py's VSCode extension can be installed from the VS Marketplace here.

Using Local Pre-trained Model

Type4Py's pre-trained model can be queried locally by using provided Docker images. See here for usage info.

Type4Py Server

GH Workflow

The Type4Py server is deployed on our server, which exposes a public API and powers the VSCode extension. However, if you would like to deploy the Type4Py server on your own machine, you can adapt the server code here. Also, please feel free to reach out to us for deployment, using the pre-trained Type4Py model and how to train your own model by creating an issue.

Citing Type4Py

@inproceedings{mir2022type4py,
  title={Type4Py: practical deep similarity learning-based type inference for python},
  author={Mir, Amir M and Lato{\v{s}}kinas, Evaldas and Proksch, Sebastian and Gousios, Georgios},
  booktitle={Proceedings of the 44th International Conference on Software Engineering},
  pages={2241--2252},
  year={2022}
}

type4py's People

Contributors

gousiosg avatar mir-am avatar p-fruck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

type4py's Issues

Cannot preprocess ManyTypes4Py dataset

Hey there,

I am currently trying to getting this project up and running and was following the instructions to train the model using the ManyTypes4Py dataset. Unfortunately, the preprocess command just skips the dataset (or rather, does not find any relevant information). I solved this issue by removing the files all_fns.csv and all_vars.csv and symlinking processed_projects_complete to processed_projects.

Did I miss anything during the setup? Are those steps expected and should be added to the documentation?

Integrate with pyre incremental and adapt the TypeWriter search strategy

It would be interesting to see how well the TypeWriter algorithm (https://software-lab.org/publications/TypeWriter_arXiv_1912.03768.pdf) for searching type annotation suggestions works against type4py. We might get dramatically better results for two reasons:

  • type4py's ML model seems to perform quite a bit better
  • today's pyre incremental is orders of magnitude faster than pyre was when the TypeWriter paper was written, so we may be able to try many more combinations and get correspondingly better results

At one point we'd considered hacking this very quickly as an internal project in my company, but we ran out of time. I think it would be better done open-source anyway because then

  • it would be easier to try out against external projects
  • we could publish our results with code if they are interesting enough to be worth a paper
  • the entire OSS community could benefit

I'm unsure if I can find time to prioritize this in the next 6 months at work but it's a little more likely if I treat it as a side project, which would also open the door to an informal weekend hackathon as a way to kick it off :)

I could do this in a separate repository or inside of type4py. What do you think @mir-am ? And does this sound interesting to you?

JSON output file not JSON conformant

The JSON output file is not JSON conformant in two aspects:

  1. Single quotes (') are used instead of double quotes(")
  2. Some words such as None, True or False are not wrapped in any quotes at all

This may affect some simpler JSON parsers, better JSON parsers can handle these minor errors just fine.

'error': None
#should be
"error": "None"

Crash when trying to infer single file with freshly trained model using ManyTypes4Py

Hello, thank you for creating and providing this great project! I plan to use this project for my bachelor thesis. Therefore, I am mainly interested in the inference functionality provided with infer.py on branch server (branch infer seems to be outdated).
I am aware of the VS Code extension and the public JSON API. I, however, prefer to use this project locally.

Since infer.py takes a pre-trained model as a program argument, I followed all the steps in the README to train such a model.
Unfortunately, the script crashes with the following message (excerpt):

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: tok for the following indices
 index: 0 Got: 7 Expected: 1
 Please fix either the inputs or the model.

Below you can find a link to a Google Colab notebook with all the steps from start (downloading the ManyTypes4Py dataset, pip-installing type4py, preprocessing) to finish (training a model, trying to infer the types of a single file) and the corresponding output from when I ran it the last time (including the full error backtrace on the bottom):

https://colab.research.google.com/drive/1kRIffMlgGCeW55wXelksGrXfSd0WjhKQ?usp=sharing

It should be relatively self-explanatory. Evidently, I use a fork of this project and not the project itself. The differences are minor though: In learn.py, I just re-uncommented the .to(DEVICE)-calls (c42144d) as otherwise it would lead to a crash in the notebook (vectors are on different devices). The remaining changes don't affect Python files and are not relevant to this issues.
Further, I am using venv, although I doubt this has any negative influence on the execution of this project.


My question is, how can I successfully use infer.py? How can I obtain a proper compatible model for it?
Are any of those steps in the linked notebook incorrect?

Return fixed amount of type predictions

I experimented with the type prediction (http://localhost:5001/api/predict?tc=0) using the provided docker image.
I noticed that depending on the analysed source code, I get different amounts of type predictions per parameter/return/variable type.
Is it possible to retrieve a fixed number of predicted types?
For example, I would like to retrieve the Top-10 type predictions for each parameter and return type.

Best regards
Florian

Error in variable initialisation

When using the preprocess command with only the -o argument, the code crashes with the following

UnboundLocalError: local variable 'train_files_vars' referenced before assignment

This is because in the following extract

if all(processed_proj_fns['set'].isin(['train', 'valid', 'test'])) and \
all(processed_proj_vars['set'].isin(['train', 'valid', 'test'])):
logger.info("Found the sets split in the input dataset")
train_files = processed_proj_fns['file'][processed_proj_fns['set'] == 'train']
valid_files = processed_proj_fns['file'][processed_proj_fns['set'] == 'valid']
test_files = processed_proj_fns['file'][processed_proj_fns['set'] == 'test']
train_files_vars = processed_proj_vars['file'][processed_proj_vars['set'] == 'train']
valid_files_vars = processed_proj_vars['file'][processed_proj_vars['set'] == 'valid']
test_files_vars = processed_proj_vars['file'][processed_proj_vars['set'] == 'test']
else:
logger.info("Splitting sets randomly")
train_files, test_files = train_test_split(pd.DataFrame(processed_proj_fns['file'].unique(), columns=['file']),
test_size=0.2)
train_files, valid_files = train_test_split(pd.DataFrame(processed_proj_fns[processed_proj_fns['file'].isin(train_files.to_numpy().flatten())]['file'].unique(),
columns=['file']), test_size=0.1)

the train_files_vars variable is only initialised in the if branch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.