Giter Site home page Giter Site logo

som-research / describeml Goto Github PK

View Code? Open in Web Editor NEW
27.0 6.0 3.0 98.83 MB

DescribeML is a Visual Studio Code language plug-in to describe machine-learning datasets in a structured format. Build better data describing the composition, provenance and social concerns of your dataset.

License: MIT License

JavaScript 40.28% TypeScript 57.94% Pug 1.19% Twig 0.59%
datasets dataset-generation machine-learning modeling data-science visual-studio-code vscode describeml langium open-data

describeml's Introduction

DescribeML GitHub tag (latest by date)

DescribeML is a VSCode language plugin to describe machine-learning datasets.

Precisely describe your data's provenance, composition, and social concerns in a structured format.

Make it easy to reproduce your experiments to others when you cannot share your data.

Check out the quick video presentating of the tool, and the tutorial presented in the MODELS '22 Conference

Installation

Via marketplace

The easiest way to install the plugin is by using the Visual Studio Code Market. Just type "describeML" in the extension tab, and that's it!

Manually

Instead, you can install it manually using the packaged release of the plugin in this repository that can be found at the root of the project.

The file is DescribeML-1.2.1.vsix

Open your terminal (or the terminal inside the VSCode) and write this:


git clone https://github.com/SOM-Research/DescribeML.git datasets
cd datasets 
code --install-extension DescribeML-1.2.1.vsix

Troubles: If you cannot see the syntax highlight in the examples files (p.e. Melanoma.descml) as the image below. Please, reload the VSCode editor and write the code --install command again

Great! That's it.

Getting Started

  1. The first step is to create a .descml file

  2. The easy way to start using our tool is to use the preloader data service, located at the top left of your editor, clicking at: preloader service

  3. Select your dataset file (.csv), and the tool will generate a draft of your description file.

  4. To help you, look to the Language Reference Guide and follow the examples in the examples/evaluation folders to get a sense of the tool's possibilities. Take a look at the Melanoma.descml file, for example.

  5. During the documentation process, hitting CTRL + Space (equivalent in other OS) gives you auto-completion help. In addition, the part marked with the points below gives you hints to complete the documentation, and the outline in the right part shows you the document structure.

Autocompletion feature

  1. Once you are happy with your documentation, you can generate HTML documentation by clicking the generator button next to the prealoder service: HTML generator

For more information, check out the quick presentation video and the tutorial presented in the MODELS '22 Conference

Contributing

This project is being development as part of a research line of the SOM Research Lab, but we are open to contributions from the community. If you are interested in contributing to this project, please first read the CONTRIBUTING.md guidelines file.

Repository structure

The following tree shows the list of the repository's relevant sections:

  • The documentation and examples folders contains the mentioend examples and the language reference guide.
  • The out folder contains the executable plugin in JS. You may not want to dive in as it is generated by the TypeScrpit compiler
  • The src folder contains the project's source code
    • The cli folder is the generated grammar and AST from Langium. You may not want to dive in it as it is a generated asset
    • The generator-service folder contains all the code of the generation service. Could be a good place to start if you want to improve the generation of the tool.
    • The uploader-service folder contains all the code of the uploader service. Could be a good place to contribute new statistical metrics, or ML techniques to do dataset reverse engineering
    • The language-server folder contains all the language features, and the grammar declaration. If you want to improve the grammar, or some of the features the plugin offers here is the place you may want to start
      • The dataset-description.langium file contains the main grammar declaration. This grammar is developed using the Langium Grammar Language. Please refer to the linked documentation to more insights on how to develop the grammar.
├── documentation
│   └── language-reference-guide.md         // The language reference guide
├── examples
│     ├── evaluation
│       ├── Gender.descml                   // Gender dataset example
|       ├── Melanoma.descml                 // Melanoma dataset example
|       └── Polarity.descml                 // Polarity dataset example
├── out                                     // The generated JS from the src folder
└── src                                     // The source code of the project
  ├── cli                                     // Langium framework utils
  ├── generator-service                       // The tool's HTML generator service
  ├── uploader-service                        // The tool's HTML uploader service
  └── language-server                         // The tool's language features
        ├── generated                           // Generated grammar and AST from Langium
        ├── dataset-description-index.ts        // Custom index feature
        ├── dataset-description-module.ts       // Declaration of the custom language features
        ├── dataset-description-validator.ts    // Custom language features 
        └── dataset-description.langium         // The main grammar file of the tool
  

Debugging the extensions

This repo comes with an already built-in config to debug. Just go to Debug in VSCode, and launch the Extension config. Please check your port 6009 is free.

For more information about how the framework works and how the language can be extended, please refer to https://github.com/langium/langium or the VSCode extension API documentation https://code.visualstudio.com/api

Research background and citation

DescribeML is part of an ongoing research project to improve dataset documentation for machine learning. The core of our proposal is a domain-specific language published in the Journal of Computer Languages that allows data creators to describe relevant aspects of their data for the machine learning field and beyond. The Critical Dataset Studios of the Knowing Machines project have compiled an excellent list of current documentation practices.

To cite the domain-specific language:

Giner-Miguelez, J., Gómez, A., & Cabot, J. (2023). A domain-specific language for describing machine learning datasets. Journal of Computer Languages, 76, 101209.

The tool has been presented at the ACM/IEEE 25th International Conference on Model Driven Engineering Languages and Systems and published as an Original Software Publication in the Science of Computer Programming journal.

To cite the tool:

Giner-Miguelez, J., Gómez, A., & Cabot, J. (2023). DescribeML: A dataset description tool for machine learning. Science of Computer Programming, 2023, 103030, ISSN 0167-6423, https://doi.org/10.1016/j.scico.2023.103030.

Code of Conduct

At SOM Research Lab we are dedicated to creating and maintaining welcoming, inclusive, safe, and harassment-free development spaces. Anyone participating will be subject to and agrees to sign on to our Code of Conduct.

License

Shield: License: MIT

The source code for the site is licensed under the MIT license, which you can find in the MIT-LICENSE file.

All graphical assets are licensed under the Creative Commons Attribution 3.0 Unported License.

describeml's People

Contributors

dependabot[bot] avatar jcabot avatar joangi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

describeml's Issues

Indentation defines the scope

💣 Describe the issue or problem you detected

To define more closely the scope, and therefore be able to provide more accurate suggestion during the autocompletion of the language, it will be useful that indentation level defines the scope, as similar as python does.

📋 Provide the solution you'd like

In this issue, there is a proposed path to achieve this behavior.

eclipse-langium/langium#782

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.