Giter Site home page Giter Site logo

marlon-br / neuro-comma Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sviperm/neuro-comma

11.0 1.0 1.0 627 KB

Punctuation restoration production-ready models for English, Spanish, Portuguese, German, Russian and French languages

License: MIT License

Dockerfile 0.74% Shell 1.81% Python 88.35% Jupyter Notebook 9.10%

neuro-comma's Introduction

Repo with trained punctuation restoration models for https://github.com/sviperm/neuro-comma and https://github.com/xashru/punctuation-restoration

Colab to prepare data for any language: https://colab.research.google.com/drive/1-EUY-Do3vR_h3htyjjAMR-zVQyUsaH-t?usp=sharing
Just set right language in LANG1 and LANG2 parameters. Also you have to set right UTF-8 characters for you language: https://www.periodni.com/unicode_utf-8_encoding.html.

Training Datasets in neuro-comma format:

Dataset language Google Drive Link Zipped Size
Portuguese https://drive.google.com/file/d/1-0t7sskF3bKGeXiCOx4nL9R53-Z0f7ak/view?usp=sharing 358.2 MB
Spanish https://drive.google.com/file/d/1JHATFSu4amgz-cHcKXsSUyUs9T70kPa7/view?usp=sharing 476.7 MB
German https://drive.google.com/file/d/1Us6mMtPjOxUi1Mu_M1PHWTB9pxuwyod7/view?usp=sharing 721.9 MB
Russian https://drive.google.com/file/d/1-0MBsnGfId6AGxgPhnkO2gAxaDn1DWGQ/view?usp=sharing 584.7 MB
French https://drive.google.com/file/d/1--ocXMVscsIKghB9xzEUUeIVvJqY0qUX/view?usp=sharing 971.8 MB

Neuro-comma (puntuation-restoration) format: Oh, nun ja, dann wird sie es wohl gewesen sein. Wo sind die Karten geblieben?

turns into:
oh COMMA
nun O
ja COMMA
dann O
wird O
sie O
es O
wohl O
gewesen O
sein PERIOD
wo O
sind O
die O
karten O
geblieben QUESTION

Pretrained models:

Model language Base model Google Drive Link
Portuguese (quantized) xlm-roberta-large https://drive.google.com/file/d/174kc3436Vck9jBYbkfr5gwY6bn5wki14/view?usp=sharing
Spanish (quantized) xlm-roberta-large https://drive.google.com/file/d/1181DFuVoYIEiTqaAnNAVSQKe9n69xB49/view?usp=sharing
German (qunatized xlm-roberta-large https://drive.google.com/file/d/1Tw4ISUVMPqMXIwEU0IFy5ldSLa9z7Zjt/view?usp=sharing
Russian (quantized) xlm-roberta-large https://drive.google.com/file/d/1VgmF4wHbpftyWxZQrY7BJHOYdKYjZSBH/view?usp=sharing
French xlm-roberta-large https://drive.google.com/file/d/1wg6GeRi5EKE_60bni3PWsDQM_IdpXIn_/view?usp=sharing
English (from punctuation-restoration) roberta-large https://drive.google.com/file/d/17BPcnHVhpQlsOTC8LEayIFFJ7WkL00cr/view?usp=sharing

How it works

https://github.com/sviperm/neuro-comma
https://github.com/xashru/punctuation-restoration

neuro-comma's People

Contributors

marlon-br avatar sviperm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

art-vish

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.