Giter Site home page Giter Site logo

spanpsp's Introduction

SpanPSP

This repository contains code accompanying the paper "A CHARACTER-LEVEL SPAN-BASED MODEL FOR MANDARIN PROSODIC STRUCTURE PREDICTION" which is submitted to ICASSP 2022.

Environment

  • Python 3.7 or higher.
  • Pytorch 1.6.0, or any compatible version.
  • NLTK 3.2, torch-struct 0.4, transformers 4.3.0, or compatible.
  • pytokenizations 0.7.2 or compatible.

Repository structure

SpanPSP
├──bert-base-chinese
|   ├──config.json
|   ├──pytorch_model.bin
|   └──vocab.txt
├──data
|   ├──train
|   |   ├──raw_data
|   |   |   ├──raw_train.txt
|   |   |   ├──raw_validate.txt
|   |   |   └──raw_test.txt
|   |   └──tree_data
|   |       ├──tree_train.txt
|   |       ├──tree_validate.txt
|   |       └──tree_test.txt
|   └──inference
|       ├──raw_data
|       |   └──raw_data.txt
|       ├──tree_data
|           └──tree_data.txt
├──models
|   ├──pretrained_model
|   |   └──pretrained_SpanPSP.pt
|   └──yours
├──src
|   ├──benepar
|       ├── ...
|   ├──count_fscore.py
|   ├──evaluate.py
|   ├──export.py
|   ├──inference_seq2tree.py
|   ├──learning_rate.py
|   ├──main.py
|   ├──seq_with_label.py
|   ├──train_seq2tree.py
|   ├──transliterate.py
|   ├──treebank.py
├──README.md

Training and test with your dataset

Data preprocessing

First prepare your own dataset into the following format, and divide it into training, validation and test named raw_train.txt, raw_validate.txt and raw_test.txt respectively. Put them in the right place as shown in the above repository structure.

猴子#2用#1尾巴#2荡秋千#3。

Then use the following command to convert the data of the three above files from sequence format to tree format by changing the file path in the code respectively. After that, you can get the tree_train.txt, tree_validate.txt and tree_test.txt.

$ python src/train_seq2tree.py

Training

Train your model using:

$ python src/main.py  train  --train-path [your_training_data_path]  --dev-path [your_dev_data_path]  --model-path-base [your_saving_model_path] 

Test

Test your model using:

$ python src/main.py  test  --model-path [your_trained_model_path]  --test-path [your_test_data_path]

Using the pretrained model to automatically label the prosody structure of text data

Data preprocessing

First prepare your own dataset into the following format, and put it in the right place as shown in the repository structure.

猴子用尾巴荡秋千。

Then use the following command to convert the dataset from sequence format to tree format:

$ python src/inference_seq2tree.py

Download the pretrained model

The pretrained model will be released soon.

Automatic labeling

$ python src/main.py  auto_labels  --model-path [your_pretrained_model_path]  --test-path [your_test_data_path]  --output-path [your_output_data_path]

spanpsp's People

Contributors

chenxuey20 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.