Giter Site home page Giter Site logo

laddie132 / lw-pt Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 112 KB

Dataset and code for "Label-Wise Document Pre-Training for Multi-Label Text Classification" (NLPCC 2020)

License: MIT License

Python 100.00%
document-representation pre-trained multi-label-classification

lw-pt's Introduction

Label-Wise Pre-Training (LW-PT)

This is the code for NLPCC 2020 paper Label-Wised Document Pre-Training for Multi-Label Text Classification

Requirements

  • Ubuntu 16.04
  • Python >= 3.6.0
  • PyTorch >= 1.3.0

Reproducibility

  • --data and --outputs

We provide the proprecessed RMSC and AAPD datasets and pretrained checkpoints of LW-LSTM+PT+FT model and HLW-LSTM+PT+FT model to make sure reproducibility. Please download from the link and decompress to the root directory of this repository.

--data
    |--aapd
    	|--label_test
    	|--label_train
    	...
    |--rmsc
    	|--rmsc.data.test.json
    	|--rmsc.data.train.json
    	|--rmsc.data.valid.json
    aapd_word2vec.model
    aapd_word2vec.model.wv.vectors.npy
    aapd.meta.json
    aapd.pkl
    rmsc_word2vec.model
    rmsc_word2vec.model.wv.vectors.npy
    rmsc.meta.json
    rmsc.pkl
--outputs
    |--aapd
    |--rmsc

Note that the data/aapdand data/rmsc is the initial dataset. Here we provide a split of RMSC (i.e. RMSC-V2).

  • Testing on AAPD
python classification.py -config=aapd.yaml -in=aapd -gpuid [GPU_ID] -test
  • Testing on RMSC
python classification.py -config=rmsc.yaml -in=rmsc -gpuid [GPU_ID] -test

Preprocessing

If you want to preprocess the dataset by yourself, just run the following command with name of dataset (e.g. RMSC or AAPD).

PYTHONHASHSEED=1 python preprocess.py -data=[RMSC/AAPD]

Note that PYTHONHASHSEED is used in word2vec.

Pre-Train

Pre-train the LW-PT model.

python pretrain.py -config=[CONFIG_NAME] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -test
  • CONFIG_NAME: aapd.yaml or rmsc.yaml
  • OUT_INFIX: infix of outputs directory contains logs and checkpoints

MLTC Task

Train the downstream model for MLTC task.

python classification.py -config=[CONFIG_NAME] -in=[IN_INFIX] -out=[OUT_INFIX] -gpuid [GPU_ID] -train -test
  • IN_INFIX: infix of inputs directory contains pre-trained checkpoints

Others

  • build a static documents representation to facilitate downstream tasks
python build_doc_rep.py -config=[CONFIG_NAME] -in=[IN_INFIX] -gpuid [GPU_ID]

Not used unless necessary.

  • make RMSC-V2 dataset: tests/make_rmsc.py
  • visual document embeddings: tests/visual_emb.py
  • visual labels F1 score: tests/visual_label_f1.py
  • case study: tests/case_study.py

Reference

If you consider our work useful, please cite the paper:

@inproceedings{liu2020label,
	title="Label-Wise Document Pre-Training for Multi-Label Text Classification",
	author="Han Liu, Caixia Yuan and Xiaojie Wang",
	booktitle="CCF International Conference on Natural Language Processing and Chinese Computing",
	year="2020"
}

lw-pt's People

Contributors

laddie132 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

lw-pt's Issues

the download link is expired.

Sorry to interrupt you that the download link is expired.Could you please resent the link?
作者您好!我是一个计算机研究生二年级的学生,非常感谢您把代码开源,但是下载数据的链接失效了,您能不能再发一次呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.