Giter Site home page Giter Site logo

zy19969967 / poi_fenlei Goto Github PK

View Code? Open in Web Editor NEW

This project forked from randolphvi/hierarchical-multi-label-text-classification

0.0 0.0 0.0 296 KB

The code of CIKM'19 paper《Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach》

License: Apache License 2.0

Python 100.00%

poi_fenlei's Introduction

Hierarchical Multi-Label Text Classification

Python Version Build StatusCodacy BadgeLicense

This repository is my research project, which has been accepted by CIKM'19. The paper is already published.

The main objective of the project is to solve the hierarchical multi-label text classification (HMTC) problem. Different from the multi-label text classification, HMTC assigns each instance (object) into multiple categories and these categories are stored in a hierarchy structure, is a fundamental but challenging task of numerous applications.

Requirements

  • Python 3.6
  • Tensorflow 1.15.0
  • Tensorboard 1.15.0
  • Sklearn 0.19.1
  • Numpy 1.16.2
  • Gensim 3.8.3
  • Tqdm 4.49.0

Introduction

Many real-world applications organize data in a hierarchical structure, where classes are specialized into subclasses or grouped into superclasses. For example, an electronic document (e.g. web-pages, digital libraries, patents and e-mails) is associated with multiple categories and all these categories are stored hierarchically in a tree or Direct Acyclic Graph (DAG).

It provides an elegant way to show the characteristics of data and a multi-dimensional perspective to tackle the classification problem via hierarchy structure.

The Figure shows an example of predefined labels in hierarchical multi-label classification of documents in patent texts.

  • Documents are shown as colored rectangles, labels as rounded rectangles.
  • Circles in the rounded rectangles indicate that the corresponding document has been assigned the label.
  • Arrows indicate a hierarchical structure between labels.

Project

The project structure is below:

.
├── HARNN
│   ├── train.py
│   ├── layers.py
│   ├── ham.py
│   ├── test.py
│   └── visualization.py
├── utils
│   ├── checkmate.py
│   ├── param_parser.py
│   └── data_helpers.py
├── data
│   ├── word2vec_100.model.* [Need Download]
│   ├── Test_sample.json
│   ├── Train_sample.json
│   └── Validation_sample.json
├── LICENSE
├── README.md
└── requirements.txt

Data

You can download the Patent Dataset used in the paper. And the Word2vec model file (dim=100) is also uploaded. Make sure they are under the /data folder.

⚠️ As for Education Dataset, they may be subject to copyright protection under Chinese law. Thus, detailed information is not provided.

:octocat: Text Segment

  1. You can use nltk package if you are going to deal with the English text data.

  2. You can use jieba package if you are going to deal with the Chinese text data.

:octocat: Data Format

See data format in /data folder which including the data sample files. For example:

{"id": "3930316", 
"title": ["sighting", "firearm"], 
"abstract": ["rear", "sight", "firearm", "ha", "peephole", "device", "formed", "hollow", "tube", "end", ...], 
"section": [5], "subsection": [104], "group": [512], "subgroup": [6535], 
"labels": [5, 113, 649, 7333]}
  • id: just the id.
  • title & abstract: it's the word segment (after cleaning stopwords).
  • section / subsection / group / subgroup: it's the first / second / third / fourth level category index.
  • labels: it's the total category which add the index offset. (I will explain that later)

:octocat: How to construct the data?

Use the sample of the Patent Dataset as an example. I will explain how to construct the label index. For patent dataset, the class number for each level is: [9, 128, 661, 8364].

Step 1: For the first level, Patent dataset has 9 classes. You should index these 9 classes first, like:

{"Chemistry": 0, "Physics": 1, "Electricity": 2, "XXX": 3, ..., "XXX": 8}

Step 2: Next, you index the next level (total 128 classes), like:

{"Inorganic Chemistry": 0, "Organic Chemistry": 1, "Nuclear Physics": 2, "XXX": 3, ..., "XXX": 127}

Step 3: Then, you index the third level (total 661 classes), like:

{"Steroids": 0, "Peptides": 1, "Heterocyclic Compounds": 2, ..., "XXX": 660}

Step 4: If you have the fourth level or deeper level, index them.

Step 5: Now suppose you have one record (id: 3930316 mentioned before):

{"id": "3930316", 
"title": ["sighting", "firearm"], 
"abstract": ["rear", "sight", "firearm", "ha", "peephole", "device", "formed", "hollow", "tube", "end", ...], 
"section": [5], "subsection": [104], "group": [512], "subgroup": [6535],
"labels": [5, 104+9, 512+9+128, 6535+9+128+661]}

Thus, the record should be construed as follows:

{"id": "3930316", 
"title": ["sighting", "firearm"], 
"abstract": ["rear", "sight", "firearm", "ha", "peephole", "device", "formed", "hollow", "tube", "end", ...], 
"section": [5], "subsection": [104], "group": [512], "subgroup": [6535], 
"labels": [5, 113, 649, 7333]}

This repository can be used in other datasets (text classification) in two ways:

  1. Modify your datasets into the same format of the sample.
  2. Modify the data preprocess code in data_helpers.py.

Anyway, it should depend on what your data and task are.

:octocat: Pre-trained Word Vectors

You can pre-training your word vectors(based on your corpus) in many ways:

  • Use gensim package to pre-train data.
  • Use glove tools to pre-train data.
  • Even can use bert to pre-train data.

Usage

See Usage.

Network Structure

Reference

If you want to follow the paper or utilize the code, please note the following info in your work:

@inproceedings{huang2019hierarchical,
  author    = {Wei Huang and
               Enhong Chen and
               Qi Liu and
               Yuying Chen and
               Zai Huang and
               Yang Liu and
               Zhou Zhao and
               Dan Zhang and
               Shijin Wang},
  title     = {Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach},
  booktitle = {Proceedings of the 28th {ACM} {CIKM} International Conference on Information and Knowledge Management, {CIKM} 2019, Beijing, CHINA, Nov 3-7, 2019},
  pages     = {1051--1060},
  year      = {2019},
}

About Me

黄威,Randolph

SCU SE Bachelor; USTC CS Ph.D.

Email: [email protected]

My Blog: randolph.pro

LinkedIn: randolph's linkedin

poi_fenlei's People

Contributors

randolphvi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.