Giter Site home page Giter Site logo

legalqa's Introduction

LegalQA using SentenceKoBART

Implementation of legal QA system based on SentenceKoBART

Setup

# install git lfs , https://github.com/git-lfs/git-lfs/wiki/Installation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git clone https://github.com/haven-jeon/LegalQA.git
cd LegalQA
git lfs pull
# If the lfs quota is exceeded, please download it with the command below.
# wget http://gogamza.ipdisk.co.kr:80/gogamzapubs/VOL1/URLs/models/SentenceKoBART.bin
# mv SentenceKoBART.bin model/
pip install -r requirements.txt

Index

python app.py -t index

GPU-based indexing available as an option

  • pods/encode.yml - device: cuda

Train

The SentenceKoBART is not a model tuned based on the legal task, so it guarantees good recall, but requires adjustment in terms of precision. By re-ranking the results of top-k using a cross-encoder, we can supplement in terms of precision.

  • Model : Ranking for general purpose
  • Learn to Rank : Ranking for task specific purpose

Learn to Rank with KoBERT

Initial training is done by classifying whether the title of the dataset and the question are related pairs like below.

Why BERT?

  • To use BERT NSP power.

[CLS] title [SEP] question [SEP]

title question label
오토바이의 고속도로 주행금지가 행복추구권 등을 침해한 것은 아닌지 여부 甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ... positive
피해자과실로 인한 교통사고로 개인택시사업면허가 취소된 경우 甲은 평소 오토바이를 좋아하여 주말, 휴일이면 오토바이로 전국을 여행하였습니다. 그런데 ... negative
python app.py -t train

The trained model is saved in the rerank_model directory.

We provide a KoBERT model tuned with LegalQA(gogamza/kobert-legalqa-v1).

Search

With REST API

To start the Jina server for REST API:

# python app.py -t query_restful --query_flow flows/query_numpy_rerank.yml
python app.py -t query_restful 

Then use a client to query:

curl --request POST -d '{"parameters": {"top_k": 1},  "data": ["상속 관련 문의"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:1234/search'

Or use Jinabox with endpoint http://127.0.0.1:1234/search

From the terminal

# python app.py -t query --query_flow flows/query_numpy_rerank.yml
python app.py -t query

Approximate KNN Search

python app.py -t query_restful --query_flow flows/query_hnswlib_rerank.yml

python app.py -t query_restful --query_flow flows/query_faiss_rerank.yml

python app.py -t query_restful --query_flow flows/query_annoy_rerank.yml

  • Retrieval time(sec.)
    • AMD Ryzen 5 PRO 4650U, 16 GB Memory
    • Average of 100 searches
    • Excluding BertReRanker
top-k Numpy Hnswlib Faiss Annoy
10 1.433 0.101 0.131 0.118

Presentation

Demo

Links

FAQ

Why this dataset?

Legal data is composed of technical terms, so it is difficult to search if you are not familiar with these terms. Because of these characteristics, I thought it was a good example to show the effectiveness of neural IR.

LFS quota is exceeded

You can download SentenceKoBART.bin from one of the two links below.

Citation

Model training, data crawling, and demo system were all supported by the AWS Hero program.

@misc{heewon2021,
author = {Heewon Jeon},
title = {LegalQA using SentenceKoBART},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/LegalQA}}

License

  • QA data data/legalqa.jsonlines is crawled in www.freelawfirm.co.kr based on robots.txt. Commercial use other than academic use is prohibited.
  • We are not responsible for any legal decisions we make based on the resources provided here.

legalqa's People

Contributors

haven-jeon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.