Giter Site home page Giter Site logo

1024-m / semeval-2024-task-8-machine-text-detection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kinit-sk/semeval-2024-task-8-machine-text-detection

0.0 0.0 0.0 11.42 MB

Detection system for multilingual, multidomain and multigenerator machine-generated texts

License: GNU General Public License v3.0

Python 55.88% Jupyter Notebook 44.12%

semeval-2024-task-8-machine-text-detection's Introduction

KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection

Source code for replication of the detection system, ranked fourth at SemEval-2024 Task 8 Subtask A in the multilingual track.

Cite

If you use the data, code, or the information in this repository, cite the following paper.

@inproceedings{KInITSemeval2024task8,
    title={{KInIT} at {SemEval}-2024 Task 8: Fine-tuned {LLMs} for Multilingual Machine-Generated Text Detection},
    author={Michal Spiegel and Dominik Macko},
    booktitle = {Proceedings of the 18th International Workshop on Semantic Evaluation},
    series = {SemEval 2024},
    year = {2024},
    address = {Mexico, address = {Mexico City, Mexico},
    month = {June},
    pages = {},    
    doi= {},
    misc = {https://github.com/kinit-sk/semeval-2024-task-8-machine-text-detection}           
}

Source Code Structure

File Description
baseline/transformer_baseline.py the official baseline script modified to also export machine-class probabilities
baseline/transformer_peft.py the script for QLoRA PEFT fine-tunning of the input LLM
LLM2S3.ipynb a Jupyter Notebook for ensembling the Falcon-7B and Mistral-7B and statistical predictions
predictions/* dumped predictions for easier analysis of the LLM2S3 ensemble (without retraining the models)

Installation

  1. Clone and install the IMGTB framework, activate the conda environment.
    git clone https://github.com/michalspiegel/IMGTB.git
    cd IMGTB
    conda env create -f environment.yaml
    conda activate IMGTB
    
  2. For the integration and usage with the official scoring scipts, clone the official SemEval-2024 Task 8 repository, copy the official data to the data folder as described in the official repository, and copy the content of this repository to the subtaskA folder.
    git clone https://github.com/mbzuai-nlp/SemEval2024-task8.git
    cd SemEval2024-task8
    
  3. Or just clone this repository and use it independently.
    git clone https://github.com/kinit-sk/semeval-2024-task-8-machine-text-detection.git
    cd semeval-2024-task-8-machine-text-detection
    

Code Usage

  1. To retrain the Mistral-7B model, run the following code (data needs to be downloaded as described in Step 2 of the Installation). Similarly, run the code to retrain Falcon-7B.
    python3 baseline/transformer_peft.py --train_file_path data/subtaskA_train_multilingual.jsonl --test_file_path data/subtaskA_test_multilingual.jsonl --prediction_file_path predictions/mistral_test_predictions_probs.jsonl --subtask A --model 'mistralai/Mistral-7B-v0.1'
    
  2. To regenerate statistical metrics, use the IMGTB framework.
  3. For LLM2S3 ensembling, use the provided Jupyter notebook script.

semeval-2024-task-8-machine-text-detection's People

Contributors

dominikmacko avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.