yafuly / deepfaketextdetect Goto Github PK

License: Apache License 2.0

Python 98.50% Shell 1.50%

deepfaketextdetect's Introduction

Hello! My name is Yafu Li, and I am currently a fourth-year PhD student under joint training of Zhejiang University and Westlake University, under the supervision of Prof. Yue Zhang.

My research focuses on machine translation and natural language generation, with a recent focus on LLM-related topics. You can find our recent work on detecting AI-generated texts at DeepfakeTextDetect.

I am always open to discussing research, potential collaborations, or opportunities. Feel free to reach out to me at [email protected].

deepfaketextdetect's People

Contributors

Stargazers

Watchers

Forkers

yaozhirui c0de3 yorko oashrafouad lindseyrich2 muflhi01

deepfaketextdetect's Issues

the same model gives different answers

Hi!
I can't understand why the same model gives different answers.
Here's an example:
"Apples are crisp, juicy fruits that belong to the Rosaceae family and are widely cultivated around the world. Known scientifically as Malus domestica, apples come in a variety of colors, including red, green, and yellow, each offering a distinct flavor profile. Renowned for their sweet taste and versatility, apples are not only enjoyed as a delightful snack but also used in a multitude of culinary applications."
If you run it online, the answer of the model is "machine-generated", which is true, I generated this text with GPT3.5.
But if you run it locally with

import torch
import os
from transformers import AutoModelForSequenceClassification,AutoTokenizer

device = 'cpu'
model_dir = "nealcly/detection-longformer"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir).to(device)

text = example
text = preprocess(text)
result = detect(text,tokenizer,model,device)

the answer is "human-written".
Why is that? And how do I download the online version of a good model locally?

`deployment/cleantext.py` is missing

Apparently, this import implies there is a file deployment/cleantext.py. However, this file is not committed. Thus, formally, the Model Access part of Readme is irreproducible.

How is the value of "th" obtained in the detect function?

Hello, I'm a beginner and this question might be simple, but I'm really curious about how the value of "th"is calculated. It would be great if you could provide an answer.

def detect(input_text, tokenizer, model, device='cuda:0', th=-3.08583984375):

Validation file parameter in the training script

Hi,

I've been trying to re-run your experimental setup for a specific setting. One thing making me confused is that although in train.sh there is a validation file parameter, it's populated with test.csv as shown below, not the valid.csv.

...

valid_file="$data_path/test.csv"

...

Once I change the parameter from test.csv to valid.csv, accuracy improved around +2%. Even though I have read the related parts of the code in main.py, I wasn't able to find an explanation why this parameter is set as test file, and why the validation file isn't used.

Any help would be very appreciated.

Which dataset should we use on the setting of unseen domains.

Hi,
I try to evaluate a detector on the setting of "unseen domains". Then, I found the two test data: test_ood.csv and test.csv.
Which data should we use for this?

Seperating model names and domains from source

Hi! Thanks for releasing the code and datasets.

I am trying to separate model names and domains from the source field, but some model names/domains contain multiple underscores, making it difficult to do so.
Could you use a more distinctive separator (other than '_') in between or provide a list of domains and model names?

Thanks!

Local dependencies in requirements.txt

Hi,

There are some local dependencies in the requirements.txt:

cffi @ file:///tmp/build/80754af9/cffi_1625807838443/work
chardet @ file:///tmp/build/80754af9/chardet_1607706746162/work
conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1618262148928/work
cryptography @ file:///tmp/build/80754af9/cryptography_1616769286105/work
idna @ file:///home/linux1/recipes/ci/idna_1610986105248/work
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1608057966937/work
PySocks @ file:///tmp/build/80754af9/pysocks_1605305779399/work
ruamel-yaml-conda @ file:///tmp/build/80754af9/ruamel_yaml_1616016699510/work
six @ file:///tmp/build/80754af9/six_1623709665295/work
urllib3 @ file:///tmp/build/80754af9/urllib3_1625084269274/work

Therefore running the command pip install -r requirements.txt raises exceptions such as ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1625807838443/work'.

I guess that either the file needs to be fixed manually or re-generated using some external packages such as pipreqs.

Thanks.

Asking some question about Longformer

Hi, I am very interested in your research, But I have one more question to ask you. You have constructed many datasets for training and testing, and then provided a classifier for Longformer. What I would like to ask is, in which type of dataset did you provide training for this Longformer (for example: unseen_domains or unseen_models or else)?
Also, could you please provide URLs to other models that appear in your paper?
Thanks！

Maybe a typo on page 5

On page 5 of your paper:

This process creates 10 testbeds for cross-validation. We train 7 classifiers for each testbed and report their weighted average performance.

I think it should be 10 classifiers for each testbed instead of 7.

yafuly / deepfaketextdetect Goto Github PK

deepfaketextdetect's Introduction

deepfaketextdetect's People

Contributors

Stargazers

Watchers

Forkers

deepfaketextdetect's Issues

the same model gives different answers

`deployment/cleantext.py` is missing

How is the value of "th" obtained in the detect function?

Validation file parameter in the training script

Which dataset should we use on the setting of unseen domains.

Seperating model names and domains from source

Local dependencies in requirements.txt

Asking some question about Longformer

Maybe a typo on page 5

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent