Giter Site home page Giter Site logo

deepfaketextdetect's Introduction

Hello! My name is Yafu Li, and I am currently a fourth-year PhD student under joint training of Zhejiang University and Westlake University, under the supervision of Prof. Yue Zhang.

My research focuses on machine translation and natural language generation, with a recent focus on LLM-related topics. You can find our recent work on detecting AI-generated texts at DeepfakeTextDetect.

I am always open to discussing research, potential collaborations, or opportunities. Feel free to reach out to me at [email protected].

deepfaketextdetect's People

Contributors

linzwcs avatar nealcly avatar oashrafouad avatar yafuly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

deepfaketextdetect's Issues

the same model gives different answers

Hi!
I can't understand why the same model gives different answers.
Here's an example:
"Apples are crisp, juicy fruits that belong to the Rosaceae family and are widely cultivated around the world. Known scientifically as Malus domestica, apples come in a variety of colors, including red, green, and yellow, each offering a distinct flavor profile. Renowned for their sweet taste and versatility, apples are not only enjoyed as a delightful snack but also used in a multitude of culinary applications."
If you run it online, the answer of the model is "machine-generated", which is true, I generated this text with GPT3.5.
But if you run it locally with

import torch
import os
from transformers import AutoModelForSequenceClassification,AutoTokenizer

device = 'cpu'
model_dir = "nealcly/detection-longformer"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_dir).to(device)

text = example
text = preprocess(text)
result = detect(text,tokenizer,model,device)

the answer is "human-written".
Why is that? And how do I download the online version of a good model locally?

How is the value of "th" obtained in the detect function?

Hello, I'm a beginner and this question might be simple, but I'm really curious about how the value of "th"is calculated. It would be great if you could provide an answer.

def detect(input_text, tokenizer, model, device='cuda:0', th=-3.08583984375):

Validation file parameter in the training script

Hi,

I've been trying to re-run your experimental setup for a specific setting. One thing making me confused is that although in train.sh there is a validation file parameter, it's populated with test.csv as shown below, not the valid.csv.

...

valid_file="$data_path/test.csv"

...

Once I change the parameter from test.csv to valid.csv, accuracy improved around +2%. Even though I have read the related parts of the code in main.py, I wasn't able to find an explanation why this parameter is set as test file, and why the validation file isn't used.

Any help would be very appreciated.

Seperating model names and domains from source

Hi! Thanks for releasing the code and datasets.

I am trying to separate model names and domains from the source field, but some model names/domains contain multiple underscores, making it difficult to do so.
Could you use a more distinctive separator (other than '_') in between or provide a list of domains and model names?

Thanks!

Local dependencies in requirements.txt

Hi,

There are some local dependencies in the requirements.txt:

  1. cffi @ file:///tmp/build/80754af9/cffi_1625807838443/work
  2. chardet @ file:///tmp/build/80754af9/chardet_1607706746162/work
  3. conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1618262148928/work
  4. cryptography @ file:///tmp/build/80754af9/cryptography_1616769286105/work
  5. idna @ file:///home/linux1/recipes/ci/idna_1610986105248/work
  6. pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
  7. pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1608057966937/work
  8. PySocks @ file:///tmp/build/80754af9/pysocks_1605305779399/work
  9. ruamel-yaml-conda @ file:///tmp/build/80754af9/ruamel_yaml_1616016699510/work
  10. six @ file:///tmp/build/80754af9/six_1623709665295/work
  11. urllib3 @ file:///tmp/build/80754af9/urllib3_1625084269274/work

Therefore running the command pip install -r requirements.txt raises exceptions such as ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/tmp/build/80754af9/cffi_1625807838443/work'.

I guess that either the file needs to be fixed manually or re-generated using some external packages such as pipreqs.

Thanks.

Asking some question about Longformer

Hi, I am very interested in your research, But I have one more question to ask you. You have constructed many datasets for training and testing, and then provided a classifier for Longformer. What I would like to ask is, in which type of dataset did you provide training for this Longformer (for example: unseen_domains or unseen_models or else)?
Also, could you please provide URLs to other models that appear in your paper?
Thanks!

Maybe a typo on page 5

On page 5 of your paper:

This process creates 10 testbeds for cross-validation. We train 7 classifiers for each testbed and report their weighted average performance.

I think it should be 10 classifiers for each testbed instead of 7.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.