Giter Site home page Giter Site logo

arxiv_dataset_scrape4post_pretrain's Introduction

arXiv Dataset Scraper

The arXiv Dataset Scraper is a tool designed for scraping datasets from the arXiv repository, specifically focusing on domains relevant for continued pretraining of machine learning models. This scraper utilizes the arXiv API to fetch datasets and relies on the open access interoperability provided by arXiv—special thanks to their team. The tool searches arXiv for specific keywords, such as 'ASD' and 'Autism Spectrum Disorder', and leverages GPT-4/deepseek-chat to filter and identify pertinent articles. The datasets are then downloaded from the mirror site at export.arxiv.org. The downloaded PDFs are converted to Markdown format using Marker, allowing for easier text processing and manipulation.

Prerequisites

  • It is recommended to use Conda for managing your Python environments, as this project was developed with Python 3.10.
  • Ensure you have git installed to clone the repository.

Setup and Installation

# Create a new Conda environment with Python 3.10
conda create -n py310 python=3.10 -y
# Activate the newly created environment
conda activate py310
# Clone the repository
git clone https://github.com/afafw/arXiv_dataset_scrape4post_pretrain.git
# Navigate to the project directory
cd arXiv_dataset_scrape4post_pretrain
# Install the required dependencies
pip install -r requirements.txt

Usage Guide

  1. Fetch Article Names: Run Get_ALL_ARXIV_ARTICLE_NAMES.py to collect all article names from arXiv and export them to target_titles.csv.
  2. Filter Unrelated Articles: Before executing filter_unrelated.py, make sure to set the required environment variables. You can either create a .env file based on the provided .env.example or export the variables directly.
    • Sample .env file:
      OPENAI_API_KEY="your_api_key"
      OPENAI_BASE_URL="https://api.openai.com"
      USE_THIS_MODEL="gpt-4"
      
  3. Download Articles: Use download_filtered.py to download the articles that have been identified as related.
  4. Convert PDFs to Markdown: Convert the downloaded PDF files to Markdown format using Marker.
  5. Access Converted Files: Check the converted_pdf directory for the Markdown-formatted articles.
python Get_ALL_ARXIV_ARTICLE_NAMES.py
python filter_unrelated.py
python download_filtered.py
# Assuming Marker is installed and in your PATH
marker ./downloaded_pdfs ./converted_pdf

arxiv_dataset_scrape4post_pretrain's People

Contributors

afafw avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.