Giter Site home page Giter Site logo

huggingface / chug Goto Github PK

View Code? Open in Web Editor NEW
132.0 10.0 9.0 149 KB

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

License: Apache License 2.0

Python 100.00%
computer-vision dataloading datasets distributed-training document-understanding multi-modal-learning pdf-document webdataset

chug's Introduction

Chugging Data

A library to help w/ efficient training for multi-modal data. Initially focused on image & document + text tasks.

chug currently leverages webdataset and Hugging Face datasets.

webdataset tar files and dataset pipelines are preferred for scalable pretraining.

Hugging Face datasets are supported and work great for exploration, validation, and fine-tune use cases.

chug provides on the fly PDF decoding and rendering via either pypdfium2 (https://github.com/pypdfium2-team/pypdfium2) as a default, or fitz/pymupdf (https://github.com/pymupdf/PyMuPDF) if your use case is okay with their AGPL-3.0 license. fitz support must be manually enabled. The pdf handling is implemented at the webdataset level, so you can plug it in to other webdataset pipelines. This enables large scale sharded streaming of native .pdf files without needing to pre-render to .png/.tiff, etc.

Status

This library is still a WIP, consider this an alpha release (pre announcement). Major features should be working, the library has been tested with several PDF datasets we will shortly make public. However, do expect there will still be breaking changes, lots of improvements, etc.

pip install --pre chug will install the current dev version.

TODOs

Nearish

  • Cleanup and refinement, codebase will change
  • Documentation & unit-tests
  • Support reading of info .json/.yaml files for automatic shard info resolution for webdatasets (like timm)

Mediumish

  • Option to output bbox annotations for lines (or word + word output) for tasks that leverage layout
  • Unified preprocessor functions for combined image + text tokenization (img+text token interleaving, etc.)
  • Image token (patch) packing ala NaViT. Online bin packing based algorithms integrated with image preprocessing and pipeline.

Longish

  • Increase range of task pipelines for other tasks, modelling needs
  • Support additional modalities & targets (video, audio, detection/dense pixel targets, image/video/audio targets)
  • Explore alternatives to .tar shards (array_record, arrow, etc)

Design

Submodule Hierarchy

The library has been designed so that functions, classes at different levels can be used independently.

If one wants to build a loader & pipeline with JSON/YAML serializable configs, use the top-level chug.create_loader() in chug/loader.py. Depending on dataset sources, one can easily switch this between webdataset, HF datasets (in the future, other sources).

Bypassing the highest level, one can also call build_pipeline_* methods in task_pipeline and then call create_loader_wds with a full array of args for wds only use cases.

If one doesn't want to use chug loaders and pipelines at all, image, text, and wds (especially decoder) functionality may be useful in other projects.

Library modules (highest to lowest level)

The dependencies of modules within the library are intended to follow the hierarchy below. e.g. doc depends on wds, but wds should never depend on doc.

app
|
loader (chug/loader.py)
|
task_pipeline
|
doc
|
wds, hfds, image, text
|
common

Submodules

common

Configs, structures (dataclasses) for general use across the library

wds

Webdataset (wds for short) specific code. Extensions and alterations of webdataset functionality to fit covered use case and improve robustness.

All data pipelines in chug currently leverage wds pipelines, even when not using wds datasets.

Document oriented decoding (pdf decoder) is present in chug/wds/decode.py, it can be used with any webdataset pipeline as a decoder. e.g. wds.decode(chug.wds.DecodeDoc('pill'), 'pill')

hfds

Hugging Face datasets support. A minimal wrapper that allows datasets to be used with chug processing pipelines.

The processing pipelines remain webdataset based when using datasets, they are invoked by a custom collate class.

image

Image processing, torchvision and albumentations based transform building code. A mix of generic image (imagenet, simclr) transforms and document specific transforms, including an implementation of albumentations based nougat transforms.

text

Text processing, tokenization code.

doc

Document processing code. Currently focused on processors that apply image/pdf decoders and process document OCR or VQA annotations.

task_pipeline

Task specific pipelines, where dataset formats meet modelling needs.

Inputs to task pipelines are sample dictionaries based on the dataset form, they are decoded and then processed into outputs that match model input requirements.

Task specific pipelines that handle the data <--> model input interface are inserted into an encompassing data pipeline which handles shard lists, shuffle, wrapping, distributed worker, splitting, batching, etc.

chug.loader

This lone top-level file includes the main factory methods for creating loaders w/ associated pipelines from config dataclasses.

app

Most applications using chug will exist outside of the lib in training libraries, etc. Some builtin utility / exploration apps will be included here.

Concepts

WIP

Datasets

Datasets that work well with this library can be found on the Hugging Face Hub under the pixparse organization (https://huggingface.co/pixparse).

We'll add links to other noteworthy datasets that can be used as we become aware of them.

Usage / Examples

Document Reading, Training w/ IDL

import chug
img_cfg = chug.ImageInputCfg(size=(1024, 768), transform_type='doc_better')
img_fn = chug.create_image_preprocessor(input_cfg=img_cfg, is_training=True)
txt_fn = chug.create_text_preprocessor(
    'naver-clova-ix/donut-base',
    prompt_end_token='<s_idl>',
    task_start_token='<s_idl>',  # NOTE needs to be added to tokenizer
)

task_cfg = chug.DataTaskDocReadCfg(
    image_process_fn=img_fn,
    text_process_fn=txt_fn,
    page_sampling='random',
    error_handler='dump_and_reraise',
)
data_cfg = chug.DataCfg(
    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/idl-wds/resolve/main/idl-train-0{0000..2999}.tar',
    batch_size=8,
    num_samples=3144726,
    format='wds',
)
lb = chug.create_loader(
    data_cfg,
    task_cfg,
    is_training=True,
)
ii = iter(lb)
sample = next(ii)

Document Reading, Exploring IDL

import chug
task_cfg = chug.DataTaskDocReadCfg(page_sampling='all')
data_cfg = chug.DataCfg(
    source='pixparse/idl-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,    
)
lb = chug.create_loader(
    data_cfg,
    task_cfg,
)
ii = iter(lb)
sample = next(ii)

Document Reading, Training with PDFA

import chug
img_cfg = chug.ImageInputCfg(size=(1024, 768), transform_type='doc_nougat')
img_fn = chug.create_image_preprocessor(input_cfg=img_cfg, is_training=True)
txt_fn = chug.create_text_preprocessor(
    'naver-clova-ix/donut-base',
    prompt_end_token='<s_pdfa>',
    task_start_token='<s_pdfa>',  # NOTE needs to be added to tokenizer
)

task_cfg = chug.DataTaskDocReadCfg(
    image_process_fn=img_fn,
    text_process_fn=txt_fn,
    page_sampling='random',
)
data_cfg = chug.DataCfg(
    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/pdfa-english-train/resolve/main/pdfa-eng-train-{000000..005000}.tar',
    batch_size=8,
    num_samples=1000000,  # FIXME replace with actual
    format='wds',   
)
lb = chug.create_loader(
    data_cfg,
    task_cfg,
    is_training=True,
)
ii = iter(lb)
sample = next(ii)

Document Reading, Exploring PDFA

import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
lb = chug.create_loader(
    data_cfg,
    task_cfg,
)
ii = iter(lb)
sample = next(ii)

Image + Text

Training

import chug
import transformers
from functools import partial
img_cfg = chug.ImageInputCfg(size=(512, 512), transform_type='image_timm')
img_fn = chug.create_image_preprocessor(input_cfg=img_cfg, is_training=True)
tokenizer = transformers.AutoTokenizer.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')
txt_fn = partial(chug.tokenize, max_length=1000, tokenizer=tokenizer)
task_cfg = chug.DataTaskImageTextCfg(
    image_process_fn=img_fn,
    text_process_fn=txt_fn,
)
data_cfg = chug.DataCfg(
    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/cc12m-wds/resolve/main/cc12m-train-{0000..2175}.tar',
    batch_size=8,
    num_samples=10968539,
    format='wds',   
)
lb = chug.create_loader(
    data_cfg,
    task_cfg,
    is_training=True,
)
ii = iter(lb)
sample = next(ii)

Document VQA

Training, Fine-tuning

import chug
from chug.task_pipeline import create_task_pipeline
img_cfg = chug.ImageInputCfg(size=(1024, 768), transform_type='doc_basic')
img_fn = chug.create_image_preprocessor(img_cfg, is_training=True)
txt_fn = chug.create_text_preprocessor(
    'naver-clova-ix/donut-base-finetuned-docvqa',
    prompt_end_token='<s_answer>',
    task_start_token='<s_docvqa>',
)

task_cfg = chug.DataTaskDocVqaCfg(
    image_process_fn=img_fn,
    text_process_fn=txt_fn,
)
data_cfg = chug.DataCfg(
    source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/docvqa-wds/resolve/main/docvqa-train-{000..383}.tar',
    batch_size=8,
    format='wds',
    num_samples=39463,
)
lb = chug.create_loader(
    data_cfg,
    task_cfg,
    is_training=True,
)
ii = iter(lb)
sample = next(ii)

Exploration

import chug
from chug.task_pipeline import create_task_pipeline
task_cfg = chug.DataTaskDocVqaCfg(
    question_prefix='Question: ',
    question_suffix='',
    answer_prefix='Answer: ',
    answer_suffix=''
)
data_cfg = chug.DataCfg(
    source='pixparse/docvqa-single-page-questions',
    split='validation',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
lb = chug.create_loader(
    data_cfg,
    task_cfg
)
ii = iter(lb)
sample = next(ii)

Acknowledgement

chug evolve from the webdataset datapipeline used successfully in the OpenCLIP project. Thanks to all the contributors in that project. Future work will likely involve closing the loop and leveraging chug in OpenCLIP for increased capability.

The image/document augmentations in chug rely on a number of external influences. Our document oriented doc_better torchvision augmentations are influenced by nougat, and the doc_nougat is a direct adaptation of the albumentations + cv2 document pipeline in nougat. Several image augmentations leverage existing work in the timm library.

Also, big thanks to the maintainers of webdataset and Hugging Face datasets.

chug's People

Contributors

rwightman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chug's Issues

[Bug]: requires pyarrow >=14.0.0

Bug description

Hi team - thanks for releasing this!

I was trying to use chug to test using a large dataset as follows:

import chug

task_cfg = chug.DataTaskDocReadCfg(
    page_sampling='all',
)
data_cfg = chug.DataCfg(
    source='pixparse/pdfa-eng-wds',
    split='train',
    batch_size=None,
    format='hfids',
    num_workers=0,
)
data_loader = chug.create_loader(
    data_cfg,
    task_cfg,
)
sample = next(iter(data_loader))

However, I ran into issues with my local env, specifically around the pyarrow dependency. Originally, in my env I had pyarrow==13.0.0, however encountered the following bug when running the above code snippet:


ArrowInvalid                              Traceback (most recent call last)
Cell In[1], [line 13](vscode-notebook-cell:?execution_count=1&line=13)
      [3](vscode-notebook-cell:?execution_count=1&line=3) task_cfg = chug.DataTaskDocReadCfg(
      [4](vscode-notebook-cell:?execution_count=1&line=4)     page_sampling='all',
      [5](vscode-notebook-cell:?execution_count=1&line=5) )
      [6](vscode-notebook-cell:?execution_count=1&line=6) data_cfg = chug.DataCfg(
      [7](vscode-notebook-cell:?execution_count=1&line=7)     source='pixparse/pdfa-eng-wds',
      [8](vscode-notebook-cell:?execution_count=1&line=8)     split='train',
   (...)
     [11](vscode-notebook-cell:?execution_count=1&line=11)     num_workers=0,
     [12](vscode-notebook-cell:?execution_count=1&line=12) )
---> [13](vscode-notebook-cell:?execution_count=1&line=13) data_loader = chug.create_loader(
     [14](vscode-notebook-cell:?execution_count=1&line=14)     data_cfg,
     [15](vscode-notebook-cell:?execution_count=1&line=15)     task_cfg,
     [16](vscode-notebook-cell:?execution_count=1&line=16) )
     [17](vscode-notebook-cell:?execution_count=1&line=17) sample = next(iter(data_loader))

File [~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:48](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:48), in create_loader(data_cfg, task_cfg, task_pipeline, is_training, start_interval, seed, distributed)
     [37](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:37)     loader = create_loader_from_config_wds(
     [38](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:38)         data_cfg=data_cfg,
     [39](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:39)         task_cfg=task_cfg,
   (...)
     [44](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:44)         distributed=distributed,
     [45](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/chug/loader.py:45)     )
...
File [~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/pyarrow/error.pxi:144](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/pyarrow/error.pxi:144), in pyarrow.lib.pyarrow_internal_check_status()

File [~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/pyarrow/error.pxi:100](https://file+.vscode-resource.vscode-cdn.net/Users/nin/repos/ocr/~/miniconda3/envs/ocr_eval/lib/python3.10/site-packages/pyarrow/error.pxi:100), in pyarrow.lib.check_status()

ArrowInvalid: Unable to merge: Field json has incompatible types: struct<pages: list<item: struct<images_bbox: list<item: list<item: double>>, images_bbox_no_text_overlap: list<item: list<item: double>>, lines: struct<bbox: list<item: list<item: double>>, score: list<item: double>, text: list<item: string>, word_slice: list<item: list<item: int64>>>, words: struct<bbox: list<item: list<item: double>>, line_pos: list<item: list<item: int64>>, score: list<item: double>, text: list<item: string>>>>> vs struct<pages: list<item: struct<images_bbox: list<item: null>, images_bbox_no_text_overlap: list<item: null>, lines: struct<bbox: list<item: list<item: double>>, score: list<item: double>, text: list<item: string>, word_slice: list<item: list<item: int64>>>, words: struct<bbox: list<item: list<item: double>>, line_pos: list<item: list<item: int64>>, score: list<item: double>, text: list<item: string>>>>

Updating this to pyarrow==14.0.0 resolved this bug. Since pyarrow version is not specifically outlined in the requirements, it may be useful to update this?

[feature] Add dataloader support for non-webdataset dataset

Currently in pixparse other dataloaders are defined by e.g.

elif cfg.format == "hf_dataset":
        # In the case of hf datasets, we use the collator defined at task level
        dataset = load_dataset(cfg.source)[cfg.split]
        training_sampler = DistributedSampler(
            dataset, rank=global_rank, shuffle=True, seed=seed, num_replicas=world_size, drop_last=True
        )
        if is_train:
            # create a shared epoch store to sync epoch to dataloader worker proc
            shared_interval_count = SharedCount(count=start_interval)
        else:
            shared_interval_count = None
        num_batches = len(dataset) // cfg.batch_size
        base_loader = DataLoader(
            dataset=dataset, 
            collate_fn=collate_fn,
            sampler=training_sampler, 
            batch_size=cfg.batch_size, 
            num_workers=cfg.num_workers,
            )
        loader = LoaderBundle(
        loader=base_loader,
        num_batches=num_batches,
        num_samples=cfg.num_samples,
        shared_interval=shared_interval_count,
    )
    return loader

Instead of having this util in pixparse, we can write it here to handle batch creation at a lower level, and then use chug normally from pixparse lib.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.