modelscope / data-juicer Goto Github PK

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

License: Apache License 2.0

Python 99.80% Shell 0.15% Dockerfile 0.05%

chinese data-analysis data-science data-visualization dataset gpt gpt-4 instruction-tuning large-language-models llama llava llm llms multi-modal nlp opendata pre-training pytorch sora streamlit

data-juicer's Introduction

English | 中文 | 日本語

Introduction

ModelScope is built upon the notion of “Model-as-a-Service” (MaaS). It seeks to bring together most advanced machine learning models from the AI community, and streamlines the process of leveraging AI models in real-world applications. The core ModelScope library open-sourced in this repository provides the interfaces and implementations that allow developers to perform model inference, training and evaluation.

In particular, with rich layers of API-abstraction, the ModelScope library offers unified experience to explore state-of-the-art models spanning across domains such as CV, NLP, Speech, Multi-Modality, and Scientific-computation. Model contributors of different areas can integrate models into the ModelScope ecosystem through the layered-APIs, allowing easy and unified access to their models. Once integrated, model inference, fine-tuning, and evaluations can be done with only a few lines of codes. In the meantime, flexibilities are also provided so that different components in the model applications can be customized wherever necessary.

Apart from harboring implementations of a wide range of different models, ModelScope library also enables the necessary interactions with ModelScope backend services, particularly with the Model-Hub and Dataset-Hub. Such interactions facilitate management of various entities (models and datasets) to be performed seamlessly under-the-hood, including entity lookup, version control, cache management, and many others.

Models and Online Accessibility

Hundreds of models are made publicly available on ModelScope (700+ and counting), covering the latest development in areas such as NLP, CV, Audio, Multi-modality, and AI for Science, etc. Many of these models represent the SOTA in their specific fields, and made their open-sourced debut on ModelScope. Users can visit ModelScope(modelscope.cn) and experience first-hand how these models perform via online experience, with just a few clicks. Immediate developer-experience is also possible through the ModelScope Notebook, which is backed by ready-to-use CPU/GPU development environment in the cloud - only one click away on ModelScope.

Some representative examples include:

NLP:

Multi-Modal:

CV:

Audio:

AI for Science:

Note: Most models on ModelScope are public and can be downloaded without account registration on modelscope website(www.modelscope.cn), please refer to instructions for model download, for dowloading models with api provided by modelscope library or git.

QuickTour

We provide unified interface for inference using pipeline, fine-tuning and evaluation using Trainer for different tasks.

For any given task with any type of input (image, text, audio, video...), inference pipeline can be implemented with only a few lines of code, which will automatically load the underlying model to get inference result, as is exemplified below:

>>> from modelscope.pipelines import pipeline
>>> word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
>>> word_segmentation('今天天气不错，适合出去游玩')
{'output': '今天 天气 不错 ， 适合 出去 游玩'}

Given an image, portrait matting (aka. background-removal) can be accomplished with the following code snippet:

>>> import cv2
>>> from modelscope.pipelines import pipeline

>>> portrait_matting = pipeline('portrait-matting')
>>> result = portrait_matting('https://modelscope.oss-cn-beijing.aliyuncs.com/test/images/image_matting.png')
>>> cv2.imwrite('result.png', result['output_img'])

The output image with the background removed is:

Fine-tuning and evaluation can also be done with a few more lines of code to set up training dataset and trainer, with the heavy-lifting work of training and evaluation a model encapsulated in the implementation of traner.train() and trainer.evaluate() interfaces.

For example, the gpt3 base model (1.3B) can be fine-tuned with the chinese-poetry dataset, resulting in a model that can be used for chinese-poetry generation.

>>> from modelscope.metainfo import Trainers
>>> from modelscope.msdatasets import MsDataset
>>> from modelscope.trainers import build_trainer

>>> train_dataset = MsDataset.load('chinese-poetry-collection', split='train'). remap_columns({'text1': 'src_txt'})
>>> eval_dataset = MsDataset.load('chinese-poetry-collection', split='test').remap_columns({'text1': 'src_txt'})
>>> max_epochs = 10
>>> tmp_dir = './gpt3_poetry'

>>> kwargs = dict(
     model='damo/nlp_gpt3_text-generation_1.3B',
     train_dataset=train_dataset,
     eval_dataset=eval_dataset,
     max_epochs=max_epochs,
     work_dir=tmp_dir)

>>> trainer = build_trainer(name=Trainers.gpt3_trainer, default_args=kwargs)
>>> trainer.train()

Why should I use ModelScope library

A unified and concise user interface is abstracted for different tasks and different models. Model inferences and training can be implemented by as few as 3 and 10 lines of code, respectively. It is convenient for users to explore models in different fields in the ModelScope community. All models integrated into ModelScope are ready to use, which makes it easy to get started with AI, in both educational and industrial settings.
ModelScope offers a model-centric development and application experience. It streamlines the support for model training, inference, export and deployment, and facilitates users to build their own MLOps based on the ModelScope ecosystem.
For the model inference and training process, a modular design is put in place, and a wealth of functional module implementations are provided, which is convenient for users to customize their own model inference, training and other processes.
For distributed model training, especially for large models, it provides rich training strategy support, including data parallel, model parallel, hybrid parallel and so on.

Installation

Docker

ModelScope Library currently supports popular deep learning framework for model training and inference, including PyTorch, TensorFlow and ONNX. All releases are tested and run on Python 3.7+, Pytorch 1.8+, Tensorflow1.15 or Tensorflow2.0+.

To allow out-of-box usage for all the models on ModelScope, official docker images are provided for all releases. Based on the docker image, developers can skip all environment installation and configuration and use it directly. Currently, the latest version of the CPU image and GPU image can be obtained from:

CPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py38-torch2.0.1-tf2.13.0-1.9.5

GPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.8.0-py38-torch2.0.1-tf2.13.0-1.9.5

Setup Local Python Environment

One can also set up local ModelScope environment using pip and conda. ModelScope supports python3.7 and above. We suggest anaconda for creating local python environment:

conda create -n modelscope python=3.8
conda activate modelscope

PyTorch or TensorFlow can be installed separately according to each model's requirements.

Install pytorch doc
Install tensorflow doc

After installing the necessary machine-learning framework, you can install modelscope library as follows:

If you only want to play around with the modelscope framework, of trying out model/dataset download, you can install the core modelscope components:

pip install modelscope

If you want to use multi-modal models:

pip install modelscope[multi-modal]

If you want to use nlp models:

pip install modelscope[nlp] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use cv models:

pip install modelscope[cv] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use audio models:

pip install modelscope[audio] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use science models:

pip install modelscope[science] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Notes:

Currently, some audio-task models only support python3.7, tensorflow1.15.4 Linux environments. Most other models can be installed and used on Windows and Mac (x86).
Some models in the audio field use the third-party library SoundFile for wav file processing. On the Linux system, users need to manually install libsndfile of SoundFile(doc link). On Windows and MacOS, it will be installed automatically without user operation. For example, on Ubuntu, you can use following commands:
```
sudo apt-get update
sudo apt-get install libsndfile1
```
Some models in computer vision need mmcv-full, you can refer to mmcv installation guide, a minimal installation is as follows:
```
pip uninstall mmcv # if you have installed mmcv, uninstall it
pip install -U openmim
mim install mmcv-full
```

Learn More

We provide additional documentations including:

License

This project is licensed under the Apache License (Version 2.0).

data-juicer's People

Contributors

Stargazers

Watchers

Forkers

yxdyc hylcool xieyxclack kit-ok techthiyanes notspicyzhan dx2048 chaojun-zhang oneplusekjoy syaikhipin yanniszhou llm-dev-open farfarfun xbxiong kg-nlp enkilee lumiqai chunmk padmanabh275 vivekguruduttk28 meowcollections eric-doug herocouple tangerinegroup mencelot alirezabayatmk mrorii oranges-deliverance jongsky beachwang ailwg stability-ai graice2013 lihuibng xuruidong xiamaozi11 projectmtbuller asdf2kr am1n3e ai-mou lylyone knightcn1983 hitszxs lynxye liudunxu qqr1 zhaopu7 wangxingjun778 co63oc merfy21 alignment-lab-ai jiejie1993 kevon217 wp931120 liuyanyi garyzhan1 garyli1019 mjwei222 flyingwen xiechengmude openkitty expert68 vincenschan ego zhenqincn zhijianma chg0901 tonywang-sh koanho pan-x-c lingzhq lavineleo shuaijiang notonion www6v patsnap-liujin krishnatejakk daming-w clardemasol sunjoshi1991 cicimmmmm songym2020 bestpredicts cathy0908 gyyz yzzer123 shiweijiezero jackwaterveg tang4109 zjrwtxdaydayup

data-juicer's Issues

不小心将analysis/overal.csv文件删除后，再重新运行分析，一直提示找不到该文件，不会自动重新生成，如何解决？

不小心将analysis/overal.csv文件及目录删除后，再重新运行分析，一直提示找不到该文件，不会自动重新生成，如何解决？
use_cache属性设置成false了

FT-Data Ranker-1b OOM finetuning on single GPU

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

使用竞赛提供的代码，deepspeed单卡（3090 24g）微调falcon-rw-1b，调试过参数和deepspeed的配置，均为OOM。

Additional 额外信息

Script

#!/bin/bash

set -e 
export CUDA_DEVICE_MAX_CONNECTIONS=1

if [ -z $XDG_CACHE_HOME ]; then
    export XDG_CACHE_HOME=$HOME/.cache
fi

if [[ $# -ne 3 ]]; then
    echo "Three arguments required! " >&2
    exit 2
fi

# Model Path
# e.g /home/model/baichuan2-7b/
model_path=${1} #/path/to/your/model/
tokenizer=${model_path}

# Data Path
# e.g /home/data/train.jsonl
data_path=${2} # /path/to/your/dataset.jsonl

# Output Path
# e.g ${WORK_DIR}/checkpoints/baichuan2-7b/
output_path=${3} #/path/to/your/output/

mkdir -p ${output_path}/

WORK_DIR=$(echo `cd $(dirname $0); pwd | xargs dirname`)
cd ${WORK_DIR}

# Deepspeed
# ds_config_file=${WORK_DIR}/train_scripts/deepspeed_configs/ds_config_stage3.json
ds_config_file=${WORK_DIR}/train_scripts/deepspeed_configs/ds_config_stage3_offload-para.json

# Train Parameter
bs_per_gpu=1
num_nodes=1
# nproc_per_node=`nvidia-smi | grep MiB | wc -l`
nproc_per_node=1
master_port=50000

# grad_acc=`expr 256 / ${bs_per_gpu} / ${num_nodes} / ${nproc_per_node}`
grad_acc=`expr 32 / ${bs_per_gpu} / ${num_nodes} / ${nproc_per_node}`
deepspeed --num_gpus ${nproc_per_node} --num_nodes ${num_nodes} --master_port ${master_port} train.py \
    --model_name_or_path ${model_path} \
    --tokenizer ${tokenizer} \
    --data_path ${data_path} \
    --output_dir ${output_path} \
    --per_device_train_batch_size ${bs_per_gpu} \
    --gradient_accumulation_steps ${grad_acc} \
    --lang en \
    --bf16 True \
    --gradient_checkpointing_enable True \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --learning_rate 2.5e-5 \
    --weight_decay 0 \
    --warmup_ratio 0.03 \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --save_steps -1 \
    --save_total_limit 999 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --deepspeed ${ds_config_file} | tee ${output_path}/training_log.txt

ORI NUMBER: 23237, AFTER FILETER: 22564, DROP NUMBER: 673
Total 22564 samples [ 6.48M tokens] in training!
  0%|                                                                                                                                                                                                                                                             | 0/705 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train.py", line 465, in <module>
    train()
  File "/home/ubuntu/Projects/ft_data_ranker_1b/competition_kit/lm-training/train.py", line 457, in train
    trainer.train()
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/trainer.py", line 1971, in _inner_training_loop
    self.optimizer.step()
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
    self.optimizer.step(closure)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
    adamw(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
    func(
  File "/home/ubuntu/Softwares/anaconda3/envs/dj_comp/lib/python3.10/site-packages/torch/optim/adamw.py", line 599, in _multi_tensor_adamw
    exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 36.38 MiB is free. Including non-PyTorch memory, this process has 23.64 GiB memory in use. Of the allocated memory 22.50 GiB is allocated by PyTorch, and 105.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

dedup across different dataset

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

Hi, thanks for the great work. I'm wondering whether juicer apply dedup across different datasets when reproducing redpajama. For example, if there will be duplicates between CommonCrawl-2023-06 and CommonCrawl-2022-05?

Additional 额外信息

No response

How to run this framework in a distributed cluster?

Spark? Slurm?

[feature] chinese_simplified_mapper

We need a new operator that can convert Chinese characters in Traditional Chinese to corresponding characters in Simplified Chinese.

读取txt文档时，可以设置成每个\n读取为一条数据吗？

读取txt文档时，可以设置成每个\n读取为一条数据吗？如何设置？

[MM] analysis for list data (such as list of sizes of images)

tokenization parameter in StopWordsFilter ops

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/stopwords_filter.py#L85

tokenization参数看起没有意义，因为只实现了一种统计stopwords的方式（另一个是从已经分词完的状态里取），就是sentencepiece tokenization.
而且#L83行不成立时，tokenization如设置为False，可能引起#L86行报错

Additional 额外信息

No response

31G的数据，运行去重的时候，保存数据的时候被kill

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

1111

Additional 额外信息

1111111

support text-based interleaved multimodal data as an intermediate format

[MM enhancement] support text-based interleaved multimodal data as the intermediate format

Basic support of multimodal data processing.

[Bug]: RAY error

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

使用RAY对language_id_score_filter算子进行处理时报错。

To Reproduce 如何复现

# ok
python tools/process_data.py --config configs/demo/process.yaml

# error
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

# ok，change op to - alphanumeric_filter:
python tools/process_data.py --config configs/demo/process.yaml --executor_type ray

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'
save_stats_in_one_file: true
# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'
  # - alphanumeric_filter:

Logs 报错日志

(python3.8) wzp@vastai-NF5468M6:~/code/LLMData/open_source/data-juicer$ python tools/process_data.py --config configs/demo/process.yaml --executor_type ray
<class 'list'>
<class 'list'>
2023-11-30 15:32:52 | WARNING  | data_juicer.config.config:329 - Cache management of datasets is disabled.
2023-11-30 15:32:52 | WARNING  | data_juicer.config.config:340 - Set temp directory to store temp files to [None].
2023-11-30 15:32:52 | INFO     | data_juicer.config.config:442 - Back up the input config file [/home/wzp/code/LLMData/open_source/data-juicer/configs/demo/process.yaml] into the work_dir [./outputs/demo-process]
2023-11-30 15:32:52 | INFO     | data_juicer.config.config:463 - Configuration table: 
╒════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════╕
│ key                    │ values                                                                                   │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════╡
│ config                 │ [Path_fr(configs/demo/process.yaml, cwd=/home/wzp/code/LLMData/open_source/data-juicer)] │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ hpo_config             │ None                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ project_name           │ 'demo-process'                                                                           │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ executor_type          │ 'ray'                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_path           │ 'demos/data/demo-dataset.jsonl'                                                          │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ export_path            │ './outputs/demo-process/demo-processed.jsonl'                                            │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ export_shard_size      │ 0                                                                                        │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ export_in_parallel     │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ np                     │ 4                                                                                        │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ text_keys              │ 'text'                                                                                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ image_key              │ 'images'                                                                                 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ image_special_token    │ '<__dj__image>'                                                                          │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ eoc_special_token      │ '<|__dj__eoc|>'                                                                          │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ suffixes               │ []                                                                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ use_cache              │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ ds_cache_dir           │ PosixPath('/home/wzp/.cache/huggingface/datasets')                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ cache_compress         │ None                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ use_checkpoint         │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ temp_dir               │ None                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ open_tracer            │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ op_list_to_trace       │ []                                                                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ trace_num              │ 10                                                                                       │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ op_fusion              │ False                                                                                    │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ process                │ [{'language_id_score_filter': {'image_key': 'images',                                    │
│                        │                                'lang': 'zh',                                             │
│                        │                                'min_score': 0.8,                                         │
│                        │                                'text_key': 'text'}}]                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ save_stats_in_one_file │ True                                                                                     │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ ray_address            │ 'auto'                                                                                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ work_dir               │ './outputs/demo-process'                                                                 │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ timestamp              │ '20231130153252'                                                                         │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ dataset_dir            │ '/home/wzp/code/LLMData/open_source/data-juicer/demos/data'                              │
├────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│ add_suffix             │ False                                                                                    │
╘════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════╛
2023-11-30 15:32:53 | INFO     | data_juicer.core.ray_executor:35 - Initing Ray ...
2023-11-30 15:32:53,326 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.23.4.252:6379...
2023-11-30 15:32:53,333 INFO worker.py:1642 -- Connected to Ray cluster.
2023-11-30 15:32:53 | INFO     | data_juicer.core.ray_executor:47 - Loading dataset with Ray...
2023-11-30 15:32:54,324 INFO read_api.py:406 -- To satisfy the requested parallelism of 192, each read task output is split into 192 smaller blocks.
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:51 - Preparing process operators...
2023-11-30 15:32:54 | INFO     | data_juicer.utils.model_utils:87 - Loading fasttext language identification model...
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:59 - columns ['text', 'meta']
2023-11-30 15:32:54 | INFO     | data_juicer.core.ray_executor:62 - Processing data...
2023-11-30 15:32:54,702 INFO streaming_executor.py:93 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[ReadJSON->SplitBlocks(192)] -> TaskPoolMapOperator[MapBatches(process_batch)->Map(compute_stats)->Filter(process)]
2023-11-30 15:32:54,702 INFO streaming_executor.py:94 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-11-30 15:32:54,702 INFO streaming_executor.py:96 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.312 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:00.363 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48497) 2023-11-30 15:33:01.362 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later.                                                                              
--- Logging error in Loguru Handler #1 ---                                                                                                                                                                                                                                                            
Record was: {'elapsed': datetime.timedelta(seconds=13, microseconds=527897), 'exception': (type=<class 'ray.exceptions.RayTaskError(ValueError)'>, value=RayTaskError(ValueError)(ValueError('Model not loaded. Please retry later.')), traceback=<traceback object at 0x7f298c1324c0>), 'extra': {}, 'file': (name='process_data.py', path='tools/process_data.py'), 'function': '<module>', 'level': (name='ERROR', no=40, icon='❌'), 'line': 19, 'message': "An error has been caught in function '<module>', process 'MainProcess' (48135), thread 'MainThread' (139830410995520):", 'module': 'process_data', 'name': '__main__', 'process': (id=48135, name='MainProcess'), 'thread': (id=139830410995520, name='MainThread'), 'time': datetime(2023, 11, 30, 15, 33, 2, 531776, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'CST'))}
Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 345, in ray._raylet.StreamingObjectRefGenerator._next_sync
  File "python/ray/_raylet.pyx", line 4533, in ray._raylet.CoreWorker.try_read_next_object_ref_stream
  File "python/ray/_raylet.pyx", line 443, in ray._raylet.check_status
ray.exceptions.ObjectRefStreamEndOfStreamError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 80, in on_waitable_ready
    meta = ray.get(next(self._streaming_gen))
  File "python/ray/_raylet.pyx", line 300, in ray._raylet.StreamingObjectRefGenerator.__next__
  File "python/ray/_raylet.pyx", line 351, in ray._raylet.StreamingObjectRefGenerator._next_sync
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_logger.py", line 1277, in catch_wrapper
    return function(*args, **kwargs)
  File "tools/process_data.py", line 15, in main
    executor.run()
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/ray_executor.py", line 83, in run
    logger.info(f'Op [{op_name}] Done. Left '
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 2498, in count
    [get_num_rows.remote(block) for block in self.get_internal_block_refs()]
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/dataset.py", line 4799, in get_internal_block_refs
    blocks = self._plan.execute().get_blocks()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/plan.py", line 591, in execute
    blocks = execute_to_legacy_block_list(
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 119, in execute_to_legacy_block_list
    block_list = _bundles_to_block_list(bundles)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/legacy_compat.py", line 357, in _bundles_to_block_list
    for ref_bundle in bundles:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/executor.py", line 37, in __next__
    return self.get_next()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 129, in get_next
    raise item
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 187, in run
    while self._scheduling_loop_step(self._topology) and not self._shutdown:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor.py", line 235, in _scheduling_loop_step
    process_completed_tasks(topology)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 333, in process_completed_tasks
    active_tasks[ref].on_waitable_ready()
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 88, in on_waitable_ready
    ex = ray.get(block_ref)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::MapBatches(process_batch)->Map(compute_stats)->Filter(process)() (pid=48715, ip=10.23.4.252)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 405, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 256, in transform_fn
    for row in rows:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 223, in __call__
    for block in blocks:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 345, in __call__
    for data in iter:
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 171, in __call__
    yield from self._row_fn(input, ctx)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 233, in transform_fn
    out_row = fn(row)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 119, in fn
    return op_fn(item, *fn_args, **fn_kwargs)
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 47, in wrapped_f
    return f(*args, **kargs)
  File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 53, in compute_stats
    raise ValueError(err_msg)
ValueError: Model not loaded. Please retry later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/loguru/_handler.py", line 204, in emit
    self._queue.put(str_record)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/queues.py", line 362, in put
    obj = _ForkingPickler.dumps(obj)
  File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'ray.exceptions.RayTaskError(ValueError)'>: attribute lookup RayTaskError(ValueError) on ray.exceptions failed
--- End of logging error ---
(MapBatches(process_batch)->Map(compute_stats)->Filter(process) pid=48713) 2023-11-30 15:33:02.374 | ERROR    | data_juicer.ops.filter.language_id_score_filter:compute_stats:52 - Model not loaded. Please retry later. [repeated 9x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)

Screenshots 截图

No response

Additional 额外信息

No response

[Bug]: alphanumeric_filter, char.isalnum()

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/alphanumeric_filter.py#L75

            alnum_count = sum(
                map(lambda char: 1
                    if char.isalnum() else 0, sample[self.text_key]))

Python3默认使用Unicode编码，所以'汉字'.isalnum()会返回True；encode()默认编码是UTF-8，编码成utf8之后，汉字就不会返回True了

            alnum_count = sum(
                map(lambda char: 1
                    if char.encode().isalnum() else 0, sample[self.text_key]))

To Reproduce 如何复现

python tools/analyze_data.py --config configs/demo/analyser.yaml

Configs 配置信息

project_name: 'demo-analyser'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
text_keys: 'text'
export_path: './outputs/demo-analyser/demo-analyser-result.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
    - alphanumeric_filter:

Logs 报错日志

No response

Screenshots 截图

第五行全中文，alnum_ratio字母数字比例，正确应该是0

Additional 额外信息

No response

[MM] support audio & audio-text data reading

As the title says.

[Bug]: date format changed from input to output

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

latest

Python Version Python版本

3.8

Describe the bug 描述这个bug

I have a jsonl data to be processed, there is a time key in data records, it looks like '2023-10-13 16:06:31' originally, then I follow the python tools/process_data.py --config configs/demo/process.yaml command to process data, and in output jsonl, I found time is changed to 1678, a integer. I've found that it may be caused by datasets.to_json , there is a parameter called date_format, I set it to 'iso', the output will change to '1970-01-01T00:00:01.698', so it's not only bug in format, but the value also changed.

To Reproduce 如何复现

prepare a jsonl dataset with time.
run python tools/process_data.py --config configs/demo/process.yaml
output time format and value changed

Configs 配置信息

project_name: 'all'                                         # project name for distinguish your configs
dataset_path: '/path/to/dataset/0.jsonl'                     
export_path: '/path/to/result/result.jsonl'              
export_shard_size: 0                                      
export_in_parallel: false                                  
np: 4                                                       # number of subprocess to process your dataset
text_keys: '内容'                                      
suffixes: ['.jsonl']                                                # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: true                                             # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null                                        
use_checkpoint: false                              
open_tracer: false                                          # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: []                                        # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10                                               # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false                                            # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null                                        # The compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

# for distributed processing
executor_type: default                                      # Type of executor, support "default" or "ray" for now.
ray_address: auto                                           # The address of the Ray cluster.

# only for data analysis
save_stats_in_one_file: false                               # whether to store all stats result into one file

# process schedule: a list of several process operators with their arguments
process:
  - clean_email_mapper:                                     # remove emails from text.
  - clean_html_mapper:                                      # remove html formats form text.
  - clean_ip_mapper:                                        # remove ip addresses from text.
  - clean_links_mapper:                                     # remove web links from text.
  - clean_copyright_mapper:                                 # remove copyright comments.
  - punctuation_normalization_mapper:                       # normalize unicode punctuations to English punctuations.
  - whitespace_normalization_mapper:                        # normalize different kinds of whitespaces to English whitespace.

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

[MM enhancement] image data loading

Add an initial version of multimodal data loading utils (e.g. for images)

[MM] llava2dj & dj2llava tools

There are diverse formats for multimodal datasets. So we provide several format conversion tools, which convert original datasets to the Data-Juicer intermediate format and convert them back. Data-Juicer will accept datasets in Data-Juicer format and process them.

Here we add tools for converting LLaVA-like dataset to the target dataset in Data-Juicer format and reverse first. Other formats will be supported in the future.

streamlit run app.py 会卡在please wait页面

在demos所有文件夹下运行streamlit run app.py 都会卡在please wait页面，应该怎么解决？

[MM] clip_similarity_filter

A new Filter clip_similarity_filter will be supported. For a multimodal sample that contains images and texts, if the CLIP similarities between image-text pairs are out of a specific range, this sample will be filtered out.

请问天池FT-Data Ranker 1B赛道训练完模型测试数据时报错 RuntimeError: Error(s) in loading state_dict for FalconForCausalLM

对falcon-rw-1b模型进行lora微调后，调用训练好的模型测试数据出现报错

`(dj_comp) root@xht-ddc311903c-03ffb4f2:~/competition_kit/lm-evaluation-harness# bash ./examples/challenge-1B-stage1.sh \

dev
/root/output/1b
/root/competition_kit/data/challenge-data
/root/output/result_1b
[MODEL] /root/output/1b
[DATA] /root/competition_kit/data/challenge-data/dev
[OUT] /root/output/result_1b/dev
[TASK] challenge_mc: 25-shot
Using device 'cuda:0'
/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
File "/root/competition_kit/lm-evaluation-harness/main.py", line 103, in
main()
File "/root/competition_kit/lm-evaluation-harness/main.py", line 70, in main
results = evaluator.simple_evaluate(
File "/root/competition_kit/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
return fn(*args, **kwargs)
File "/root/competition_kit/lm-evaluation-harness/lm_eval/evaluator.py", line 80, in simple_evaluate
lm = lm_eval.models.get_model(model).create_from_arg_string(
File "/root/competition_kit/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
return cls(**args, **args2)
File "/root/competition_kit/lm-evaluation-harness/lm_eval/models/gpt2.py", line 85, in init
self.model = transformers.AutoModelForCausalLM.from_pretrained(
File "/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3187, in from_pretrained
) = cls._load_pretrained_model(
File "/root/miniconda3/envs/dj_comp/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3636, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for FalconForCausalLM:
size mismatch for transformer.word_embeddings.weight: copying a param with shape torch.Size([50258, 2048]) from checkpoint, the shape in current model is torch.Size([0, 2048]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([50258, 2048]) from checkpoint, the shape in current model is torch.Size([0, 2048]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.
`

请问这种情况是我训练出了问题导致测试出问题，还是说只是测试的时候哪里出了问题，这个[0,2048]看起来太诡异了

google和baidu了很多，大部分都是在讲连接层输出的问题，但是不太理解怎么会出现0。而且采用他建议的 ignore_mismatched_sizes=True也无法解决问题

去重50G左右中文语料卡死

使用项目进行中文语料去重(minhash)等算子操作，在去重阶段卡死，约4000w条数据
详细配置如下：

project_name: 'CC100-zh'
dataset_path: xxx.jsonl
export_path: xxx-processed.jsonl

np: 50
open_tracer: true
text_keys: 'text'
process:
  - perplexity_filter:
      lang: zh
      max_ppl: 2500
  - document_minhash_deduplicator:                          
      tokenization: character                                    
      window_size: 5                                        
      num_permutations: 256                                   
      jaccard_threshold: 0.7                                 
      num_bands: null                                        
      num_rows_per_band: null                                 
      lowercase: true                                         
      ignore_pattern: null                                    
  - text_length_filter:
      min_len: 200
      max_len: 65589
  - character_repetition_filter:
      rep_len: 10
      max_ratio: 0.3
  - word_repetition_filter:
      lang: zh
      tokenization: true
      rep_len: 10
      max_ratio: 0.279

另外想问一下是否有处理类似规模语料详细的耗时统计？

token_num_filter的计算

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/token_num_filter.py#L49
基于EleutherAI/pythia-6.9b-deduped
这个模型基于English，在token_num_filter对中文进行tokenize时，对token数量的统计是不是存在一些问题

Additional 额外信息

No response

[Bug]: clean_links_mapper、clean_ip_mapper等算子导致OOM

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu20.04

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

latest

Python Version Python版本

3.10.12

Describe the bug 描述这个bug

清洗英文web数据，数据规模约150G左右，开启以下算子：

  - clean_email_mapper:                                     # remove emails from text.
  - clean_html_mapper:                                      # remove html formats form text.
  - clean_ip_mapper:                                        # remove ip addresses from text.
  - clean_links_mapper:                                     # remove web links from text.

清洗过程占用内存达1.4TB导致OOM

To Reproduce 如何复现

详细配置文件如下：

project_name: 'xxx'
dataset_path: 'xxx'
export_path: 'xxx'

np: 128  
open_tracer: true
text_keys: 'text'
trace_num: 0x3f3f3f3f

process:
  - words_num_filter:                                     
      lang: en                                              
      tokenization: false                                    
      min_num: 100                                            
  - clean_email_mapper:                                     
  - clean_html_mapper:                                      
  - clean_ip_mapper:                                       
  - clean_links_mapper:                                     
  - perplexity_filter:
      lang: en
      max_ppl: 2500 
  - document_minhash_deduplicator:                          
      tokenization: space                                    
      window_size: 5                                          
      num_permutations: 256                                   
      jaccard_threshold: 0.7                                 
      num_bands: null                                        
      num_rows_per_band: null                                 
      lowercase: false                                         
      ignore_pattern: null                                    
  - word_repetition_filter:
      lang: en
      tokenization: true
      rep_len: 10
      max_ratio: 0.1

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

[Bug]: scalene分析报错

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

https://zhuanlan.zhihu.com/p/654006173

To Reproduce 如何复现

pip install -U scalene
scalene tools/process_data.py --config configs/demo/process.yaml

Configs 配置信息

# Process config example for dataset

# global parameters
project_name: 'demo-process'
dataset_path: 'demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
use_cache: false
export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  - language_id_score_filter:
      lang: 'zh'

Logs 报错日志

2023-11-29 11:13:15 | INFO | data_juicer.core.executor:107 - Processing data...
2023-11-29 11:13:15 | ERROR | data_juicer.core.executor:165 - An error occurred during Op [language_id_score_filter].
Traceback (most recent call last):
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/executor.py", line 131, in run
dataset = dataset.add_column(name=Fields.stats,
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 255, in add_column
return NestedDataset(super().add_column(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 5446, in add_column
dataset = self.flatten_indices() if self._indices is not None else self
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/fingerprint.py", line 511, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3592, in flatten_indices
return self.map(
File "/home/wzp/code/LLMData/open_source/data-juicer/data_juicer/core/data.py", line 180, in map
new_ds = NestedDataset(super().map(*args, **kargs))
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 563, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 528, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3004, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3392, in _map_single
buf_writer, writer, tmp_file = init_buffer_and_writer()
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3326, in init_buffer_and_writer
tmp_file = tempfile.NamedTemporaryFile("wb", dir=os.path.dirname(cache_file_name), delete=False)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/home/wzp/anaconda3/envs/python3.8/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1pvg2dc5/tmpbf_w1ds6'

Screenshots 截图

Additional 额外信息

No response

如何设置清洗后的数据不包含"djstats__"字段？

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

在使用中发现清洗后的数据中不仅包含原有数据的字段，还新增了"dj__stats"字段，里面包括一系列与清洗时的算子相关的属性值。这确实方便了我们了解数据在清洗过程中为何被保留下来。

现在我想要进行多个阶段的数据清洗，例如第一阶段在数据集内进行清洗，第二阶段在数据集间进行清洗。针对这种多阶段的设计，当我想要基于预处理后的数据再次做预处理时，会报错：

pyarrow.lib.ArrowInvalid: Unable to merge: Field __dj__stats__ has incompatible types: struct<alnum_ratio: double, avg_line_length: double, char_rep_ratio: double, flagged_words_ratio: double, max_line_length: int64, num_words: int64, perplexity: double, special_char_ratio: double, text_len: int64, word_rep_ratio: double> vs struct<alnum_ratio: double>
  0%|          | 0/1 [00:00<?, ?it/s]

看起来似乎是"dj__stats"字段的格式引起的问题。
我没有找到如何设置清洗后的数据不包含"dj__stats"字段，看源代码似乎是一个默认包含"dj__stats"字段的设计？

Additional 额外信息

No response

What causes visualization to be blank?

Thanks for your excellent work! Question in title. Here is my result.

[MM] mmc42dj & dj2mmc4 tools

Here we add tools for converting MMC4-like dataset to the target dataset in Data-Juicer format and reverse first. Other formats will be supported in the future.

[fix] pre-commit problems fixing

Fix pre-commit failures
Add pre-commit actions

Dirs or files excluded:

docs
tests
demos
*.md

Are there any other files that need to be excluded from the pre-commit check list? @chenhesen @yxdyc @zhijianma

[feature] release of data-juicer model checkpoints in huggingface format

Our reference models pre-trained by Data-Juicer are in Megatron format, we provide their download links as follows:

Plz @zhijianma paste the below links later.

LLaMA-1.3B
- pre-trained on DJ-refined RedPajama and Pile: 50B token, 100B token, 150B token
- then fine-tuned on DJ-refined instruct data: DJ-IFT, 4.7B token
LLaMA-7B, fine-tuned on DJ-refined data (chat, English), DJ-CFT-EN, 40k samples
LLaMA2-7B-FlagAlpha, fine-tuned on DJ refined data (chat, Chinese), DJ-CFT-ZH, 52k samples

Now, we are converting them into huggingface format, and will upload them in ModelScope and huggingface hub.

[Bug]: NameError: name 'fingerprint_warnings' is not defined TypeError: cannot pickle 'OpenCC' object

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

To Reproduce 如何复现

# configs/demo/process.yaml

# global parameters
project_name: 'demo-process'
dataset_path: './demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
np: 4  # number of subprocess to process your dataset

export_path: './outputs/demo-process/demo-processed.jsonl'

# process schedule
# a list of several process operators with their arguments
process:
  # - language_id_score_filter:
  #     lang: 'zh'
  - chinese_convert_mapper:
        mode: 's2t'

python tools/process_data.py --config configs/demo/process.yaml

报错如上面截图

使用以下单独脚本测试是正常的：

from data_juicer.ops.mapper.chinese_convert_mapper import ChineseConvertMapper

text = {"text": "我在馬路邊嫲緑變"}

op = ChineseConvertMapper(mode = 't2s')

aa = op.process(text)
print(aa)

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

[Bug]: 无法安装simhash-py

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

No response

Python Version Python版本

3.9

Describe the bug 描述这个bug

/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/simhash/simhash.pyx
tree = Parsing.p_module(s, pxd, full_module_name)

Error compiling Cython file:

...
import hashlib
import struct

from simhash cimport compute as c_compute
^

simhash/simhash.pyx:4:0: 'simhash/compute.pxd' not found

Error compiling Cython file:

...
import hashlib
import struct

from simhash cimport compute as c_compute
from simhash cimport find_all as c_find_all
^

simhash/simhash.pyx:5:0: 'simhash/find_all.pxd' not found

Error compiling Cython file:

...
Find the set of all matches within the provided vector of hashes.

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
       ^

simhash/simhash.pyx:26:9: 'matches_t' is not a type identifier

Error compiling Cython file:

...

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
  cdef vector[match_t] results_vector
       ^

simhash/simhash.pyx:27:9: 'vector' is not a type identifier

Error compiling Cython file:

...

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
  cdef vector[match_t] results_vector
       ^

simhash/simhash.pyx:27:9: 'vector' is not a type identifier

Error compiling Cython file:

...
# Unpacks the binary bytes in digest into a Python integer
return struct.unpack('>Q', digest)[0] & 0xFFFFFFFFFFFFFFFF

def compute(hashes):
'''Compute the simhash of a vector of hashes.'''
return c_compute(hashes)
^

simhash/simhash.pyx:17:11: 'c_compute' is not a constant, variable or function identifier

Error compiling Cython file:

...
Find the set of all matches within the provided vector of hashes.

  The provided hashes are manipulated in place, but upon completion are
  restored to their original state.
  '''
  cdef matches_t results_set = c_find_all(hashes, number_of_blocks, different_bits)
                               ^

simhash/simhash.pyx:26:33: 'c_find_all' is not a constant, variable or function identifier
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/setup.py", line 38, in
setup(
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 148, in setup
return run_commands(dist)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
dist.run_commands()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
self.run_command(cmd)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/wheel/bdist_wheel.py", line 364, in run
self.run_command("build")
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
cmd_obj.run()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 448, in build_extensions
self._build_extensions_serial()
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/_distutils/command/build_ext.py", line 473, in _build_extensions_serial
self.build_extension(ext)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Distutils/build_ext.py", line 130, in build_extension
new_ext = cythonize(
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 1154, in cythonize
cythonize_one(*args)
File "/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 1321, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: simhash/simhash.pyx
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /home/kemove/anaconda3/envs/sakura/bin/python -u -c '
exec(compile('"'"''"'"''"'"'

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from `distutils.core` to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so

setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

import os, sys, tokenize

try:
import setuptools
except ImportError as error:
print(
"ERROR: Can not execute setup.py since setuptools is not available in "
"the build environment.",
file=sys.stderr,
)
sys.exit(1)

file = %r
sys.argv[0] = file

if os.path.exists(file):
filename = file
with tokenize.open(file) as f:
setup_py_code = f.read()
else:
filename = ""
setup_py_code = "from setuptools import setup; setup()"

exec(compile(setup_py_code, filename, "exec"))
'"'"''"'"''"'"' % ('"'"'/tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/setup.py'"'"',), "", "exec"))' bdist_wheel -d /tmp/pip-wheel-oyoenh4d
cwd: /tmp/pip-install-642i4337/simhash-py_707a16c4f0d24a878223a92ec6376dbd/
Building wheel for simhash-py (setup.py) ... error
ERROR: Failed building wheel for simhash-py
Running setup.py clean for simhash-py
Running command python setup.py clean
Building from Cython
/home/kemove/anaconda3/envs/sakura/lib/python3.9/site-packages/setuptools/dist.py:723: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
warnings.warn(
running clean
removing 'build/lib.linux-x86_64-3.9' (and everything under it)
'build/bdist.linux-x86_64' does not exist -- can't clean it
'build/scripts-3.9' does not exist -- can't clean it
Failed to build simhash-py
ERROR: Could not build wheels for simhash-py, which is required to install pyproject.toml-based projects
WARNING: There was an error checking the latest version of pip.

To Reproduce 如何复现

cd data-juicer
pip install -v -e .

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

[MM] support audio data basic processing OPs

Support some basic OPs for audio data.

加载modelscope数据集

Search before continuing 先搜索，再继续

I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。

Description 描述

加载modelscope数据集

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR？

Yes I'd like to help by submitting a PR! 是的！我愿意提供帮助并提交一个PR！

[MM] image_shape_filter

A new Filter image_shape_filter will be supported. For a multimodal sample that contains images, if shapes (w, h) of images are out of a specific range, this sample will be filtered out.

[Bug]: 命令行参数

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8

Describe the bug 描述这个bug

-https://github.com/alibaba/data-juicer/blob/main/docs/DeveloperGuide_ZH.md#丰富的配置源和类型提示

使用命令行更新参数时，如果存在process算子参数和非process算子参数时，以下行会报错：
https://github.com/alibaba/data-juicer/blob/main/data_juicer/config/config.py#L270

To Reproduce 如何复现

python` tools/analyze_data.py --config configs/demo/analyser.yaml --project_name l666 --language_id_score_filter.min_score=0.9

Configs 配置信息

默认的configs/demo/analyser.yaml

Logs 报错日志

No response

Screenshots 截图

Additional 额外信息

No response

[MM] image_aspect_ratio_filter

A new Filter image_aspect_ratio_filter is supported. It will filter out multimodal samples that contain images with unexpected aspect ratios.

[Bug]: formatter 调用 datasets.load_dataset 时缓存位置问题

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu 20.04.5 LTS x86_64

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.10

Describe the bug 描述这个bug

查阅 data-juicer 源码后，发现 LocalFormatter.load_dataset 在调用 datasets.load_dataset 时，系统指定的 HF_DATASETS_CACHE 环境变量没有生效，加载本地的 json 文件的时候仍然会放到默认的 ~/.cache/huggingface/datasets 下。

To Reproduce 如何复现

Run command

python tools/process_data.py --config configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml

Check .cache folder

ls ~/.cache/huggingface/datasets

Shwo result

--> ls ~/.cache/huggingface/datasets
xxx-960372e22b98db9d_0.0.0_fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e.lock  json

Configs 配置信息

global parameters

project_name: 'Data-Juicer-recipes-alpaca-cot-en'
dataset_path: '../data/raw_data/raw_data_en.jsonl' # path to your dataset directory or file
export_path: '../data/refine_data/dataset.jsonl'

np: 100 # number of subprocess to process your dataset
open_tracer: true

process schedule

a list of several process operators with their arguments

process:

document_deduplicator: # 104636705
lowercase: true
ignore_non_character: true
alphanumeric_filter: # 104636381
tokenization: false
min_ratio: 0.1
character_repetition_filter: # 104630030
rep_len: 10
max_ratio: 0.6
flagged_words_filter: # 104576967
lang: en
tokenization: true
max_ratio: 0.017
maximum_line_length_filter: # 104575811
min_len: 20
text_length_filter: # 104573711
min_len: 30
document_simhash_deduplicator: # 72855345
tokenization: space
window_size: 3
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 9
hamming_distance: 7

Logs 报错日志

None

Screenshots 截图

None

Additional 额外信息

代码发生位置 data-juicer/data_juicer/format/formatter.py

预期解决方案：

可从 config 动态给 load_dataset 传参，即通过 yaml 读取额外的 key 和 value，通过 **kwargs 传入 LocalFormatter.load_dataset 指定 cache_dir 参数。

如果可能的话，我可以作为社区开发者提 PR 作贡献。

为什么我的算子无法传参呢

我创建了一个mapper算子，在算子在算子的init中写了需要传递的参数，并在Mapper的init中注册了算子。当变量在算子中写死时算子是可以用的，而且是正确的。当在config中传参时就会报config写错了的错误。这是怎么回事呢

[MM] entity/action filter based on texts

export_path can not be a folder

Search before continuing 先搜索，再继续

I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。

Description 描述

dataset_path是一个文件夹路径，里面多个json文件
处理完后，希望export_path也是一个文件夹，里面为对应多个json

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR？

Yes I'd like to help by submitting a PR! 是的！我愿意提供帮助并提交一个PR！

如何离线使用这个包

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

我目前需要在一个离线容器中使用这个包，虽然目前包已经装好了，但在运行的时候发现这个包需要下载很多模型权重之类的东西，请问有什么比较有效的办法能让我在离线环境中也能使用这个包吗

Additional 额外信息

No response

[feature] remove_non_chinese_character_mapper

Remove all characters outside the unicode encoding range 4E00-9FA5

Auto-HPO是全流程自动化的还是需要人工介入

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

看了论文，发现里面有Auto-HPO的功能，似乎没有看到相关的代码，请问这个功能现在是全流程自动化的吗，还是说，通过调整yaml文件，生成不同的数据，然后用https://modelscope.cn/studios/Data-Juicer/auto_evaluation_helm/summary这个里面的评测，来观察结果，再调整数据生成策略，如此循环，亦或者是其他方法？

Additional 额外信息

No response

词汇多样性ModelScope Demo报错

从https://modelscope.cn/studios/Data-Juicer/data_visulization_diversity/summary，这个页面的词汇多样性ModelScope链接点进去后，ModelScope页面报错：

ImportError: cannot import name 'prepare_diversity_model' from 'data_juicer.analysis.diversity_analysis' (/opt/conda/lib/python3.8/site-packages/data_juicer/analysis/diversity_analysis.py)
Traceback:
File "/opt/conda/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/home/studio_service/studio_file/PROJECT/app.py", line 9, in
from data_juicer.analysis.diversity_analysis import (DiversityAnalysis,

[enhance] refine the log info when input dataset_path is invalid

When input dataset_path is invalid, raise an error with detailed information instead of a pure NotImplementedError without any details.

[Bug]: nlpaug could generate an indefinite number of augmented samples.

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

all

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

latest

Python Version Python版本

3.8

Describe the bug 描述这个bug

During the FT-Ranker competition, a user used nlpaug_en_mapper to augment the dataset but an error occurs during processing:

which shows that for some samples, nlpaug generated no augmented samples. Fields alignment except text should be modified similar to nlpcda_zh_mapper.

To Reproduce 如何复现

Run nlpaug_en_mapper on FT-Ranker Competition dataset raw_data_en.jsonl.

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

pip install py-data-juicer安装失败

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

pip install py-data-juicer安装失败

` Cython.Compiler.Errors.CompileError: simhash/simhash.pyx
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for simhash-py
Running setup.py clean for simhash-py
Failed to build kenlm simhash-py
ERROR: Could not build wheels for kenlm, simhash-py, which is required to install pyproject.toml-based projects`

全部报错信息见全部报错

Additional 额外信息

windows11系统，已经安装gcc, cmake, visual studio 2022 MSVC140/143生成工具，我不太清楚是不是因为gcc和cmake版本不对，导致kenlm, simhash-py报错？
gcc (x86_64-win32-seh-rev3, Built by MinGW-W64 project) 12.1.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
cmake version 3.28.0-rc5

[Bug]: 使用document_simhash_deduplicator算子报错：NameError: name 'fingerprint_warnings' is not defined

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

docker image build from Dockerfile by myself

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8.18

Describe the bug 描述这个bug

（1）关键错误信息
An error occurred during Op [document_simhash_deduplicator]

_raise PicklingError(_pickle.PicklingError: Can't pickle <cyfunction _Pyx_CFunc_size__t____hash__t____hash__t___to_py..wrap at 0x7ff22bad4e80>: it's not found as cfunc.to_py.wrap

NameError: name 'fingerprint_warnings' is not defined
（2）报错信息截图

To Reproduce 如何复现

1、pulled the latest code，and build docker image from Dockerfile by myself（docker image name： data-juicer:v0.1.2s）
2、start docker container as follow
docker run -dit
--name dj
-v ~/.cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash

docker cp -a dj:/data-juicer/ /data/llm-data/data-juicer

docker stop dj && docker rm dj

docker run -dit
--name dj
-p 8501:8501
-v /data/llm-data/data-juicer:/data-juicer
-v /data/llm-data/cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash
3、docker exec -it dj /bin/bash
4、cd /data-juicer/demos/process_cft_zh_data
streamlit run app.py
5、access ip:8501 by browser
6、click "start to process data" button

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

issue83有类似问题，下载最新代码 and 重装datasets==2.11.0
dill==0.3.4包，都无法解决

[MM] image_size_filter

A new Filter image_size_filter will be supported. For a multimodal sample that contains images, if the byte sizes of the images are out of a specific range, this sample will be filtered out.

[MM] image_deduplicator

A new Deduplicator image_deduplicator will be supported. It will remove duplicate images in multimodal samples. Maybe based on imagededup library

TBD: when a sample contains multiple images, remove duplicate images only or remove this whole sample?