Comments (8)
@garyzhang99 It worked, thanks a lot.
from data-juicer.
I was unable to reproduce this bug when running the command. Could you try running it again after executing pip install -v -e .[sci]?
from data-juicer.
I was unable to reproduce this bug when running the command. Could you try running it again after executing pip install -v -e .[sci]?
This error occured again after I executed pip install -v -e .[sci]
.
- ray version=2.7.0
- python version=3.8.18
- data juicer =0.2.0
This error is weird, after I commented line 83, it will raise error from line 88.
from data-juicer.
I was unable to reproduce this bug when running the command. Could you try running it again after executing pip install -v -e .[sci]?
This error occured again after I executed
pip install -v -e .[sci]
.
- ray version=2.7.0
- python version=3.8.18
- data juicer =0.2.0
This error is weird, after I commented line 83, it will raise error from line 88.
When using Ray in a distributed setting, due to Ray's feature (Ray future), Ray does not compute directly at the corresponding line of code. Instead, the computation is performed when the result is called. After you commented out line 83, the computation that was originally performed at line 83 is executed at line 88, leading to an error at line 88, whereas the actual error should have occurred before line 83.
The error reporting mechanism of Ray makes it difficult to pinpoint the corresponding error. Could you try not using Ray first and run the corresponding code in a single-machine version to see if there are more complete error messages?
from data-juicer.
I was unable to reproduce this bug when running the command. Could you try running it again after executing pip install -v -e .[sci]?
This error occured again after I executed
pip install -v -e .[sci]
.
- ray version=2.7.0
- python version=3.8.18
- data juicer =0.2.0
This error is weird, after I commented line 83, it will raise error from line 88.
When using Ray in a distributed setting, due to Ray's feature (Ray future), Ray does not compute directly at the corresponding line of code. Instead, the computation is performed when the result is called. After you commented out line 83, the computation that was originally performed at line 83 is executed at line 88, leading to an error at line 88, whereas the actual error should have occurred before line 83.
The error reporting mechanism of Ray makes it difficult to pinpoint the corresponding error. Could you try not using Ray first and run the corresponding code in a single-machine version to see if there are more complete error messages?
Thanks for your advice. I have tested the code without Ray, and everything worked as expected, normally. I then double-checked the demo.yaml file, modified ray_address: 'ray://localhost:10001'
to ray_address: 'auto'
and ran the code. Everything worked normally except for two operators with models, namely, language_id_score_filter
and perplexity_filter
. When I commented out these two operators, it worked fine. I conducted unit tests on both operators, and they both worked. But on the local ray, they were unable to find the model.
# language_id_score_filter
File "/home/lzj/project/open-source/data-juicer/data_juicer/ops/filter/language_id_score_filter.py", line 53, in compute_stats
raise ValueError(err_msg)
ValueError: Model not loaded. Please retry later.
# perplexity_filter
File "/home/lzj/project/open-source/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 71, in compute_stats
logits += kenlm_model.score(line)
AttributeError: 'NoneType' object has no attribute 'score'
from data-juicer.
Based solely on the provided description, we have not been able to reproduce the bug, nor can we pinpoint the specific issue. It appears it might be an environmental problem or an issue with the get_model and check_model functions. Could you provide more information?
Additionally, I would like to ask whether CUDA is enabled in your local environment and whether the corresponding Data-Juicer version is up to date.
from data-juicer.
Based solely on the provided description, we have not been able to reproduce the bug, nor can we pinpoint the specific issue. It appears it might be an environmental problem or an issue with the get_model and check_model functions. Could you provide more information?
Additionally, I would like to ask whether CUDA is enabled in your local environment and whether the corresponding Data-Juicer version is up to date.
- torch.cuda.is_available() = True
- data-juicer = v0.2.0
Here are the screenshots. I hope they are helpful for you.
from data-juicer.
Based solely on the provided description, we have not been able to reproduce the bug, nor can we pinpoint the specific issue. It appears it might be an environmental problem or an issue with the get_model and check_model functions. Could you provide more information?
Additionally, I would like to ask whether CUDA is enabled in your local environment and whether the corresponding Data-Juicer version is up to date.
It looks like the issue may be because you are using an older version of data-juicer, which previously did not have good support for CUDA in the Ray distributed version. You can try these two solutions separately:
- Pull the latest data-juicer code from the main branch on GitHub, then build from source (pip install -v -e .).
- Avoid using CUDA by setting the use_cuda related configurations to False in the code and modifying the CUDA environment variables accordingly.
It should be able to solve your problem.
from data-juicer.
Related Issues (20)
- DJ-v0.2 API page enhancement
- Video content compliance and privacy protection operators (image, text, audio)
- [Bug]: video split by duration mapper return non-exist video
- support panda's student captioner model in our captioning mapper HOT 5
- [Bug]: Video_split_by_scene_mapper create non-exist video_keys
- [Feature Request] Implement more streamlined interfaces for users seeking minimal functionality (data_juicer.op.functional) HOT 5
- Request a sample code demonstrating the use of image_captioning_from_gpt4v_mapper.py HOT 3
- Can not download the data quality classifier models. HOT 1
- alphanumeric_filter算子清洗疑问 HOT 5
- Absolute path to relative path for multi-source
- filter是否支持batch处理,以及怎么设置batch_size? HOT 6
- hash calculate in ray deduplicator HOT 5
- 为什么大部分的refined recipe都是用simhash去重? HOT 3
- [Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text' HOT 2
- [Question] Can't find evalutor.yaml on the path of `/workspace/data-juicer/demos` HOT 1
- A Compatibility Issue in Environment Installation of DJ-Sandbox HOT 1
- stopwords_filter 为什么是过滤掉小于某个阈值的样本 HOT 4
- 报”error: Unrecognized arguments: -B -S -I -c“ HOT 4
- About Quality Classifier HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-juicer.