Comments (4)
If you can load the whole data into memory, you can use tfdv.generate_statistics_from_dataframe with n_jobs
argument set to -1
to utilize all the CPU cores.
Can you share us an example graphs where it ignores the subset of data? I am unable to replicate the issue with below code. Also, make sure you are using Apache beam >= 2.22.0 as direct_num_workers = 0
support was introduced in 2.22.0.
stats = tfdv.generate_statistics_from_tfrecord(data_location=train_tf_file,
# stats_options=tfdv.StatsOptions(enable_semantic_domain_stats=True),
pipeline_options=PipelineOptions(['--runner','DirectRunner','--direct_num_workers', '0', '--direct_running_mode', 'multi_processing']),
output_path=new_stats_location,
)
Thank you!
from data-validation.
Thank you for the reply.
I would like to avoid using the data frame one.
I will share an example when I can.
I think the main difference most likely is that I am working with a folders that has multiple tfrecord files compressed.
from data-validation.
generate_statistics_from_tfrecord
by default will pick multiple tfrecord files from given data folder(location). Make sure the given path have all the tfrecord files and avoid sub folders. Please share us the example code to replicate this issue on our end. Thanks.
from data-validation.
Closing this due to inactivity. Please take a look into the answers provided above, feel free to reopen and post your comments(if you still have queries on this). Thank you!
from data-validation.
Related Issues (20)
- Support for manual numerical distribution constraints in schema/anomalies HOT 5
- Dependency Issues HOT 4
- Dipslay schema and stats in dashboard HOT 2
- The potential security vulnerability on the joblib library HOT 2
- hot key issue HOT 4
- The latest numpy release 1.24.0 broke TFDV HOT 3
- Support for statistics of discrete numerical data HOT 4
- Installation of tensorflow data validation still failing for mac m1/m2 chips HOT 3
- Request to update source distributions in pypi repository HOT 3
- Error building data-validation HOT 5
- Remove pyarrow dependency upper bound cap HOT 1
- Error installing tensorflow_data_validation HOT 7
- Descriptor can not be created directly
- Python 3.10 Support HOT 7
- Lack of understandable documentation for Custom Data Validation HOT 6
- EVA HOT 3
- Update pyarrow version range to address vulnerability CVE-2023-47248 HOT 3
- Installation of tensorflow data validation still failing for mac m2 chips? HOT 1
- Upgrade pandas version HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-validation.