Giter Site home page Giter Site logo

hansalemaos / dfcsv2parquet Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 32 KB

converts large CSV files into smaller, Pandas-compatible Parquet files

Home Page: https://pypi.org/project/dfcsv2parquet/

License: MIT License

Python 100.00%
csv dataframe df pandas parquet

dfcsv2parquet's Introduction

converts large CSV files into smaller, Pandas-compatible Parquet files

pip install dfcsv2parquet

Tested against Windows 10 / Python 3.10 / Anaconda

The convert2parquet function is used to convert large CSV files into smaller Parquet files. It offers several advantages such as reducing memory usage, improving processing speed, and optimizing data types for more efficient storage.

This function might be interesting for individuals or organizations working with large datasets in CSV format and looking for ways to optimize their data storage and processing. By converting CSV files to Parquet format, which is a columnar storage format, several benefits can be achieved:

Reduced Memory Usage:

Parquet files are highly compressed and store data in a columnar format, allowing for efficient memory utilization. This can significantly reduce the memory footprint compared to traditional row-based CSV files.

Improved Processing Speed:

Parquet files are designed for parallel processing and can be read in a highly efficient manner. By converting CSV files to Parquet, you can potentially achieve faster data ingestion and query performance.

Optimized Data Types:

The function includes a data type optimization step (optimize_dtypes) that aims to minimize the memory usage of the resulting Parquet files. It intelligently selects appropriate data types based on the actual data values, which can further enhance storage efficiency.

Categorical Data Optimization:

The function handles categorical columns efficiently by limiting the number of categories (categorylimit). It uses the union_categoricals function to merge categorical data from different chunks, reducing duplication and optimizing memory usage.

        Args:
            csv_file (str | None): Path to the input CSV file. Default is None.
            parquet_file (str | None): Path to the output Parquet file. Default is None.
            chunksize (int): Number of rows to read from the CSV file per chunk. Default is 1000000.
            categorylimit (int): The minimum number of categories in categorical columns. Default is 4.
            verbose (int | bool): Verbosity level. Set to 1 or True for verbose output, 0 or False for no output. Default is 1.
            zerolen_is_na (int | bool): Whether to treat zero-length strings as NaN values. Set to 1 or True to enable, 0 or False to disable. Default is 0.
            args: passed to pd.read_csv (doesn't work with the cli)
            kwargs: passed to pd.read_csv (doesn't work with the cli)

        Returns:
            None

        Examples:
            # Download the csv:
            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part03.rar
            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part02.rar
            https://github.com/hansalemaos/csv2parquet/raw/main/bigcsv.part01.rar
            # in Python


            from dfcsv2parquet import convert2parquet
            convert2parquet(csv_file=r"C:\bigcsv.csv",
                                            parquet_file=r'c:\parquettest4.pqt',

                                            chunksize=1000000,
                                            categorylimit=4,
                                            verbose=True,
                                            zerolen_is_na=False, )

            # CLI
            python.exe "...\__init__.py" --csv_file "C:\bigcsv.csv" --parquet_file  "c:\parquettest4.pqt" --chunksize 100000 --categorylimit 4 --verbose 1 --zerolen_is_na 1         

dfcsv2parquet's People

Contributors

hansalemaos avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.