Giter Site home page Giter Site logo

Comments (12)

bl-young avatar bl-young commented on July 17, 2024 1

I attempted to make the switch to fastparquet. My first attempt indicated I was missing python-snappy despite that not being a listed requirement of fastparquet, but I believe we can force that in setup.py.

However to install via the command line it seems fastparquet requires MIcrosoft Visual C++ BuildTools 14.0

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

read_parquet is the critical function that triggers a parquet engine dependency, which is called to open a flowlist. It will either use pyarrow or fastparquet or by default whichever of the two is set as a default on the users' machine. Here is a comparison of these two parquet engines

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

One option I would think it to not specify the engine in the read or write calls. And it would be possible to remove the required dependency and list it as an optional dependency. But we'd have to put on the Wiki Install page both here and on the pages of every package that uses fedelemflowlist to instruct the user to install or verify install of either pyarrow or fastparquet before installing the package (following the Python and pip install instructions). @dyoung11 @bl-young @hottleta what do you all think about this proposal?

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

Whatever is decided here will also be applied to flowsa as well as any other of our ecosystem tools. We have to make sure the resulting parquet files can be read and written across OS as well, which was one issue that we encountered in flowsa.

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

@gschivley also adding you to this discussion since you were the one that led me down this path originally (no blame, actually its thanks because its helped us avoid the much more complex database dependency and storage issues), in case you have any suggestions.

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

Ah yes I got the same error mesage

Failed to build fastparquet
Installing collected packages: llvmlite, numba, packaging, thrift, fastparquet
  Running setup.py install for fastparquet ... error
...
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

This dependency is not good...requires too much work and not sure about what happens on Mac OS and Linux...
fastparquet doesn't seem to be that easy option i was hoping for

from fedelemflowlist.

bl-young avatar bl-young commented on July 17, 2024

After updating BuildTools, I can install fastparquet, but still can't install python-snappy. It turns out installing python-snappy is no simple task either as it has its own set of dependencies.
https://stackoverflow.com/questions/11416024/error-installing-python-snappy-snappy-c-h-no-such-file-or-directory

from fedelemflowlist.

bl-young avatar bl-young commented on July 17, 2024

Anyway if anyone else wants to try the fedefl import based on fastparquet on their machine:

pip install git+https://github.com/bl-young/Federal-LCA-Commons-Elementary-Flow-List.git@parquet#egg=Federal-LCA-Commons-Elementary-Flow-List

from fedelemflowlist.

bl-young avatar bl-young commented on July 17, 2024

When writing the parquet, using
flows.to_parquet(outputpath + 'FedElemFlowListMaster.parquet', engine='fastparquet', compression='GZIP')
eliminates the need for python-snappy. (note you need to re-write the parquet with GZIP compressoin before attempting to access it). If we don't specify the engine, then users on 32-bit could use fastparquet while 64-bit could stick with pyarrow (and avoid the need to figure install Microsoft Build Tools)

from fedelemflowlist.

bl-young avatar bl-young commented on July 17, 2024

Ok last thought on the issue for now. The import issue can be partially resolved by requiring pyarrow on 64 bit, but fastparquet on 32-bit (which would then necessitate a user to update Microsoft BuildTools). I don't know how this will impact USEPA/flowsa#2. My hope is that by specifying the compression type across engines will avoid the issue seen there.

https://github.com/bl-young/Federal-LCA-Commons-Elementary-Flow-List/commit/e04587e3066a5cccb27dba7ae85b10edc449d55c

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

I agree @bl-young let's just not specify the engine and leave it up to the users to have one of the engines installed. But our findings were from the work with USEPA/flowsa#2 were that it's better not to do compression at all, so just set compression=None. See my comment then do the pull request.

from fedelemflowlist.

WesIngwersen avatar WesIngwersen commented on July 17, 2024

This guide is really helpful
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#handling-indexes
We need to make sure we specific index=False to make sure its more compatible.

from fedelemflowlist.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.