Giter Site home page Giter Site logo

apache / arrow-cookbook Goto Github PK

View Code? Open in Web Editor NEW
93.0 35.0 46.0 1.81 MB

Apache Arrow Cookbook

Home Page: https://arrow.apache.org/

License: Apache License 2.0

Makefile 7.94% HTML 5.55% Batchfile 4.96% Python 20.09% R 5.47% CMake 3.26% C++ 47.22% Shell 5.51%

arrow-cookbook's Introduction

Apache Arrow Cookbooks

Cookbooks are a collection of recipes about common tasks that Arrow users might want to do. The cookbook is actually composed of multiple cookbooks, one for each supported platform, which contain the recipes for that specific platform.

The cookbook aims to provide immediate instructions for common tasks, in contrast with the Arrow User Guides which provides in-depth explanation. In terms of the Diátaxis framework, the cookbook is task-oriented while the user guide is learning-oriented. The cookbook will often refer to the user guide for deeper explanation.

All cookbooks are buildable to HTML and verifiable by running a set of tests that confirm that the recipes are still working as expected.

Each cookbook is implemented using platform specific tools. For this reason a Makefile is provided which abstracts platform specific concerns and makes it possible to build/test all cookbooks without any platform specific knowledge (as long as dependencies are available on the target system).

See https://arrow.apache.org/cookbook/ for the latest published version using the latest stable version of Apache Arrow. See https://arrow.apache.org/cookbook/dev for the latest published version using the development version of Apache Arrow.

Building All Cookbooks

make all

Testing All Cookbooks

make test

Listing Available Commands

make help

Building Platform Specific Cookbook

Refer to make help to learn the commands that build or test the cookbook for the platform you are targeting.

Prerequisites

Both the R and Python cookbooks will try to install the dependencies they need (including latests pyarrow/arrow-R version). This means that as far as you have a working Python/R environment able to install dependencies through the respective package manager you shouldn't need to install anything manually.

Contributing to the Cookbook

Please refer to the CONTRIBUTING.md file for instructions about how to contribute to the Apache Arrow Cookbook.


All participation in the Apache Arrow project is governed by the Apache Software Foundation’s code of conduct.

arrow-cookbook's People

Contributors

alamb avatar alenkaf avatar alistaire47 avatar amoeba avatar amol- avatar benjaminwolfe avatar davisusanibar avatar dependabot[bot] avatar dgreiss avatar drabastomek avatar humbedooh avatar johnmackintosh avatar jorisvandenbossche avatar kou avatar lidavidm avatar liry avatar lwhite1 avatar nealrichardson avatar nlte avatar paliwalashish avatar raulcd avatar stephhazlitt avatar thatstatsguy avatar thisisnic avatar toddfarmer avatar tonyfujs avatar vibhatha avatar wesm avatar westonpace avatar wjones127 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow-cookbook's Issues

Difference between cookbook and user guide docs

I'm looking at this repo for the first time and surprised that I see a lot of overlap with the user guide. For example, how to read a CSV file in Python. My prior expectation was that recipes were common but non-trivial uses of Arrow (for example, creating a sub-sample of an Arrow table), rather than simple examples of functionality like shown in the user guide.

How do we define what a recipe is here? And what's the relationship between the user guide and the cookbook?

[R] Add content on Tables vs. Datasets

There are fundamental differences between working with Tables/InMemoryDatasets and file-based datasets. There should be content about working with datasets which has an intro covering those differences, and recipes for working with them.

Possible topics:

  • Adding new data to a dataset
  • Link to reading/writing sections
  • Mention of reading only reading in relevant data

This content potentially could be part of other chapters, but is definitely needed

Common example files across implementations

Do we want some common example data files that are used on all the implementations? For example, common files used in the various dataset API recipes. I don't really know how often people will be bouncing between languages or comparing languages though.

[R] Content feedback via DM

You’ll notice we’ve used collect() in the Arrow pipeline above. That’s because one of the ways in which arrow is efficient is that it works out the instructions for the calculations it needs to perform (expressions) and only runs them once you actually pull the data into your R session.

We might rephrase this to make it clear we mean the computation only happens when you trigger it, but that the computation happens in Arrow and not in R

It also means that you are able to manipulate data that is larger than you can fit into memory on the machine you’re running your code on, if you only pull data into R when you have selected the desired subset.

We also have the ability to operate on chunks of data so you might not even need to subset it to be smaller than memory, just be able to have compute kernels finish with chunks that re smaller than memory. I’m not sure if we want/need to mention that here, just something to note.

You want to use a function which is implemented in Arrow’s C++ library but either: * it doesn’t have a mapping to a base R or tidyverse equivalent, or * it has a mapping but nevertheless you want to call the C++ function directly

It looks like the bullets aren’t being caught here (probably need a stupid extra new line somewhere)

[Python] Makefile: pytest target fails

I'm getting 1 fail test when running the make pytest target.

Document: io
------------
**********************************************************************
File "io.rst", line 799, in default
Failed example:
    dataset = ds.dataset("s3://ursa-labs-taxi-data/2011",
                         partitioning=["month"])
    for f in dataset.files[:10]:
        print(f)
    print("...")
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/doctest.py", line 1336, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest default[0]>", line 1, in <module>
        dataset = ds.dataset("s3://ursa-labs-taxi-data/2011",
      File "/Users/nathanael.leaute/Documents/github/arrow-cookbook/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 655, in dataset
        return _filesystem_dataset(source, **kwargs)
      File "/Users/nathanael.leaute/Documents/github/arrow-cookbook/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 410, in _filesystem_dataset
        return factory.finish(schema)
      File "pyarrow/_dataset.pyx", line 2402, in pyarrow._dataset.DatasetFactory.finish
      File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
    OSError: Error creating dataset. Could not read schema from 'ursa-labs-taxi-data/2011/01/data.parquet': Could not open Parquet input source 'ursa-labs-taxi-data/2011/01/data.parquet': AWS Error [code 15]: Access Denied. Is this a 'parquet' file?
**********************************************************************
1 items had failures:
   1 of  27 in default
27 tests in 1 items.
26 passed and 1 failed.
***Test Failed*** 1 failures.

It seems like the ACL on the ursa-labs-taxi-data bucket doesn't allow public access. I don't know if you want to open up the bucket / prefix to the public and incur that aws bandwidth costs though. Those are definitely a thing.

[R] Recipe for random sampling

It would be great if there were a way to sample from an arrow dataset. I put together this somewhat hacky example, but I bet there's some thing a bit more elegant..

library(arrow)
library(dplyr)
library(nycflights13)

flights <- nycflights13::flights

flights$id <- seq_len(nrow(flights))

for(i in unique(flights$month)) {
  out <- filter(flights, month == i)
  arrow::write_parquet(out, paste0("flight_ds/", i, ".parquet"))
}

ds <- arrow::open_dataset("flight_ds")

sample <- sample(flights$id, 100)

ds %>% 
  filter(id %in% sample) %>% 
  collect()

Investigate PR templates

I'd like to see that PRs link to an issue (if applicable, I don't think creating an issue is necessary to submit a PR). Is there anything else we might want to make sure we include in a PR?

Define writing style

It seems there are some inconsistencies in the cookbook's writing style.
For example:

  • verb tense in titles: "Write a Parquet file" vs. "Reading a Parquet file"
  • capitalization: "IPC" vs. "ipc", "parquet" vs. "Parquet", "numpy", "Arrow array" vs. "Arrow Array", etc.

[C++] Add annotated gRPC + Flight service example

Add an example of a Flight and gRPC service coexisting on the same port. Document the caveats (namely: everything should link gRPC dynamically, not statically). Talk about why this is useful (including being able to customize low-level server options).

This can be done in Java as well, but not Python. We should also document setting gRPC client options (this can be done in Java, Python, and C++).

The caveats should be noted in the C++ docs as well (see ARROW-14662).

See apache/arrow#11657 where a user ran into this.

Separate CI jobs for Python and R cookbook implementations

Currently the cookbooks are built via make all; however, they have different dependencies and if no changes have been made to a particular implementation, this makes builds take longer. The C++ build currently has its own CI job (see #22 ) - we could improve build/deploy time by doing the same for the Python/R cookbooks.

Update contributors guide with conventions on issue assignment

I'm not sure what the best way is to mark that someone is interested in working on an issue. In JIRA you can just assign an issue to yourself but it does not seem that GH issues has the same capability (you have to have write permission to be able to assign issues I think). So maybe the best we can do is ask people to leave a comment if they start working on an issue?

Or, we can ask people to create an empty draft PR and reference the issue so that the PR is linked to the issue even if it doesn't have any associated work.

[DISCUSS] Handling Arrow Versioning

I think there are a number of questions around Arrow versioning:

  1. Should recipes be based on the latest released version of the implementation? Or should they be based on the nightly build or the latest commit?
  2. Should there be recipes that use deprecated methods? What about methods that have been removed (presumably after having been deprecated for some time)
  3. How should we mark features that were only newly added? Do we just rely on the compatibility matrix or do we specifically call it out in the cookbook (e.g. "Since 5.0.0")?

[R] tibble or data.frame?

The language is a bit loose in a few places. I'd suggest at least noting that tibble is not required, but that the data frames returned by arrow include the tibble class attributes. So if you use tibble, they'll print and otherwise behave like tibbles. Otherwise, they're just data.frames and that's fine.

You may also want to consider have (at least) your first examples use data.frame(...) to make clear that tibble is not required. (This is a cookbook, and (food) recipes often allow for substitutions or tell you ways you can do variations on it, right?)

gh-pages or Apache hosting?

It appears that in addition to gh-pages we can use Apache hosting. The only real difference would be the URLs.

https://apache.github.io/arrow-cookbook
https://arrow.apache.org/cookbook

However, the latter approach may require some synchronization with the main Arrow repository (I'm not sure 100% sure if it is sufficient to just make sure the main arrow site doesn't have a cookbook directory). We might need to ask Infra but I'd rather test if there is interest before doing that.

Allow cookbook to pair build version with Arrow release version

As more features are added to Apache Arrow, we might want to build versions of the cookbook that are relevant to that release.

I'm not sure what a good strategy would be in terms of adding content, e.g. is the main branch the latest version, and we create branches for releases, and then cherry-pick any commits which are relevant to multiple cookbook versions? And how many versions to have, in terms of major/minor releases?

Change builds to use prebuilt binaries

Right now the CI build is pretty slow and a fair amount of time is spent downloading / installing build dependencies and building Arrow. We should be able to use prebuilt binaries for this.

This is somewhat related to #10 . Do we grab the last nightly build or the last released build?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.