apache / arrow-cookbook Goto Github PK

Apache Arrow Cookbook

License: Apache License 2.0

Makefile 7.94% HTML 5.55% Batchfile 4.96% Python 20.09% R 5.47% CMake 3.26% C++ 47.22% Shell 5.51%

arrow-cookbook's Introduction

Apache Arrow Cookbooks

Cookbooks are a collection of recipes about common tasks that Arrow users might want to do. The cookbook is actually composed of multiple cookbooks, one for each supported platform, which contain the recipes for that specific platform.

The cookbook aims to provide immediate instructions for common tasks, in contrast with the Arrow User Guides which provides in-depth explanation. In terms of the Diátaxis framework, the cookbook is task-oriented while the user guide is learning-oriented. The cookbook will often refer to the user guide for deeper explanation.

All cookbooks are buildable to HTML and verifiable by running a set of tests that confirm that the recipes are still working as expected.

Each cookbook is implemented using platform specific tools. For this reason a Makefile is provided which abstracts platform specific concerns and makes it possible to build/test all cookbooks without any platform specific knowledge (as long as dependencies are available on the target system).

See https://arrow.apache.org/cookbook/ for the latest published version using the latest stable version of Apache Arrow. See https://arrow.apache.org/cookbook/dev for the latest published version using the development version of Apache Arrow.

Building All Cookbooks

make all

Testing All Cookbooks

make test

Listing Available Commands

make help

Building Platform Specific Cookbook

Refer to make help to learn the commands that build or test the cookbook for the platform you are targeting.

Prerequisites

Both the R and Python cookbooks will try to install the dependencies they need (including latests pyarrow/arrow-R version). This means that as far as you have a working Python/R environment able to install dependencies through the respective package manager you shouldn't need to install anything manually.

Contributing to the Cookbook

Please refer to the CONTRIBUTING.md file for instructions about how to contribute to the Apache Arrow Cookbook.

All participation in the Apache Arrow project is governed by the Apache Software Foundation’s code of conduct.

arrow-cookbook's People

Contributors

Stargazers

Watchers

Forkers

ianmcook nlte thisisnic westonpace aucahuasi amol- drabastomek isabella232 lidavidm davisusanibar tonyfujs wjones127 benjaminwolfe vibhatha stephhazlitt datatrekkers thatstatsguy albertvillanova alamb toddfarmer paulvanleeuwen drin alistaire47 paliwalashish nealrichardson xiaoguang-hh raulcd ajunlonglive lwhite1 js8544 jacekpliszka milesgranger jorisvandenbossche kou doytsujin liry qpc-github abmo-x dgreiss shauryashaurya llama90 amoeba d-morrison alenkaf

arrow-cookbook's Issues

Difference between cookbook and user guide docs

I'm looking at this repo for the first time and surprised that I see a lot of overlap with the user guide. For example, how to read a CSV file in Python. My prior expectation was that recipes were common but non-trivial uses of Arrow (for example, creating a sub-sample of an Arrow table), rather than simple examples of functionality like shown in the user guide.

How do we define what a recipe is here? And what's the relationship between the user guide and the cookbook?

[R] Add content on Tables vs. Datasets

There are fundamental differences between working with Tables/InMemoryDatasets and file-based datasets. There should be content about working with datasets which has an intro covering those differences, and recipes for working with them.

Possible topics:

Adding new data to a dataset
Link to reading/writing sections
Mention of reading only reading in relevant data

This content potentially could be part of other chapters, but is definitely needed

Common example files across implementations

Do we want some common example data files that are used on all the implementations? For example, common files used in the various dataset API recipes. I don't really know how often people will be bouncing between languages or comparing languages though.

Update repo URL to arrow.apache.org/cookbook

Can we update the URL on the repo to point to arrow.apache.org/cookbook instead of arrow.apache.org/? I'm not sure who can do this?

[R] Content feedback via DM

You’ll notice we’ve used collect() in the Arrow pipeline above. That’s because one of the ways in which arrow is efficient is that it works out the instructions for the calculations it needs to perform (expressions) and only runs them once you actually pull the data into your R session.

We might rephrase this to make it clear we mean the computation only happens when you trigger it, but that the computation happens in Arrow and not in R

It also means that you are able to manipulate data that is larger than you can fit into memory on the machine you’re running your code on, if you only pull data into R when you have selected the desired subset.

We also have the ability to operate on chunks of data so you might not even need to subset it to be smaller than memory, just be able to have compute kernels finish with chunks that re smaller than memory. I’m not sure if we want/need to mention that here, just something to note.

You want to use a function which is implemented in Arrow’s C++ library but either: * it doesn’t have a mapping to a base R or tidyverse equivalent, or * it has a mapping but nevertheless you want to call the C++ function directly

It looks like the bullets aren’t being caught here (probably need a stupid extra new line somewhere)

[Python] Add a recipe on how to replace an existing column in Table

We currently have a recipe on how to Add a column to a table, but not a recipe on how to replace an existing column

Add linting to C++ cookbook

PRs should have a clang-format check

[Python] Makefile: pytest target fails

I'm getting 1 fail test when running the make pytest target.

Document: io
------------
**********************************************************************
File "io.rst", line 799, in default
Failed example:
    dataset = ds.dataset("s3://ursa-labs-taxi-data/2011",
                         partitioning=["month"])
    for f in dataset.files[:10]:
        print(f)
    print("...")
Exception raised:
    Traceback (most recent call last):
      File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/doctest.py", line 1336, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest default[0]>", line 1, in <module>
        dataset = ds.dataset("s3://ursa-labs-taxi-data/2011",
      File "/Users/nathanael.leaute/Documents/github/arrow-cookbook/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 655, in dataset
        return _filesystem_dataset(source, **kwargs)
      File "/Users/nathanael.leaute/Documents/github/arrow-cookbook/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 410, in _filesystem_dataset
        return factory.finish(schema)
      File "pyarrow/_dataset.pyx", line 2402, in pyarrow._dataset.DatasetFactory.finish
      File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
      File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
    OSError: Error creating dataset. Could not read schema from 'ursa-labs-taxi-data/2011/01/data.parquet': Could not open Parquet input source 'ursa-labs-taxi-data/2011/01/data.parquet': AWS Error [code 15]: Access Denied. Is this a 'parquet' file?
**********************************************************************
1 items had failures:
   1 of  27 in default
27 tests in 1 items.
26 passed and 1 failed.
***Test Failed*** 1 failures.

It seems like the ACL on the ursa-labs-taxi-data bucket doesn't allow public access. I don't know if you want to open up the bucket / prefix to the public and incur that aws bandwidth costs though. Those are definitely a thing.

[Python] Add a recipe on how to save partitioned datasets to the Cookbook

We show how to read paritioned data, but we don't show how to write it in the Cookbook

[R] Recipe for random sampling

It would be great if there were a way to sample from an arrow dataset. I put together this somewhat hacky example, but I bet there's some thing a bit more elegant..

library(arrow)
library(dplyr)
library(nycflights13)

flights <- nycflights13::flights

flights$id <- seq_len(nrow(flights))

for(i in unique(flights$month)) {
  out <- filter(flights, month == i)
  arrow::write_parquet(out, paste0("flight_ds/", i, ".parquet"))
}

ds <- arrow::open_dataset("flight_ds")

sample <- sample(flights$id, 100)

ds %>% 
  filter(id %in% sample) %>% 
  collect()

[R] Add content on using open_dataset() with schemas vs. column names

For more information, see the following discussion on a PR on the main arrow repo: apache/arrow#12083

[R] Update recipe format

Update recipes to use the "problem/solution/discussion/see also" format.

[Python] Recipe on how to find index of entries equal to a set of values.

Use indices_nonzero

Update Writing CSV recipe to showcase usage of pyarrow.csv

Investigate PR templates

I'd like to see that PRs link to an issue (if applicable, I don't think creating an issue is necessary to submit a PR). Is there anything else we might want to make sure we include in a PR?

Define writing style

It seems there are some inconsistencies in the cookbook's writing style.
For example:

verb tense in titles: "Write a Parquet file" vs. "Reading a Parquet file"
capitalization: "IPC" vs. "ipc", "parquet" vs. "Parquet", "numpy", "Arrow array" vs. "Arrow Array", etc.

Python tests are now failing with 6.0.0 release

Now that 6.0.0 is released the python tests are failing because the print output appears to have changed. @amol-

[R] Refactor Table$create() to arrow_table() once 6.0.0 is on CRAN

[C++] Add annotated gRPC + Flight service example

Add an example of a Flight and gRPC service coexisting on the same port. Document the caveats (namely: everything should link gRPC dynamically, not statically). Talk about why this is useful (including being able to customize low-level server options).

This can be done in Java as well, but not Python. We should also document setting gRPC client options (this can be done in Java, Python, and C++).

The caveats should be noted in the C++ docs as well (see ARROW-14662).

See apache/arrow#11657 where a user ran into this.

[Python] Recipe showing how to use the filename callback to generate new files every time write_dataset is called

Inspired by: https://stackoverflow.com/questions/69184289/pyarrow-overwrites-dataset-when-using-s3-filesystem/69185178#69185178

R build failing with Error: `Error in loadNamespace(x) : there is no package called ‘testrmd’`

I was hoping this was just intermittent since there was no actual code change but it's happened 3 times in a row now so I'm not sure what could have caused it.

[Python] Add a Python Cookbook recipe on group_by + sort

Showcase Table group_by+aggregate+sort capabilities added in 7.0.0

Separate CI jobs for Python and R cookbook implementations

Currently the cookbooks are built via make all; however, they have different dependencies and if no changes have been made to a particular implementation, this makes builds take longer. The C++ build currently has its own CI job (see #22 ) - we could improve build/deploy time by doing the same for the Python/R cookbooks.

Update the R cookbook dependency installation to use the Linux binaries where possible

Update contributors guide with conventions on issue assignment

I'm not sure what the best way is to mark that someone is interested in working on an issue. In JIRA you can just assign an issue to yourself but it does not seem that GH issues has the same capability (you have to have write permission to be able to assign issues I think). So maybe the best we can do is ask people to leave a comment if they start working on an issue?

Or, we can ask people to create an empty draft PR and reference the issue so that the PR is linked to the issue even if it doesn't have any associated work.

Create recipe showing how to use Status/Result in googletest / other testing frameworks.

This would mainly be demonstrating the ASSERT_* and EXPECT_* macros

Use `ARROW_RETURN_NOT_OK` (and friends) instead of `ASSERT_OK` (and friends) in recipes

Recipes should show how the code will be used in the wild. ASSERT_OK is tailored for tests. We could probably do this by putting recipes in their own function (possibly a part of the macro?) and then asserting the function returns ok.

[DISCUSS] Handling Arrow Versioning

I think there are a number of questions around Arrow versioning:

Should recipes be based on the latest released version of the implementation? Or should they be based on the nightly build or the latest commit?
Should there be recipes that use deprecated methods? What about methods that have been removed (presumably after having been deprecated for some time)
How should we mark features that were only newly added? Do we just rely on the compatibility matrix or do we specifically call it out in the cookbook (e.g. "Since 5.0.0")?

[R] tibble or data.frame?

The language is a bit loose in a few places. I'd suggest at least noting that tibble is not required, but that the data frames returned by arrow include the tibble class attributes. So if you use tibble, they'll print and otherwise behave like tibbles. Otherwise, they're just data.frames and that's fine.

You may also want to consider have (at least) your first examples use data.frame(...) to make clear that tibble is not required. (This is a cookbook, and (food) recipes often allow for substitutions or tell you ways you can do variations on it, right?)

[C++][Flight] Add Flight/C++ examples

We should add the same examples that Python/R/etc. have.

[R] Use as.data.frame() instead of dplyr::collect()

I'd only use dplyr::collect() when you're already doing dplyr things. dplyr is not a required dependency of arrow, so we shouldn't imply that you have to use it just to convert data to/from arrow.

Should recipes use / demonstrate deprecated methods?

Related: Should we leave recipes around that are no longer valid at all on the latest release of Arrow (e.g. the feature has been since removed, presumably after being deprecated for some amount of time)

[R] Improve docs for calling Arrow compute functions directly from R

Add something about:

where in the C++ code to find the options classes
the fact that they have to be passed in as items on the options list
the difference between unary/binary functions

[R] Add map_batches examples from vignette

apache/arrow#11894 introduces a section of datasets vignette showing how to randomly sample a dataset and compute aggregate statistics without loading the entire dataset into memory. These might be good examples for the cookbook as well.

[R] Add recipe on converting data from one format to another (e.g. CSV to Parquet)

(Show how can do open_dataset(..., as_data_frame = FALSE) followed by write_dataset(...) for efficiency and explain why it's efficient.

clang-tidy missing in CI

[R] Make all examples standalone

There are inconsistences between whether recipes are standalone or depend on other ones. All recipes should be updated to be standalone.

e.g. https://arrow.apache.org/cookbook/r/reading-and-writing-data.html#solution-1 is not whereas the dplyr recipes are

gh-pages or Apache hosting?

It appears that in addition to gh-pages we can use Apache hosting. The only real difference would be the URLs.

https://apache.github.io/arrow-cookbook
https://arrow.apache.org/cookbook

However, the latter approach may require some synchronization with the main Arrow repository (I'm not sure 100% sure if it is sufficient to just make sure the main arrow site doesn't have a cookbook directory). We might need to ask Infra but I'd rather test if there is interest before doing that.

[R][CI] Ensure that the correct C++ version of Arrow is being installed

Do we want to feed github notifications to zulip and/or a mailing list?

It appears for the ML it is a mere matter of updating .asf.yaml:

notifications:
  commits:      [email protected]
  issues:       [email protected]
  pullrequests: [email protected]

I'm not sure how to configure Zulip but can't imagine it is that tricky.

[Python][Flight] Recipe to show how to stream data in `do_get`

The current flight recipe loads in memory the whole content of the datasets before writing them to the response, it would be good to have a recipe that also shows how to stream data for the case of big datasets that might not fit in memory.

Adopt PyData sphinx theme for compatibility with Arrow docs

https://github.com/pydata/pydata-sphinx-theme

Allow cookbook to pair build version with Arrow release version

As more features are added to Apache Arrow, we might want to build versions of the cookbook that are relevant to that release.

I'm not sure what a good strategy would be in terms of adding content, e.g. is the main branch the latest version, and we create branches for releases, and then cherry-pick any commits which are relevant to multiple cookbook versions? And how many versions to have, in terms of major/minor releases?