bkamins / julia-dataframes-tutorial Goto Github PK

View Code? Open in Web Editor NEW

520.0 19.0 119.0 5.41 MB

A tutorial on Julia DataFrames package

License: MIT License

Jupyter Notebook 100.00%

julia data-frame tutorial

julia-dataframes-tutorial's Introduction

An Introduction to DataFrames.jl

Bogumił Kamiński, February 13, 2023

The tutorial is for DataFrames.jl 1.5.0

A brief introduction to basic usage of DataFrames.

The tutorial contains a specification of the project environment version under which it should be run. In order to prepare this environment, before using the tutorial notebooks, while in the project folder run the following command in the command line:

julia -e 'using Pkg; Pkg.activate("."); Pkg.instantiate()'

Tested under Julia 1.9.0. The project dependencies are the following:

  [69666777] Arrow v2.4.3
  [6e4b80f9] BenchmarkTools v1.3.2
  [336ed68f] CSV v0.10.9
  [324d7699] CategoricalArrays v0.10.7
  [8be319e6] Chain v0.5.0
  [944b1d66] CodecZlib v0.7.1
  [a93c6f00] DataFrames v1.5.0
  [1313f7d8] DataFramesMeta v0.13.0
  [5789e2e9] FileIO v1.16.0
  [da1fdf0e] FreqTables v0.4.5
  [7073ff75] IJulia v1.24.0
  [babc3d20] JDF v0.5.1
  [9da8a3cd] JLSO v2.7.0
  [b9914132] JSONTables v1.0.3
  [86f7a689] NamedArrays v0.9.6
  [2dfb63ee] PooledArrays v1.4.2
  [f3b207a7] StatsPlots v0.15.4
  [bd369af6] Tables v1.10.0
  [a5390f91] ZipFile v0.10.1
  [9a3f8284] Random
  [10745b16] Statistics v1.9.0

I will try to keep the material up to date as the packages evolve.

This tutorial covers DataFrames.jl and CategoricalArrays.jl, as they constitute the core of DataFrames.jl along with selected file reading and writing packages.

In the last extras part mentions selected functionalities of selected useful packages that I find useful for data manipulation, currently those are: FreqTables.jl, DataFramesMeta.jl StatsPlots.jl.

File	Topic
01_constructors.ipynb	Creating DataFrame and conversion
02_basicinfo.ipynb	Getting summary information
03_missingvalues.ipynb	Handling missing values
04_loadsave.ipynb	Loading and saving DataFrames
05_columns.ipynb	Working with columns of DataFrame
06_rows.ipynb	Working with row of DataFrame
07_factors.ipynb	Working with categorical data
08_joins.ipynb	Joining DataFrames
09_reshaping.ipynb	Reshaping DataFrames
10_transforms.ipynb	Transforming DataFrames
11_performance.ipynb	Performance tips
12_pitfalls.ipynb	Possible pitfalls
13_extras.ipynb	Additional interesting packages

Changelog:

Date	Changes
2017-12-05	Initial release
2017-12-06	Added description of `insert!`, `merge!`, `empty!`, `categorical!`, `delete!`, `DataFrames.index`
2017-12-09	Added performance tips
2017-12-10	Added pitfalls
2017-12-18	Added additional worthwhile packages: FreqTables and DataFramesMeta
2017-12-29	Added description of `filter` and `filter!`
2017-12-31	Added description of conversion to `Matrix`
2018-04-06	Added example of extracting a row from a `DataFrame`
2018-04-21	Major update of whole tutorial
2018-05-01	Added `byrow!` example
2018-05-13	Added `StatPlots` package to extras
2018-05-23	Improved comments in sections 1 do 5 by Jane Herriman
2018-07-25	Update to 0.11.7 release
2018-08-25	Update to Julia 1.0 release: sections 1 to 10
2018-08-29	Update to Julia 1.0 release: sections 11, 12 and 13
2018-09-05	Update to Julia 1.0 release: FreqTables section
2018-09-10	Added CSVFiles section to chapter on load/save
2018-09-26	Updated to DataFrames 0.14.0
2018-10-04	Updated to DataFrames 0.14.1, added `haskey` and `repeat`
2018-12-08	Updated to DataFrames 0.15.2
2019-01-03	Updated to DataFrames 0.16.0, added serialization instructions
2019-01-18	Updated to DataFrames 0.17.0, added `passmissing`
2019-01-27	Added Feather.jl file read/write
2019-01-30	Renamed StatPlots.jl to StatsPlots.jl and added Tables.jl
2019-02-08	Added `groupvars` and `groupindices` functions
2019-04-27	Updated to DataFrames 0.18.0, dropped JLD2.jl
2019-04-30	Updated handling of missing values description
2019-07-16	Updated to DataFrames 0.19.0
2019-08-14	Added JSONTables.jl and `Tables.columnindex`
2019-08-16	Added Project.toml and Manifest.toml
2019-08-26	Update to Julia 1.2 and DataFrames 0.19.3
2019-08-29	Add example how to compress/decompress CSV file using CodecZlib
2019-08-30	Add examples of JLSO.jl and ZipFile.jl by xiaodaigh
2019-11-03	Add examples of JDF.jl by xiaodaigh
2019-12-08	Updated to DataFrames 0.20.0
2020-05-06	Updated to DataFrames 0.21.0 (except load/save and extras)
2020-11-20	Updated to DataFrames 0.22.0 (except DataFramesMeta.jl which does not work yet)
2020-11-26	Updated to DataFramesMeta.jl 0.6; update by @pdeffebach
2021-05-15	Updated to DataFrames.jl 1.1.1
2021-05-15	Updated to DataFrames.jl 1.2 and DataFramesMeta.jl 0.8, added Chain.jl instead of Pipe.jl
2021-12-12	Updated to DataFrames.jl 1.3
2022-10-05	Updated to DataFrames.jl 1.4
2023-02-13	Updated to DataFrames.jl 1.5

Core functions summary

Constructors: DataFrame, DataFrame!, Tables.rowtable, Tables.columntable, Matrix, eachcol, eachrow, Tables.namedtupleiterator, empty, empty!
Getting summary: size, nrow, ncol, describe, names, eltypes, first, last, getindex, setindex!, @view, isapprox, metadata, metadata!, colmetadata, colmetadata!
Handling missing: missing (singleton instance of Missing), ismissing, nonmissingtype, skipmissing, replace, replace!, coalesce, allowmissing, disallowmissing, allowmissing!, completecases, dropmissing, dropmissing!, disallowmissing, disallowmissing!, passmissing
Loading and saving: CSV (package), CSVFiles (package), Serialization (module), CSV.read, CSV.write, save, load, serialize, deserialize, Arrow.write, Arrow.Table (from Arrow.jl package), JSONTables (package), arraytable, objecttable, jsontable, CodecZlib (module), GzipCompressorStream, GzipDecompressorStream, JDF.jl (package), JDF.save, JDF.load, JLSO.jl (package), JLSO.save, JLSO.load, ZipFile.jl (package), ZipFile.reader, ZipFile.writer, ZipFile.addfile
Working with columns: rename, rename!, hcat, insertcols!, categorical!, columnindex, hasproperty, select, select!, transform, transform!, combine, Not, All, Between, ByRow, AsTable
Working with rows: sort!, sort, issorted, append!, vcat, push!, view, filter, filter!, deleteat!, unique, nonunique, unique!, allunique, repeat, parent, parentindices, flatten, @chain (from Chain.jl package), only, subset, subset!, shuffle, prepend!, pushfirst!, insert!, keepat!
Working with categorical: categorical, cut, isordered, ordered!, levels, unique, levels!, droplevels!, unwrap, recode, recode!
Joining: innerjoin, leftjoin, leftjoin!, rightjoin, outerjoin, semijoin, antijoin, crossjoin
Reshaping: stack, unstack
Transforming: groupby, mapcols, parent, groupcols, valuecols, groupindices, keys (for GroupedDataFrame), combine, select, select!, transform, transform!, @chain (from Chain.jl package)
Extras:
- FreqTables: freqtable, prop, Name
- DataFramesMeta: @with, @subset, @select, @transform, @orderby, @by, @combine, @eachrow, @newcol, ^, $
- StatsPlots: @df, plot, density, histogram,boxplot, violin

julia-dataframes-tutorial's People

Contributors

Stargazers

Watchers

Forkers

tmeits snowdj radovankavicky gapdata lachlandeer slowbrain xiaoguozhi xorjane erinyunyou felipeaalves lazarusa tdhorne willy83310 talk2levi jngod2011 carstenbauer godkin1211 chengjiangxiliu jeanpauphilet cnjelita vjustkoder xue35 i-blis donkeyfish87 sbacelar samusz fredo-xvii dmolina thomwo robertus100 xiaodaigh robertgawrylczyk ym-han yanglu201908 yanglududu juanguigen agutieda saurabh180 emb gtrunsec danielchen26 pitmonticone dysartk y1my1 adrian-tymorek birusod motabe falves-forks gvarthi nuclear718 emiliohdez dichromatism joserfjuniorllms bjy007 anhnguyendepocen jalalsiddiqui vovalev gus-hub hdocmsu albandecrevoisier manishkapoords dawievlill rafeemusabbir rustydigg918 ra1nowl kitisak dkjpy pdeffebach abusido82 dawnywu wenxiang-li maxdiefenbach floswald singlemice aerdely stjordanis ubc-finance kondiara miguelraz marianojgm86 aiguo-yang mchoudhary mostafa-at-github mrchaos olummy mhatzikonstantinou arnaudh floforsthuber abdurrahmanregi dynaryu biomedai lapobini xiaobaaaa ztsweet ppforde abhimanyuaryan horvatha wangzhichao1990 henry2004y karthy257

julia-dataframes-tutorial's Issues

Converting Dataframe row to raw Julia vector?

Could you please add an example to the rows tutorial explaining how users can convert DataFrame rows into Julia vectors or arrays?

I was used to doing vec(convert(Array, df[row,cols])), but this doesn't work anymore because of the introduction of Missing Unions.

Minimum interface to define a DataFrame-like type

It would be great to learn more about the minimum interface expected to be implemented by subtypes of AbstractDataFrame in one tutorial notebook. Do you think it makes sense to have it here?

Groupby in parallel

I had a questions with regards to how to perform groupby in parallel.
can you provide sample notebooks on this task ?
https://discourse.julialang.org/t/dataframes-with-groupby-in-parallel/47186

coalesce example doesn't work

coalesce no longer works as you describe. It must returns the first item that is missing. You must perhaps have left out map(...)?

Sadly, I had to do:

df[ismissing.(df.mycol),:mycol] = "" # or whatever substitute value is appropriate.

update DataFramesMeta.jl

@pdeffebach - if would be great if you found time to update notebook 13 with DataFramesMeta.jl examples when it is up to DataFrames.jl 0.22 relase. Feel free to remove/change everything - as now you know best what things should be shown there.

Would you be willing to do this? Thank you!

Creating dataframe from vector of vectors with different length each

Thank you for these tutorials, they are really helpful. Today I came across this task and I wonder if you have a good solution for it that could be added to the repo:

using DataFrames

cols = [rand(i) for i in 1:5]
DataFrame(Dict("col$i" => cols[i] for i in 1:length(cols)))

ERROR: DimensionMismatch("column length 1 for column(s) col1 is incompatible with column length 2 for column(s) col2 is incompatible with column length 3 for column(s) col3 is incompatible with column length 4 for column(s) col4, and is incompatible with column length 5 for column(s) col5")

The expected result is a dataframe with 5 rows where the first columns are filled with missing at the end.

"transform" function undefined

Hi, I'm a beginner Julia user, so please forgive me for any mistakes.

I can't use transform function, it throws UndefVarError (and the function is nowhere to be found).

I'm using Julia 1.5.3 and DataFrames 0.20.2

Steps to reproduce:

Import DataFrames using DataFrames
Run transform

Thanks in advance.

Performance Tips for Categorical Data

In your performance tips at In [8] you mentioned:

Allowing missing as well as categorical slows down computations

I might have an idea on how to fix this (at least partially). I did not do any research on whether there has already been done some work on this. If we compute the count map on the references into the pool, we only operate on UInts:

julia> function StatsBase.countmap(x::CategoricalArray; alg = :auto)
           cm = countmap(x.refs, alg=alg)
           Dict(x.pool[ref] => count for (ref, count) in cm)
       end

julia> x = rand(1:10, 10^6);

julia> y = categorical(x);

julia> z = compress(y);

julia> @btime countmap(x);
  2.927 ms (8 allocations: 7.63 MiB)

julia> @btime countmap(y);
  2.141 ms (14 allocations: 3.82 MiB)

julia> @btime countmap(z);
  541.943 μs (11 allocations: 3.27 KiB)

On top of that, if we guard x.pool[ref] for ref == 0, which corresponds to a missing value, we also fix the performance loss for missing values.

My question is, where should I propose and contribute a change like the above, CategoricalArrays.jl or StatsBase.jl or somewhere else? Neither of the two has the other as a dependency, but they need to be added to allow for type dispatch (I think; I'm still a Julia novice), so maybe it should go to DataFrames.jl?

Just wanted to say....

this is AWESOME. Thank you SO MUCH for putting the time into putting this together for the community!

Deprecation in 05_columns

x = DataFrame(Bool, 3, 4) is deprecated. Should be x = DataFrame(Matrix{Bool}(undef,3,4))

JDF dependency errored

julia version : 1.6.2 (2021-07-14)

Precompiling project...
  ✗ JDF
  125 dependencies successfully precompiled in 76 seconds (71 already precompiled)
  1 dependency errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the package

I've only been using Julia for a bit over a week, so not sure where to go from here. Happy to answer questions...

Chapter 4: Column types does not include missing after reading them with CSV

One column type changed String -> Char but the first column is still only of type Bool.

Custom index type

Given a dataset like:

	Fat	Sodium	Carbonates Protein  total
soy	  7	    14	        10	13    100
peas	  3	    12	        61	22    100
wheat	  1	    15	        60	23    100
corn	  1	   180	        15	 2     82
beans	  0.5	   680          30	12     NA

How to load it so that the first column is interpreted as the index column, and how to manipulate the resulting dataframe? Is this possible with the current state of DataFrames.jl?

Buggy example?

Hello and thanks for these wonderful tutorials.

I was trying to run the above example (I use Pluto, julia 1.9 and DataFrames 1.5.0)

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

And got the following error:

UndefVarError: nrow not defined

_combine(::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Vector{Any}, ::Vector{Bool}, ::Bool, ::Bool, ::Bool, ::Bool)@splitapplycombine.jl:739
_combine_prepare_norm(::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Vector{Any}, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool)@splitapplycombine.jl:86
var"#_combine_prepare#671"(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(DataFrames._combine_prepare), ::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Base.RefValue{Any})@splitapplycombine.jl:51
[email protected]:25[inlined]
#combine#[email protected]:845[inlined]
[email protected]:845[inlined]
#combine#[email protected]:830[inlined]
[email protected]:824[inlined]
top-level scope@Local: 1[inlined]

The I tried to fix it:

combine(groupby(x, :id)) do sdf
    n = size(sdf)[1]
    n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

And got this.

UndefVarError: DataFrame not defined

The example if of particular interest for me, since my problem is precisely to drop groups with less than certain number of rows

Thanks!

Help with complex groupby and combine in efficient way

Here is a sample dataset with 7 million rows

# some ids have single date entries
data = DataFrame(
    "unique_id" => [i for i in 1:1500000], 
    "datestamp" => [Date("2021-08-16") for i in 1:1500000],
)

# some ids have 2 date entries
data = vcat(data, DataFrame(
    "unique_id" => repeat(1500001:4500000, 2), 
    "datestamp" => vcat(repeat([Date("2021-08-16")],3000000), repeat([Date("2021-08-22")],3000000)), 
))

# have lot of columns that needs to be processed
real_number_columns = [string(c)*"_"*string(i) for c in 'a':'z' for i in 1:4]
for c in columns
    data[!, c] = rand(7500000)
end
# some dimension
data[!, "dimension"] = rand([string(c) for c in 'A':'Z'],7500000)
size(data)

This will give us (7500000, 107) dataset

Looking for processing this large dataset faster however possible. currently it takes around 40 mins without any parallel processing. The resulting data-frame will have

unique_id
real_number_columns with suffix "_DIFF" when unique_id contains 2 datestamps (diffs of values) or else missing
real_number_columns with suffix "_RECENT" when unique_id contains 1 or 2 datestamp (just the recent value)
just the most recent dimension

# get the difference between real_number_columns across dates if 2 dates exist
df = @chain data begin
    transform!(:, :datestamp => ByRow(d -> d == Date("2021-08-22") ? true : false) => :recent_date)
    groupby(["unique_id"])
    combine(df1 -> begin
            df1_max = nothing
            df1_min = nothing
    try
        d = Dict()
        if nrow(df1) == 1
            d["day_diff"] = [0]
            df1_max = df1[1,:]
            df1_min = df1[1,:]
        else
            d["day_diff"] = [6]
            df1_max = df1[findfirst(df1.recent_date),:]
            df1_min = df1[findfirst(.!df1.recent_date),:]
        end
        d["dimension"] = [df1_max.dimension]
        for m in real_number_columns
            d[m*"_RECENT"] = [df1_max[m]]
            if nrow(df1) > 1
                d[m*"_DIFF"] = [df1_max[m] - df1_min[m]]
            else
                d[m*"_DIFF"] = [missing]
            end
        end
        DataFrame(d)
        catch e
            @error "Something went wrong" exception=(e, catch_backtrace())
            rethrow()
        end
    end)
end

Questions

how can one make this processing efficient ?
Can we get faster processing time than efficiency (say on device with RAM around 1 TB) ?