Giter Site home page Giter Site logo

julia-dataframes-tutorial's Introduction

An Introduction to DataFrames.jl

Bogumił Kamiński, February 13, 2023

The tutorial is for DataFrames.jl 1.5.0

A brief introduction to basic usage of DataFrames.

The tutorial contains a specification of the project environment version under which it should be run. In order to prepare this environment, before using the tutorial notebooks, while in the project folder run the following command in the command line:

julia -e 'using Pkg; Pkg.activate("."); Pkg.instantiate()'

Tested under Julia 1.9.0. The project dependencies are the following:

  [69666777] Arrow v2.4.3
  [6e4b80f9] BenchmarkTools v1.3.2
  [336ed68f] CSV v0.10.9
  [324d7699] CategoricalArrays v0.10.7
  [8be319e6] Chain v0.5.0
  [944b1d66] CodecZlib v0.7.1
  [a93c6f00] DataFrames v1.5.0
  [1313f7d8] DataFramesMeta v0.13.0
  [5789e2e9] FileIO v1.16.0
  [da1fdf0e] FreqTables v0.4.5
  [7073ff75] IJulia v1.24.0
  [babc3d20] JDF v0.5.1
  [9da8a3cd] JLSO v2.7.0
  [b9914132] JSONTables v1.0.3
  [86f7a689] NamedArrays v0.9.6
  [2dfb63ee] PooledArrays v1.4.2
  [f3b207a7] StatsPlots v0.15.4
  [bd369af6] Tables v1.10.0
  [a5390f91] ZipFile v0.10.1
  [9a3f8284] Random
  [10745b16] Statistics v1.9.0

I will try to keep the material up to date as the packages evolve.

This tutorial covers DataFrames.jl and CategoricalArrays.jl, as they constitute the core of DataFrames.jl along with selected file reading and writing packages.

In the last extras part mentions selected functionalities of selected useful packages that I find useful for data manipulation, currently those are: FreqTables.jl, DataFramesMeta.jl StatsPlots.jl.

TOC

File Topic
01_constructors.ipynb Creating DataFrame and conversion
02_basicinfo.ipynb Getting summary information
03_missingvalues.ipynb Handling missing values
04_loadsave.ipynb Loading and saving DataFrames
05_columns.ipynb Working with columns of DataFrame
06_rows.ipynb Working with row of DataFrame
07_factors.ipynb Working with categorical data
08_joins.ipynb Joining DataFrames
09_reshaping.ipynb Reshaping DataFrames
10_transforms.ipynb Transforming DataFrames
11_performance.ipynb Performance tips
12_pitfalls.ipynb Possible pitfalls
13_extras.ipynb Additional interesting packages

Changelog:

Date Changes
2017-12-05 Initial release
2017-12-06 Added description of insert!, merge!, empty!, categorical!, delete!, DataFrames.index
2017-12-09 Added performance tips
2017-12-10 Added pitfalls
2017-12-18 Added additional worthwhile packages: FreqTables and DataFramesMeta
2017-12-29 Added description of filter and filter!
2017-12-31 Added description of conversion to Matrix
2018-04-06 Added example of extracting a row from a DataFrame
2018-04-21 Major update of whole tutorial
2018-05-01 Added byrow! example
2018-05-13 Added StatPlots package to extras
2018-05-23 Improved comments in sections 1 do 5 by Jane Herriman
2018-07-25 Update to 0.11.7 release
2018-08-25 Update to Julia 1.0 release: sections 1 to 10
2018-08-29 Update to Julia 1.0 release: sections 11, 12 and 13
2018-09-05 Update to Julia 1.0 release: FreqTables section
2018-09-10 Added CSVFiles section to chapter on load/save
2018-09-26 Updated to DataFrames 0.14.0
2018-10-04 Updated to DataFrames 0.14.1, added haskey and repeat
2018-12-08 Updated to DataFrames 0.15.2
2019-01-03 Updated to DataFrames 0.16.0, added serialization instructions
2019-01-18 Updated to DataFrames 0.17.0, added passmissing
2019-01-27 Added Feather.jl file read/write
2019-01-30 Renamed StatPlots.jl to StatsPlots.jl and added Tables.jl
2019-02-08 Added groupvars and groupindices functions
2019-04-27 Updated to DataFrames 0.18.0, dropped JLD2.jl
2019-04-30 Updated handling of missing values description
2019-07-16 Updated to DataFrames 0.19.0
2019-08-14 Added JSONTables.jl and Tables.columnindex
2019-08-16 Added Project.toml and Manifest.toml
2019-08-26 Update to Julia 1.2 and DataFrames 0.19.3
2019-08-29 Add example how to compress/decompress CSV file using CodecZlib
2019-08-30 Add examples of JLSO.jl and ZipFile.jl by xiaodaigh
2019-11-03 Add examples of JDF.jl by xiaodaigh
2019-12-08 Updated to DataFrames 0.20.0
2020-05-06 Updated to DataFrames 0.21.0 (except load/save and extras)
2020-11-20 Updated to DataFrames 0.22.0 (except DataFramesMeta.jl which does not work yet)
2020-11-26 Updated to DataFramesMeta.jl 0.6; update by @pdeffebach
2021-05-15 Updated to DataFrames.jl 1.1.1
2021-05-15 Updated to DataFrames.jl 1.2 and DataFramesMeta.jl 0.8, added Chain.jl instead of Pipe.jl
2021-12-12 Updated to DataFrames.jl 1.3
2022-10-05 Updated to DataFrames.jl 1.4
2023-02-13 Updated to DataFrames.jl 1.5

Core functions summary

  1. Constructors: DataFrame, DataFrame!, Tables.rowtable, Tables.columntable, Matrix, eachcol, eachrow, Tables.namedtupleiterator, empty, empty!
  2. Getting summary: size, nrow, ncol, describe, names, eltypes, first, last, getindex, setindex!, @view, isapprox, metadata, metadata!, colmetadata, colmetadata!
  3. Handling missing: missing (singleton instance of Missing), ismissing, nonmissingtype, skipmissing, replace, replace!, coalesce, allowmissing, disallowmissing, allowmissing!, completecases, dropmissing, dropmissing!, disallowmissing, disallowmissing!, passmissing
  4. Loading and saving: CSV (package), CSVFiles (package), Serialization (module), CSV.read, CSV.write, save, load, serialize, deserialize, Arrow.write, Arrow.Table (from Arrow.jl package), JSONTables (package), arraytable, objecttable, jsontable, CodecZlib (module), GzipCompressorStream, GzipDecompressorStream, JDF.jl (package), JDF.save, JDF.load, JLSO.jl (package), JLSO.save, JLSO.load, ZipFile.jl (package), ZipFile.reader, ZipFile.writer, ZipFile.addfile
  5. Working with columns: rename, rename!, hcat, insertcols!, categorical!, columnindex, hasproperty, select, select!, transform, transform!, combine, Not, All, Between, ByRow, AsTable
  6. Working with rows: sort!, sort, issorted, append!, vcat, push!, view, filter, filter!, deleteat!, unique, nonunique, unique!, allunique, repeat, parent, parentindices, flatten, @chain (from Chain.jl package), only, subset, subset!, shuffle, prepend!, pushfirst!, insert!, keepat!
  7. Working with categorical: categorical, cut, isordered, ordered!, levels, unique, levels!, droplevels!, unwrap, recode, recode!
  8. Joining: innerjoin, leftjoin, leftjoin!, rightjoin, outerjoin, semijoin, antijoin, crossjoin
  9. Reshaping: stack, unstack
  10. Transforming: groupby, mapcols, parent, groupcols, valuecols, groupindices, keys (for GroupedDataFrame), combine, select, select!, transform, transform!, @chain (from Chain.jl package)
  11. Extras:
    • FreqTables: freqtable, prop, Name
    • DataFramesMeta: @with, @subset, @select, @transform, @orderby, @by, @combine, @eachrow, @newcol, ^, $
    • StatsPlots: @df, plot, density, histogram,boxplot, violin

julia-dataframes-tutorial's People

Contributors

albandecrevoisier avatar arnaudh avatar bkamins avatar carstenbauer avatar derekmahar avatar henry2004y avatar pdeffebach avatar sbacelar avatar strickek avatar xiaodaigh avatar xorjane avatar y1my1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

julia-dataframes-tutorial's Issues

Converting Dataframe row to raw Julia vector?

Could you please add an example to the rows tutorial explaining how users can convert DataFrame rows into Julia vectors or arrays?

I was used to doing vec(convert(Array, df[row,cols])), but this doesn't work anymore because of the introduction of Missing Unions.

coalesce example doesn't work

coalesce no longer works as you describe. It must returns the first item that is missing. You must perhaps have left out map(...)?

Sadly, I had to do:

df[ismissing.(df.mycol),:mycol] = "" # or whatever substitute value is appropriate.

update DataFramesMeta.jl

@pdeffebach - if would be great if you found time to update notebook 13 with DataFramesMeta.jl examples when it is up to DataFrames.jl 0.22 relase. Feel free to remove/change everything - as now you know best what things should be shown there.

Would you be willing to do this? Thank you!

Creating dataframe from vector of vectors with different length each

Thank you for these tutorials, they are really helpful. Today I came across this task and I wonder if you have a good solution for it that could be added to the repo:

using DataFrames

cols = [rand(i) for i in 1:5]
DataFrame(Dict("col$i" => cols[i] for i in 1:length(cols)))

ERROR: DimensionMismatch("column length 1 for column(s) col1 is incompatible with column length 2 for column(s) col2 is incompatible with column length 3 for column(s) col3 is incompatible with column length 4 for column(s) col4, and is incompatible with column length 5 for column(s) col5")

The expected result is a dataframe with 5 rows where the first columns are filled with missing at the end.

"transform" function undefined

Hi, I'm a beginner Julia user, so please forgive me for any mistakes.

I can't use transform function, it throws UndefVarError (and the function is nowhere to be found).

I'm using Julia 1.5.3 and DataFrames 0.20.2

Steps to reproduce:

  1. Import DataFrames using DataFrames
  2. Run transform

Thanks in advance.

Performance Tips for Categorical Data

In your performance tips at In [8] you mentioned:

Allowing missing as well as categorical slows down computations

I might have an idea on how to fix this (at least partially). I did not do any research on whether there has already been done some work on this. If we compute the count map on the references into the pool, we only operate on UInts:

julia> function StatsBase.countmap(x::CategoricalArray; alg = :auto)
           cm = countmap(x.refs, alg=alg)
           Dict(x.pool[ref] => count for (ref, count) in cm)
       end

julia> x = rand(1:10, 10^6);

julia> y = categorical(x);

julia> z = compress(y);

julia> @btime countmap(x);
  2.927 ms (8 allocations: 7.63 MiB)

julia> @btime countmap(y);
  2.141 ms (14 allocations: 3.82 MiB)

julia> @btime countmap(z);
  541.943 μs (11 allocations: 3.27 KiB)

On top of that, if we guard x.pool[ref] for ref == 0, which corresponds to a missing value, we also fix the performance loss for missing values.

My question is, where should I propose and contribute a change like the above, CategoricalArrays.jl or StatsBase.jl or somewhere else? Neither of the two has the other as a dependency, but they need to be added to allow for type dispatch (I think; I'm still a Julia novice), so maybe it should go to DataFrames.jl?

Just wanted to say....

this is AWESOME. Thank you SO MUCH for putting the time into putting this together for the community!

JDF dependency errored

julia version : 1.6.2 (2021-07-14)

Precompiling project...
  ✗ JDF
  125 dependencies successfully precompiled in 76 seconds (71 already precompiled)
  1 dependency errored. To see a full report either run `import Pkg; Pkg.precompile()` or load the package

I've only been using Julia for a bit over a week, so not sure where to go from here. Happy to answer questions...

Custom index type

Given a dataset like:

	Fat	Sodium	Carbonates Protein  total
soy	  7	    14	        10	13    100
peas	  3	    12	        61	22    100
wheat	  1	    15	        60	23    100
corn	  1	   180	        15	 2     82
beans	  0.5	   680          30	12     NA

How to load it so that the first column is interpreted as the index column, and how to manipulate the resulting dataframe? Is this possible with the current state of DataFrames.jl?

Buggy example?

Hello and thanks for these wonderful tutorials.

I was trying to run the above example (I use Pluto, julia 1.9 and DataFrames 1.5.0)

x = DataFrame(id=rand('a':'d', 100), v=rand(100))
combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

And got the following error:

UndefVarError: nrow not defined

_combine(::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Vector{Any}, ::Vector{Bool}, ::Bool, ::Bool, ::Bool, ::Bool)@splitapplycombine.jl:739
_combine_prepare_norm(::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Vector{Any}, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool)@splitapplycombine.jl:86
var"#_combine_prepare#671"(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(DataFrames._combine_prepare), ::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Base.RefValue{Any})@splitapplycombine.jl:51
[email protected]:25[inlined]
#combine#[email protected]:845[inlined]
[email protected]:845[inlined]
#combine#[email protected]:830[inlined]
[email protected]:824[inlined]
top-level scope@Local: 1[inlined]

The I tried to fix it:

combine(groupby(x, :id)) do sdf
    n = size(sdf)[1]
    n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

And got this.

UndefVarError: DataFrame not defined

_combine(::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Vector{Any}, ::Vector{Bool}, ::Bool, ::Bool, ::Bool, ::Bool)@splitapplycombine.jl:739
_combine_prepare_norm(::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Vector{Any}, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool)@splitapplycombine.jl:86
var"#_combine_prepare#671"(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(DataFrames._combine_prepare), ::DataFrames.GroupedDataFrame{DataFrames.DataFrame}, ::Base.RefValue{Any})@splitapplycombine.jl:51
[email protected]:25[inlined]
#combine#[email protected]:845[inlined]
[email protected]:845[inlined]
#combine#[email protected]:830[inlined]
[email protected]:824[inlined]
top-level scope@Local: 1[inlined]

The example if of particular interest for me, since my problem is precisely to drop groups with less than certain number of rows

Thanks!

Help with complex groupby and combine in efficient way

Here is a sample dataset with 7 million rows

# some ids have single date entries
data = DataFrame(
    "unique_id" => [i for i in 1:1500000], 
    "datestamp" => [Date("2021-08-16") for i in 1:1500000],
)

# some ids have 2 date entries
data = vcat(data, DataFrame(
    "unique_id" => repeat(1500001:4500000, 2), 
    "datestamp" => vcat(repeat([Date("2021-08-16")],3000000), repeat([Date("2021-08-22")],3000000)), 
))

# have lot of columns that needs to be processed
real_number_columns = [string(c)*"_"*string(i) for c in 'a':'z' for i in 1:4]
for c in columns
    data[!, c] = rand(7500000)
end
# some dimension
data[!, "dimension"] = rand([string(c) for c in 'A':'Z'],7500000)
size(data)

This will give us (7500000, 107) dataset

Looking for processing this large dataset faster however possible. currently it takes around 40 mins without any parallel processing. The resulting data-frame will have

  1. unique_id
  2. real_number_columns with suffix "_DIFF" when unique_id contains 2 datestamps (diffs of values) or else missing
  3. real_number_columns with suffix "_RECENT" when unique_id contains 1 or 2 datestamp (just the recent value)
  4. just the most recent dimension
# get the difference between real_number_columns across dates if 2 dates exist
df = @chain data begin
    transform!(:, :datestamp => ByRow(d -> d == Date("2021-08-22") ? true : false) => :recent_date)
    groupby(["unique_id"])
    combine(df1 -> begin
            df1_max = nothing
            df1_min = nothing
    try
        d = Dict()
        if nrow(df1) == 1
            d["day_diff"] = [0]
            df1_max = df1[1,:]
            df1_min = df1[1,:]
        else
            d["day_diff"] = [6]
            df1_max = df1[findfirst(df1.recent_date),:]
            df1_min = df1[findfirst(.!df1.recent_date),:]
        end
        d["dimension"] = [df1_max.dimension]
        for m in real_number_columns
            d[m*"_RECENT"] = [df1_max[m]]
            if nrow(df1) > 1
                d[m*"_DIFF"] = [df1_max[m] - df1_min[m]]
            else
                d[m*"_DIFF"] = [missing]
            end
        end
        DataFrame(d)
        catch e
            @error "Something went wrong" exception=(e, catch_backtrace())
            rethrow()
        end
    end)
end

Questions

  1. how can one make this processing efficient ?
  2. Can we get faster processing time than efficiency (say on device with RAM around 1 TB) ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.