Dataframe

Dataframe is a Torch7 class to load and manipulate tabular data (e.g. Kaggle-style CSVs) inspired from R's and pandas' data frames.

As of release 1.5 it fully supports the torchnet data structure. It also has custom iterators to convenient integration with torchnet's engines, see the mnist example. As of release 1.6 it has changed the internal storage to tensor

For a more detailed look at the changes between the versions have a look at the NEWS file.

Requirements
Installation
Changelog
Usage
Tests
Documentation
Contributing

Requirements

Installation

You can clone this repository or directly install it through luarocks:

git clone https://github.com/AlexMili/torch-dataframe
cd torch-dataframe
luarocks make rocks/torch-dataframe-scm-1.rockspec

the same in one line :

luarocks install torch-dataframe scm-1

luarocks install torch-dataframe

Changelog

Version: 1.6

The data is now stored in Dataseries that handles all the manipulations, statistics, categoricals, etc internally. The data backend is either a tensor or a tds.Vec in order to better accomodate large datasets.
The self.columns has been dropped and there is now only self.column_order that keeps track of column order.
Most functions now use either tds.Hash or tds.Vec for returning values instead of regular tables.
The data types are now more sophisticate with boolean, integer, long, double, and string. The first and the last are internally stored as tds.Vec while the remaining are in the form of torch tensors.
Since conversions are more restricted with the new column types the is a boolean2tensor and boolean2categorical that help converting boolean columns into numerical.
The Dataframe.schema property has been removed as it now resides in the series. The same information can be retrieved using get_schema().
There is now a custom busted assertion that can compare tensors, tds, and Dataseries.
The csv data is entered using csvigo's large mode thus circumventing the memory limit for large csv's.
The to_/from_categorical now always return a single value when a single value is entered.
Add column now takes a Dataseries instead of a Df_Array
Generalized the argcheck by adding string.split for | separated arguments
Multiple minor bug-fixes with non-local variables

See NEWS.md file for previous changes.

Usage

Named arguments

The Dataframe relies on argcheck for parsing arguments. This means that you can used named parameters using the function{arg_name=value} syntax. Named arguments are supported by all functions except the constructor and is in certain functions mandatory in order to avoid ambiguity.

The argcheck package also works as the API documentation. It checks arguments and if you happen to provide the function with invalid arguments it will automatically output the function documentation.

Important: Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:

Df_Array - contains only values and no keys
Df_Dict - a dictionary table that has named keys that map to all values
Df_Tbl - a raw table wrapper that does a shallow argument copy

Load data

Initiate the object:

require 'Dataframe'
df = Dataframe()

Load CSV file:

df:load_csv{path='./data/training.csv', header=true}

Load from table:

df:load_table{data=Df_Dict{firstColumn={1,2,3},
                           secondColumn={4,5,6}}}

You can also instantiate the object with a csv-filename or a table by passing the table or filename as an argument:

require 'Dataframe'
df = Dataframe('./data/training.csv')

Data inspection

You can discover your dataset with the following functions:

-- you can either view the data as a plain text output or itorch html table
df:output() -- prints html if in itorch otherwise prints plain table
df:output{html=true} -- forces html output

df:show() -- prints the head + tail of the table

-- You can also directly call print() on the object
-- and it will print the ascii-table
print(df)

General dataset information can be found using:

df:shape() -- print {rows=3, cols=3}
#df -- gets the number of rows
df:size() -- returns a tensor with the size rows, columns
df.column_order -- table of columns names
df:count_na() -- print all the missing values by column name

If you want to inspect random elements you can use the get_random():

df:get_random(10):output()

Manipulate

You can manipulate it:

df:insert(Df_Dict({['first_column']={7,8,9},['second_column']={10,11,12}}))
df:remove_index(3) -- remove line 3 of the entire dataset

df:has_column('x') -- return true if the column exist
df:get_column('y') -- return column x as table
df["$y"] -- alias for get_column

df:add_column('z', 0) -- Add column with default value 0 at the end (right side of the table)
df:add_column('first_column', 1, 2) -- Add column with default value 2 at the beginning (left side of the table)
df:drop('x') -- delete column
df:rename_column('x', 'y') -- rename column 'x' in 'y'

df:reset_column('my_col', 0) -- reset the given column with 0
df:fill_na('x', 0) -- replace missing values in 'x' column with 0
df:fill_all_na(0) -- replace all missing values with the value 0

df:unique('col_name') -- return table with unique values of the given column
df:unique('col_name', true) -- return table with unique values of the given column as keys

df:where('column_name','my_value') -- find the first row where the column has the given value

-- Customly update all rows filling the condition defined in first lambda
df:update(function(row) row['column'] == 'test' end,
          function(row) row['other_column'] = 'new_value' return row end)

Categorical variables

You can define categorical variables that will be treated internally as numbers ranging from 1 to n levels while displayed as strings. The numeric representation is retained when exporting to_tensor allowing a simpler understanding of a classifier's output:

df:as_categorical('my string column') -- converts a column to categorical
df:get_cat_keys('my string column') -- retreives the keys used to converts
df:to_categorical(Df_Array({1,2,1}), 'my string column') -- converts numbers to the categories

Subsetting

You can subset your data using:

df:head(20) -- print 20 first elements (10 by default)
df:tail(5) -- print 5 last elements (10 by default)
df:show() -- print 10 first and 10 last elements

df[13] -- returns a table with the row values
df["13:17"] -- returns a Dataframe with values in that span
df["13:"] -- returns a Dataframe with values starting from index 13
df[Df_Array(1,3,4)] -- returns a Dataframe with values index 1,3 and 4

Exporting

Finally, you can save your dataset to tensor (only numerical/categorical columns will be taken):

df:to_tensor{filename = './data/train.th7'} -- saves data
data = df:to_tensor{columns = Df_Array('first_column', 'my string column')} -- Converts the two columns into tensor

or to CSV:

df:to_csv('data.csv')

Batch loading

The Dataframe provides a built-in system for handling batch loading. It also has an extensive set of samplers that you can use. See API docs for more on which that are available.

The gist of it is:

The main Dataframe is initialized for batch loading via calling the create_subsets. This creates random subsets that have their own samplers. The default is a train 70%, validate 20%, and a test 10% split in the data but you can choose any split and any names.
Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require.
When you want to retrieve a batch from a subset you call the subset using my_dataframe:get_subset('train'):get_batch(30) or my_dataframe['/train']:get_batch(30).
The batch returned is also a subclass that has a custom to_tensor function that returns the data and corresponding label tensors. You can provide custom functions that will get the full row as an argument allowing you to use e.g. a filename that permits load an external resource.

A simple example:

local df = Dataframe('my_csv'):
	create_subsets()

local batch = df["/train"]:get_batch(10)
local data, label = batch:to_tensor{
	load_data_fn = my_image_loader
}

As of version 1.5 you may also want to consider using th iterators that integrate with the torchnet infrastructure. Take a look at the iterator API and the mnist example for how an implementation may look.

Tests

The package contains an extensive test suite and tries to apply a behavior driven development approach. All features should be accompanied by a test-case.

To launch the tests you need to install busted (See: Olivine-Labs/busted) via luarocks:

luarocks install busted

then you can run all tests via command line:

cd specs/
./run_all.sh

Documentation

The package relies on self-documenting functions via the argcheck package that reside in the doc folder. The GitHub Wiki is intended for more extensive in detail documentation.

To generate the documentation please run:

th doc.lua > /dev/null

Contributing

Feel free to report a bug, suggest enhancements or submit new cool features using Issues or directly send us a Pull Request :). Don't forget to test your code and generate the doc before submitting. You can find how we implemented our tests in the specs directory. See "Behavior Driven Development" for more details on this technique.

deeplearningsprint / torch-dataframe Goto Github PK

torch-dataframe's Introduction