Dataframe is a Torch7 class to load and manipulate tabular data (e.g. Kaggle-style CSVs) inspired from R's and pandas' data frames.
As of release 1.5 it fully supports the torchnet data structure. It also has custom iterators to convenient integration with torchnet's engines, see the mnist example. As of release 1.6 it has changed the internal storage to tensor
For a more detailed look at the changes between the versions have a look at the NEWS file.
You can clone this repository or directly install it through luarocks
:
git clone https://github.com/AlexMili/torch-dataframe
cd torch-dataframe
luarocks make rocks/torch-dataframe-scm-1.rockspec
the same in one line :
luarocks install torch-dataframe scm-1
or
luarocks install torch-dataframe
- The data is now stored in Dataseries that handles all the manipulations, statistics, categoricals, etc internally. The data backend is either a tensor or a tds.Vec in order to better accomodate large datasets.
- The self.columns has been dropped and there is now only self.column_order that keeps track of column order.
- Most functions now use either tds.Hash or tds.Vec for returning values instead of regular tables.
- The data types are now more sophisticate with boolean, integer, long, double, and string. The first and the last are internally stored as
tds.Vec
while the remaining are in the form of torch tensors. - Since conversions are more restricted with the new column types the is a boolean2tensor and boolean2categorical that help converting boolean columns into numerical.
- The
Dataframe.schema
property has been removed as it now resides in the series. The same information can be retrieved usingget_schema()
. - There is now a custom busted assertion that can compare tensors, tds, and Dataseries.
- The csv data is entered using csvigo's
large
mode thus circumventing the memory limit for large csv's. - The to_/from_categorical now always return a single value when a single value is entered.
- Add column now takes a Dataseries instead of a Df_Array
- Generalized the argcheck by adding string.split for
|
separated arguments - Multiple minor bug-fixes with non-local variables
See NEWS.md
file for previous changes.
The Dataframe relies on argcheck for parsing arguments. This means that you can used named parameters using the function{arg_name=value}
syntax. Named arguments are supported by all functions except the constructor and is in certain functions mandatory in order to avoid ambiguity.
The argcheck package also works as the API documentation. It checks arguments and if you happen to provide the function with invalid arguments it will automatically output the function documentation.
Important: Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:
- Df_Array - contains only values and no keys
- Df_Dict - a dictionary table that has named keys that map to all values
- Df_Tbl - a raw table wrapper that does a shallow argument copy
Initiate the object:
require 'Dataframe'
df = Dataframe()
Load CSV file:
df:load_csv{path='./data/training.csv', header=true}
Load from table:
df:load_table{data=Df_Dict{firstColumn={1,2,3},
secondColumn={4,5,6}}}
You can also instantiate the object with a csv-filename or a table by passing the table or filename as an argument:
require 'Dataframe'
df = Dataframe('./data/training.csv')
You can discover your dataset with the following functions:
-- you can either view the data as a plain text output or itorch html table
df:output() -- prints html if in itorch otherwise prints plain table
df:output{html=true} -- forces html output
df:show() -- prints the head + tail of the table
-- You can also directly call print() on the object
-- and it will print the ascii-table
print(df)
General dataset information can be found using:
df:shape() -- print {rows=3, cols=3}
#df -- gets the number of rows
df:size() -- returns a tensor with the size rows, columns
df.column_order -- table of columns names
df:count_na() -- print all the missing values by column name
If you want to inspect random elements you can use the get_random()
:
df:get_random(10):output()
You can manipulate it:
df:insert(Df_Dict({['first_column']={7,8,9},['second_column']={10,11,12}}))
df:remove_index(3) -- remove line 3 of the entire dataset
df:has_column('x') -- return true if the column exist
df:get_column('y') -- return column x as table
df["$y"] -- alias for get_column
df:add_column('z', 0) -- Add column with default value 0 at the end (right side of the table)
df:add_column('first_column', 1, 2) -- Add column with default value 2 at the beginning (left side of the table)
df:drop('x') -- delete column
df:rename_column('x', 'y') -- rename column 'x' in 'y'
df:reset_column('my_col', 0) -- reset the given column with 0
df:fill_na('x', 0) -- replace missing values in 'x' column with 0
df:fill_all_na(0) -- replace all missing values with the value 0
df:unique('col_name') -- return table with unique values of the given column
df:unique('col_name', true) -- return table with unique values of the given column as keys
df:where('column_name','my_value') -- find the first row where the column has the given value
-- Customly update all rows filling the condition defined in first lambda
df:update(function(row) row['column'] == 'test' end,
function(row) row['other_column'] = 'new_value' return row end)
You can define categorical variables that will be treated internally as numbers ranging from 1 to n levels while displayed as strings. The numeric representation is retained when exporting to_tensor
allowing a simpler understanding of a classifier's output:
df:as_categorical('my string column') -- converts a column to categorical
df:get_cat_keys('my string column') -- retreives the keys used to converts
df:to_categorical(Df_Array({1,2,1}), 'my string column') -- converts numbers to the categories
You can subset your data using:
df:head(20) -- print 20 first elements (10 by default)
df:tail(5) -- print 5 last elements (10 by default)
df:show() -- print 10 first and 10 last elements
df[13] -- returns a table with the row values
df["13:17"] -- returns a Dataframe with values in that span
df["13:"] -- returns a Dataframe with values starting from index 13
df[Df_Array(1,3,4)] -- returns a Dataframe with values index 1,3 and 4
Finally, you can save your dataset to tensor (only numerical/categorical columns will be taken):
df:to_tensor{filename = './data/train.th7'} -- saves data
data = df:to_tensor{columns = Df_Array('first_column', 'my string column')} -- Converts the two columns into tensor
or to CSV:
df:to_csv('data.csv')
The Dataframe provides a built-in system for handling batch loading. It also has an extensive set of samplers that you can use. See API docs for more on which that are available.
The gist of it is:
- The main Dataframe is initialized for batch loading via calling the
create_subsets
. This creates random subsets that have their own samplers. The default is a train 70%, validate 20%, and a test 10% split in the data but you can choose any split and any names. - Each subset is a separate dataframe subclass that has two columns, (1) indexes with the corresponding index in the main dataframe, (2) labels that some of the samplers require.
- When you want to retrieve a batch from a subset you call the subset using
my_dataframe:get_subset('train'):get_batch(30)
ormy_dataframe['/train']:get_batch(30)
. - The batch returned is also a subclass that has a custom
to_tensor
function that returns the data and corresponding label tensors. You can provide custom functions that will get the full row as an argument allowing you to use e.g. a filename that permits load an external resource.
A simple example:
local df = Dataframe('my_csv'):
create_subsets()
local batch = df["/train"]:get_batch(10)
local data, label = batch:to_tensor{
load_data_fn = my_image_loader
}
As of version 1.5 you may also want to consider using th iterators that integrate with the torchnet infrastructure. Take a look at the iterator API and the mnist example for how an implementation may look.
The package contains an extensive test suite and tries to apply a behavior driven development approach. All features should be accompanied by a test-case.
To launch the tests you need to install busted
(See:
Olivine-Labs/busted) via luarocks
:
luarocks install busted
then you can run all tests via command line:
cd specs/
./run_all.sh
The package relies on self-documenting functions via the argcheck package that reside in the doc folder. The GitHub Wiki is intended for more extensive in detail documentation.
To generate the documentation please run:
th doc.lua > /dev/null
Feel free to report a bug, suggest enhancements or submit new cool features using Issues or directly send us a Pull Request :). Don't forget to test your code and generate the doc before submitting. You can find how we implemented our tests in the specs directory. See "Behavior Driven Development" for more details on this technique.