Giter Site home page Giter Site logo

dataframe's Introduction

Java DataFrame

An easy-to-use DataFrame Library for Java.

travis codecov Codacy Badge

Documentation

Javadocs

Install

Maven Central

Add this to you pom.xml

<dependencies>
...
    <dependency>
        <groupId>de.unknownreality</groupId>
        <artifactId>dataframe</artifactId>
        <version>0.7.6</version>
    </dependency>
...
</dependencies>

Build

To build the library from sources:

  1. Clone github repository

    $ git clone https://github.com/nRo/DataFrame.git

  2. Change to the created folder and run mvn install

    $ cd DataFrame

    $ mvn install

  3. Include it by adding the following to your project's pom.xml:

<dependencies>
...
    <dependency>
        <groupId>de.unknownreality</groupId>
        <artifactId>dataframe</artifactId>
        <version>0.7.6-SNAPSHOT</version>
    </dependency>
...
</dependencies>

Version 0.7.5

  • direct value access for DataRow object.

    DataRows now directly access the respective values from the columns.
    This improves runtime and memory footprint for most DataFrame operations. DataRow objects are invalidated once the source DataFrame is changed.
    Accessing an invalidated row results in an exception

  • Row collections are now return as DataRows object.
    DataRows can be converted to a new DataFrame

  • improved 'groupBy' method

Version 0.7

  • The read and write functions have been rewritten from scratch for this version. Some existing methods have been removed.

  • Data grouping has been refactored and aggregation functions can now be applied to data groupings. In general, data groupings can now be used like normal DataFrames.

  • Java 8 is now required.

  • Empty DataFrame instances are now created using DataFrame.create()

Examples

Select all users called Meier or Schmitt from Germany, group by age and add column that contains the number of users with the respective age. Then sort by age and print

URL csvUrl = new URL("https://raw.githubusercontent.com/nRo/DataFrame/master/src/test/resources/users.csv");

DataFrame users = DataFrame.load(csvUrl, FileFormat.CSV);

users.select("(name == 'Schmitt' || name == 'Meier') && country == 'Germany'")
        .groupBy("age").agg("count",Aggregate.count())
        .sort("age")
        .print();

/*
    age count
    20   1
    24   2
    30   2
 */

Load a csv file, set a unique column as primary key and add an index for two other columns. Select rows using the previously created index, change the values in their NAME column and join them with the original DataFrame.

URL csvUrl = new URL("https://raw.githubusercontent.com/nRo/DataFrame/master/src/test/resources/data_index.csv");

DataFrame dataFrame = DataFrame.load(csvUrl, FileFormat.CSV);

dataFrame.setPrimaryKey("UID");
dataFrame.addIndex("id_name_idx","ID","NAME");

DataRow row = dataFrame.selectByPrimaryKey(1);
System.out.println(row);
//1;A;1

DataFrame idxExample = dataFrame.selectByIndex("id_name_idx",3,"A");

idxExample.print();
/*
    ID	NAME	UID
    3	A	4
    3	A	8
 */
idxExample.getStringColumn("NAME").map((value -> value + "_idx_example"));
idxExample.print();
/*
    ID	NAME	UID
    3	A_idx_example	4
    3	A_idx_example	8
 */

dataFrame.joinInner(idxExample,"UID").print();
/*
    ID.A    NAME.A	UID	ID.B	NAME.B
    3   A   4	3   A_idx_example
    3   A   8	3   A_idx_example
 */

Usage

Load DataFrame from a CSV file. Column types will be detected automatically. (String, Double, Integer, Boolean)

File file = new File("person.csv");
DataFrame users = DataFrame.fromCSV(file, ';', true);

Load a DataFrame with custom options and predefined column types.

File file = new File("person.csv");

CSVReader csvReader = CSVReaderBuilder.create()
                .containsHeader(true)
                .withHeaderPrefix("#")
                .withSeparator(';')
                .setColumnType("person_id", Integer.class)
                .setColumnType("first_name", String.class)
                .setColumnType("last_name", String.class)
                .setColumnType("age", Integer.class).build();

DataFrame users = DataFrame.load(file,csvReader);
        
System.out.println(users.getHeader());
for(DataRow row : users)
{
    System.out.println(row);
}

DataFrames can be written using default formats (CSV or TSV). Additionally it is possible to set different options when writing DataFrames.

dataFrame.write(file); // TSV per default

dataFrame.write(file, FileFormat.CSV);

dataFrame.writeCSV(file, ';',true); // use ';' as separator and include the header


CSVWriter csvWriter = CSVWriterBuilder.create()
                .withHeader(true)
                .withSeparator('\t')
                .useGzip(true).build();
dataFrame.write(file, csvWriter);

If a meta file is written for a DataFrame, it can simply be loaded by pointing at the DataFrame file. The meta has the same path as the DataFrame file with '.dfm' extension

File file = new File("dataFrame.csv");
dataFrame.write(file);
DataFrame loadedDataFrame = DataFrame.load(file);

Values within a DataFrame are accessed using DataRow objects. If the source DataFrame changes after a DataRow object is created, the DataRow is invalidated and can no longer be accessed.

for(DataRow row : dataFrame){
    ... = row.getInteger("id");
}


DataRows rows = dataFrame.getRows();

//returns the value within the id column in the first row
rows.get(0).getInteger("id");

dataFrame.sort("name");

//The DataFrame was sorted after the DataRows were obtained.
//The first row can now differ. 
//To avoid these effects, a RuntimeException is thrown
//if a row that was created before the DataFrame is altered is accessed

rows.get(0).getInteger("id"); //throws exception

rows = dataFrame.getRows();

//rows is now valid again and rows can be accessed
rows.get(0).getInteger("id");


//DataRows can be converted to a new independent DataFrame.
//changes to the original DataFrame have no effect on the new DataFrame.
DataFrame dataFrame2 = rows.toDataFrame();

dataFrame.sort("id");

dataFrame.getRow(0).getInteger("id"); // no exception

DataRows can be used to change values within a DataFrame

DataRows rows = dataFrame.getRows();

//sets the value in the second row in the name column to 'A'
rows.get(1).set("name","A");

//sets the value in the second row in the first column to 'A'
rows.get(1).set(0,"A");

Use indices for fast row access.

//set the primary key of a data frame
users.setPrimaryKey("person_id");
DataRow firstUser = users.selectByPrimaryKey(1)

//add a multi-column index

users.addIndex("name-address","last_name","address");

//returns rows containing all users with the last name Smith in the Example-Street 15
DataRows user = users.selectRowsByIndex("name-address","Smith","Example-Street 15")

It is possible to define and use other index types. The following example shows interval indices. This index type requires two number columns, start and end. The index can then be used to find rows where start and end value overlap with a region specified by two number values. It is also possible to find rows where the region defined by start and end contains a certain value.

 DataFrame dataFrame = DataFrame.create()
                .addStringColumn("name")
                .addIntegerColumn("start")
                .addIntegerColumn("end");
dataFrame.append("A",1,3);
dataFrame.append("B",2,3);
dataFrame.append("C",4,5);
dataFrame.append("D",6,7);
IntervalIndex index = new IntervalIndex("idx",
    dataFrame.getNumberColumn("start"),
    dataFrame.getNumberColumn("end"));
dataFrame.addIndex(index);

//returns a new dataframe containing all rows where (start,end) overlaps with (1,3)
// -> A, B
DataFrame df = dataFrame.selectByIndex("idx",1,3);

//rows where (start,end) overlaps with (4,5)
// -> C
dataFrame.selectByIndex("idx",4,5);

//rows where (start,end) contains 2.5
// -> A, B
dataFrame.selectByIndex("idx",2.5);

Perform operations on columns.

//max value of column "person_id"
users.getIntegerColumn("age").max();

DoubleColumn dc1 = ...;
DoubleColumn dc2 = ...;
//add one column to another
dc1.add(dc2);

//multiply each value with 2
dc1.multiply(2);

//Use MapFunction to convert all values in a row
dataFrame.getIntegerColumn("age").map(value -> value + 2);

Filter and select rows using predicates.

//keep users with age between 18 and 60
users.filter(FilterPredicate.btwn("age",18,60));

//find all users with age > 18 an first_name == "Max"
DataFrame foundUsers = users.select(
      FilterPredicate.and(
              FilterPredicate.gt("age",18),
              FilterPredicate.eq("first_name","Max")
));

Create and compile predicates from strings
Available value comparison operations:
==, !=, <, <=, >, >=, ~= (regex)
Available predicate operations:
&&, ||, NOR, XOR, !(predicate) (negates the predicate)

//find all users that are younger than 18 or older users with first_name == "Max"
DataFrame foundUsers = users.select("(age > 18 && first_name == 'Max') OR (age < 18)");

//Boolean column filter
//find all users that are older than 18 or the selected column is true
DataFrame foundUsers = users.select("(age > 18) OR selected");

//find all users the selected column is false
DataFrame notSelected = users.select("!selected");

//compare tow columns
//returns all rows where col1 equals col2
FilterPredicate.eqColumn("col1","col2")

//Get all users where first name does not equal the last name
//Column comparisons require '.' as prefix
DataFrame dataframe = users.select(".first_name != .last_name");


// regex filter
//find all users where the street begins with A, B or C followed by lowercase characters
DataFrame foundUsers = users.select("street ~= /[ABC][a-z]+/");

Sort rows by one or more columns.

//sort by column "person_id" (ascending)
users.sort("person_id", SortColumn.Direction.Ascending);

//sort by "last_name" und "first_name"
users.sort(
   new SortColumn("last_name", SortColumn.Direction.Descending),
   new SortColumn("first_name", SortColumn.Direction.Descending)
);

Group dataframes using one or more columns.

//group by "age" and "first_name"
DataGrouping grouping = users.groupBy("age","first_name");


//iterate through all found groups
for(DataRow row : grouping){
    DataGroup group = grouping.getGroup(row.getIndex());
    //print the group description (group values)
    System.out.println(group.getGroupDescription());
   
    //iterate through all rows from the respective groups
    for(DataRow groupRow : group){
            System.out.println(row);
     }
}

Direct access to groups in a grouping.

//group by "age" and "first_name"
DataGrouping grouping = users.groupBy("age","first_name");

//Get all users that are called John and are 18 years old
DataGroup group = grouping.findByGroupValues(18, "John");

It is possible to apply aggregation function to data groups. In this example, a column "max_age" is added to the grouping DataFrame. This column contains the maximum value of the "age" column of the respective rows in the original DataFrame. The resulting grouping DataFrame contains two columns: "first_name" and "max_age"

DataGrouping grouping = users.groupBy("first_name").agg("max_age", Aggregate.max("age"));

//some other aggregate functions
grouping.agg("mean_age", Aggregate.mean("age"));

grouping.agg("median_age", Aggregate.median("age"));

grouping.agg("25quantile_age", Aggregate.quantile("age", 0.25));

grouping.agg("older_30_count", Aggregate.filterCount("age > 30"));

//custom aggregate function
grouping.agg("org_percentage", group -> group.size() / users.size());

Join two dataframes using one or more columns.

//join DataFrames users and visits by columns users.person_id == visits.person_id
DataFrame visitors = users.joinLeft(visits,"person_id");

//join DataFrames users and orders by columns users.person_id == orders.customer_id
DataFrame userOrders = users.joinInner(orders,new JoinColumn("person_id","customer_id"));

dataframe's People

Contributors

dependabot[bot] avatar nro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataframe's Issues

dataframe need a count function

In my situation, I need get the dataframe size frequent

But consider the API, I will do this by

df.toList().size()

but the toList method defined in BaseDataFrame is expensive.

@Override
    public List<List> toList() {
        ArrayList<List> list = new ArrayList<>();

        for (DataRow row : this) {
            List data = new ArrayList();
            for (int i = 0; i < columns.length; i++) {
                data.add(row.get(i));
            }
            list.add(data);
        }

        return list;
    }

any idea?

Trying to filter Data by applying conditions to the csv file

Hi. I am currently facing an issue with the Dataframe.
I have donwloaded a file from Amazon s3 private bucket and I am facing issues while filtering the rows that respect a certain condition.
Here is my code:
`
//This function allows me to connect to the private s3 bucket
connection();
S3Object s3object = s3client.getObject(bucketName, sourceFile);
DataFrame file = DataFrame.load(s3object.getObjectContent(), FileFormat.CSV);

//listColumns & size displaying
System.out.println(file.getColumnNames().toString());
System.out.println(file.size());
//getting the first line with the header column "AreaQ" being superior to 2
file.select("(AreaQ > 2)").print();`

I am having an error on this last line saying that there was a NULL exception that occured and the exception being "Exception in getValues() with cause = 'NULL' and exception = 'column header name not found 'AreaQ'' de.unknownreality.dataframe.DataFrameRuntimeException: column header name not found 'AreaQ'"
and yet I do have a column named AreaQ with numeric values that are > to 2.
Can you help me please?

Support of temporal column types

Hi Alexander,

I'm looking at "DataFrame" to use it in of my pet projects

  • How to support temporal column types, e.g. LocalDate or LocalDateTime? Want I want to do is to hide a parsed CSV or Excel sheet behind "DataFrame" but Excel has temporal column types
  • Did you have a look at Commons CSV? Since you strive for minimal dependencies it might be not an option but I wanted to ask :-)

Siegfried

Initialization of the parser map is not thread-safe

Discovered this while adding columns to data frames in a multi threaded environment. Each thread has its own frame, so concurrency should not be an issue. However, ParserUtil#getParserMap() and ParserUtil#init() are not thread-safe. The lazy initialization of parserMap can lead to unexpected "Parser not found" errors when adding columns to multiple frames in multiple threads. This occurs because the if (parserMap == null) check no longer triggers (since the map has been created by init()), but the map is not done initializing.

From just looking at ParserUtil.java, it doesn't seem like there is a reason to lazily initialize this. Making the parser map static final and initializing it in a static block seems like a good solution that avoids multi-threading issues.

Binary File Format

With the introduction of custom value types (#22), all column values can be written and read from DataStreams. This enables the implementation of a binary file format to improve performance and decrease file sizes

[Request] Make findByIndex return Iterable<DataRow>

Hey Alex,

if I am putting an index on my DataFrame I can search for a row like this:

dataFrame.addIndex("idx", "barcode")
DataRow row = dataFrame.findByIndex("idx", "barcode1");

However, in some cases multiple rows are returned. So ideally it could return sth like:

Iterable<DataRow> row = dataFrame.findByIndex("idx", "barcode1");

Could you maybe add this.

Best, Simon

support for custom column types

introduce a value type abstraction to support custom column types (like temporal column types #21 )
Value types should provide the following methods:

  • read/write values to DataOutputStream
  • convert values to String
  • parse values from String

These value types must then be used in the following library parts:

  • CSV reader / writer
  • Automatic column type detection
  • Printer
  • Filter query parser

Value types are currently being worked on in branch
value-type-abstraction

Progress:

  • there are value types for all previously available columns.
  • value type object are available from column, row and header objects
  • CSV writer uses value types to write String representations

Locale are not taken in operation such column copy

Hi,
I changed Locale to FRENCH in NumberUtil Class but it is not taken in copy column (I tried for Double type but I think it is the same for other types).

The only place where it is used is in DataFrame.print function..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.