Giter Site home page Giter Site logo

jtablesaw / tablesaw Goto Github PK

View Code? Open in Web Editor NEW
3.4K 143.0 622.0 64.62 MB

Java dataframe and visualization library

Home Page: https://jtablesaw.github.io/tablesaw/

License: Apache License 2.0

Java 99.43% HTML 0.57% Shell 0.01%
dataframe data-frame java java-dataframe visualization plotting statistics chart machine-learning statistical-analysis

tablesaw's Introduction

Tablesaw

Apache 2.0 Build Status Codacy Badge Maintainability Rating

Overview

Tablesaw is a dataframe and visualization library that supports loading, cleaning, transforming, filtering, and summarizing data. If you work with data in Java, it may save you time and effort. Tablesaw also supports descriptive statistics and can be used to prepare data for working with machine learning libraries like Smile, Tribuo, H20.ai, DL4J.

Tablesaw features

Data processing & transformation

  • Import data from RDBMS, Excel, CSV, TSV, JSON, HTML, or Fixed Width text files, whether they are local or remote (http, S3, etc.)
  • Export data to CSV, JSON, HTML or Fixed Width files.
  • Combine tables by appending or joining
  • Add and remove columns or rows
  • Sort, Group, Filter, Edit, Transpose, etc.
  • Map/Reduce operations
  • Handle missing values

Visualization

Tablesaw supports data visualization by providing a wrapper for the Plot.ly JavaScript plotting library. Here are a few examples of the new library in action.

Tornadoes Tornadoes Tornadoes
Tornadoes Tornadoes Tornadoes
Tornadoes Tornadoes Tornadoes
Tornadoes Tornadoes Tornadoes

Statistics

  • Descriptive stats: mean, min, max, median, sum, product, standard deviation, variance, percentiles, geometric mean, skewness, kurtosis, etc.

Getting started

Add tablesaw-core to your project. You can find the version number for the latest release in the release notes:

<dependency>
    <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-core</artifactId>
    <version>VERSION_NUMBER_GOES_HERE</version>
</dependency>

You may also add supporting projects:

  • tablesaw-beakerx - for using Tablesaw inside BeakerX
  • tablesaw-excel - for using Excel workbooks
  • tablesaw-html - for using HTML
  • tablesaw-json - for using JSON
  • tablesaw-jsplot - for creating charts

External supporting projects - outside of this organization:

Documentation and support

Integrations

Jupyter Notebooks

Other integrations

tablesaw's People

Contributors

agebhar1 avatar antoine-guillou avatar ashvina avatar benmccann avatar carl-rabbit avatar ccleva avatar danielmao1 avatar dependabot[bot] avatar ebalaitung avatar emilianbold avatar gregorco avatar hallvard avatar jackie-h avatar jbsooter avatar kerwinooooo avatar kiamesdavies avatar lina2002 avatar lujop avatar lwhite-fmi avatar lwhite1 avatar mario-s avatar murtuza-ranapur avatar numericoverflow avatar r1j1t avatar richiethom avatar ryancerf avatar s1ny1998 avatar smarks avatar ustitc avatar yukoba avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tablesaw's Issues

Enhance .saw format compression

The Saw data store uses Snappy for compression by default. It could possibly use type specific compression that was both smaller and faster (Roaring Bitmaps for Booleans, FastPFOR for Ints). Bitmaps could double as the in-memory representation, eliminating translation overhead.

Evaluate use of specialized compression for in-memory data

It's possible that both boolean and integer columns could use specialized compression that allowed operations on the data in compressed format. Bools could use Roaring Bitmaps for example. Integers could use integer compression (library from groupon?). Ints are especially important as they're used for dates, times, and categorical data, as well as for IntColumns.

enable parallel query execution

using index results, you can just divide the keys, and reassemble the results in order.

may require a different approach for standard queries if we are to maintain result orders ala kdb.

Convert operations to use indexes when they are present

For an integer column, the following operations can all be implemented on top of indexes, rather than on the column itself, and so should be more efficient:

sum()
mean()
median() (and all other percentile operations)
max()/min()
equalTo(), greaterThan(), atLeast(), atMost(), lessThan()
between() and all variations on between();
standardDeviation
variance()
histogram binning?

lots of functions in FloatColumn seem to be broken

here is an example:
` public FloatColumn round() {
FloatColumn newColumn = create(this.name() + "[rounded]");

    for(int r = 0; r < this.size(); ++r) {
        float value = this.data.getFloat(r);
        newColumn.set(r, (float)Math.round(value));
    }

    return newColumn;
}`

the set fails because the length of newColumn is 0.

is there no way to copy columns or tables?

I can't see one. If there isn't, I think we should add a way of copying both columns and tables. Otherwise you add a column form one Table to another, modify that column, and screw up the original Table.

Clearly with the big-data type workflows, you want to avoid copying but Table will also be useful for many cases where people are dealing with much smaller things, for instance a Table of summary statistics that you want to save as csv. Calling copy in these cases won't be an issue.

The other approach is to make Column immutable and so you don't have to defensively copy. It may be too late for that here.

Change the sort api to allow the use of "-columnName" to indicate a descending sort

Sorting currently defaults to ascending order:
t.sortOn("column1", "column2");
sorts in ascending order based on the values in the two given columns:

to specify descending sorts, you have to use:
t.sortDescendingOn:("column1);

If you want to mix ascending and descending so that you sorted the tallest first by age, starting with the youngest, you have to construct a Sort object and pass that to a specialized sortOn: method. It should be possible to simply write:
sortOn("age", "-height"); // by age starting with the youngest, then by height starting with the tallest.

This can be implemented by parsing the column names and constructing a Sort object behind the scenes

Store Category columns in .saw format with dictionary encoding intact

In loading a very large file from disk using .saw, it seemed like the encoding process took most of the time. The file should be written encoded as it is in memory, and read back the same way, avoiding as much work in building the dictionary as possible.

This will have the positive side effect of significantly reducing disk space requirements for category columns

Ensure that the names of columns are unique

when compared in a case-insensitive way.

This will require changes to Table and also on CSV IO (results sets (at least those originating in an RDBMS) must have unique names).

Add difference function for numeric and date-time cols

The user should be able to "difference" a temporal column. The function would be an instance method on the column. The template below is for an IntColumn version. The temporal version would take a TimeUnit argument.

/**

  • Returns a new column of the same type as the receiver, such that the values in the new column
  • would contain the difference between each cell in the original and it's predecessor.
  • The value for the first cell in the new column would be the missing value indicator for that column
  • (e.g. IntColumn.MISSING_VALUE)
    */
    IntColumn difference() {..... }

See the attached pdf for an example of the output.

Indexing takes too long

The current index implementation takes too long to perform an index operation on a large int column. Indexing times vary but are in the 7 to 8 minute range on a 500 million record column.

Parallel CSV file imports

CSV imports are orders of magnitude slower than Saw file loads. Try to speed things up by opening multiple importers and loading separate columns with each. (ActivityMonitor suggests that CSV imports are bound on CPU, and that it only uses one currently).

Add missing value support to boolean columns

java's boolean type doesn't offer much support for dealing with missing values. Convert the column to use byte instead of boolean and use Byte.MIN_VALUE as the missing value.

Feature: auto indexing

implement auto indexing to speed search. indexing could occur on load or on first query that uses a particular column.

Add Apache license text to every file

like so:

/*

  • Licensed to the Apache Software Foundation (ASF) under one or more
  • contributor license agreements. See the NOTICE file distributed with
  • this work for additional information regarding copyright ownership.
  • The ASF licenses this file to You under the Apache License, Version 2.0
  • (the "License"); you may not use this file except in compliance with
  • the License. You may obtain a copy of the License at
  •  http://www.apache.org/licenses/LICENSE-2.0
    
  • Unless required by applicable law or agreed to in writing, software
  • distributed under the License is distributed on an "AS IS" BASIS,
  • WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  • See the License for the specific language governing permissions and
  • limitations under the License.
    */

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.