amplab / succinct Goto Github PK

View Code? Open in Web Editor NEW

278.0 39.0 76.0 18.11 MB

Enabling queries on compressed data.

Home Page: succinct.cs.berkeley.edu

License: Apache License 2.0

Shell 1.40% Java 75.81% Scala 22.79%

big-data compression succinct spark java scala

succinct's People

Contributors

Stargazers

Watchers

Forkers

gilv goverdata mrt shivaram swathimystery sungsoo robbenti radicalbit rtvt123 evilmcjerkface yc-huang poolis schevalier ujvl zhangbogh codeaudit rahulravindren lucentcosmos anukat2015 sanjosh nikolayvoronchikhin biaoma-ty sjanulonoks mgthunderbolt tresata-opensource sprokopenko rahulsmehta panfengfeng mmicky debasish83 luciferyang containerz esquive lefromage shaneknapp mindis gaodayue virtualvineet makeyang j143-zz manner7979 kioco yuanhaitao ericdoug rach4x0r gevg irishbird scalavision luohao xiashuijun shiangjun yuanchunyu asherzhou muditsin x-huang donfanning qiaojialin qiangcai kouchya seven7777777 sksundaram-learning titsuki starnullptr rustrw jlleitschuh benschaff binlijin d3v3l0 3401ijk omarmahamid hzh0425 gpicron iq-scm protectione055 bulksecuritygeneratorprojectv2 shawandaveh

succinct's Issues

Support for faster scans

Faster scans can be supported by having a snappy compressed representation of the data along with the Succinct data structures; operations on the Succinct RDDs / DataFrame that require full scans (e.g., aggregates), can execute efficiently on the alternate representation, whereas search/random access queries are handled by the Succinct data structures. The two representations should remain under the hood -- exposing a single unified interface to the Succinct RDDs / DataFrame.

Please Add support for ArrayType

scala> spark.read.json(rdd)
res59: org.apache.spark.sql.DataFrame = [ProgressionRegex: string, Progressions: array ... 27 more fields]

scala> val flattenedDf = spark.read.json(rdd)
flattenedDf: org.apache.spark.sql.DataFrame = [ProgressionRegex: string, Progressions: array ... 27 more fields]

scala> val flattenedSuccinctDf = flattenedDf.toSuccinctDF
java.lang.IllegalArgumentException: Unexpected type. ArrayType(StringType,true)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$$anonfun$7.apply(SuccinctTableRDD.scala:224)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$$anonfun$7.apply(SuccinctTableRDD.scala:214)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.min(SuccinctTableRDD.scala:214)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.apply(SuccinctTableRDD.scala:158)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.apply(SuccinctTableRDD.scala:176)
at edu.berkeley.cs.succinct.sql.SuccinctInMemoryRelation.(SuccinctInMemoryRelation.scala:11)
at edu.berkeley.cs.succinct.sql.package$SuccinctDataFrame.toSuccinctDF(package.scala:33)
... 58 elided

Publishing to Spark Packages

Hi,

I noticed that you ran into several problems while publishing a release to Spark Packages. I strongly suggest that you use the sbt-spark-package plugin, as it would greatly ease your publishing process. The Spark Packages repo has several different requirements regarding publishing, which you can find here. Please let me know if you have any questions, and we would greatly appreciate any feedback you might have!

Thanks,
Burak

Could you share C++ version of Succinct?

Hi,

I recently read the Succinct paper. I think it would be a great tool for text analysis and want to explore more use cases with Succinct. I wonder if you could share the C++ implementation for the Succinct core, so I could do more experiments?

Thanks

trove as a dependency with LGPL

you sure an Apache 2 licensed project can depend on something that has GNU Lesser General Public License 2.1?

Add regexMatch to SuccinctKVRDD

SuccinctKVRDD currently supports a regexSearch which returns an RDD of keys for documents that contain matches for a regular expression. We should add support for a regexMatch method as follows:

def regexMatch(query: String): RDD[(K, RegExMatch)]

where each RegExMatch encapsulates:

The offset into the value for the match
The length of the match

We already have a similar method in SuccinctRDD, it should be a simple translation to SuccinctKVRDD.

Add support for JSON documents

It would be nice to have support for JSON documents, perhaps through a SuccinctJsonRDD. The JSON objects would have an associated primary key, which can be a field within the document itself called "id"; it should support the following semantics:

// Return the JSON document associated with the given ID.
def get(id: Long): String

// Return an RDD of IDs for documents that match a particular JSON field value
def filter(field: String, value: String): RDD[Long]

// Return the RDD of IDs for documents that contain a particular query term
def search(query: String): RDD[Long]

Cleanup support for Spark SQL interface.

The current Spark SQL interface still uses constructs from Spark 1.4.0. The implementation can benefit from the optimizations if the Data Source is brought up to date with Spark 2.0 constructs including Datasets.

Additionally, the current Spark SQL interface supports a subset of operators. We should add support for all Spark SQL operators.

Add support for bulk appends

It would be nice to support bulk appends for SuccinctRDD and SuccinctKVRDD[K] as follows:

// For SuccinctRDD; the preservePartitioning flag dictates whether the 
// partitioning scheme for the data RDD should be preserved
def bulkAppend(data: RDD[Array[Byte]], preservePartitioning: Boolean = false)

// For SuccinctKVRDD
def bulkAppend[K](data: RDD[(K, Array[Byte]], preservePartitioning: Boolean = false)

Convert an arbitrary Data Frame to Succinct Data Frame

Right now, To convert an Arbitrary Data Frame to Succinct Data Frame I would need to write to a File and then read it back from a file as follows

val df = spark.sql("SELECT * from table_name limit 100") //Cassandra Query
df.write.format("edu.berkeley.cs.succinct.sql").save("/path/to/data")
val succinctCities = sqlContext.succinctTable("/path/to/data")

It would be great to have a functionality where it is easy to convert any arbitrary data frame to succinct data frame using a single method call if possible.

@anuragkh

Optimizations for pre-processing

The pre-processing step for Succinct can benefit from optimizations for:

Storage footprint
Construction time

Is the C++ version of Succinct Open?

RDD Size estimation

Spark's RDD size estimation often fails for Succinct RDDs / DataFrame, leading to incorrect cache decisions (e.g., not caching partitions even when RAM is abundantly available). Spark 1.6.0 exposes a KnownSizeEstimation trait, which enables classes to report their own size. Succinct RDDs / DataFrame should implement this trait, and report the correct size for Succinct Data Structures.

Fast join

Hello, sorry for asking this question here but I couldn't find a mailing list or similar.

Does succinct provide (or planning to provide) a faster join implementation that does not need to shuffle keys in both datasets?

I got two datasets, A and B, where B << A (so I want to avoid sorting A) but B is not small enough to use broadcast join.
Is this a potential use case for succinct?
Thanks

Add support for case-intensive search

The current search() implementation performs a case-sensitive match. We should add a flag to the search call to make it case insensitive, i.e.,

def search(query: String, caseInsensitive: Boolean = false): RDD[Long]

For a query string abc, it would be equivalent to executing the following regular expression query:

[aA][bB][cC]

Is succinct suitable for OLAP?

I am implementing a new data warehouse (https://github.com/shunfei/indexr) which is suitable for OLAP scenario. It should support fast scan, good compress ratio, and fast enough random access (ok to be slower than hbase, but not too slow). It does not need to be updatable. We group rows into files and throw to HDFS, then do queries on them. It is designed to store very large dataset, e.g. over 100TBs, but still fast enough to do ad-hoc queries.

Currently I separate the data in blocks, each block is compressed according to their data types. And index the data in block level. This design greatly reduce data size, and with the help of block level indices, random access queries is not too bad, as they can filter out most irrelevant blocks.

The main problem now is the STRING type. Decompress process on string is too expensive. And the search queries are very slow, because they cannot take advantage of my indices(they are not designed for search). My last option is tokenizing those strings, and use an invert index, like Lucene.

Then I searched around and found Succinct. Direct queries on compressed data, naturally search support seems perfect! The benchmark looks very promising.

My questions for Succinct are:

Is there any improvements on the compress speed? According to this post(http://succinct.cs.berkeley.edu/wp/wordpress/) the process is a bit too slow.
How is the performance for scan? Scan is a every important feature for big data analyse.
Do you think succinct is suitable for OLAP scenario? Any suggestions for me if I decide to work on it?

Thank you for your great work!

IllegalArgumentException when writing SuccinctTable

Hi,
I was trying to convert a parquet table to succinct table. But was not able to write it using saveAsTable().

My code to produce succinct table is fairly straightforward.

spark.sql("select * from my_table")
      .write
      .format("edu.berkeley.cs.succinct.sql")
      .option("path", succinctDir)
      .saveAsTable("my_table_succinct")

The error messages show the below stacktrace:

java.lang.IllegalArgumentException: Negative initial size: -1683087478
	at java.io.ByteArrayOutputStream.<init>(ByteArrayOutputStream.java:74)
	at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.createSuccinctTablePartition(SuccinctTableRDD.scala:202)
	at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$$anonfun$6.apply(SuccinctTableRDD.scala:163)
	at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$$anonfun$6.apply(SuccinctTableRDD.scala:163)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
	at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I guess it exceeds the 2G limit of shuffle partitions and increases spark.sql.shuffle.partitions to a large number. However, still get the same error. Is this a misconfiguration in spark or a bug in succinct library?

Thanks.
Hao Luo

version 0.1.8 isn't uploaded on maven central

Hi,
The instructions in the README recommend to use the version 0.1.8 but I found that this version isn't uploaded on maven central.
Didn't you forget to upload the version 0.1.8?

Cheers,

OOM at spark shell in local mode

Hi, thank you for open sourcing this project.

I tried to run it on my spark 1.5.2 in local mode from the spark-shell on 2 datasets

300mb .gz (2.1 Gb) uncompressed text file.

I consistently got OOM Java heap space, does not matter if the input is a single non-splittable .gz or an uncompressed text file

15/12/15 12:27:14 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.lang.StringCoding.safeTrim(StringCoding.java:79)
at java.lang.StringCoding.access$300(StringCoding.java:50)
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:305)
at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
at java.lang.String.getBytes(String.java:956)
at $line20.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at $line20.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:21)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

8mb text same from above, but first 200,000 lines

Every time on 8mb sample OOM again, GC overhead limit this time

15/12/15 12:36:55 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.constructNPA(SuccinctBuffer.java:445)
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.construct(SuccinctBuffer.java:307)
at edu.berkeley.cs.succinct.buffers.SuccinctBuffer.<init>(SuccinctBuffer.java:81)
at edu.berkeley.cs.succinct.buffers.SuccinctFileBuffer.<init>(SuccinctFileBuffer.java:30)
at edu.berkeley.cs.succinct.buffers.SuccinctIndexedFileBuffer.<init>(SuccinctIndexedFileBuffer.java:31)
at edu.berkeley.cs.succinct.buffers.SuccinctIndexedFileBuffer.<init>(SuccinctIndexedFileBuffer.java:42)
at edu.berkeley.cs.succinct.SuccinctRDD$.createSuccinctPartition(SuccinctRDD.scala:288)

The code I tried was

import edu.berkeley.cs.succinct._
val wikiData = sc.textFile("....").map(_.getBytes)
val wikiSuccinctData = wikiData.succinct
//or wikiData.saveAsSuccinctFile(...)

Also tried adding same parameters as you have for spark-submit, like

./bin/spark-shell bin/spark-shell --executor-memory 1G  --driver-memory 1G --packages amplab:succinct:0.1.6

with no success yet.

Please advise!

Support for Spark 2.0

Port code to work with Spark 2.0, and make sure APIs can work with:

SparkSession
Datasets/DataFrames

Also make sure any changes/additions/removals does not change the behavior of or break any of Succinct's features.

Binary Type not Supported in a Succinct Data Frame

val df = spark.sql("SELECT * from table_name limit 100")
df.write.format("edu.berkeley.cs.succinct.sql").save("/Users/hello/df")

This is what happens:

java.lang.IllegalArgumentException: Unexpected type. BinaryType
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$$anonfun$7.apply(SuccinctTableRDD.scala:223)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$$anonfun$7.apply(SuccinctTableRDD.scala:213)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.min(SuccinctTableRDD.scala:213)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.apply(SuccinctTableRDD.scala:157)
at edu.berkeley.cs.succinct.sql.SuccinctTableRDD$.apply(SuccinctTableRDD.scala:175)
at edu.berkeley.cs.succinct.sql.package$SuccinctDataFrame.saveAsSuccinctTable(package.scala:29)
at edu.berkeley.cs.succinct.sql.DefaultSource.createRelation(DefaultSource.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:429)

@anuragkh

Problems with Spark 1.4

I've built Succinct with Spark 1.4, expecting smooth migration. But the SQL module generates several errors. I've resolved a few of them but still haven't figure out how to fix the serialVersionUID mismatches like blow.

[info] - dsl test *** FAILED ***
[info] java.io.InvalidClassException: org.apache.spark.sql.types.StructType; local class incompatible: stream classdesc serialVersionUID = 8479641856817081483, local class serialVersionUID = -7860166653361823912
[info] at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
[info] at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
[info] at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
[info] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
[info] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
[info] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
[info] at edu.berkeley.cs.succinct.sql.SuccinctUtils$.readObjectFromFS(SuccinctUtils.scala:40)
[info] at edu.berkeley.cs.succinct.sql.SuccinctRelation.getSchema(SuccinctRelation.scala:29)
[info] at edu.berkeley.cs.succinct.sql.SuccinctRelation.(SuccinctRelation.scala:14)
[info] at edu.berkeley.cs.succinct.sql.package$SuccinctContext.succinctFile(package.scala:12)
[info] ...

[info] - sql test *** FAILED ***
[info] java.io.InvalidClassException: org.apache.spark.sql.types.StructType; local class incompatible: stream classdesc serialVersionUID = 8479641856817081483, local class serialVersionUID = -7860166653361823912
[info] at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
[info] at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
[info] at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
[info] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
[info] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
[info] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
[info] at edu.berkeley.cs.succinct.sql.SuccinctUtils$.readObjectFromFS(SuccinctUtils.scala:40)
[info] at edu.berkeley.cs.succinct.sql.SuccinctRelation.getSchema(SuccinctRelation.scala:29)
[info] at edu.berkeley.cs.succinct.sql.SuccinctRelation.(SuccinctRelation.scala:14)
[info] at edu.berkeley.cs.succinct.sql.DefaultSource.createRelation(DefaultSource.scala:18)
[info] ...

Support non-ASCII characters and arbitrary binary files

Succinct currently does not support all byte values, since some byte values are internally reserved as special markers. We can remove this constraint by switching to a larger alphabet internally (e.g., integer range) and use values outside the byte range (-128 to 127) for internal markers. This would allow Succinct to support arbitrary binary files in addition to non-ASCII characters.

Spark 2.0+ support

Are there any plans to support Spark 2.0+?