spotify / sparkey-java Goto Github PK

View Code? Open in Web Editor NEW

117.0 117.0 28.0 365 KB

Java implementation of the Sparkey key value store

License: Apache License 2.0

Java 100.00%

sparkey-java's People

Contributors

Stargazers

Watchers

sparkey-java's Issues

Possible problems with sparkey files with two dots in file names.

Say you have a Sparkey index file and log file that is in the following form:

a-b-c.sparkey.spi, a-b-c.sparkey.spl. Assuming the "base file" here is "a-b-c.sparkey", the getIndexFile method will return "a-b-c.spi" and the getLogFile method will return "a-b-c.spl"

bug may be at the 'isAt' method of IndexHash.java

it seems you forget the statement 'displacement ++ ' at line 275
https://gist.github.com/zhenchuan/ac33042b68f4b01733da

The datas must be written before indexing can be established？

Very high memory usage of Sparkey reader in Spark

I'm using Sparkey in spark -- one Sparky Reader per JVM executor, and the reader is used in up to 4 executor threads. I'm seeing very high memory requirements compared to what I was expecting

I'm thinking the thread local reader might be the issue.

My code is like this:

import scala.util.Try
import com.spotify.sparkey.Sparkey
import java.io.File
import java.nio.ByteBuffer

object SparkeyState extends Serializable {

  val SPARKEY_PATH = "/usr/local/share/sparkey/"
  // this sparkey dataset maps a UTF-8 string to a 32 bit integer
  @transient lazy val SPARKEY_READER = Try(Sparkey.open(
    new File(SPARKEY_PATH)
  ))

  def getCount(token: String): Option[Int] = {
    SPARKEY_READER.map(reader =>
      Option(reader.getAsByteArray(token.getBytes)).map(
        intBytes =>
          ByteBuffer.wrap(intBytes).getInt
      )
    ).getOrElse(None)
  }
}

val data = spark.reader.csv("/datapath/").as[String]
data.map(s => SparkeyState.getCount(s)).write.csv("/outputpath/")

Is high memory usage expected in this scenario? My understanding is that Sparkey should mostly leave data on disk. Is there possibly a GC issue?

Any tips on reducing memory usage?

Make a release with zstd support

No big rush, especially if you're still on vacation :)

ConstructionMethod.SORTING on windows prevents renaming .spi file

Thanks for adding the sorting hash algorithm, it is a great for loading 500M entries.

With version 2.3.0, the sorting hash writer and a Windows PC, this small example

    public static void main(String[] args) throws IOException {
        final SparkeyWriter writer = Sparkey.createNew(new File("test-store.spi"), CompressionType.NONE, 0);

        writer.put("foo", "bar");

        writer.flush();
        writer.setConstructionMethod(SparkeyWriter.ConstructionMethod.SORTING);
        writer.setHashType(HashType.HASH_64_BITS);
        writer.writeHash();
        writer.close();
    }

fails with

Exception in thread "main" java.io.IOException: Could not rename C:\inno\kvtest\test-store.spi-tmp9431a3cc-4f48-453b-bcab-bb67f10ffa54 to test-store.spi
	at com.spotify.sparkey.SingleThreadedSparkeyWriter.writeHash(SingleThreadedSparkeyWriter.java:106)
	at dk.innovasion.TestCleaner.main(TestCleaner.java:21)

I guess the cause is the scheduled call to ByteBufferCleaner.cleanMapping(chunk); which is executed a long time after the attempt to rename the .spi file.

Vulnerable shared library might make sparkey vulnerable. Can you help upgrade to patch versions?

Hi, @krka , @davidpoblador , I'd like to report a vulnerability issue in com.spotify.sparkey:sparkey:3.2.1.

Issue Description

I noticed that com.spotify.sparkey:sparkey:3.2.1 directly depends on com.github.luben:zstd-jni:v1.4.5-5 in the pom. However, as shown in the following dependency graph, com.github.luben:zstd-jni:v1.4.5-5 sufferes from the vulnerability which the C library zstd(version:1.4.5) exposed: CVE-2021-24032.

Dependency Graph between Java and Shared Libraries

Suggested Vulnerability Patch Versions

com.github.luben:zstd-jni:v1.4.9-1 (>=v1.4.9-1) has upgraded this vulnerable C library zstd to the patch version 1.4.9.

Java build tools cannot report vulnerable C libraries, which may induce potential security issues to many downstream Java projects. Could you please upgrade this vulnerable dependency?

Thanks for your help~
Best regards,
Helen Parr

Potential file descriptor leak in IndexHash.open()

In IndexHash.open, logData is initialized, then the IndexHash constructor is called, then validate() on the new object is called. If the indexFile got truncated somehow, then a runtime exception is thrown and logData doesn't get closed. As this whole thing typically gets called from SingleThreadedSparkeyReader's constructor, there's no access by a library user to close stuff in an exception handler.

Allow direct closing of Sparkey Reader

We're currently using Sparkey in a way that requires us to use many large Sparkey files at once for a short period of time, with new files being opened at a (potentially) very high rate during peak times.
This happens in a multi threaded context, with each reader being used from a separate thread. Readers are not shared between threads. During our performance tests we've occasionally had issues with virtual memory filling up and not being released quick enough, using up all virtual address space on the machine. It appears that this is largely due to the fact that the current implementation uses a fixed timeout of 1 second before the actual memory cleanup happens. I do understand the reasoning behind this after reading the following discussion:
#13.

However, my question is whether it would be an option to make this behavior configurable in the sense that it should be possible to choose whether a reader is closed immediately, instead of being closed using an executor and a timeout. If not explicitly disabled, the timeout would still apply. This way, the change should be downward compatible.
Another option that comes to my mind is that it would be possible to pass a timeout value in the new constructor. If the timeout is 0, the closing of the file would happen immediately within the same thread, otherwise the timeout for the close action would simply be set to the given value and execution would happen in a separate thread (like today). Does this sound like a reasonable suggestion or am I missing anything?

spotify / sparkey-java Goto Github PK

sparkey-java's People

Contributors

Stargazers

Watchers

Forkers

sparkey-java's Issues

Issue Description

Dependency Graph between Java and Shared Libraries

Suggested Vulnerability Patch Versions

Recommend Projects

Recommend Topics

Recommend Org