Giter Site home page Giter Site logo

sparkey-java's People

Contributors

dependabot[bot] avatar krka avatar mattnworb avatar mbruggmann avatar nevillelyh avatar nresare avatar protocol7 avatar spkrka avatar vasi avatar vasi-stripe avatar yonromai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkey-java's Issues

Possible problems with sparkey files with two dots in file names.

Say you have a Sparkey index file and log file that is in the following form:

a-b-c.sparkey.spi, a-b-c.sparkey.spl. Assuming the "base file" here is "a-b-c.sparkey", the getIndexFile method will return "a-b-c.spi" and the getLogFile method will return "a-b-c.spl"

Very high memory usage of Sparkey reader in Spark

I'm using Sparkey in spark -- one Sparky Reader per JVM executor, and the reader is used in up to 4 executor threads. I'm seeing very high memory requirements compared to what I was expecting

I'm thinking the thread local reader might be the issue.

My code is like this:

import scala.util.Try
import com.spotify.sparkey.Sparkey
import java.io.File
import java.nio.ByteBuffer

object SparkeyState extends Serializable {

  val SPARKEY_PATH = "/usr/local/share/sparkey/"
  // this sparkey dataset maps a UTF-8 string to a 32 bit integer
  @transient lazy val SPARKEY_READER = Try(Sparkey.open(
    new File(SPARKEY_PATH)
  ))

  def getCount(token: String): Option[Int] = {
    SPARKEY_READER.map(reader =>
      Option(reader.getAsByteArray(token.getBytes)).map(
        intBytes =>
          ByteBuffer.wrap(intBytes).getInt
      )
    ).getOrElse(None)
  }
}

val data = spark.reader.csv("/datapath/").as[String]
data.map(s => SparkeyState.getCount(s)).write.csv("/outputpath/")

Is high memory usage expected in this scenario? My understanding is that Sparkey should mostly leave data on disk. Is there possibly a GC issue?

Any tips on reducing memory usage?

ConstructionMethod.SORTING on windows prevents renaming .spi file

Thanks for adding the sorting hash algorithm, it is a great for loading 500M entries.

With version 2.3.0, the sorting hash writer and a Windows PC, this small example

    public static void main(String[] args) throws IOException {
        final SparkeyWriter writer = Sparkey.createNew(new File("test-store.spi"), CompressionType.NONE, 0);

        writer.put("foo", "bar");

        writer.flush();
        writer.setConstructionMethod(SparkeyWriter.ConstructionMethod.SORTING);
        writer.setHashType(HashType.HASH_64_BITS);
        writer.writeHash();
        writer.close();
    }

fails with

Exception in thread "main" java.io.IOException: Could not rename C:\inno\kvtest\test-store.spi-tmp9431a3cc-4f48-453b-bcab-bb67f10ffa54 to test-store.spi
	at com.spotify.sparkey.SingleThreadedSparkeyWriter.writeHash(SingleThreadedSparkeyWriter.java:106)
	at dk.innovasion.TestCleaner.main(TestCleaner.java:21)

I guess the cause is the scheduled call to ByteBufferCleaner.cleanMapping(chunk); which is executed a long time after the attempt to rename the .spi file.

Vulnerable shared library might make sparkey vulnerable. Can you help upgrade to patch versions?

Hi, @krka , @davidpoblador , I'd like to report a vulnerability issue in com.spotify.sparkey:sparkey:3.2.1.

Issue Description

I noticed that com.spotify.sparkey:sparkey:3.2.1 directly depends on com.github.luben:zstd-jni:v1.4.5-5 in the pom. However, as shown in the following dependency graph, com.github.luben:zstd-jni:v1.4.5-5 sufferes from the vulnerability which the C library zstd(version:1.4.5) exposed: CVE-2021-24032.

Dependency Graph between Java and Shared Libraries

image (12)

Suggested Vulnerability Patch Versions

com.github.luben:zstd-jni:v1.4.9-1 (>=v1.4.9-1) has upgraded this vulnerable C library zstd to the patch version 1.4.9.

Java build tools cannot report vulnerable C libraries, which may induce potential security issues to many downstream Java projects. Could you please upgrade this vulnerable dependency?

Thanks for your help~
Best regards,
Helen Parr

Potential file descriptor leak in IndexHash.open()

In IndexHash.open, logData is initialized, then the IndexHash constructor is called, then validate() on the new object is called. If the indexFile got truncated somehow, then a runtime exception is thrown and logData doesn't get closed. As this whole thing typically gets called from SingleThreadedSparkeyReader's constructor, there's no access by a library user to close stuff in an exception handler.

Allow direct closing of Sparkey Reader

We're currently using Sparkey in a way that requires us to use many large Sparkey files at once for a short period of time, with new files being opened at a (potentially) very high rate during peak times.
This happens in a multi threaded context, with each reader being used from a separate thread. Readers are not shared between threads. During our performance tests we've occasionally had issues with virtual memory filling up and not being released quick enough, using up all virtual address space on the machine. It appears that this is largely due to the fact that the current implementation uses a fixed timeout of 1 second before the actual memory cleanup happens. I do understand the reasoning behind this after reading the following discussion:
#13.

However, my question is whether it would be an option to make this behavior configurable in the sense that it should be possible to choose whether a reader is closed immediately, instead of being closed using an executor and a timeout. If not explicitly disabled, the timeout would still apply. This way, the change should be downward compatible.
Another option that comes to my mind is that it would be possible to pass a timeout value in the new constructor. If the timeout is 0, the closing of the file would happen immediately within the same thread, otherwise the timeout for the close action would simply be set to the given value and execution would happen in a separate thread (like today). Does this sound like a reasonable suggestion or am I missing anything?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.