spotify / sparkey-java Goto Github PK
View Code? Open in Web Editor NEWJava implementation of the Sparkey key value store
License: Apache License 2.0
Java implementation of the Sparkey key value store
License: Apache License 2.0
Say you have a Sparkey index file and log file that is in the following form:
a-b-c.sparkey.spi, a-b-c.sparkey.spl. Assuming the "base file" here is "a-b-c.sparkey", the getIndexFile method will return "a-b-c.spi" and the getLogFile method will return "a-b-c.spl"
it seems you forget the statement 'displacement ++ ' at line 275
https://gist.github.com/zhenchuan/ac33042b68f4b01733da
I'm using Sparkey in spark -- one Sparky Reader per JVM executor, and the reader is used in up to 4 executor threads. I'm seeing very high memory requirements compared to what I was expecting
I'm thinking the thread local reader might be the issue.
My code is like this:
import scala.util.Try
import com.spotify.sparkey.Sparkey
import java.io.File
import java.nio.ByteBuffer
object SparkeyState extends Serializable {
val SPARKEY_PATH = "/usr/local/share/sparkey/"
// this sparkey dataset maps a UTF-8 string to a 32 bit integer
@transient lazy val SPARKEY_READER = Try(Sparkey.open(
new File(SPARKEY_PATH)
))
def getCount(token: String): Option[Int] = {
SPARKEY_READER.map(reader =>
Option(reader.getAsByteArray(token.getBytes)).map(
intBytes =>
ByteBuffer.wrap(intBytes).getInt
)
).getOrElse(None)
}
}
val data = spark.reader.csv("/datapath/").as[String]
data.map(s => SparkeyState.getCount(s)).write.csv("/outputpath/")
Is high memory usage expected in this scenario? My understanding is that Sparkey should mostly leave data on disk. Is there possibly a GC issue?
Any tips on reducing memory usage?
No big rush, especially if you're still on vacation :)
Thanks for adding the sorting hash algorithm, it is a great for loading 500M entries.
With version 2.3.0, the sorting hash writer and a Windows PC, this small example
public static void main(String[] args) throws IOException {
final SparkeyWriter writer = Sparkey.createNew(new File("test-store.spi"), CompressionType.NONE, 0);
writer.put("foo", "bar");
writer.flush();
writer.setConstructionMethod(SparkeyWriter.ConstructionMethod.SORTING);
writer.setHashType(HashType.HASH_64_BITS);
writer.writeHash();
writer.close();
}
fails with
Exception in thread "main" java.io.IOException: Could not rename C:\inno\kvtest\test-store.spi-tmp9431a3cc-4f48-453b-bcab-bb67f10ffa54 to test-store.spi
at com.spotify.sparkey.SingleThreadedSparkeyWriter.writeHash(SingleThreadedSparkeyWriter.java:106)
at dk.innovasion.TestCleaner.main(TestCleaner.java:21)
I guess the cause is the scheduled call to ByteBufferCleaner.cleanMapping(chunk); which is executed a long time after the attempt to rename the .spi file.
Hi, @krka , @davidpoblador , I'd like to report a vulnerability issue in com.spotify.sparkey:sparkey:3.2.1.
I noticed that com.spotify.sparkey:sparkey:3.2.1 directly depends on com.github.luben:zstd-jni:v1.4.5-5 in the pom. However, as shown in the following dependency graph, com.github.luben:zstd-jni:v1.4.5-5 sufferes from the vulnerability which the C library zstd(version:1.4.5) exposed: CVE-2021-24032.
com.github.luben:zstd-jni:v1.4.9-1 (>=v1.4.9-1) has upgraded this vulnerable C library zstd
to the patch version 1.4.9.
Java build tools cannot report vulnerable C libraries, which may induce potential security issues to many downstream Java projects. Could you please upgrade this vulnerable dependency?
Thanks for your help~
Best regards,
Helen Parr
In IndexHash.open, logData is initialized, then the IndexHash constructor is called, then validate() on the new object is called. If the indexFile got truncated somehow, then a runtime exception is thrown and logData doesn't get closed. As this whole thing typically gets called from SingleThreadedSparkeyReader's constructor, there's no access by a library user to close stuff in an exception handler.
We're currently using Sparkey in a way that requires us to use many large Sparkey files at once for a short period of time, with new files being opened at a (potentially) very high rate during peak times.
This happens in a multi threaded context, with each reader being used from a separate thread. Readers are not shared between threads. During our performance tests we've occasionally had issues with virtual memory filling up and not being released quick enough, using up all virtual address space on the machine. It appears that this is largely due to the fact that the current implementation uses a fixed timeout of 1 second before the actual memory cleanup happens. I do understand the reasoning behind this after reading the following discussion:
#13.
However, my question is whether it would be an option to make this behavior configurable in the sense that it should be possible to choose whether a reader is closed immediately, instead of being closed using an executor and a timeout. If not explicitly disabled, the timeout would still apply. This way, the change should be downward compatible.
Another option that comes to my mind is that it would be possible to pass a timeout value in the new constructor. If the timeout is 0, the closing of the file would happen immediately within the same thread, otherwise the timeout for the close action would simply be set to the given value and execution would happen in a separate thread (like today). Does this sound like a reasonable suggestion or am I missing anything?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.