Giter Site home page Giter Site logo

zrlio / crail Goto Github PK

View Code? Open in Web Editor NEW
71.0 71.0 18.0 893 KB

[Archived] A Fast Multi-tiered Distributed Storage System based on User-Level I/O

Home Page: http://crail.incubator.apache.org/

License: Apache License 2.0

Shell 2.08% Java 97.92%
crail distributed-systems high-performance java-8

crail's People

Contributors

animeshtrivedi avatar asqasq avatar follitude avatar patrickstuedi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crail's Issues

crail-site.conf configure mistake leads to java arrayindexoutofboundsexception 1

In current version, I find that crail.storage.types replace the crail.datanode.types.
In the conf/crail-site.conf.template, it is like below:
crail.storage.types com.ibm.crail.datanode.rdma.RdmaDataNode
but in the README.md, it is this
crail.datanode.types com.ibm.crail.storage.rdma.RdmaStorageTier
In fact crail.storage.types com.ibm.crail.storage.rdma.RdmaStorageTier is the right one in current version.
Here com.ibm.crail.storage.rdma.RdmaStorageTier is in the StorageServer.java.
If using com.ibm.crail.datanode.rdma.RdmaDataNode, then storageTierIndex will be 1 which leading to java arrayindexoutofboundsexception at the namenode.
So this STORAGE_TYPES should be consistent in README.me, core-site.conf.template, CrailConstants.java and StorageServer.java.

fsck tool issues

  1. https://github.com/zrlio/crail/blob/master/client/src/main/java/com/ibm/crail/tools/CrailFsck.java#L181

The usage show getLocation whereas the check is on getLocations

  1. https://github.com/zrlio/crail/blob/master/client/src/main/java/com/ibm/crail/tools/CrailFsck.java#L176

Please show more details to tell the user what is a wrong option. Just exiting is non-intuitive and less helpful to a user who might not know what went wrong with the system.

  1. the -r feature is undocumented.

  2. https://github.com/zrlio/crail/blob/master/client/src/main/java/com/ibm/crail/tools/CrailFsck.java#L49

A bit more explanation of commands would be helpful too (as I keep forgetting which one to use or what is the difference between getLocation and blockStatistics).

multiple IPs per interface

Crail picks IP from interface name to report to namenode, fails to work if multiple IPs are assigned to the interface. Better solution might be to choose interface via specified IP.

move the default build to 2.7 hadoop

None of us are using 2.6 hadoop. All Spark versions are also for 2.7 build. if by mistake i do mvn build i get 2.6 packages that later on creates problems with spark.

I propose to make the default build to 2.7, with the backward option for 2.6. Next time anyone doing a pull request please consider setting this (or I will do, what ever happens first)

Access permission to cachepath and datapath files

I have been having the following issue while running Crail-Spark-TeraSort. It looks like an access permission problem in cachepath and datapath that occurs inside MappedBufferCache's constructor. This is probably because I am using Cloudera Hadoop, and it uses other usernames such as "yarn", "hdfs" for YARN and HDFS.

...
WARN scheduler.DAGScheduler: Creating new stage failed due to exception - job: 0
java.lang.NullPointerException
at com.ibm.crail.memory.MappedBufferCache.(MappedBufferCache.java:55)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
...

I was running Crail-TeraSort with my own username. But "yarn" is also creating some cache file as below. So essentially the files are created by a different user than the one trying to read them.
ls -lth /memory/cache/
drwxrwxrwx 2 yarn hadoop 4.0K Dec 6 17:06 1512608812456
drwxrwxrwx 2 yarn hadoop 4.0K Dec 5 11:43 1512503023326

I was wondering if there is a way to permanently fix it? For example, does it help to change the access permission when creating RandomAccessFile for MappedBufferCache in the datapath?
RandomAccessFile dataFile = new RandomAccessFile(dataFilePath, "rw"); // (MappedBufferCache.java:89)

Thanks,
Kevin

DirectoryRecord: update

  1. Only process valid entry i.e. check for valid before using length to create array.
  2. Don't allocate array every time on update you know max size from DIRECTORY_RECORD

Divide by zero exception if running with a small block size

If running with a block size less than 1MB no blocks seem to be registered which leads to a divide by zero exception:

com.ibm.crail.namenode.rpc.darpc.DaRPCServiceDispatcher:130 - ERROR: Unknown error/ by zero
java.lang.ArithmeticException: / by zero
at com.ibm.crail.namenode.StorageTier$RoundRobinBlockSelection.getNext(BlockStore.java:185)
at com.ibm.crail.namenode.StorageTier$DataNodeArray.get(BlockStore.java:225)
at com.ibm.crail.namenode.StorageTier$DataNodeArray.access$0(BlockStore.java:220)
at com.ibm.crail.namenode.StorageTier.getBlock(BlockStore.java:118)
at com.ibm.crail.namenode.BlockStore.getBlock(BlockStore.java:63)
at com.ibm.crail.namenode.NameNodeService.createFile(NameNodeService.java:99)
at com.ibm.crail.namenode.rpc.darpc.DaRPCServiceDispatcher.processServerEvent(DaRPCServiceDispatcher.java:78)
at com.ibm.darpc.RpcServerGroup.processServerEvent(RpcServerGroup.java:122)
at com.ibm.darpc.RpcServerEndpoint.dispatchReceive(RpcServerEndpoint.java:75)
at com.ibm.darpc.RpcEndpoint.dispatchCqEvent(RpcEndpoint.java:157)
at com.ibm.darpc.RpcCluster.dispatchCqEvent(RpcCluster.java:36)
at com.ibm.darpc.RpcCluster.dispatchCqEvent(RpcCluster.java:1)
at com.ibm.disni.rdma.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.rdma.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)

mmap in hugetlbfs

Hi,
I read the code of crail and found it supports hugetlbfs, I perform some experiments on hugetlbfs(not associated with crail).my question is we use the MappedByteBuffer which mapped from a file in hugetlbfs and never unmap it, it says GC will take care of it, but the fact is sometimes I run to this error:
java.lang.Error: Cleaner terminated abnormally at sun.misc.Cleaner$1.run(Cleaner.java:147) at sun.misc.Cleaner$1.run(Cleaner.java:144) at java.security.AccessController.doPrivileged(Native Method) at sun.misc.Cleaner.clean(Cleaner.java:144) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:141) Caused by: java.io.IOException: Invalid argument at sun.nio.ch.FileChannelImpl.unmap0(Native Method) at sun.nio.ch.FileChannelImpl.access$000(FileChannelImpl.java:40) at sun.nio.ch.FileChannelImpl$Unmapper.run(FileChannelImpl.java:787) at sun.misc.Cleaner.clean(Cleaner.java:142)
Can anyone give me some hints?I am not very familiar with hugetlbfs.

Exception stack trace overwritten

Catch all clauses sometimes obfuscate the original exception. For example, in client/src/main/java/com/ibm/crail/CrailBufferedInputStream.java:230, the catch clause creates and throws a new exception which results in loosing the actual stack trace:

       private void triggerFetch() throws IOException {
                try {
                        if (future == null && internalBuf.remaining() == 0){
                                internalBuf.clear();
                                future = inputStream.read(internalBuf);
                                if (future == null){
                                        internalBuf.clear().flip();
                                }
                        }
                } catch(Exception e) {
                        throw new IOException(e);
                }
        }

A simple solution would be to print the original stack trace. A better solution would be to remove the try-catch and force the enclosed code to properly deal with all non-IOExceptions.

Hang when writing directory record entry

Possible concurrency problem when using the BufferedOutputStream. Process hangs here:

"Executor task launch worker for task 143" #152 daemon prio=5 os_prio=0 tid=0x00007f8cd0e78000 nid=0x6ae3 runnable [0x00007f8cd4253000]
java.lang.Thread.State: RUNNABLE
at com.ibm.crail.storage.rdma.client.RdmaStoragePassiveEndpoint.write(RdmaStoragePassiveEndpoint.java:194)
at com.ibm.crail.core.CoreOutputStream.trigger(CoreOutputStream.java:110)
at com.ibm.crail.core.CoreStream.prepareAndTrigger(CoreStream.java:238)
at com.ibm.crail.core.CoreStream.dataOperation(CoreStream.java:104)
at com.ibm.crail.core.CoreOutputStream.write(CoreOutputStream.java:67)
at com.ibm.crail.core.DirectoryOutputStream.writeRecord(DirectoryOutputStream.java:53)
at com.ibm.crail.core.CoreFileSystem._createNode(CoreFileSystem.java:211)
at com.ibm.crail.core.CreateNodeFuture.process(CoreMetaDataOperation.java:164)
at com.ibm.crail.core.CreateNodeFuture.process(CoreMetaDataOperation.java:150)
at com.ibm.crail.core.CoreMetaDataOperation.get(CoreMetaDataOperation.java:87)
at com.ibm.crail.core.CoreEarlyFile.file(CoreFile.java:167)
- eliminated <0x00007f8d7bb59890> (a com.ibm.crail.core.CoreEarlyFile)
at com.ibm.crail.core.CoreEarlyFile.getDirectOutputStream(CoreFile.java:104)
- locked <0x00007f8d7bb59890> (a com.ibm.crail.core.CoreEarlyFile)
at com.ibm.crail.CrailBufferedOutputStream.outputStream(CrailBufferedOutputStream.java:329)
at com.ibm.crail.CrailBufferedOutputStream.syncSlice(CrailBufferedOutputStream.java:320)
at com.ibm.crail.CrailBufferedOutputStream.write(CrailBufferedOutputStream.java:124)
at com.ibm.crail.CrailBufferedOutputStream.write(CrailBufferedOutputStream.java:102)
at com.ibm.crail.terasort.serializer.F22SerializerStream.writeObject(F22Serializer.scala:92)
at com.ibm.crail.terasort.serializer.F22SerializerStream.writeValue(F22Serializer.scala:102)
at org.apache.spark.storage.CrailObjectWriter.write(CrailStore.scala:717)
at org.apache.spark.shuffle.crail.CrailShuffleWriter$$anonfun$write$1.apply(CrailShuffleWriter.scala:67)
at org.apache.spark.shuffle.crail.CrailShuffleWriter$$anonfun$write$1.apply(CrailShuffleWriter.scala:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.shuffle.crail.CrailShuffleWriter.write(CrailShuffleWriter.scala:65)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Check the directory entry size

When experimenting with crail.directoryrecord or crail.directorydepth parameters, the error when the file name or depth exceeded these limits are cryptic. See below. It would be nice to check if it is the case and then print a more sensible error, for example "limit for these params are exceeded", etc. ;)

For the rest of the system it seems like a particular file just vanished underneath the crail file system.

 17/09/21 13:44:48 2501 main WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/09/21 13:45:17 30985 main ERROR FileFormatWriter: Aborting job null.         
 java.io.FileNotFoundException: /sql/data1.pq/_temporary/0/task_20170921134508_0001_m_000021
	at com.ibm.crail.hdfs.CrailHadoopFileSystem.listStatus(CrailHadoopFileSystem.java:199)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:47)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:128)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:209)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at com.ibm.crail.spark.tools.ParquetGenerator$.main(ParquetGenerator.scala:116)
	at com.ibm.crail.spark.tools.ParquetGenerator.main(ParquetGenerator.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at com.ibm.crail.spark.tools.ParquetGenerator$.main(ParquetGenerator.scala:116)
	at com.ibm.crail.spark.tools.ParquetGenerator.main(ParquetGenerator.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: /sql/data1.pq/_temporary/0/task_20170921134508_0001_m_000021
	at com.ibm.crail.hdfs.CrailHadoopFileSystem.listStatus(CrailHadoopFileSystem.java:199)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
	at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:47)
	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:128)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:209)
	... 44 more

iobench read faulty request

Hi, I have met something wrong when I ran iobench.
When I run ./bin/crail iobench -t writeClusterDirect -s 1048576 -k 1024 -f /tmp.dat, it worked well. And I can see the result like below and find the tmp.dat using ./bin/crail fs -l /:

starting benchmark...
execution time 0.157
ops 1024.0
sumbytes 1.073741824E9
throughput 54712.95918471338
latency 153.3203125

But when I read the file. It went wrong. I use the command like below
./bin/crail iobench -t readSequentialDirect -s 1048576 -k 1024 -f /tmp.dat
The result is this and halt there:

starting benchmark...
17/04/10 15:28:29 INFO crail: faulty request, status 12

Could any one tell me where is wrong?

Multiple devices per datanode, per tier

I don't believe Crail supports multiple devices per datanode for a specific tier. For instance, exporting of two nvmef targets from a storagetier on a single datanode, like:

crail.storage.blkdev.datapath /dev/nvme0n1,/dev/nvme1n1,

I'm trying to figure out scope out how much effort this would be, but I first wanted to check and see if there is already any plans or existing work to support such functionality? This would probably be most useful to the blk-dev repo, where we could just expose multiple iscsi/nvmef targets to a namenode, like in the above conf example

Thanks,
Tim

Use map from test case name to test case function

CrailBenchmark and HdfsIOBenchmark should use a map from testnames (command line) to test case functions (see PR 35).

Example:
[...]
int locationClass = 0;
boolean useBuffered = true;

	String benchmarkTypes = "write|writeAsync|readSequential|readRandom|readSequentialAsync|readMultiStream|"
			+ "createFile|createFileAsync|createMultiFile|getKey|getFile|getFileAsync|getMultiFile"
			+ "getMultiFileAsync|enumerateDir|browseDir|"
			+ "writeInt|readInt|seekInt|readMultiStreamInt|printLocationclass";
	Option typeOption = Option.builder("t").desc("type of experiment [" + benchmarkTypes + "]").hasArg().build();
	Option fileOption = Option.builder("f").desc("filename").hasArg().build();

[...]

potential bug?

dataBuf.position(opDesc.getBufferPosition());

A dataBuf.clear() might be required here before setting limit and position? Otherwise in my parquet reader I get

java.lang.IllegalArgumentException
	at java.nio.Buffer.position(Buffer.java:244)
	at com.ibm.crail.memory.OffHeapBuffer.position(OffHeapBuffer.java:46)
	at com.ibm.crail.core.CoreStream.prepareAndTrigger(CoreStream.java:239)
	at com.ibm.crail.core.CoreStream.dataOperation(CoreStream.java:105)
	at com.ibm.crail.core.CoreInputStream.read(CoreInputStream.java:77)
	...

More descriptive names for Crail parameters

The default config that is generated in assembly/target contains crail-site.conf, which has a list of parameters:

crail.blocksize				1048576
crail.buffersize			1048576
crail.regionsize			1073741824
crail.cachelimit			1073741824
crail.cachepath				<path, should be huge page mountpoint>
crail.singleton				true
crail.statistics 			true
crail.namenode.address			crail://<hostname>:9060
..

It would be nice if we can qualify all parameters that are listed there ... for example
crail.blocksize can be crail.fs.blocksize, and crail.buffersize can be crail.client.buffersize. This modification explicitly shows which parameter is relevant for which Crail component.

Hibench-pagerank: java.io.StreamCorruptedException: invalid type code:AC

Hello,
A problem confusing me a lot. I believe it is a code bug of Crail
When running spark-io with Hibench-pagerank with 3 iterations. The stage 0 is fine, as the stage 1 starts, the problem pops out: java.io.StreamCorruptedException: invalid type code: AC;
the debug info is as follows:
WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 40, 172.18.0.17): java.io.StreamCorruptedException: invalid type code: AC
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1381)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:157)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:189)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:186)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.shuffle.crail.CrailInputCloser.hasNext(CrailInputCloser.scala:33)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.shuffle.crail.CrailShuffleWriter.write(CrailShuffleWriter.scala:65)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

How to solve this? A little help will be appreciated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.