parallelai / spyglass Goto Github PK

View Code? Open in Web Editor NEW

54.0 54.0 31.0 1.83 MB

Cascading and Scalding wrapper for HBase with advanced read features

License: Apache License 2.0

Java 79.79% Scala 20.21%

spyglass's People

Contributors

Stargazers

Watchers

spyglass's Issues

Create link Table with Secondary Index Support

HBaseSource should be able to create link between two tables with row key or column. In order to retain performance and indexing it also need to support Secondary Indexing.

Scala 2.11 support

We would like to use SpyGlass in a Scala 2.11 project. Would you consider upgrading it or cross building?

autoflush and performance problems

We have a job that writes a lot of data to hbase. We were hitting severe performance issues. We discovered that the HBaseRecordWriter sets autoflush to true. When we set autoflush to false write performance went up several orders of magnitude. Is there any reason why the autoflush feature is enabled?

publish to conjars

I would like to request that this project be published to the conjars repository.

http://conjars.org/

nulls in hbase

I noticed that when writing to HBasePipeWrapper.toBytesWritable replaces nulls by empty strings which end up being empty byte arrays stored in hbase.
It looks like HBaseScheme would throw a NPE if one tried to store a null.

When reading from hbase HBaseScheme will replace nulls (which represent missing column values) by empty byte arrays.

hbase is a sparse store after so why not benefit from this?

I propose that when writing to hbase we do not replace nulls by empty strings but instead do not write them at all. and when reading from hbase we put the nulls that hbase emits for missing values in the cascading tuple instead of replacing it by empty byte arrays. this works well because the null is also used in cascading tuples to represent missing values.

are there any downsides to this approach?
i have pull request ready if this is considered a good idea.

reverseDNS error logs

i am seeing a lot of error messages like this:
13/11/04 16:01:04 ERROR hbase.HBaseInputFormatGranular: Cannot resolve the host name for /192.168.1.56 because of javax.naming.NameNotFoundException: DNS name not found [response code 3]; remaining name '56.1.168.192.in-addr.arpa'

Our cluster uses hosts files only. We have no DNS. The job completes successfully. Are these real errors? If not, the logging level should probably be changed.

I tried to understand what is going on, but got confused by this call:
hostName = Strings.domainNamePointerToHostName(DNS.reverseDns(
ipAddress, this.nameServer));

this.nameServer is a private variable initialized to nulll. I cannot find where in the code it gets set.

Best, Koert

NullPointerException using CDH5 Branch

Opening from the follow up on issue #17

Error: java.lang.NullPointerException at parallelai.spyglass.hbase.HBaseRecordReaderBase.setHTable(HBaseRecordReaderBase.java:64) at parallelai.spyglass.hbase.HBaseInputFormatGranular.getRecordReader(HBaseInputFormatGranular.java:373) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:762) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:172) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:414) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

HBaseTap for secure HBase cluster

Does this work with secure HBase cluster?

In a HBase mapreduce job written in java, TableMapReduceUtil.initTableMapperJob internally calls TableMapReduceUtil.initCredentials. What is the equivalent of this in SpyGlass HBase taps for working with secure HBase?

If i don't do this step, it results in Kerberos errors in map step:
"No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)"

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HRegionLocation.getServerAddress()Lorg/apache/hadoop/hbase/HServerAddress

There appears to be an incompatibility with newer versions of hbase, apparently the method is deprecated and/or removed.

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HRegionLocation.getServerAddress()Lorg/apache/hadoop/hbase/HServerAddress;
at parallelai.spyglass.hbase.HBaseInputFormatGranular.getSplits(HBaseInputFormatGranular.java:113)
at parallelai.spyglass.hbase.HBaseInputFormatGranular.getSplits(HBaseInputFormatGranular.java:43)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:107)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

HBaseRawSource Missing from Project

It appears that HBaseRawSource.scala (as opposed to the Java code for the raw Scheme and Tap) was removed from the project with Commit 42249fa. This source type is still documented in the README file, so I assume it should still be present. I found it in the history at https://github.com/ParallelAI/SpyGlass/blob/b72c234dd35c3eb807e8050385adf697dcf97fad/src/main/scala/parallelai/spyglass/hbase/HBaseRawSource.scala

I gathered from Issue #4, along with some commit messages, that the raw-related code had been commented out in the past. It looks like the Scheme and Tap were uncommented, while the Source was instead removed completely.

testing via htablemock or similar

Hello,
what are the options on testing hbase access? I can use scalding's JobTest but that would mock out the whole source. Is there a way to use SpyGlass with some test double for the database, like htablemock? Currently I'm testing against a live hbase, but that's quite non-scalable.

HBaseRawSource

HBaseRawSource is commented out in the source code.

Tried to look at the tags but there is no git tag or release checkin with that that I saw. Only scala 2.9.3 tag.

Is there a reason why it was commented out ? perhaps merged with another framework ? Or did it not work

Thanks.

Scalding type safe API support

The Scalding Fields API is deprecated and recommended to be avoided.

It would be nice to make HBaseSource work with the Type Safe API as a source and sink.

Not able to run in Kerberos safe environment.

Hi, I am trying to run the spyglass api in a kerberos environment, i used the following code :
// read the input file
val pipe = TextLine( args("input") )
// split each line of the input file using any whitespace "\s+" as the separator
.flatMap('line -> 'WORD) { line : String =>
line.toLowerCase.split("\s+")
}
// get count of each word in the input file
.groupBy('WORD) { _.size }
// convert to uppercase for use with HBase Phoenix
.rename('size -> 'SIZE)
// create a new pipe with byte values rather than string/number values
val pipe2 = new HBasePipeWrapper(pipe).toBytesWritable(new Fields("WORD"), new Fields("SIZE"))
// write to the TEST table with WORD as the rowkey, CF as the columnFamily and SIZE as the only value column
.write(new HBaseSource("TEST", "myserver.mydomain.com:2181", 'WORD, List("CF"), List(new Fields("SIZE"))) )
}

Changing the "myserver.mydomain.com:2181" with my specific quorum values. this specific code launches and shows 2 mappers and 60 reducers are required. the the mappers goes well but the reducers get stucked at 57 and then keeps failing without killing the task.

Upgrade to CDH5.2.1 (HBase 0.98.6)

To get all fixes from 0.98.1 to 0.98.6 upgrade the dependency to 0.98.6-CDH.5.2.1. Especially fix for this blocker issue: https://issues.apache.org/jira/browse/HBASE-11118

Add Hadoop Counter to avoid timeouts

I recently had the problem that I was reading a lot of rows from an HBase table and filtered the majority of rows in the first steps of my scalding job. -> The Hadoop counters didn't change and the job timed out after 10min.

Would it be possible to add a counter that counts lines read (or hundrets of lines read) and publishes the values to a hadoop counter to avoid timing out?

parallelai / spyglass Goto Github PK

spyglass's People

Contributors

Stargazers

Watchers

Forkers

spyglass's Issues

Recommend Projects

Recommend Topics

Recommend Org