parallelai / spyglass Goto Github PK
View Code? Open in Web Editor NEWCascading and Scalding wrapper for HBase with advanced read features
License: Apache License 2.0
Cascading and Scalding wrapper for HBase with advanced read features
License: Apache License 2.0
Does this work with secure HBase cluster?
In a HBase mapreduce job written in java, TableMapReduceUtil.initTableMapperJob internally calls TableMapReduceUtil.initCredentials. What is the equivalent of this in SpyGlass HBase taps for working with secure HBase?
If i don't do this step, it results in Kerberos errors in map step:
"No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)"
I noticed that when writing to HBasePipeWrapper.toBytesWritable replaces nulls by empty strings which end up being empty byte arrays stored in hbase.
It looks like HBaseScheme would throw a NPE if one tried to store a null.
When reading from hbase HBaseScheme will replace nulls (which represent missing column values) by empty byte arrays.
hbase is a sparse store after so why not benefit from this?
I propose that when writing to hbase we do not replace nulls by empty strings but instead do not write them at all. and when reading from hbase we put the nulls that hbase emits for missing values in the cascading tuple instead of replacing it by empty byte arrays. this works well because the null is also used in cascading tuples to represent missing values.
are there any downsides to this approach?
i have pull request ready if this is considered a good idea.
Hi, I am trying to run the spyglass api in a kerberos environment, i used the following code :
// read the input file
val pipe = TextLine( args("input") )
// split each line of the input file using any whitespace "\s+" as the separator
.flatMap('line -> 'WORD) { line : String =>
line.toLowerCase.split("\s+")
}
// get count of each word in the input file
.groupBy('WORD) { _.size }
// convert to uppercase for use with HBase Phoenix
.rename('size -> 'SIZE)
// create a new pipe with byte values rather than string/number values
val pipe2 = new HBasePipeWrapper(pipe).toBytesWritable(new Fields("WORD"), new Fields("SIZE"))
// write to the TEST table with WORD as the rowkey, CF as the columnFamily and SIZE as the only value column
.write(new HBaseSource("TEST", "myserver.mydomain.com:2181", 'WORD, List("CF"), List(new Fields("SIZE"))) )
}
Changing the "myserver.mydomain.com:2181" with my specific quorum values. this specific code launches and shows 2 mappers and 60 reducers are required. the the mappers goes well but the reducers get stucked at 57 and then keeps failing without killing the task.
The Scalding Fields API is deprecated and recommended to be avoided.
It would be nice to make HBaseSource
work with the Type Safe API as a source and sink.
HBaseRawSource is commented out in the source code.
Tried to look at the tags but there is no git tag or release checkin with that that I saw. Only scala 2.9.3 tag.
Is there a reason why it was commented out ? perhaps merged with another framework ? Or did it not work
Thanks.
HBaseSource should be able to create link between two tables with row key or column. In order to retain performance and indexing it also need to support Secondary Indexing.
There appears to be an incompatibility with newer versions of hbase, apparently the method is deprecated and/or removed.
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HRegionLocation.getServerAddress()Lorg/apache/hadoop/hbase/HServerAddress;
at parallelai.spyglass.hbase.HBaseInputFormatGranular.getSplits(HBaseInputFormatGranular.java:113)
at parallelai.spyglass.hbase.HBaseInputFormatGranular.getSplits(HBaseInputFormatGranular.java:43)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:107)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
We would like to use SpyGlass in a Scala 2.11 project. Would you consider upgrading it or cross building?
We have a job that writes a lot of data to hbase. We were hitting severe performance issues. We discovered that the HBaseRecordWriter sets autoflush to true. When we set autoflush to false write performance went up several orders of magnitude. Is there any reason why the autoflush feature is enabled?
To get all fixes from 0.98.1 to 0.98.6 upgrade the dependency to 0.98.6-CDH.5.2.1. Especially fix for this blocker issue: https://issues.apache.org/jira/browse/HBASE-11118
i am seeing a lot of error messages like this:
13/11/04 16:01:04 ERROR hbase.HBaseInputFormatGranular: Cannot resolve the host name for /192.168.1.56 because of javax.naming.NameNotFoundException: DNS name not found [response code 3]; remaining name '56.1.168.192.in-addr.arpa'
Our cluster uses hosts files only. We have no DNS. The job completes successfully. Are these real errors? If not, the logging level should probably be changed.
I tried to understand what is going on, but got confused by this call:
hostName = Strings.domainNamePointerToHostName(DNS.reverseDns(
ipAddress, this.nameServer));
this.nameServer is a private variable initialized to nulll. I cannot find where in the code it gets set.
Best, Koert
Hello,
what are the options on testing hbase access? I can use scalding's JobTest but that would mock out the whole source. Is there a way to use SpyGlass with some test double for the database, like htablemock? Currently I'm testing against a live hbase, but that's quite non-scalable.
Opening from the follow up on issue #17
Error: java.lang.NullPointerException at parallelai.spyglass.hbase.HBaseRecordReaderBase.setHTable(HBaseRecordReaderBase.java:64) at parallelai.spyglass.hbase.HBaseInputFormatGranular.getRecordReader(HBaseInputFormatGranular.java:373) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:762) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:172) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:414) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
It appears that HBaseRawSource.scala (as opposed to the Java code for the raw Scheme and Tap) was removed from the project with Commit 42249fa. This source type is still documented in the README file, so I assume it should still be present. I found it in the history at https://github.com/ParallelAI/SpyGlass/blob/b72c234dd35c3eb807e8050385adf697dcf97fad/src/main/scala/parallelai/spyglass/hbase/HBaseRawSource.scala
I gathered from Issue #4, along with some commit messages, that the raw-related code had been commented out in the past. It looks like the Scheme and Tap were uncommented, while the Source was instead removed completely.
I recently had the problem that I was reading a lot of rows from an HBase table and filtered the majority of rows in the first steps of my scalding job. -> The Hadoop counters didn't change and the job timed out after 10min.
Would it be possible to add a counter that counts lines read (or hundrets of lines read) and publishes the values to a hadoop counter to avoid timing out?
I would like to request that this project be published to the conjars repository.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.