Giter Site home page Giter Site logo

spyglass's People

Contributors

antwnis avatar blakeney avatar crajah avatar deepakas avatar galarragas avatar jamoozy avatar koertkuipers avatar kristiankaufmann avatar rore avatar saad373 avatar sathish316 avatar xargsgrep avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spyglass's Issues

HBaseTap for secure HBase cluster

Does this work with secure HBase cluster?

In a HBase mapreduce job written in java, TableMapReduceUtil.initTableMapperJob internally calls TableMapReduceUtil.initCredentials. What is the equivalent of this in SpyGlass HBase taps for working with secure HBase?

If i don't do this step, it results in Kerberos errors in map step:
"No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)"

nulls in hbase

I noticed that when writing to HBasePipeWrapper.toBytesWritable replaces nulls by empty strings which end up being empty byte arrays stored in hbase.
It looks like HBaseScheme would throw a NPE if one tried to store a null.

When reading from hbase HBaseScheme will replace nulls (which represent missing column values) by empty byte arrays.

hbase is a sparse store after so why not benefit from this?

I propose that when writing to hbase we do not replace nulls by empty strings but instead do not write them at all. and when reading from hbase we put the nulls that hbase emits for missing values in the cascading tuple instead of replacing it by empty byte arrays. this works well because the null is also used in cascading tuples to represent missing values.

are there any downsides to this approach?
i have pull request ready if this is considered a good idea.

Not able to run in Kerberos safe environment.

Hi, I am trying to run the spyglass api in a kerberos environment, i used the following code :
// read the input file
val pipe = TextLine( args("input") )
// split each line of the input file using any whitespace "\s+" as the separator
.flatMap('line -> 'WORD) { line : String =>
line.toLowerCase.split("\s+")
}
// get count of each word in the input file
.groupBy('WORD) { _.size }
// convert to uppercase for use with HBase Phoenix
.rename('size -> 'SIZE)
// create a new pipe with byte values rather than string/number values
val pipe2 = new HBasePipeWrapper(pipe).toBytesWritable(new Fields("WORD"), new Fields("SIZE"))
// write to the TEST table with WORD as the rowkey, CF as the columnFamily and SIZE as the only value column
.write(new HBaseSource("TEST", "myserver.mydomain.com:2181", 'WORD, List("CF"), List(new Fields("SIZE"))) )
}

Changing the "myserver.mydomain.com:2181" with my specific quorum values. this specific code launches and shows 2 mappers and 60 reducers are required. the the mappers goes well but the reducers get stucked at 57 and then keeps failing without killing the task.

HBaseRawSource

HBaseRawSource is commented out in the source code.

Tried to look at the tags but there is no git tag or release checkin with that that I saw. Only scala 2.9.3 tag.

Is there a reason why it was commented out ? perhaps merged with another framework ? Or did it not work

Thanks.

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HRegionLocation.getServerAddress()Lorg/apache/hadoop/hbase/HServerAddress

There appears to be an incompatibility with newer versions of hbase, apparently the method is deprecated and/or removed.

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.HRegionLocation.getServerAddress()Lorg/apache/hadoop/hbase/HServerAddress;
at parallelai.spyglass.hbase.HBaseInputFormatGranular.getSplits(HBaseInputFormatGranular.java:113)
at parallelai.spyglass.hbase.HBaseInputFormatGranular.getSplits(HBaseInputFormatGranular.java:43)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:200)
at cascading.tap.hadoop.io.MultiInputFormat.getSplits(MultiInputFormat.java:142)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:107)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Scala 2.11 support

We would like to use SpyGlass in a Scala 2.11 project. Would you consider upgrading it or cross building?

autoflush and performance problems

We have a job that writes a lot of data to hbase. We were hitting severe performance issues. We discovered that the HBaseRecordWriter sets autoflush to true. When we set autoflush to false write performance went up several orders of magnitude. Is there any reason why the autoflush feature is enabled?

reverseDNS error logs

i am seeing a lot of error messages like this:
13/11/04 16:01:04 ERROR hbase.HBaseInputFormatGranular: Cannot resolve the host name for /192.168.1.56 because of javax.naming.NameNotFoundException: DNS name not found [response code 3]; remaining name '56.1.168.192.in-addr.arpa'

Our cluster uses hosts files only. We have no DNS. The job completes successfully. Are these real errors? If not, the logging level should probably be changed.

I tried to understand what is going on, but got confused by this call:
hostName = Strings.domainNamePointerToHostName(DNS.reverseDns(
ipAddress, this.nameServer));

this.nameServer is a private variable initialized to nulll. I cannot find where in the code it gets set.

Best, Koert

testing via htablemock or similar

Hello,
what are the options on testing hbase access? I can use scalding's JobTest but that would mock out the whole source. Is there a way to use SpyGlass with some test double for the database, like htablemock? Currently I'm testing against a live hbase, but that's quite non-scalable.

NullPointerException using CDH5 Branch

Opening from the follow up on issue #17

Error: java.lang.NullPointerException at parallelai.spyglass.hbase.HBaseRecordReaderBase.setHTable(HBaseRecordReaderBase.java:64) at parallelai.spyglass.hbase.HBaseInputFormatGranular.getRecordReader(HBaseInputFormatGranular.java:373) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:253) at cascading.tap.hadoop.io.MultiInputFormat$1.operate(MultiInputFormat.java:248) at cascading.util.Util.retry(Util.java:762) at cascading.tap.hadoop.io.MultiInputFormat.getRecordReader(MultiInputFormat.java:247) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:172) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:414) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

HBaseRawSource Missing from Project

It appears that HBaseRawSource.scala (as opposed to the Java code for the raw Scheme and Tap) was removed from the project with Commit 42249fa. This source type is still documented in the README file, so I assume it should still be present. I found it in the history at https://github.com/ParallelAI/SpyGlass/blob/b72c234dd35c3eb807e8050385adf697dcf97fad/src/main/scala/parallelai/spyglass/hbase/HBaseRawSource.scala

I gathered from Issue #4, along with some commit messages, that the raw-related code had been commented out in the past. It looks like the Scheme and Tap were uncommented, while the Source was instead removed completely.

Add Hadoop Counter to avoid timeouts

I recently had the problem that I was reading a lot of rows from an HBase table and filtered the majority of rows in the first steps of my scalding job. -> The Hadoop counters didn't change and the job timed out after 10min.

Would it be possible to add a counter that counts lines read (or hundrets of lines read) and publishes the values to a hadoop counter to avoid timing out?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.