Giter Site home page Giter Site logo

imhotep's People

Contributors

darren-indeed avatar duaneobrien avatar jason-wolfe avatar jplaisance avatar kevindigo avatar lpasch avatar mmorrisontx avatar sheriffhobo avatar vladimir-i avatar youknowjack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

imhotep's Issues

“NoSuchMethodErrors” due to multiple versions of commons-io:commons-io:jar

Issue description

Hi, there are multiple versions of commons-io:commons-io in imhotep-master\imhotep-archive. As shown in the following dependency tree, library commons-io:commons-io:2.4 is transitively introduced by org.apache.hadoop:hadoop-client:2.6.0-cdh5.4.11, but has been managed to be version 1.4.

However, several methods defined in shadowed version commons-io:commons-io:2.4 are referenced by client project via org.apache.hadoop:hadoop-client:2.6.0-cdh5.4.11, but missing in the actually loaded version commons-io:commons-io:1.4.

For instance, the following missing methods(defined in commons-io:commons-io:2.4) are actually referenced by imhotep-master\imhotep-archive, which will introduce a runtime error(i.e., "NoSuchMethodErrors") into imhotep-master\imhotep-archive.

1. org.apache.commons.io.IOUtils: void closeQuietly(java.net.Socket) is invoked by imhotep-master\imhotep-archive via the following path:


paths------
<com.indeed.imhotep.archive.compression.NoCompressionInputStream: int read(byte[],int,int)> imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.DFSInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: int readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: org.apache.hadoop.hdfs.protocol.DatanodeInfo blockSeekTo(long)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader build()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader getRemoteBlockReaderFromTcp()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReaderFactory$BlockReaderPeer nextTcpPeer()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.server.namenode.NamenodeFsck$1: org.apache.hadoop.hdfs.net.Peer newConnectedPeer(java.net.InetSocketAddress,org.apache.hadoop.security.token.Token,org.apache.hadoop.hdfs.protocol.DatanodeID)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.IOUtils: void closeQuietly(java.net.Socket)>

2. org.apache.commons.io.IOUtils: void closeQuietly(java.io.Closeable) is invoked by imhotep-master\imhotep-archive via the following path:


paths------
<com.indeed.imhotep.archive.compression.GzipCompressionInputStream: int read(byte[],int,int)> imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.DFSInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: int readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: org.apache.hadoop.hdfs.protocol.DatanodeInfo blockSeekTo(long)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader build()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader getRemoteBlockReaderFromDomain()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReaderFactory$BlockReaderPeer nextDomainPeer()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: org.apache.hadoop.net.unix.DomainSocket createSocket(org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory$PathInfo,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.IOUtils: void closeQuietly(java.io.Closeable)>

3. org.apache.commons.io.input.BoundedInputStream: void init (java.io.InputStream,long) is invoked by imhotep-master\imhotep-archive via the following path:


paths------
<com.indeed.imhotep.archive.compression.NoCompressionInputStream: int read(byte[],int,int)> githubProject\imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: java.io.InputStream getInputStream()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: java.io.InputStream openInputStream()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.input.BoundedInputStream: void init (java.io.InputStream,long)>

Suggested fixing solutions

  1. Use configuration attribute <dependencyManagement> to unify the version of library commons-io:commons-io to be 2.4 in imhotep-master\imhotep-archive's pom file.
  2. Declare a direct dependency commons-io:commons-io:2.4 in the pom file of imhotep-master\imhotep-archive, to override commons-io:commons-io's managed version.

Please let me know which solution do you prefer? I can submit a PR to fix it.

Thank you very much for your attention.
Best regards,

Dependency tree----


[INFO] com.indeed:imhotep-archive:jar:1.0.11-SNAPSHOT
[INFO] \- org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.4.11:compile
[INFO]    +- org.apache.hadoop:hadoop-common:jar:2.6.0-cdh5.4.11:compile
[INFO]    |  \- commons-io:commons-io:jar:1.4:compile (version managed from 2.4)
[INFO]    +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.4.11:compile
[INFO]    |  \- (commons-io:commons-io:jar:1.4:compile - version managed from 2.4; omitted for duplicate)
[INFO]    \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.0-cdh5.4.11:compile
[INFO]       \- org.apache.hadoop:hadoop-yarn-common:jar:2.6.0-cdh5.4.11:compile
[INFO]          \- (commons-io:commons-io:jar:1.4:compile - version managed from 2.4; omitted for duplicate)

Persistent Problems with Shard / Cache corruption

I am having persistent problems like this:

Some day, suddenly I will get an error:
java.lang.RuntimeException: unable to open session

This error will be tied to specific date range in a dataset.
Eg If my query includes that date in that dataset it breaks, if not, it doesn't.

I can solve this by deleting the contents of cache
sudo rm -rf /var/data/file_cache/*
And then restarting deamon and then killing all active imhotep processes (workaround due to #19)

However it sucks I have to do this manually and with some frequency.

When I look in the logs for the daemon, I see some periodic problem that looks like:

2017-08-21 17:29:06,327 INFO  [CachingLocalImhotepServiceCore] loading shard index20170710.00-20170717.00 from com.indeed.imhotep.io.caching.CachedFile@3463f366

2017-08-21 17:29:06,578 ERROR [CachingLocalImhotepServiceCore] Exception during cleanup of a Closeable, ignoring
java.lang.NullPointerException
at com.indeed.imhotep.io.Shard.close(Shard.java:131)
at com.indeed.util.core.reference.SharedReference.decRef(SharedReference.java:111)
at com.indeed.util.core.reference.SharedReference.close(SharedReference.java:76)
at com.indeed.util.core.io.Closeables2.closeQuietly(Closeables2.java:29)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:308)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.(CachingLocalImhotepServiceCore.java:148)
at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)

Not sure if related.

APache Tomcat crashes during upload and needs to be manually restarted

Most processes are managed by supervisor, however tomcat is not.
Sometimes when uploading a large file the apache tomcat crashes.
Since it's not managed by supervisor it needs to be manually bounced.
IQL and IUPLOAD will not work until this is done.

This is hard to troubleshoot if you can't tell what's going on.
When this occurs, iupload leads to a 503.
IQL works, but will show an error 'can't connect' when trying to query

IUpload let's you create invalid dataset names

If the dataset is named incorrectly (eg containing a number or uppercase characters), it will spin forever and not upload or error.

IUpload allows this, so it should probably prevent it.

Imhotep doesn't clean up folders for overwritten shards

Minor issue:
If I upload a new tsv with same name as an existing tsv, it will over-write the data from the old with the new.
If I look in the databucket, I can see that the old shard file is gone, however the folder it was in still sticks around.

Outdated AWS java sdk causes problems

Dependencies are outdated:

There's an issue with aws java SDK that imhotep uses
basically the same issue as here: soabase/exhibitor#213

imhotepDaemon fails to start with the same error if I point it to s3 bucket in Frankfurt, but work for Ireland or other aws regions

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256. (Service: Amazon S3; Status Code: 400; Error Code: InvalidRequest; Request ID: XXX), S3 Extended Request ID: XXX
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:820)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:439)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:245)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3722)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3675)
	at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:620)
	at com.indeed.imhotep.io.caching.S3RemoteFileSystem.getListing(S3RemoteFileSystem.java:160)
	at com.indeed.imhotep.io.caching.S3RemoteFileSystem.stat(S3RemoteFileSystem.java:176)
	at com.indeed.imhotep.io.caching.SqarAutomountingRemoteFileSystem.stat(SqarAutomountingRemoteFileSystem.java:126)
	at com.indeed.imhotep.io.caching.RemappingRemoteFileSystem.stat(RemappingRemoteFileSystem.java:66)
	at com.indeed.imhotep.io.caching.CachedRemoteFileSystem.stat(CachedRemoteFileSystem.java:149)
	at com.indeed.imhotep.io.caching.CachedFile.exists(CachedFile.java:85)
	at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:245)
	at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.<init>(CachingLocalImhotepServiceCore.java:148)
	at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
	at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
	at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)

Imhotep is using 1.7.14, and according to the dude in this github issue it's fixed in atleast 1.9.16 which was in 2015. Current version is 1.11.262

Parse errors can cause indexer to spin forever.

If there is a parse problem (i have seen multiple times with csvs)

Imhotep will spin forever but never complete indexing the file.
Looking into the logs I see:
scanning for int field ....
sleep 8 minutes
restart
continue until 'scanning for int fields'
and then stop

it will never produce an error or stop.
file must be manually deleted.

Imhotep Daemon doesn't shut down properly for supervisor

While trying to help someone debug an issue in the forum recently, I noticed that after running "sudo /usr/local/bin/supervisorctl restart ImhotepDaemonMain", the old java process remained alongside the new one. There is some evidence that was causing problems with queries. I was also able to see that "sudo /usr/local/bin/supervisorctl stop ImhotepDaemonMain" was not resulting in the daemon process going away.

Make time zone configurable

See discussion here: https://groups.google.com/forum/#!topic/indeedeng-imhotep-users/GFuYpDI06e4

Question: "Is it possible to adjust default timezone to something other than GMT-6?"
Answer: """
Yes, but it requires code changes since there isn't a config parameter for it right now.
It has to be changed in 3 projects:
https://github.com/indeedeng/imhotep-tsv-converter/search?utf8=%E2%9C%93&q=-6&type=Code
https://github.com/indeedeng/imhotep/search?utf8=%E2%9C%93&q=-6&type=Code
https://github.com/indeedeng/iql/search?utf8=%E2%9C%93&q=-6
"""

Time zone should be made configurable for these components.

Add custom install instruction

Would like to do a custom install in AWS.
As we manage so much in AWS we would prefer a custom install we can control.

This is main thing preventing us from moving imhotep into an offical supported tool.

Sunset Notice

NOTE: Indeed has discontinued supporting this project. Archiving will take place on 8/16/21.
If you are interested in taking over as the Maintainer, please contact Indeed at [email protected]

Impossible to troubleshoot if a metric switches from int to str (or vice versa)

  • imhotep automatically detects and assigns type (int or str) to all fields in a shard based on the value of those fields
  • if the type of a field switches (eg is sometimes a str and sometimes an int) imhotep has a bad time and that field cannot be used in the group-by
  • however there is no way to know way to know which metrics have which types in which shards

It would be really nice to have some sort of a debug / validation tool that let's operators know which fields in which datasets contain multiple types, and for which shards.

Otherwise when this happens we basically have to backfill all time and hope it fixes it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.