indeedeng / imhotep Goto Github PK

View Code? Open in Web Editor NEW

136.0 136.0 36.0 118.71 MB

Imhotep is a large-scale analytics platform built by Indeed.

License: Apache License 2.0

Java 94.63% C 2.27% Perl 0.04% C++ 3.03% Shell 0.04%

analytics scalability time-series

imhotep's People

Contributors

Stargazers

Watchers

imhotep's Issues

“NoSuchMethodErrors” due to multiple versions of commons-io:commons-io:jar

Issue description

Hi, there are multiple versions of commons-io:commons-io in imhotep-master\imhotep-archive. As shown in the following dependency tree, library commons-io:commons-io:2.4 is transitively introduced by org.apache.hadoop:hadoop-client:2.6.0-cdh5.4.11, but has been managed to be version 1.4.

However, several methods defined in shadowed version commons-io:commons-io:2.4 are referenced by client project via org.apache.hadoop:hadoop-client:2.6.0-cdh5.4.11, but missing in the actually loaded version commons-io:commons-io:1.4.

For instance, the following missing methods(defined in commons-io:commons-io:2.4) are actually referenced by imhotep-master\imhotep-archive, which will introduce a runtime error(i.e., "NoSuchMethodErrors") into imhotep-master\imhotep-archive.

1. org.apache.commons.io.IOUtils: void closeQuietly(java.net.Socket) is invoked by imhotep-master\imhotep-archive via the following path:


paths------
<com.indeed.imhotep.archive.compression.NoCompressionInputStream: int read(byte[],int,int)> imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.DFSInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: int readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: org.apache.hadoop.hdfs.protocol.DatanodeInfo blockSeekTo(long)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader build()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader getRemoteBlockReaderFromTcp()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReaderFactory$BlockReaderPeer nextTcpPeer()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.server.namenode.NamenodeFsck$1: org.apache.hadoop.hdfs.net.Peer newConnectedPeer(java.net.InetSocketAddress,org.apache.hadoop.security.token.Token,org.apache.hadoop.hdfs.protocol.DatanodeID)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.IOUtils: void closeQuietly(java.net.Socket)>

2. org.apache.commons.io.IOUtils: void closeQuietly(java.io.Closeable) is invoked by imhotep-master\imhotep-archive via the following path:


paths------
<com.indeed.imhotep.archive.compression.GzipCompressionInputStream: int read(byte[],int,int)> imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.DFSInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: int readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: org.apache.hadoop.hdfs.protocol.DatanodeInfo blockSeekTo(long)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader build()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader getRemoteBlockReaderFromDomain()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReaderFactory$BlockReaderPeer nextDomainPeer()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: org.apache.hadoop.net.unix.DomainSocket createSocket(org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory$PathInfo,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.IOUtils: void closeQuietly(java.io.Closeable)>

3. org.apache.commons.io.input.BoundedInputStream: void init (java.io.InputStream,long) is invoked by imhotep-master\imhotep-archive via the following path:


paths------
<com.indeed.imhotep.archive.compression.NoCompressionInputStream: int read(byte[],int,int)> githubProject\imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: java.io.InputStream getInputStream()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: java.io.InputStream openInputStream()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.input.BoundedInputStream: void init (java.io.InputStream,long)>

Dependency tree----


[INFO] com.indeed:imhotep-archive:jar:1.0.11-SNAPSHOT
[INFO] \- org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.4.11:compile
[INFO]    +- org.apache.hadoop:hadoop-common:jar:2.6.0-cdh5.4.11:compile
[INFO]    |  \- commons-io:commons-io:jar:1.4:compile (version managed from 2.4)
[INFO]    +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.4.11:compile
[INFO]    |  \- (commons-io:commons-io:jar:1.4:compile - version managed from 2.4; omitted for duplicate)
[INFO]    \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.0-cdh5.4.11:compile
[INFO]       \- org.apache.hadoop:hadoop-yarn-common:jar:2.6.0-cdh5.4.11:compile
[INFO]          \- (commons-io:commons-io:jar:1.4:compile - version managed from 2.4; omitted for duplicate)

Persistent Problems with Shard / Cache corruption

I am having persistent problems like this:

Some day, suddenly I will get an error:
java.lang.RuntimeException: unable to open session

This error will be tied to specific date range in a dataset.
Eg If my query includes that date in that dataset it breaks, if not, it doesn't.

I can solve this by deleting the contents of cache
sudo rm -rf /var/data/file_cache/*
And then restarting deamon and then killing all active imhotep processes (workaround due to #19)

However it sucks I have to do this manually and with some frequency.

When I look in the logs for the daemon, I see some periodic problem that looks like:

2017-08-21 17:29:06,327 INFO  [CachingLocalImhotepServiceCore] loading shard index20170710.00-20170717.00 from com.indeed.imhotep.io.caching.CachedFile@3463f366

2017-08-21 17:29:06,578 ERROR [CachingLocalImhotepServiceCore] Exception during cleanup of a Closeable, ignoring
java.lang.NullPointerException
at com.indeed.imhotep.io.Shard.close(Shard.java:131)
at com.indeed.util.core.reference.SharedReference.decRef(SharedReference.java:111)
at com.indeed.util.core.reference.SharedReference.close(SharedReference.java:76)
at com.indeed.util.core.io.Closeables2.closeQuietly(Closeables2.java:29)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:308)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.(CachingLocalImhotepServiceCore.java:148)
at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)

Not sure if related.

Running on https causes problems due to appcache

On https we end up having ~2min load time due to problems caused by appcache.
As workaround we need to run in http.
Would like to run in https but not have appcache problems.

APache Tomcat crashes during upload and needs to be manually restarted

Most processes are managed by supervisor, however tomcat is not.
Sometimes when uploading a large file the apache tomcat crashes.
Since it's not managed by supervisor it needs to be manually bounced.
IQL and IUPLOAD will not work until this is done.

This is hard to troubleshoot if you can't tell what's going on.
When this occurs, iupload leads to a 503.
IQL works, but will show an error 'can't connect' when trying to query

IUpload let's you create invalid dataset names

If the dataset is named incorrectly (eg containing a number or uppercase characters), it will spin forever and not upload or error.

IUpload allows this, so it should probably prevent it.

Imhotep doesn't clean up folders for overwritten shards

Minor issue:
If I upload a new tsv with same name as an existing tsv, it will over-write the data from the old with the new.
If I look in the databucket, I can see that the old shard file is gone, however the folder it was in still sticks around.

Outdated AWS java sdk causes problems

Dependencies are outdated:

There's an issue with aws java SDK that imhotep uses
basically the same issue as here: soabase/exhibitor#213

imhotepDaemon fails to start with the same error if I point it to s3 bucket in Frankfurt, but work for Ireland or other aws regions

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256. (Service: Amazon S3; Status Code: 400; Error Code: InvalidRequest; Request ID: XXX), S3 Extended Request ID: XXX
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:820)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:439)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:245)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3722)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3675)
	at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:620)
	at com.indeed.imhotep.io.caching.S3RemoteFileSystem.getListing(S3RemoteFileSystem.java:160)
	at com.indeed.imhotep.io.caching.S3RemoteFileSystem.stat(S3RemoteFileSystem.java:176)
	at com.indeed.imhotep.io.caching.SqarAutomountingRemoteFileSystem.stat(SqarAutomountingRemoteFileSystem.java:126)
	at com.indeed.imhotep.io.caching.RemappingRemoteFileSystem.stat(RemappingRemoteFileSystem.java:66)
	at com.indeed.imhotep.io.caching.CachedRemoteFileSystem.stat(CachedRemoteFileSystem.java:149)
	at com.indeed.imhotep.io.caching.CachedFile.exists(CachedFile.java:85)
	at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:245)
	at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.<init>(CachingLocalImhotepServiceCore.java:148)
	at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
	at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
	at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)

Imhotep is using 1.7.14, and according to the dude in this github issue it's fixed in atleast 1.9.16 which was in 2015. Current version is 1.11.262

how can you convert a single TSV from the command line?

@ThomasBergman1 asked if there was a way to isolate TSV conversion on the command line for a single file.

Imhotep Error Messages reference indeed internal wiki

I saw an error like this when doing a particularly heavy query.

Query failed:
Looks like the IQL server got overloaded and is restarting.
Please take a look at your query and consider if it is too heavy.
Wiki:
https://wiki.indeed.com/display/INTEL/_Performance+Considerations+for+IQL+usage

Parse errors can cause indexer to spin forever.

If there is a parse problem (i have seen multiple times with csvs)

Imhotep will spin forever but never complete indexing the file.
Looking into the logs I see:
scanning for int field ....
sleep 8 minutes
restart
continue until 'scanning for int fields'
and then stop

it will never produce an error or stop.
file must be manually deleted.

Imhotep Daemon doesn't shut down properly for supervisor

While trying to help someone debug an issue in the forum recently, I noticed that after running "sudo /usr/local/bin/supervisorctl restart ImhotepDaemonMain", the old java process remained alongside the new one. There is some evidence that was causing problems with queries. I was also able to see that "sudo /usr/local/bin/supervisorctl stop ImhotepDaemonMain" was not resulting in the daemon process going away.

Imhotep daemon needs to clean up it's own cache better

Had some problems where cache gets too full, takes up all the space and it can't load new shards.

Add Minio remote file support

As an alternative to S3, support accessing remote files in a Minio (open source S3 compatible) service.

Add HDFS remote filesystem support

Open-source Imhotep currently has only S3 remote filesystem support. While this can probably work for running outside of AWS using https://minio.io/, we should add HDFS as an alternative.

Make time zone configurable

See discussion here: https://groups.google.com/forum/#!topic/indeedeng-imhotep-users/GFuYpDI06e4

Question: "Is it possible to adjust default timezone to something other than GMT-6?"
Answer: """
Yes, but it requires code changes since there isn't a config parameter for it right now.
It has to be changed in 3 projects:
https://github.com/indeedeng/imhotep-tsv-converter/search?utf8=%E2%9C%93&q=-6&type=Code
https://github.com/indeedeng/imhotep/search?utf8=%E2%9C%93&q=-6&type=Code
https://github.com/indeedeng/iql/search?utf8=%E2%9C%93&q=-6
"""

Time zone should be made configurable for these components.

Add custom install instruction

Would like to do a custom install in AWS.
As we manage so much in AWS we would prefer a custom install we can control.

This is main thing preventing us from moving imhotep into an offical supported tool.

Provide a way to grab jpeg of graph in IQL

Often find myself screenshotting a graph and then sending it in an email along with a query.
Would be nice to be able to grab it directly from the tool

ERROR [S3RemoteFileSystem] Error parsing S3 object Key. Key: impression/

iupload somehow creating corrupt S3 object keys.
This prevent imhotep from loading shards.

IQL webapp 'export server tsv / csv' only exports first query

When running multiple queries, exporting from server only exports first one

Sunset Notice

NOTE: Indeed has discontinued supporting this project. Archiving will take place on 8/16/21.
If you are interested in taking over as the Maintainer, please contact Indeed at [email protected]

manual AWS instructions requested

@ThomasBergman1 has requested instructions on how to manually deploy the services required for an Imhotep cluster in Imhotep (as an alternative to the CloudFormation scripts).

Provide wrappers for bash, python, R as part of opensource

Do you have wrappers for querying imhotep programmatically?
Would be very useful.

Bash wrapper was included in previous version of opensource, but is now missing.

When uploading multiple files into the 'tsvtoindex' folder at the same time, imhotep sometimes breaks.

Breaking manifests as follows:
All files are moved into indexed
shards are created in data bucket
Shards are not loaded up into imhotep
No new shards will be loaded until the most recently loaded shard is deleted
Looking into logs you can see it's trying to load a file and never succeeds.

IQL cannot search for tokens containing special characters eg ü ä

Eg I will have a value "Österreich" showing in IQL when I do a groupby
However if I click on that value and try to include in a filter, I will get zero results.

Impossible to troubleshoot if a metric switches from int to str (or vice versa)

imhotep automatically detects and assigns type (int or str) to all fields in a shard based on the value of those fields
if the type of a field switches (eg is sometimes a str and sometimes an int) imhotep has a bad time and that field cannot be used in the group-by
however there is no way to know way to know which metrics have which types in which shards

It would be really nice to have some sort of a debug / validation tool that let's operators know which fields in which datasets contain multiple types, and for which shards.

Otherwise when this happens we basically have to backfill all time and hope it fixes it.

Add a link to builder directory in Imhotep gh-pages

I set up a builder directory in the Github wiki here:
https://github.com/indeedeng/imhotep/wiki/Imhotep-Builder-Directory

I chose the wiki for ease of addition to the directory. The Imhotep documentation in the gh-pages branch should be updated to link to the directory.

Redshift plugin to enable export directly from redshift

Our current workflow is: build data in redshift > export to tsv > clean up tsv > copy in tsvtoindex

Would be fantastic to be able to export directly from redshift.
I am sure that many AWS users face similar issues.

indeedeng / imhotep Goto Github PK

imhotep's People

Contributors

Stargazers

Watchers

Forkers

imhotep's Issues

Issue description

Suggested fixing solutions

Dependency tree----

Recommend Projects

Recommend Topics

Recommend Org