indeedeng / imhotep Goto Github PK
View Code? Open in Web Editor NEWImhotep is a large-scale analytics platform built by Indeed.
License: Apache License 2.0
Imhotep is a large-scale analytics platform built by Indeed.
License: Apache License 2.0
Hi, there are multiple versions of commons-io:commons-io in imhotep-master\imhotep-archive. As shown in the following dependency tree, library commons-io:commons-io:2.4 is transitively introduced by org.apache.hadoop:hadoop-client:2.6.0-cdh5.4.11, but has been managed to be version 1.4.
However, several methods defined in shadowed version commons-io:commons-io:2.4 are referenced by client project via org.apache.hadoop:hadoop-client:2.6.0-cdh5.4.11, but missing in the actually loaded version commons-io:commons-io:1.4.
For instance, the following missing methods(defined in commons-io:commons-io:2.4) are actually referenced by imhotep-master\imhotep-archive, which will introduce a runtime error(i.e., "NoSuchMethodErrors") into imhotep-master\imhotep-archive.
1. org.apache.commons.io.IOUtils: void closeQuietly(java.net.Socket) is invoked by imhotep-master\imhotep-archive via the following path:
paths------
<com.indeed.imhotep.archive.compression.NoCompressionInputStream: int read(byte[],int,int)> imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.DFSInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: int readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: org.apache.hadoop.hdfs.protocol.DatanodeInfo blockSeekTo(long)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader build()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader getRemoteBlockReaderFromTcp()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReaderFactory$BlockReaderPeer nextTcpPeer()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.server.namenode.NamenodeFsck$1: org.apache.hadoop.hdfs.net.Peer newConnectedPeer(java.net.InetSocketAddress,org.apache.hadoop.security.token.Token,org.apache.hadoop.hdfs.protocol.DatanodeID)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.IOUtils: void closeQuietly(java.net.Socket)>
2. org.apache.commons.io.IOUtils: void closeQuietly(java.io.Closeable) is invoked by imhotep-master\imhotep-archive via the following path:
paths------
<com.indeed.imhotep.archive.compression.GzipCompressionInputStream: int read(byte[],int,int)> imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.DFSInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: int readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.DFSInputStream: org.apache.hadoop.hdfs.protocol.DatanodeInfo blockSeekTo(long)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader build()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReader getRemoteBlockReaderFromDomain()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.BlockReaderFactory: org.apache.hadoop.hdfs.BlockReaderFactory$BlockReaderPeer nextDomainPeer()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: org.apache.hadoop.net.unix.DomainSocket createSocket(org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory$PathInfo,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.IOUtils: void closeQuietly(java.io.Closeable)>
3. org.apache.commons.io.input.BoundedInputStream: void init (java.io.InputStream,long) is invoked by imhotep-master\imhotep-archive via the following path:
paths------
<com.indeed.imhotep.archive.compression.NoCompressionInputStream: int read(byte[],int,int)> githubProject\imhotep-master\imhotep-archive\target\classes
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: int read(byte[],int,int)> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: java.io.InputStream getInputStream()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.hadoop.hdfs.web.ByteRangeInputStream: java.io.InputStream openInputStream()> Repositories\org\apache\hadoop\hadoop-hdfs\2.6.0-cdh5.4.11\hadoop-hdfs-2.6.0-cdh5.4.11.jar
<org.apache.commons.io.input.BoundedInputStream: void init (java.io.InputStream,long)>
Please let me know which solution do you prefer? I can submit a PR to fix it.
Thank you very much for your attention.
Best regards,
[INFO] com.indeed:imhotep-archive:jar:1.0.11-SNAPSHOT
[INFO] \- org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.4.11:compile
[INFO] +- org.apache.hadoop:hadoop-common:jar:2.6.0-cdh5.4.11:compile
[INFO] | \- commons-io:commons-io:jar:1.4:compile (version managed from 2.4)
[INFO] +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.4.11:compile
[INFO] | \- (commons-io:commons-io:jar:1.4:compile - version managed from 2.4; omitted for duplicate)
[INFO] \- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.0-cdh5.4.11:compile
[INFO] \- org.apache.hadoop:hadoop-yarn-common:jar:2.6.0-cdh5.4.11:compile
[INFO] \- (commons-io:commons-io:jar:1.4:compile - version managed from 2.4; omitted for duplicate)
I am having persistent problems like this:
Some day, suddenly I will get an error:
java.lang.RuntimeException: unable to open session
This error will be tied to specific date range in a dataset.
Eg If my query includes that date in that dataset it breaks, if not, it doesn't.
I can solve this by deleting the contents of cache
sudo rm -rf /var/data/file_cache/*
And then restarting deamon and then killing all active imhotep processes (workaround due to #19)
However it sucks I have to do this manually and with some frequency.
When I look in the logs for the daemon, I see some periodic problem that looks like:
2017-08-21 17:29:06,327 INFO [CachingLocalImhotepServiceCore] loading shard index20170710.00-20170717.00 from com.indeed.imhotep.io.caching.CachedFile@3463f366
2017-08-21 17:29:06,578 ERROR [CachingLocalImhotepServiceCore] Exception during cleanup of a Closeable, ignoring
java.lang.NullPointerException
at com.indeed.imhotep.io.Shard.close(Shard.java:131)
at com.indeed.util.core.reference.SharedReference.decRef(SharedReference.java:111)
at com.indeed.util.core.reference.SharedReference.close(SharedReference.java:76)
at com.indeed.util.core.io.Closeables2.closeQuietly(Closeables2.java:29)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:308)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.(CachingLocalImhotepServiceCore.java:148)
at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)
Not sure if related.
On https we end up having ~2min load time due to problems caused by appcache.
As workaround we need to run in http.
Would like to run in https but not have appcache problems.
Most processes are managed by supervisor, however tomcat is not.
Sometimes when uploading a large file the apache tomcat crashes.
Since it's not managed by supervisor it needs to be manually bounced.
IQL and IUPLOAD will not work until this is done.
This is hard to troubleshoot if you can't tell what's going on.
When this occurs, iupload leads to a 503.
IQL works, but will show an error 'can't connect' when trying to query
If the dataset is named incorrectly (eg containing a number or uppercase characters), it will spin forever and not upload or error.
IUpload allows this, so it should probably prevent it.
Minor issue:
If I upload a new tsv with same name as an existing tsv, it will over-write the data from the old with the new.
If I look in the databucket, I can see that the old shard file is gone, however the folder it was in still sticks around.
Dependencies are outdated:
There's an issue with aws java SDK that imhotep uses
basically the same issue as here: soabase/exhibitor#213
imhotepDaemon fails to start with the same error if I point it to s3 bucket in Frankfurt, but work for Ireland or other aws regions
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256. (Service: Amazon S3; Status Code: 400; Error Code: InvalidRequest; Request ID: XXX), S3 Extended Request ID: XXX
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:820)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:439)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:245)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3722)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3675)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:620)
at com.indeed.imhotep.io.caching.S3RemoteFileSystem.getListing(S3RemoteFileSystem.java:160)
at com.indeed.imhotep.io.caching.S3RemoteFileSystem.stat(S3RemoteFileSystem.java:176)
at com.indeed.imhotep.io.caching.SqarAutomountingRemoteFileSystem.stat(SqarAutomountingRemoteFileSystem.java:126)
at com.indeed.imhotep.io.caching.RemappingRemoteFileSystem.stat(RemappingRemoteFileSystem.java:66)
at com.indeed.imhotep.io.caching.CachedRemoteFileSystem.stat(CachedRemoteFileSystem.java:149)
at com.indeed.imhotep.io.caching.CachedFile.exists(CachedFile.java:85)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.updateShards(CachingLocalImhotepServiceCore.java:245)
at com.indeed.imhotep.service.CachingLocalImhotepServiceCore.<init>(CachingLocalImhotepServiceCore.java:148)
at com.indeed.imhotep.service.ImhotepDaemon.newImhotepDaemon(ImhotepDaemon.java:758)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:728)
at com.indeed.imhotep.service.ImhotepDaemon.main(ImhotepDaemon.java:694)
Imhotep is using 1.7.14, and according to the dude in this github issue it's fixed in atleast 1.9.16 which was in 2015. Current version is 1.11.262
@ThomasBergman1 asked if there was a way to isolate TSV conversion on the command line for a single file.
I saw an error like this when doing a particularly heavy query.
Query failed:
Looks like the IQL server got overloaded and is restarting.
Please take a look at your query and consider if it is too heavy.
Wiki:
https://wiki.indeed.com/display/INTEL/_Performance+Considerations+for+IQL+usage
If there is a parse problem (i have seen multiple times with csvs)
Imhotep will spin forever but never complete indexing the file.
Looking into the logs I see:
scanning for int field ....
sleep 8 minutes
restart
continue until 'scanning for int fields'
and then stop
it will never produce an error or stop.
file must be manually deleted.
While trying to help someone debug an issue in the forum recently, I noticed that after running "sudo /usr/local/bin/supervisorctl restart ImhotepDaemonMain", the old java process remained alongside the new one. There is some evidence that was causing problems with queries. I was also able to see that "sudo /usr/local/bin/supervisorctl stop ImhotepDaemonMain" was not resulting in the daemon process going away.
Had some problems where cache gets too full, takes up all the space and it can't load new shards.
As an alternative to S3, support accessing remote files in a Minio (open source S3 compatible) service.
Open-source Imhotep currently has only S3 remote filesystem support. While this can probably work for running outside of AWS using https://minio.io/, we should add HDFS as an alternative.
See discussion here: https://groups.google.com/forum/#!topic/indeedeng-imhotep-users/GFuYpDI06e4
Question: "Is it possible to adjust default timezone to something other than GMT-6?"
Answer: """
Yes, but it requires code changes since there isn't a config parameter for it right now.
It has to be changed in 3 projects:
https://github.com/indeedeng/imhotep-tsv-converter/search?utf8=%E2%9C%93&q=-6&type=Code
https://github.com/indeedeng/imhotep/search?utf8=%E2%9C%93&q=-6&type=Code
https://github.com/indeedeng/iql/search?utf8=%E2%9C%93&q=-6
"""
Time zone should be made configurable for these components.
Would like to do a custom install in AWS.
As we manage so much in AWS we would prefer a custom install we can control.
This is main thing preventing us from moving imhotep into an offical supported tool.
Often find myself screenshotting a graph and then sending it in an email along with a query.
Would be nice to be able to grab it directly from the tool
iupload somehow creating corrupt S3 object keys.
This prevent imhotep from loading shards.
When running multiple queries, exporting from server only exports first one
NOTE: Indeed has discontinued supporting this project. Archiving will take place on 8/16/21.
If you are interested in taking over as the Maintainer, please contact Indeed at [email protected]
@ThomasBergman1 has requested instructions on how to manually deploy the services required for an Imhotep cluster in Imhotep (as an alternative to the CloudFormation scripts).
Do you have wrappers for querying imhotep programmatically?
Would be very useful.
Bash wrapper was included in previous version of opensource, but is now missing.
Breaking manifests as follows:
All files are moved into indexed
shards are created in data bucket
Shards are not loaded up into imhotep
No new shards will be loaded until the most recently loaded shard is deleted
Looking into logs you can see it's trying to load a file and never succeeds.
Eg I will have a value "Österreich" showing in IQL when I do a groupby
However if I click on that value and try to include in a filter, I will get zero results.
It would be really nice to have some sort of a debug / validation tool that let's operators know which fields in which datasets contain multiple types, and for which shards.
Otherwise when this happens we basically have to backfill all time and hope it fixes it.
I set up a builder directory in the Github wiki here:
https://github.com/indeedeng/imhotep/wiki/Imhotep-Builder-Directory
I chose the wiki for ease of addition to the directory. The Imhotep documentation in the gh-pages branch should be updated to link to the directory.
Our current workflow is: build data in redshift > export to tsv > clean up tsv > copy in tsvtoindex
Would be fantastic to be able to export directly from redshift.
I am sure that many AWS users face similar issues.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.