twitter / hraven Goto Github PK

hRaven collects run time data and statistics from MapReduce jobs in an easily queryable format

Home Page: https://twitter.com/twitterhadoop

License: Apache License 2.0

Ruby 3.59% Shell 2.52% Java 93.89%

hraven's Introduction

hRaven

hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs' execution for an application and more comprehensive trending and analysis over time.

Requirements

Apache HBase (1.1.3) - a running HBase cluster is required for the hRaven data storage
Apache Hadoop - hRaven current supports collection of job data on specific versions of Hadoop:
- Hadoop 2.6+ post hRaven 1.0.0 Hadoop 1 will no longer be supported
JRE 8 - As of hRaven 1.0.0 Java 8 jars are generated. Use hRaven 0.9.x for Java 7 runtimes.

Quick start

Clone the github repo or download the latest release:

git clone git://github.com/twitter/hraven.git

If you cloned the repository, build the full tarball:

mvn clean package assembly:single

Extract the assembly tarball on a machine with HBase client access.

Create the initial schema

hbase [--config /path/to/hbase/conf] shell bin/create_schema.rb

Schema

hRaven requires the following HBase tables in order to store data for map reduce jobs:

job_history - job-level statistics, one row per job
job_history_task - task-level statistics, one row per task attempt
job_history-by_jobId - index table pointing to job_history row by job ID
job_history_app_version - distinct versions associated with an application, one row per application
job_history_raw - stores the raw job configuration and job history files, as byte[] blobs
job_history_process - meta table storing progress information for the data loading process
flow_queue - time based index of flows for Ambrose integration
flow_event - stores flow progress events for Ambrose integration

The initial table schema can be created by running the create_schema.rb script:

hbase [--config /path/to/hbase/conf] shell bin/create_schema.rb

Data Loading

Currently, hRaven loads data for completed map reduce jobs by reading and parsing the job history and job configuration files from HDFS. As a pre-requisite, the Hadoop Job Tracker must be configured to archive job history files in HDFS, by adding the following setting to your mapred-site.xml file:

<property>
  <name>mapred.job.tracker.history.completed.location</name>
   <value>hdfs://<namenode>:8020/hadoop/mapred/history/done</value>
  <description>Store history and conf files for completed jobs in HDFS.
  </description>
</property>

Once your Job Tracker is running with this setting in place, you can load data into hRaven with a series of map reduce jobs:

JobFilePreprocessor - scans the HDFS job history archive location for newly completed jobs; writes the new filenames to a sequence file for processing in the next stage; records the sequence file name in a new row in the job_history_process table
JobFileRawLoader - scans the processing table for new records from JobFileProcessor; reads the associated sequence files; writes the associated job history files for each sequence file entry into the HBase job_history_raw table
JobFileProcessor - reads new records from the raw table; parses the stored job history contents into individual puts for the job_history, job_history_task, and related index tables

Each job has an associated shell script under the bin/ directory. See these scripts for more details on the job parameters.

REST API

Once data has been loaded into hRaven tables, a REST API provides access to job data for common query patterns. hRaven ships with a simple REST server, which can be started or stopped with the command:

./bin/hraven-daemon.sh (start|stop) rest

The following endpoints are currently supported:

Get Job

Path: /job/<cluster>[/jobId]
Returns: single job
Optional QS Params: n/a

Get Flow By JobId

Path: /jobFlow/<cluster>[/jobId]
Returns: the flow for the jobId
Optional QS Params - v1:

limit (default=1)

Get Flows

Path: /flow/<cluster>/<user>/<appId>[/version]
Returns: list of flows
Optional QS Params - v1:

limit (default=1) - max number of flows to return
includeConf - filter configuration property keys to return only the given names
includeConfRegex - filter configuration property keys to return only those matching the given regex patterns

Get Flow Timeseries

Path: /flowStats/<cluster>/<user>/<app>
Returns: list of flows with only minimal stats
Optional QS params:

version (optional filter)
startRow (base64 encoded row key)
startTime (ms since epoch) - restrict results to given time window
endTime (ms since epoch) - restrict results to given time window
limit (default=100) - max flows to return
includeJobs (boolean flag) - include per-job details

Note: This endpoint duplicates functionality from the "/flow/" endpoint and maybe be combined back in to it in the future.

Get Tasks

Path: /tasks/<cluster>/[jobId]

Returns: task details of that single job

Get App Versions

Path: /appVersion/<cluster>/<user>/<app>
Returns: list of distinct app versions
Optional QS params:

limit - max results to return

Get New Jobs

Path: /newJobs/<cluster>/

Returns: list of apps with only minimal stats

Optional params:

startTime (epoch timestamp in milliseconds)
endTime (epoch timestamp in milliseconds)
limit (max rows to return)
user (user name to filter it by)

Project Resources

Bug tracker

Have a bug? Please create an issue here on GitHub https://github.com/twitter/hraven/issues

Mailing list

Have a question? Ask on our mailing list!

hRaven Users:

[email protected]

hRaven Developers:

[email protected]

Contributing to hRaven

For more details on how to contribute to hRaven, see CONTRIBUTING.md.

Known Issues

While hRaven stores the full data available from job history logs, the rolled-up statistics in the Flow class only represent data from sucessful task attempts. We plan to extend this so that the Flow class also reflects resources used by failed and killed task attempts.

Copyright and License

Licensed under the Apache License Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

hraven's People

Contributors

Stargazers

Watchers

Forkers

sjlee agilemobiledev mkwhitacre vrushalic jeromebanks bitted bojanbabic mailmahee aniket486 vidyar jrottinghuis frolovm skuehn maysamyabandeh exbyt3s santhu404 everth juliansutter bdacode s16systems chinna1986 hcoyote inmobi kirayagami felixlv shrijeet lxiong gjtsusi siahh mehakinderoberoi vrushalivc jfeng3 fysoft2006 rlxrlxrlx c61811 hardiku mohamedmaalej bedeedidiong gsteelman mingmasplace piyushnarang flakytestdetection pgaref talglobus leusonmario jerryxing98 bh-lushuai haiyang1987 tool-recommender-bot cass-green mmirzazad jason-cooke dyet92k doytsujin marcygo classicvalues pickkaa seanpm2001 twitter-backup twitter-mirror amhudson sagnikiitb lichaonetuser awesome1128 opslevel-test

hraven's Issues

make hRaven hadoop 2.0 compatible

hRaven needs to be enhanced to process the job history files generated by hadoop 2.0.

The job history file format has changed from 1.0. It is now a avro-json format.

There is a job history parser api which can be used to process the jobhistory files.

Ensure REST api can process 2.0 related hdfs stats

hRaven REST api presently supports fetching hdfs stats from existing data which is for 1.0 hadoop clusters.

The collection for 2.0 hadoop clusters maybe be slightly different. The REST api may needs enhancements to process this new data (in the context of namespaces etc).
Also, the cost needs to be double instead of long

Aggregate job counters in the Flow instance

For the /flowStats endpoint, it would be useful to see combined job counters for the complete flow. We should provide an aggregated version of the counters in the Flow instance.

Ensure queue name /pool name does not get stored as "default" string

In hadoop2 job confs, the mapreduce.job.queuename property has the value of "default" as a String.

This although is correctly interpreted by hadoop to run in that user's pool, can be incorrectly interpreted if looked at through hRaven.

Add a check to confirm we don't store "default" as a string, instead we can store the user name

Debug test failures due to HBaseTestingUtility

We get some failures in the build that we don't get locally:
https://travis-ci.org/twitter/hraven/builds/8447411

e.g.,
Running com.twitter.hraven.datasource.TestJobHistoryService
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.825 sec <<< FAILURE!

Refactor code in TaskDetails

Some of the tasks in hadoop2 may or may not have certain fields (either intentional or due to bugs in hadoop2 itself). Need to refactor the code in TaskDetails to accommodate this.

For instance, hadoop2 has an AM_STARTED event in the history file but does not have a corresponding AM_FINISHED event. Due to this the AM task has only start time but no finish time in the history file.

Have queue/pool name returned at flow level

The flowStats api does not return the poolname/queue at the flow level. Would be a good idea to have that parameter.

Note: As unlikely as it may be, a job inside a flow in theory could have a different poolname (since it is a command line argument). Need to figure out what should be done in this case.

Investigate if some steps in the json parsing can be optimized

Specifically if the toString can be optimized away at

https://github.com/twitter/hraven/blob/master/hraven-etl/src/main/java/com/twitter/hraven/etl/JobHistoryFileParserHadoop2.java#L228

hRaven-core Code Cleanup: remove config filter references in HdfsStatsSerializer in ObjectMapperProvider

Some simple code cleanups would be good to have in the the HdfsStatsSerializer in ObjectMapperProvider code to remove the config filter references. Presently we wish to return all attributes of hdfs stats so the filter is no longer necessary.

Correct the timestamp being stored in appVersion table

In JobFileTableMapper, when the version is added to the appVersion table, presently it stores the submit time of the job as the run id.

Need to correct it to store the jobDesc.getRunId() instead of submitTime

The appVersionService.addVersion() should be invoked with values from jobDesc object that it holds, not from the initialized values of submitTime

Correct the Slot millis seen in hadoop2 counters

map slot millis in Hadoop 2.0 are not calculated properly. It is aproximately 4X off by actual value.

calculate the correct slot millis as = hadoop2ReportedMapSlotMillis * yarn.scheduler.minimum-allocation-mb / mapreduce.map.memory.mb

similarly for reduce slot millis
Also, there is https://issues.apache.org/jira/browse/MAPREDUCE-5463 where they have deprecated this altogether.
And there is a patch from Sandy https://issues.apache.org/jira/browse/MAPREDUCE-5464 that adds analogs for SLOTS_MILLIS that better fit the MR2 resource model.

Set a hadoop version in the hbase record for hadoop2 jobs

When processing a 2.0 job history/conf file, add another put for a column that indicates the hadoop version. It can be null for hadoop1 - which would be default.

Will be good to have major and minor version (if that is available).

FlowQueueService should support easier pagination of results

FlowQueueService methods for obtaining flows in a status allow limiting the number of results, but not easy pagination back through the full set. We should make use of PaginatedResult and allow easier navigation to previous entries.

Deal with changed counter sub group names in hadoop2

There have been some changes in counter sub-group names from hadoop1 to hadoop2. Till now, hRaven did not peek inside counter groups to look at their subgroup names. But now we need to.

We envision a long term and short term fix for this.

Long term goal is to come up with a naming scheme for counter sub group names since there seem to be a lot of repetition of org.apache.hadoop.mapred.XX.YY for every counter. So we could map org.apache.hadoop.mapred. to something like o.a.h.m. But that is a bigger fix since we first need an audit of all existing counter sub group names, come up with map and then we need to change every place where counters are being written, queried for and returned and then rewrite the existing data in hbase as well.

So the short term fix is to add a check in the rest response where we return counters, specifically in JobDetails populate function if it's hadoop1 or hadoop2 and look for the corresponding counter sub group names.

i:gm!FileSystemCounters!FILE_BYTES_READ becomes i:gm!org.apache.hadoop.mapreduce.FileSystemCounter!FILE_BYTES_READ

i:gm!FileSystemCounters!FILE_BYTES_WRITTEN becomes i:gm!org.apache.hadoop.mapreduce.FileSystemCounter!FILE_BYTES_WRITTEN

i:gr!FileSystemCounters!FILE_BYTES_READ - similar change in package name

i:gr!FileSystemCounters!FILE_BYTES_WRITTEN similar change in package name

i:g!FileSystemCounters!HDFS_BYTES_READ similar change in package name

i:g!FileSystemCounters!HDFS_BYTES_WRITTEN similar change in package name

i:g!org.apache.hadoop.mapred.JobInProgress$Counter!SLOTS_MILLIS_MAPS becomes i:g!org.apache.hadoop.mapreduce.JobCounter!SLOT_MILLIS_MAPS

i:g!org.apache.hadoop.mapred.JobInProgress$Counter!SLOTS_MILLIS_REDUCES becomes i:g!org.apache.hadoop.mapreduce.JobCounter!SLOTS_MILLIS_REDUCES

i:g!org.apache.hadoop.mapred.Task$Counter!REDUCE_SHUFFLE_BYTES becomes i:g!org.apache.hadoop.mapreduce.TaskCounter!REDUCE_SHUFFLE_BYTES

Add task level REST apis to hRaven

Currently hRaven supports job, flow and app summary level rest apis.
Will be good to add task level rest apis to hRaven

Extend hRaven to include hdfs usage

Presently hRaven includes job level statistics. It collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format.

It will be good to extend hRaven capabilities and add in hdfs usage statistics. This involves two broad aspects - collection and query apis (rest endpoints).

hraven-core declares a dependency to commons-lang3 but uses 2.x classes

The hraven-core pom.xml declares a dependency on commons-lang3, but all commons-lang classes used in the code are the 2.x versions (org.apache.commons.lang.* vs. org.apache.commons.lang3.*). We should modify the hraven-core dependency to use commons-lang 2.5.

Consider not attempting to load huge history files into raw table

Some of the 2.0 job history files can be huge. HBase can't load such big files and the RawLoader step fails with java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space.

Filing jira for a short term fix of considering not loading such huge history files

JobDescFactoryBase should check for mapreduce.job.name as well

For hadoop1 jobs, the mapred.job.name config param can be used to interpret the job name for hRaven.

This code needs to be enhanced for hadoop2 jobs since the config parameter is now called mapreduce.job.name.

Also for scalding jobs, it may be more useful to look for cascading.app.name.

Calculate cost of job as post processing step

We can determine how expensive a job was based on the megabyteMillis it consumed if we know the total cost of operating a machine and it's memory.

Filing issue to track enabling calculation of cost for a job within hRaven based on a job's megabyteMillis, tco of the node the job ran on and the max memory of that node.

Add start and end time to flow API

The flow end point currently supports start and end times in the endpoint with version: /flow////version

but not in the endpoint without version : /flow///

Need to extend the flow endpoint to include start and end times.

Return jobCost and flowCost in job and flow apis

Refactor existing job history parsing into factory model

Job History file formats have been modified in
https://issues.apache.org/jira/browse/MAPREDUCE-1016

Hadoop 2.0 also has a different file format.

In order for hRaven to process these different file formats, we need to adopt a factory based model for history file processing.

This bug is to abstract the existing file processing into a factory based model. As we add support for more file formats, these classes will be enhanced/modified as needed.

Allow hbase table prefix to be configurable at run-time

In our HBase environment, we namespace tables in order to group them and more easily identify which project they're associated with.

Currently, the table names are hardcoded in ./hraven-core/src/main/java/com/twitter/hraven/Constants.java unless you set IS_DEV (which sets PREFIX to "dev.") or you just override what PREFIX is set to. Both options require rebuilding hraven.

It would be great if this were a run-time option that could be set without having to rebuild.

Add skip count to REST calls

Current REST calls support LIMIT to limit the number of records fetched. With support of SKIP records, now the REST calls makes pagination easy on the server side for processing and returning of the results.

fix getDuration() in Flow

The getDuration function looks at the start time of first job in the flow and the end time of the last job in the flow.

This needs to be fixed to look for the smallest start time and biggest end time

Ensure hadoop 1.0 job history files can be processed on 2.0 cluster

The work that has gone into refactoring the code and enhancing hraven has been towards ensuring hraven can run on any hadoop version and process any version of job history files.

The earlier related pull request (#18) ensured that we can process 2.0 job history files on hadoop 1 as well as hadoop 2 clusters

We still need to ascertain that 1.0 job history files can be successfully processed on a hadoop 2.0 cluster. Opening jira to track those changes, if any.

Ensure hRaven can process/parse the new format of history & conf file names

With https://issues.apache.org/jira/browse/MAPREDUCE-323 job history file names and conf file names were changed.

Earlier job history file name was:
hostname2.example.com_1334279672946_job_201204130114_0020_user1_JobConfParser

Now it is:
job_201306192220_0001_1371680576348_hadoop_word+count

Ensure that hRaven can process the jobfile names correctly in order to interpret the job id which is used in processing.

Update MRJobDescFactory/getSubmitTimeMillisFromJobHistory to check for hadoop2 submit time

A couple of places need to be updated in the code for setting runId for Map Reduce jobs.

For a Map Reduce job, the run Is is set based on submit time in the config. Currently the submit time conf param that's being checked for is mapred.app.submitted.timestamp. But this does not seem to exist in hadoop2. Will find out the corresponding hadoop2 config param. Also, the function getSubmitTimeMillisFromJobHistory(byte[] jobHistoryRaw) in JobHistoryRawService is outdated. It should be updated to look for the new offset in the 2.0 history file. It should probably be refactored into the hraven-etl module.

Certain map reduce jobs have this param mapred.app.submitted.timestamp set where as some dont. Since submit time is obtained much before the entire job history file is parsed, there needs to be a byte seeking or something else that needs to be done in getSubmitTimeMillisFromJobHistory(byte[] jobHistoryRaw) . So this is not as trivial as simply updating the code for a new job conf param.

Add a newJobs REST API

Add a rest api that will point out the new jobs that were launched on a cluster/pool in a given time range

The output should have cluster, pool, user, app name, quota for the pool, utilization of the app.

This will help in determining on a given day how many new jobs were seen and how much of the pool/cluster capacity these jobs took up.

Ensure megabyte millis is stored in hbase at post processing step

megabyte millis is to be calculated and stored in hbase for that job key in the post processing step.
For hadoop2:
Megabyte millis is calculated as:
if not uberized:
map slot millis * mapreduce.map.memory.mb

reduce slot millis * mapreduce.reduce.memory.mb
yarn.app.mapreduce.am.resource.mb * job run time

if uberized:

yarn.app.mapreduce.am.resource.mb * job run time

For Hadoop1:
Total estimated memory (Xmx as 75% see below) * map slot millis +
Total estimated memory (Xmx as 75% see below) * reduce slot millis

for hadoop1 jobs, we can consider the -Xmx value to be 75% of the memory used by that task. For eg, if Xmx is set to 3G, we can consider 4G to be the task's memory usage so that we account for native memory (25% presumption). This way we don't depend on cluster specific memory and max map and reduce tasks on that cluster

REST client throws error when run against older jackson libraries in client's classpath

When the REST client is invoked for the flow endpoint with older jackson libraries (specifically jackson-mapper-asl-1.5.2.jar and jackson-core-asl-1.5.2.jar) in the classpath before 1.9.6 version of these libraries (as declared in the pom), then there
is an error thrown as noted below.

java.lang.UnsupportedOperationException: Should never call 'set' on setterless property
at org.codehaus.jackson.map.deser.SettableBeanProperty$SetterlessProperty.set(SettableBeanProperty.java:294)
at org.codehaus.jackson.map.deser.PropertyValue$Regular.assign(PropertyValue.java:57)
at org.codehaus.jackson.map.deser.Creator$PropertyBased.build(Creator.java:253)
at org.codehaus.jackson.map.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:507)
at org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:367)
at org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:303)
at org.codehaus.jackson.map.deser.CollectionDeserializer.deserialize(CollectionDeserializer.java:107)
at org.codehaus.jackson.map.deser.CollectionDeserializer.deserialize(CollectionDeserializer.java:84)
at org.codehaus.jackson.map.deser.CollectionDeserializer.deserialize(CollectionDeserializer.java:24)
at org.codehaus.jackson.map.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:135)
at org.codehaus.jackson.map.deser.SettableBeanProperty$MethodProperty.deserializeAndSet(SettableBeanProperty.java:221)
at org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:323)
at org.codehaus.jackson.map.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:483)
at org.codehaus.jackson.map.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:367)
at org.codehaus.jackson.map.deser.BeanDeserializer.deserialize(BeanDeserializer.java:286)
at org.codehaus.jackson.map.deser.CollectionDeserializer.deserialize(CollectionDeserializer.java:107)
at org.codehaus.jackson.map.deser.CollectionDeserializer.deserialize(CollectionDeserializer.java:84)
at org.codehaus.jackson.map.deser.CollectionDeserializer.deserialize(CollectionDeserializer.java:24)
at org.codehaus.jackson.map.ObjectMapper._readMapAndClose(ObjectMapper.java:1588)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1165)
at com.twitter.hraven.util.JSONUtil.readJson(JSONUtil.java:63)
at com.twitter.hraven.rest.client.HRavenRestClient.retrieveFlowsFromURL(HRavenRestClient.java:162)
at com.twitter.hraven.rest.client.HRavenRestClient.fetchFlowsWithConfig(HRavenRestClient.java:112)

Add more logging for REST endpoints

Add more logging for the REST endpoints so that it helps in debugging and tracking down which queries with what parameters took how much time and fetched how many jobs/flows and looked at how many hbase rows and columns

Processing job is pending on CDH 4.4

Here is part of log from TT and JT:

JT:
Job job_201311211731_0002 initialized successfully with 1 map tasks and 0 reduce tasks.

Adding task (JOB_SETUP) 'attempt_201311211731_0002_m_000002_0' to tip task_201311211731_0002_m_000002, for tracker 'tracker_test-mr.lol.ru:localhost/127.0.0.1:59964'

TT:
JVM Runner jvm_201311211731_0002_m_1483851591 spawned.

Writing commands to /mapred/local/ttprivate/taskTracker/devops/jobcache/job_201311211731_0002/attempt_201311211731_0002_m_000002_0/taskjvm.sh

JVM with ID: jvm_201311211731_0002_m_1483851591 given task: attempt_201311211731_0002_m_000002_0

HRaven:
13/11/21 17:37:24 INFO etl.ProcessRecordService: Returning 2 process records

13/11/21 17:37:24 INFO etl.JobFileRawLoader: ProcessRecords for mycluster: 2

13/11/21 17:37:24 INFO etl.JobFileRawLoader: Processing ProcessRecord(ProcessRecordKey[cluster=mycluster, timestamp=1385033014695]) 20131121110405-20131121112334 PREPROCESSED: 100 job files in /hadoop/mapred/history/processing/hraven-mycluster-20131121125615-0 minJobId: job_201311211502_0001 maxJobId: job_201311211502_0050

13/11/21 17:37:24 INFO etl.JobFileRawLoader: Processing using myHBaseConf:Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, hdfs-site.xml, hbase-default.xml, hbase-site.xml

13/11/21 17:37:24 INFO etl.JobFileRawLoader: Processing using getProcessFile:/hadoop/mapred/history/processing/hraven-mycluster-20131121125615-0

Job output:
13/11/21 17:37:25 INFO mapreduce.TableOutputFormat: Created table instance for job_history_raw
13/11/21 17:37:25 INFO input.FileInputFormat: Total input paths to process : 1
13/11/21 17:37:25 INFO mapred.JobClient: Running job: job_201311211731_0002
13/11/21 17:37:26 INFO mapred.JobClient: map 0% reduce 0%
13/11/21 17:47:27 INFO mapred.JobClient: Task Id : attempt_201311211731_0002_m_000002_0, Status : FAILED
Task attempt_201311211731_0002_m_000002_0 failed to report status for 600 seconds. Killing!

What do we do wrong?

Return (wall clock time) elapsedTime for flows

Presently, hRaven rest api for flows or flowStats will return the duration of the flow which is finish time of last job in the flow - launch time of first job.

It will be good to have the elapsedTime or wall clock time of the flow, which would be finish time of last job in the flow - submit time of the first job.

Sometimes jobs are waiting for a long time before they get launched and this will help us determine the delay in submit and launch.

Aggregate app per day and per week during hRaven Proccessing of each job

Create per day and per week aggregations in hRaven

Can do the aggregation at the Processing step itself

Ensure MR job status is set for successful hadoop2 jobs

hRaven looks at the jobStatus field in hadoop1 and hadoop2 history files. In hadoop2 (at least the most recent ones that I am looking at), the jobStatus field seems to occur only in JOB_INITED, JOB_KILLED AND JOB_FAILED events. It is missing from the JOB_FINISHED event. The jobStatus field should be a part of the JOB_FINISHED event in the history file when it's generated.

Since it's not, on the hRaven side, we need to ensure we store the job status while processing the file and insert a SUCCEEDED state for jobs with JOB_FINISHED event . Else the jobStatus field from JOB_INITED is considered to be the terminal status of the job by hRaven processing.

redundant maven-jar-plugin definition in hraven-etl

The pom file for hraven-etl contains two definitions for hraven-etl. Build says

[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.twitter.hraven:hraven-etl:jar:0.9.4-SNAPSHOT
[WARNING] 'build.pluginManagement.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.apache.maven.plugins:maven-jar-plugin @ line 101, column 17
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]

refactor enums in JobHistoryFileParserHadoop2 into separate classes

there are some enums in the JobHistoryFileParserHadoop2 class. will be good to refactor them out into individual classes

REST api for getting timeseries of hdfs usage for a particular attribute

Path: /hdfs/path/{cluster}/{attribute}

Required QS params - v1
path (the path for which the timeseries is to be fetched

Optional QS Params - v1:

startTime (default=now)
endTime (default=1 week ago)

Returns: json response containing a list that has data points per hour timestamp for the given input time range, and provides that attribute as queried for. (attribute could be any column in hbase : like storageCost, accessCost, hdfsCost, fileCount, spaceConsumed, accessCountTotal, trashFileCount, trashSpaceConsumed, tmpFileCount, tmpSpaceConsumed, dirCount for that time range

Ensure the scan for flowseries endpoint with version is time bound

The flow series endpoint presently looks for flows belonging to a cluster/user/app/version. It is limited by the number of flows to retrieve. In some cases a version occurs very few times like once or twice. But the scan will look through all runs for the app to see if the version is occurring somewhere.

We would like to add a time bound to this scan. The default can be 30 days to look back and configurable by start and end times.

jobFileProcessor.sh complains about missing arguments.

Running from origin/master.

I have things patched up enough to get the jobFilePreprocessor.sh and jobFileLoader.sh connecting to our Hadoop environment. The last step in hraven-etl.sh invokes jobFileProcessor.sh, but this throws errors about missing arguments.

I poked around in the code and it's not really clear what these should be. machinetype appears like it should be set to "default" if not explicitly set, but the arg processor makes this argument required. Additionally, I can't find a great deal of discussion on what's supposed to be in the cost properties.

ERROR: Missing required options: z, m

usage: JobFileProcessor  [-b <batch-size>] -c <cluster> [-d] -m
       <machinetype> [-p <processFileSubstring>] [-r] [-t <thread-count>]
       -z <costfile>
 -b,--batchSize <batch-size>                        The number of files to
                                                    process in one batch.
                                                    Default 100
 -c,--cluster <cluster>                             cluster for which jobs
                                                    are processed
 -d,--debug                                         switch on DEBUG log
                                                    level
 -m,--machineType <machinetype>                     The type of machine
                                                    this job ran on
  -p,--processFileSubstring <processFileSubstring>   use only those process
                                                     records where the
                                                     process file path
                                                     contains the provided
                                                     string. Useful when
                                                     processing production
                                                     jobs in parallel to
                                                     historic loads.
  -r,--reprocess                                     Reprocess only those
                                                     records that have been
                                                     marked to be
                                                     reprocessed. Otherwise
                                                     process all rows
                                                     indicated in the
                                                     processing records,
                                                     but successfully
                                                     processed job files
                                                     are skipped.
  -t,--threads <thread-count>                        Number of parallel
                                                     threads to use to run
                                                     Hadoop jobs
                                                     simultaniously.
                                                     Default = 1
  -z,--costFile <costfile>                           The cost properties
                                                     file on local disk

Flow, FlowStats end points should include megabytemillis, hadoop version

As part of #30 we added in megabytemillis calculations and loading into hbase. We need to ensure megabytemillis is returned at the flow level in the REST apis

As part of #27 hadoop version was also added in. Will be good to return the hadoop version at the flow level as well in case we want to compare flows across two different versions

Shorten counter sub group names - long term fix

As noted in #34 , we need a long term fix for dealing with counter subgroup names. Presently hRaven stores the subgroup name for each counter in the column name. The idea here is to come up with a naming scheme for counter sub group names since there seem to be a lot of repetition of org.apache.hadoop.mapred.XX.YY for every counter. So we could map org.apache.hadoop.mapred. to something like o.a.h.m. But that is a bigger fix since we first need an audit of all existing counter sub group names, come up with map and then we need to change every place where counters are being written, queried for and returned and then rewrite the existing data in hbase as well.

REST api for getCluster (also enable loading of different properties files for cluster identifier mappings)

Expose a rest api to get the cluster identifier based on hostname

Cluster.java currently loads a properties file called "hadoopclusters.properties" . Filing enhancement to be able to load different properties files.

Allow tuning of scanner caching for JobHistoryService.getFlowSeries() to improve response times

When obtaining a series of flows with a small specified limit, we may wind up scanning much more data than necessary, slowing down response times. We can do a simple improvement for these small limit cases in the short term. In the longer term, the better approach would be to use a custom filter to early out of the scan on the regionserver.

REST api for hdfs usage

As part of #55, we need a rest api that returns hdfs stats for a cluster for a given hour.

The endpoint can be:
Path: /hdfs//

Optional QS Params - v1:

limit (default=default number (250))
path (can be a full path or a partial path, will do prefix matching on path)
runId (default=2 hours ago)

Returns: json response containing a list of paths and cost, file counts, space consumed, access counts, trash file counts, trash space consumed

Add a new job history key called "diagnostics" in hadoop2

One new job history key has been added in hadoop2 called "diagnostics". it's a string

Allow mapSlotMillis and reduceSlotMillis to be not found in megabytemillis calculations

Map only jobs or failed jobs may not contain map slot millis and reduce slot millis.

Ensure megabytemillis is calculated correctly in that case and no exception is thrown.

Add an hRaven specific configuration parameter for pool/queue

In hadoop1, the pool name property comes from the job conf as mapred.fairscheduler.pool

In hadoop2, we have mapreduce.job.queuename

It will be good to have generic name for pools/queues across hadoop1 and hadoop2. Currently hRaven processing does not look at configuration file keys and values, but now we would need to maintain some kind of enum/map so that we have our generic names stored as well

Could be potentially called as "hraven.poolname"