Light

baimingze / project1 Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 1.0 23.48 MB

Python 65.63% NSIS 11.16% Shell 3.81% R 14.56% Batchfile 0.11% XSLT 2.99% CSS 1.74%

project1's Introduction

project1

project1's People

Contributors

Watchers

Forkers

m9994

project1's Issues

input/output format for mapper/reducer

hadoop streaming use stdin and stdout as the way of communication between mapper and reducer.
In default ,the input/output format is k-v pair ,and hadoop recognize one line as a k-v pair.
So when I try to output a text(the slice of the spectrum) as a key,what reducer get is divided single lines.
Here are 2 possible solutions:

replace '\n' with a separator,like | .and recover at the reducer step
2.hadoop streaming provide parameters ,-inputformat JavaClassNanme -outputformat JavaClassName.But they need to be customized by writing a JAVA class .
What do you think?

Questions about spectrum file

xtandem supports DTA, PKL or MGF files,each file contains more than one spectrum.
I think the information need to be classified when divided , but I'm not familiar with spectrum file.
Whether one file contains several results of peptides , which means each spectrum is a result of a peptide?
I wonder if xtandem calculates the value through mass—m/z pair of fragment ?
Does peptide mass and charge matter ?

new architecture of the Python-MapReduce-Xtandem

Main idea

why MapReduce, not Spark? we follow the mr-xtandem, implement the MapReduce version at first.
From Top to Down(by Sangzhe): follow the same MapRduce design of mr-xtandem, implement the multiple map/reduce task, as follow(make a fake process method and some fake results):
From Down to Top(by Mingze): find a way to wrap the original c++ class files as a library, which can be call by cython.

HDFS目录

general_config.json中需要配置hadoop_dr，这个路径应该是哪个呢？
现在运行到copy文件至HDFS出错，我觉得可能是这个目录地址写的不对

图是hadoop的log

输入输出

请看mt-tandem.py 481行（修改后的482行）
假设我们的计算逻辑也是执行如下代码

          if ( mrh.runOldSkool() ) : 
             # for debug and performance comparison, just runs regular tandem on a single node in the reducer

             # run tandem - "reducer99" is its cue to fall back into traditional single-node multi-thread behavior
             workStepOldSkool = boto.emr.StreamingStep( name = '%s-OldSkool-final' % baseName, # "-final" is cue to grab results below
                                        mapper = '%s -mapreduceinstalltest' % xtandemCmd,
                                        reducer = '%s -reducer99_%d %s %s -reportURL %s' % (xtandemCmd,nParamFiles,outputName, mainXtandemParametersName, finalReportURL),
                                        cache_files = cachefiles,
                                        input = '%s/%s' %  (baseURL, mapper1InputFile),
                                        output = '%s/%s' %  (baseURL, resultsDir),
                                        step_args = stepArgs)
             worksteps.extend([workStepOldSkool])

那么这里的 maper，reducer，input，output都是怎么来的，代表了什么含义？
最原始的fasta文件和谱图文件里面的信息是如何走到这里的input中来的？
上面是假设的执行这段代码，真实的计算逻辑我们可以在python的调试模式下跟踪整个计算流程，看看它的逻辑是怎么走的。

how to view output file?

The output file generated by tandem program should contain the concrete result of execution.
But I don't know how to open it and view its content.
Do you have any idea?@baimingze

PS:I searched online ,some say output file should be opened with tandem-style.css , but I don't know how.

测试运行tandem报错

报错提示

[sangzhe@localhost mrtandem_bin]$ ./tandem gpm_input.xml
./tandem: error while loading shared libraries: libboost_serialization.so.1.57.0: cannot open shared object file: No such file or directory

工作计划

总体方案

在hadoop集群上运行成功MR-tandem
梳理MR-tandem的PYTHON运行程序，需要关注逻辑流程和输入输出接口
梳理MR-tandem的输入输出接口
参考以上信息编写面向spark的xtandem程序

环境信息

OS版本确定为fedora22
Hadoop 采用Hadoop-Version文件里规定的版本
mrtandam-ica-code目录包含mrtandem的运行代码（主要为python）
mrtandem_bin目录是在fedora22环境中编译成功的tandem可执行程序
mrtandem_fasta目录包含mrtandem可能用到的fasta数据库文件
mrtandem_readme文件为mrtandem单机版运行方法

成功运行mrtandem

安装fedora22
安装python，java开发环境
安装hadoop，测试附带example
测试单机版（非hadoop）的xtandem程序
编写hadoop版本mrtandem的各种输入脚本
测试/debug hadoop版本的mrtandem程序

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.