project1's Introduction
project1's People
Forkers
m9994project1's Issues
input/output format for mapper/reducer
hadoop streaming use stdin and stdout as the way of communication between mapper and reducer.
In default ,the input/output format is k-v pair ,and hadoop recognize one line as a k-v pair.
So when I try to output a text(the slice of the spectrum) as a key,what reducer get is divided single lines.
Here are 2 possible solutions:
- replace '\n' with a separator,like
|
.
and recover at the reducer step
2.hadoop streaming provide parameters ,-inputformat JavaClassNanme
-outputformat JavaClassName
.But they need to be customized by writing a JAVA class .
What do you think?
Questions about spectrum file
- xtandem supports DTA, PKL or MGF files,each file contains more than one spectrum.
I think the information need to be classified when divided , but I'm not familiar with spectrum file.
Whether one file contains several results of peptides , which means each spectrum is a result of a peptide? - I wonder if xtandem calculates the value through mass—m/z pair of fragment ?
Does peptide mass and charge matter ?
new architecture of the Python-MapReduce-Xtandem
Main idea
- why MapReduce, not Spark? we follow the mr-xtandem, implement the MapReduce version at first.
- From Top to Down(by Sangzhe): follow the same MapRduce design of mr-xtandem, implement the multiple map/reduce task, as follow(make a fake process method and some fake results):
- From Down to Top(by Mingze): find a way to wrap the original c++ class files as a library, which can be call by cython.
HDFS目录
输入输出
请看mt-tandem.py 481行(修改后的482行)
假设我们的计算逻辑也是执行如下代码
if ( mrh.runOldSkool() ) :
# for debug and performance comparison, just runs regular tandem on a single node in the reducer
# run tandem - "reducer99" is its cue to fall back into traditional single-node multi-thread behavior
workStepOldSkool = boto.emr.StreamingStep( name = '%s-OldSkool-final' % baseName, # "-final" is cue to grab results below
mapper = '%s -mapreduceinstalltest' % xtandemCmd,
reducer = '%s -reducer99_%d %s %s -reportURL %s' % (xtandemCmd,nParamFiles,outputName, mainXtandemParametersName, finalReportURL),
cache_files = cachefiles,
input = '%s/%s' % (baseURL, mapper1InputFile),
output = '%s/%s' % (baseURL, resultsDir),
step_args = stepArgs)
worksteps.extend([workStepOldSkool])
- 那么这里的 maper,reducer,input,output都是怎么来的,代表了什么含义?
- 最原始的fasta文件和谱图文件里面的信息是如何走到这里的input中来的?
- 上面是假设的执行这段代码,真实的计算逻辑我们可以在python的调试模式下跟踪整个计算流程,看看它的逻辑是怎么走的。
how to view output file?
The output file generated by tandem program should contain the concrete result of execution.
But I don't know how to open it and view its content.
Do you have any idea?@baimingze
PS:I searched online ,some say output file should be opened with tandem-style.css , but I don't know how.
测试运行tandem报错
报错提示
[sangzhe@localhost mrtandem_bin]$ ./tandem gpm_input.xml
./tandem: error while loading shared libraries: libboost_serialization.so.1.57.0: cannot open shared object file: No such file or directory
工作计划
总体方案
- 在hadoop集群上运行成功MR-tandem
- 梳理MR-tandem的PYTHON运行程序,需要关注逻辑流程和输入输出接口
- 梳理MR-tandem的输入输出接口
- 参考以上信息编写面向spark的xtandem程序
环境信息
- OS版本确定为fedora22
- Hadoop 采用Hadoop-Version文件里规定的版本
- mrtandam-ica-code目录 包含mrtandem的运行代码(主要为python)
- mrtandem_bin目录 是在fedora22环境中编译成功的tandem可执行程序
- mrtandem_fasta目录包含mrtandem可能用到的fasta数据库文件
- mrtandem_readme文件为mrtandem单机版运行方法
成功运行mrtandem
- 安装fedora22
- 安装python,java开发环境
- 安装hadoop,测试附带example
- 测试单机版(非hadoop)的xtandem程序
- 编写hadoop版本mrtandem的各种输入脚本
- 测试/debug hadoop版本的mrtandem程序
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.