Giter Site home page Giter Site logo

project1's Introduction

project1

project1's People

Contributors

baimingze avatar

Watchers

Yasset Perez-Riverol avatar James Cloos avatar  avatar  avatar

Forkers

m9994

project1's Issues

input/output format for mapper/reducer

hadoop streaming use stdin and stdout as the way of communication between mapper and reducer.
In default ,the input/output format is k-v pair ,and hadoop recognize one line as a k-v pair.
So when I try to output a text(the slice of the spectrum) as a key,what reducer get is divided single lines.
Here are 2 possible solutions:

  1. replace '\n' with a separator,like | .and recover at the reducer step
    2.hadoop streaming provide parameters ,-inputformat JavaClassNanme -outputformat JavaClassName.But they need to be customized by writing a JAVA class .
    What do you think?

Questions about spectrum file

  1. xtandem supports DTA, PKL or MGF files,each file contains more than one spectrum.
    I think the information need to be classified when divided , but I'm not familiar with spectrum file.
    Whether one file contains several results of peptides , which means each spectrum is a result of a peptide?
  2. I wonder if xtandem calculates the value through mass—m/z pair of fragment ?
    Does peptide mass and charge matter ?

new architecture of the Python-MapReduce-Xtandem

Main idea

  1. why MapReduce, not Spark? we follow the mr-xtandem, implement the MapReduce version at first.
  2. From Top to Down(by Sangzhe): follow the same MapRduce design of mr-xtandem, implement the multiple map/reduce task, as follow(make a fake process method and some fake results):
    archtecture design1
  3. From Down to Top(by Mingze): find a way to wrap the original c++ class files as a library, which can be call by cython.

HDFS目录

general_config.json中需要配置hadoop_dr,这个路径应该是哪个呢?
现在运行到copy文件至HDFS出错,我觉得可能是这个目录地址写的不对

图是hadoop的log
2015-07-12 08 21 09

输入输出

请看mt-tandem.py 481行(修改后的482行)
假设我们的计算逻辑也是执行如下代码

          if ( mrh.runOldSkool() ) : 
             # for debug and performance comparison, just runs regular tandem on a single node in the reducer

             # run tandem - "reducer99" is its cue to fall back into traditional single-node multi-thread behavior
             workStepOldSkool = boto.emr.StreamingStep( name = '%s-OldSkool-final' % baseName, # "-final" is cue to grab results below
                                        mapper = '%s -mapreduceinstalltest' % xtandemCmd,
                                        reducer = '%s -reducer99_%d %s %s -reportURL %s' % (xtandemCmd,nParamFiles,outputName, mainXtandemParametersName, finalReportURL),
                                        cache_files = cachefiles,
                                        input = '%s/%s' %  (baseURL, mapper1InputFile),
                                        output = '%s/%s' %  (baseURL, resultsDir),
                                        step_args = stepArgs)
             worksteps.extend([workStepOldSkool])
  1. 那么这里的 maper,reducer,input,output都是怎么来的,代表了什么含义?
  2. 最原始的fasta文件和谱图文件里面的信息是如何走到这里的input中来的?
  3. 上面是假设的执行这段代码,真实的计算逻辑我们可以在python的调试模式下跟踪整个计算流程,看看它的逻辑是怎么走的。

how to view output file?

The output file generated by tandem program should contain the concrete result of execution.
But I don't know how to open it and view its content.
Do you have any idea?@baimingze

PS:I searched online ,some say output file should be opened with tandem-style.css , but I don't know how.

测试运行tandem报错

报错提示

[sangzhe@localhost mrtandem_bin]$ ./tandem gpm_input.xml
./tandem: error while loading shared libraries: libboost_serialization.so.1.57.0: cannot open shared object file: No such file or directory

工作计划

总体方案

  • 在hadoop集群上运行成功MR-tandem
  • 梳理MR-tandem的PYTHON运行程序,需要关注逻辑流程和输入输出接口
  • 梳理MR-tandem的输入输出接口
  • 参考以上信息编写面向spark的xtandem程序

环境信息

  • OS版本确定为fedora22
  • Hadoop 采用Hadoop-Version文件里规定的版本
  • mrtandam-ica-code目录 包含mrtandem的运行代码(主要为python)
  • mrtandem_bin目录 是在fedora22环境中编译成功的tandem可执行程序
  • mrtandem_fasta目录包含mrtandem可能用到的fasta数据库文件
  • mrtandem_readme文件为mrtandem单机版运行方法

成功运行mrtandem

  • 安装fedora22
  • 安装python,java开发环境
  • 安装hadoop,测试附带example
  • 测试单机版(非hadoop)的xtandem程序
  • 编写hadoop版本mrtandem的各种输入脚本
  • 测试/debug hadoop版本的mrtandem程序

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.