Giter Site home page Giter Site logo

cumbof / opengdc Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 242.4 MB

An open-source Java tool to automatically extract and convert all clinical and genomic data from the Genomic Data Commons to BED, GTF, CSV, and JSON format

Home Page: http://geco.deib.polimi.it/opengdc/

License: MIT License

Java 100.00%
bed bioinformatics csv gdc gtf json target tcga

opengdc's Introduction

opengdc's People

Contributors

cumbof avatar eleonoracappelli avatar emanuelws avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

deib-geco

opengdc's Issues

file_uuid issue

HOW IT CURRENTLY WORKS

During the download phase, the software attaches the file_uuid as a prefix to the output file name. This step is required to retrieve the aliquot_uuid associated to a specific file during the conversion process.

ISSUE

This procedure described above works perfectly if the end user will convert the datasets downloaded with OpenGDC, but it will generate an error if a dataset will be manually downloaded from the GDC portal and then converted with OpenGDC, because the input file name will not contain the file_uuid as prefix.

This issue affects all parsers except MetadataParser* and MaskedSomaticMutationParser, because the aliquot_uuid is already included inside the Clinical and Biospecimen Supplements and Masked Somatic Mutation (MAF) input files.

Control of existing file during download and conversion procedures

This issue is intended to propose a solution to the problem of restarting the download and/or the conversion procedure by skipping the files that were already downloaded/converted.

In DownloadDataAction class, method retrieveData:

private static void retrieveData( ... ) {
   ...
   for (String uuid: dataMap.keySet()) {
      // download data
      String fileName = uuid + "_" + dataMap.get(uuid).get("file_name");
      // download file if it does not exist
      File existing_file = new File(gdc_path + fileName);
      if (!existing_file.exists()) {
         for (String s: dataMap.get(uuid).keySet()){
            ......
      }
   }
   ...
}

In all parser classes, method convert:

public int convert( ... ) {
   int acceptedFiles = FSUtils.acceptedFilesInFolder(inPath, getAcceptedInputFileFormats());   
   System.err.println("Data Amount: " + acceptedFiles + " files" + "\n\n");
   GUI.appendLog(this.getLogger(), "Data Amount: " + acceptedFiles + " files" + "\n\n");

   // if the output folder is not empty, delete the most recent file
   File folder = new File(outPath);
   File[] files_out = folder.listFiles();
   if (files_out.length!=0) {
      File lastmodif =files_out[0];
      long time = 0;
      for (File file : files_out) {
         if (file.getName().endsWith("bed")) {
            if (file.lastModified() > time) {  
               time = file.lastModified();
               lastmodif = file;
            }
         }
      }
      System.out.println("deleted: "+lastmodif.getName());
      lastmodif.delete();
   }

   ...

   if (!aliquot_uuid.trim().equals("")) {
      String suffix_id = this.getOpenGDCSuffix(dataType, false);
      String filePath = outPath + aliquot_uuid + "-" + suffix_id + "." + this.getFormat();
      // create file if it does not exist
      File existing_file = new File(filePath);
      if (!existing_file.exists()) {
         try {
            System.out.println("creation of file: "+existing_file.getName());
            ...
         }
         ...
      }
   } 
}

It is worth noting that the System.out.println declarations are invoked just to take note of the converted file names.

Cleaning appdata

This is just a reminder about cleaning the package/appdata directory from deprecated and unused files before application deployment.

Improvement considerations about Gene Expression Quantification parser

HOW IT WORKS NOW:
In the GeneExpressionQuantificationParser class we start converting data searching for counts files and automatically retrieving the related fpkm and fpkmuq files.
If the counts file does not exist for some reason, but the other fpkm and fpkmuq files exist, we skip that experiment. Otherwise, even if fpkm and/or fpkmuq are missing, we generate the output file with missing values.

IMPROVEMENT:
Because of the last sentence above, an improvement of the GeneExpressionQuantificationParser is required in my opinion. In general, we need to generate the output file if at least one of the three input files exist and does not contain any error inside (look at the GeneExpressionQuantificationReader.getEnsembl2Value(File expFile) method).

ReadableByteChannel [buffer override for large files]

The downloadFile(...) method in GDCQuery class should be improved to allow the download of large files.

According to this article:

The FileChannel will try to read all data from ReadableByteChannel, starting from first byte (0) 
until the last byte in cache, but no more than maximum number of bytes (Long.MAX_VALUE). 
So the problem is: if cache does not contain complete file, also Java cannot download 
complete file. **For larger files this can occur quite often.**

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.