Giter Site home page Giter Site logo

gwascatftp's Introduction

gwascatftp - Access the GWAS Catalog FTP server from R

Lifecycle: experimental

This R package provides functions to interact with the GWAS Catalog's FTP server that are (as far as I know) not available in the GWAS Catalog API and by extension the R package {gwasrapidd}. It provides wrapper functions that call to lftp, a command-line file transfer program.

The main goal of {gwascatftp} is to query the GWAS Catalog FTP server with a user-provided study accession (e.g. GCST009541) and:

  1. Find all associated files (documentation, meta-data, summary statistics files)
  2. Identify whether harmonised summary statistics are available
  3. Download & parse the yaml meta-data file
  4. Download all available files from the FTP server
  5. Work from behind an institution's HTTP proxy server (e.g. from a university HPC cluster)

Use Case

This package wants to cover this basic use case: You read a GWAS paper about heart failure that that makes summary statistics available on the GWAS Catalog under the study accession GCST90162626. You want to use the full summary statistics for your own research. In R, {gwasrapidd} can be used to query the Catalog's API to download association summary statistics with genome-wide significance, but not the full summary statistics. These are available from the FTP server, but the links to those files are not available in the data the API provides. Instead, you have to have to go to the study website at https://www.ebi.ac.uk/gwas/studies/GCST90162626, click on the FTP Download link and then manually copy the links to each file and download them using wget, curl, or a similar program. Wouldn't it be much easier to do this from within R, without having to manually search the FTP server in your web browser for the links to the file? Especially when trying to download data for several studies, this can be tedious. That is where {gwascatftp} comes in. It provides a simple interface to access the GWAS Catalog FTP server using only the study accession from within R to find and download all files associated with a specific study accession.

Installation

You can install the development version of gwascatftp from GitHub with:

# install.packages("devtools")
remotes::install_github("cfbeuchel/gwascatftp")

Prerequisites

This package requires an installation of lftp and the path of the program's executable file. To get basic installation information from within the package, call the install_lftp() function.

library(gwascatftp)
install_lftp()

An easy way to install lftp without root access is to use conda. See below for the commands to install lftp in a fresh conda environment and identify the lftp binary's path with whereis. If you do not have lftp installed in your system's $PATH (which it is not unless you activate the environment from within your R session), you need to provide the full path when using {gwascatftp}.

conda create -n lftp -c conda-forge lftp
conda activate lftp
whereis lftp
#> e.g. `~/miniconda3/envs/lftp/bin/lftp`

Getting Started

Before using the package, make sure you have lftp installed and know where the executable is located. When using conda, this should hopefully work on most systems.

Due to interfacing with lftp, we need to create a simple list containing the basic settings for getting lftp to run.

# This returns a named list to be passed to functions calling `lftp`
my_lftp_settings <- create_lftp_settings(
  lftp_bin = "~/miniconda3/envs/lftp/bin/lftp", # Enter your path here!
  use_proxy = FALSE, # When behind a HTTP proxy, set to TRUE
  ftp_proxy = NA, # When behind a HTTP proxy, enter the link, e.g. http://proxy.my_uni.com:8080"
  ftp_root = "ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/"
)

When querying the GWAS Catalog, two large lists are used often. To avoid repeatedly downloading these files, we will download them once, save them in variables and supply those to the functions using them.

my_directory_list <- get_directory_list(lftp_settings = my_lftp_settings)
my_harmonised_list <- get_harmonised_list(lftp_settings = my_lftp_settings)

Now we can query the GWAS Catalog FTP server using accession names. The most high-level function is download_all_accession_data(), that, given a GWAS Catalog study accession, will try to download all available files for that accession and return the parsed meta data when available.

# Supply a single accession you want to download the data from
my_study_accession <- "GCST009541"

# Call the function with all the necessary data, including the settings and pre-downloaded lists
download_all_accession_data(
    study_accession = my_study_accession,
    harmonised_list = my_harmonised_list,
    directory_list = my_directory_list,
    lftp_settings = my_lftp_settings,
    download_directory = "/PATH/TO/DOWNLOAD/TO/",
    create_accession_directory = TRUE,
    overwrite_existing_files = FALSE,
    return_meta_data = TRUE
)

The second high-level function, download_multiple_accession_meta_data() will accept a vector of study accessions and return a data.table containing all meta data available for each accession's [...]-meta.yaml files if available.

# Prepare a list of GWAS IDs
my_mulltiple_study_accessions <- c("GCST009541", "GCST90204201")

# Call the functions including the necessary lists to download the meta data
my_meta_data <- download_multiple_accession_meta_data(
  study_accessions = my_mulltiple_study_accessions, 
  harmonised_list = my_harmonised_list, 
  directory_list = my_directory_list, 
  lftp_settings = my_lftp_settings
)

See Also

gwascatftp's People

Contributors

cfbeuchel avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

comp-med

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.