Giter Site home page Giter Site logo

leykocytes_dmr_analysis's Introduction

Effect of smoking on human leukocyte epigenome

Students: Polina Pavlova, Maria Firuleva

Supervisors: Oleg Sergeev, Yulia Medvedeva, Julia Kornienko

Project Description

The Russian Children’s Study is a prospective cohort of 516 boys who were enrolled at 8–9 years of age and provided semen samples at 18–19 years of age. RRBS of sperm was conducted to identify the methylation level of CpG dinucleotides. At the moment of enrollment into the study, the blood level of 2,3,7,8-Tetrachlorodibenzodioxin (TCDD), most toxic dioxin congener and one of the most harmful endocrine disrupting chemical, was measured in boys for further longitudinal study of its influence on the reproductive health. Subjects visited the clinic biennially - for blood sampling; annually - for urine sampling, follow up of growth and puberty and interviewing => 20 000+ sample aliquots and 1000+ analyzing parameters were collected in total for further analysis.

What is already known?

  • 52 differentially methylated regions (DMRs) in sperm were identified that distinguished lowest and highest peripubertal serum TCDD concentrations (RCS, Pilsner et al., 2018)
  • Peripubertal exposure to toxicants (like dioxins) and smoking affects the methylation of sperm DNA at the age of 18 (last year project, J.Kornienko)

What is needed to be known?

  • How does smoking influence the DNA methylation in peripheral blood leukocytes at the age of 18?
  • How are associated epigenome changes of sperm and leukocytes in the same study participants?

Aims of the project:

  • Analysis of smoking influence on the DNA methylation level of peripheral blood leukocytes at the age of 18
  • Comparison of DNA methylation level in sperm (results of last year project, Julia Kornienko) and leukocytes in the same study participants

What were the data to analyse:

  • 36 samples: 11 smoking and 25 don’t
  • Data regarding lifestyle habits of each of 36 chosen participants
  • Methylation levels of 2277623 CpGs present in at least one of 36 samples

On the histogram the distribution of samples by the number of cigarettes smoked is presented GitHub Logo

Scripts Description

  1. data_preparation.ipynb - script for data preprocessing
  2. visualization.ipynb - script for drawing plots presented here
  3. Aclust_GEE.R - script for A-clustering and GEE analysis
  4. DMRcate_RRBS_leuk.R - script for DMRcate analysis

Brief summary of the results:

  1. Similar methylation distribution revealed in leukocytes and sperms - for dataset with CpGs presented at least sample
  2. Methylation distribution is different in sperms and leukocytes in dataset used for A-clustering - CpGs presented in all samples
  3. Comparison of CpGs distribution across samples shows similar distribution in sperms and leukocytes
  4. Comparison of methylation range also shows similar distibution
  5. Using A-clustering method found 217 clusters. After GEE identified 77 significant clusters (p-value < 0.05)
  6. A-clustering+GEE comparison shows similar distribution in sperms and leukocytes. In sperms data were identified 814 clusters (A-clust) and 136 significant clusters after GEE.
  7. DMRcate package (R) were used for search of DMR by smoking. Factor - smoking last 6 months in binary classification (yes/no). 34 significant DMRs revealed, of them 19 overlap with at least one promoter (reference - hg38), 145 significant CpGs.
  8. Genes associated with significant CpGs encode pseudogenes, lincRNAs, antisense RNAs, zinc fingers, miRNA, regulatory elements, proteins of cell adhesion, amino-acid transporter.

Detailed description of results:

Samples selection:

  • 51 out of 516 samples were chosen by the following criterias:
  • Prepubertal TCDD exposure at 8-9 years old
  • Semen quality at 18-19 years old
  • Frozen semen samples at 18-19 years old
  • Buffy coat samples at 18-19 years old closest to date of semen samplingю Buffy coat samples – source of leukocyte libraries.

To start with, 10 IDs with either the highest or the lowest TCDD concentration were selected. Then were selected one ID with the highest semen quality and one with the lowest. Later, 39 more IDs were chosen by random selection 13 IDs belonging to each tercile in terms of TCDD concentration. Therefore the data set of 51 samples was created. 45 leukocyte libraries were sequenced. 4 samples were excluded because of low quality. Then additionally 5 samples were excluded because of coverage less than 10. Finally, 36 samples with satisfactory sequencing quality were selected for this project.

Data preprocessing:

From the file containing information regarding all CpGs for 41 samples (with CpGs coverage >= 10x) we extracted methylation levels for the selected 36 samples. So, we got dataset with methylation level of 2277623 CpGs for 36 samples (dataset with information of CpGs presented in at least one sample). Then we prepared dataset for A-clustering algorithm (it works with methylation levels of CpGs presented in all samples). So, for A-clustering+GEE approach we had dataset with methylation level of 19466 CpGs.

Comparison with sperms (last year project - Julia Kornienko):

  • Comparison of methylation distribution for datasets with methylation level for CpGs presented in at least one sample. We can see similar methylation distribution - the most of CpGs have very low methylation level, and some CpGs have high methylation level

GitHub Logo GitHub Logo

  • Comparison of methylation distribution for dataset with methylation level for CpGs presented in all samples. Here we can see different distribution. In sperms the most of CpGs have very low level of methylation and some CpGs with high methylation, but in leukocytes vise versa. We suppose, the reason is that global demethylation occurs in sex cells

GitHub Logo GitHub Logo

  • Comparison of CpGs distribution across sample. In both type of cells we can see geometric distribution. Most often the CpGs is found in one sample, 1.5 times less in two samples and etc

GitHub Logo GitHub Logo

GitHub Logo GitHub Logo

A-clustering+GEE:

Next, we made A-clustering and GEE regression using R packages “Aclust” and “gee”. This algorithm allows to detect sets of neighboring CpGs sites that are correlated with each other. With this approach we identified 217 A-clusters, from them 77 were significant (p-value < 0.05). But this approach has a significant drawback. Because of this algorithm works with information presented in all samples (without any NAs), we needed to restrict data from 2277623 to 19466 CpGs. So, a lot of important information could have been lost.

  • A-clustering+GEE approach comparison. In general, we see similar distribution of number of CpGs in cluster. In leukocytes we found 217 clusters, in sperms - 874. Number of significant clusters (p-value < 0.05) in leukocytes 77, in sperms - 136

GitHub Logo GitHub Logo

On the histogram distribution of number of CpGs per cluster is presented

Thus, another approach were suggested.

DMRcate:

DMRcate - R package for search of DMRs associated with exposure to a factor. In our case, we used smoking in binary classification (smoke or not) to find DMRs associated with smoking influence. 145 significant CpGs and 34 significant DMRs (p-value < 0.05) were found in our data. From them 19 DMRs overlap with at least one promoter (reference - hg38). We found 23 genes associated with significant DMRs. These genes are associated with antisense RNAs, lincRNAs, miRNA, pseudogenes, zinc fingers, transcriptional factor, spliceosome, cell adhesion and migration, kinase, metalloprotease, electron transport chain and amino-acid transporter.

On the histogram distribution of DMRs per chromosome is presented GitHub Logo

Here genes associated with significant DMRs are presented GitHub Logo

Summary

In course of the project, we analyzed the data from reduced representation bisulfite sequencing of participants in the Russian Children's Study longitudinal project. Various approaches to statistics interpretation of data were applied, in particular, the use of A-clustering algorithm and GEE and the DMRcate software package. In the future, it will be interesting to compare the results obtained in the previous spring project, devoted to the analysis of RRBS data from spermatozoids, and the results of our project.

leykocytes_dmr_analysis's People

Contributors

mariafiruleva avatar polina-pavlova avatar

Watchers

James Cloos avatar

Forkers

polina-pavlova

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.