TL;DR
tidy tibbles can contain non-atomic classes. This is a proof of concept demonstration for such implementation with S4 object-oriented classes, for meta-analysis of complex genomic data.
Motivation
In my previous post I reviewed the evolution of Bioconductor S4 classes for omics data. The most recent extended class is the MultiAssayExperiment for multi-assay, where each (single) assay store data for thousand of analytes/gens for many subjects.
What if… I want to run a meta-analysis of the multiple multi-omic data?
For example, suppose I want to find a genetic marker/signature (a set of genes) that is unique to a subset of subject with a specific phenotype (e.g. disease), and then compare it to another cohort of subjects, with different phenotype. Is that genetic marker unique only to a specific cohort of subjects, or also shared across different sub-types of the disease?
Public repositories of multiple multi-omic datasets
Below are some of common big public repositories for multi-omics experiments.
ImmuneSpace: a web-portal for integrative modeling of human immunological data. Currently, it contain data from 72 human immunology studies, covering a total of 5,355 participants. Disclaimer: I am a proud member of one of the groups that develop and maintain this repository.
The Cancer Genome Atlas (TCGA)
Cancer Cell Line Encyclopedia (CCLE)
These repositories collect and annotate an enormous amount of genomics data in a unified way that allow proper downstream analysis of the data, with minimum additional handling. for simplicity, I will refer to different collections of subjects of the same phenotype (disease type) as a ‘study’.
Each ‘study’ is stored in a single S4 object that contain an entire “universe” of multi-assays data sets, each of multiple genes, for many subjects. This is a collection of very rich data sets, well condensed in a systematic way into a single object in R. And,… each of these repositories have many of these ‘Universes’.
Image credit: the movie ‘Man in Black’, Columbia pictures. Now I understand why Purrr has a cat on its hexagon logo.
Where should I store my multiple S4 classes?
In a meta-analysis pipeline, I need the flexibility of accessing each ‘Study’ and play with it, that is, get/extract specific data of interest, e.g. an entire group of analytes within a specific assay, a gene, or a phenotype (clinical feature), and run some analytical procedures (normalization, scaling etc) on it. I also need to do it systematically across all of the studies, within each study across multiple assays, within each assay across all subjects and all genes… Oy Vey.
In my early training as a Statistician, I was taught that in most data analysis, you have a data set with rows for the sampling unit, and columns for the variables/features. Therefore, it is not surprise that the tidyverse paradigm is totally make sense to me. In a meta-analysis the sampling unit is each of the studies. But my data at each cell (unit in my tidy dataset) is not so simple… it is NOT an atomic vector! (i.e. not a number or a string). It is a S4 class. What the #$%& should I do?
Where should I store my multiple S4 objects? I wish I had some sort of a data frame that could store non-atomic vectors like my S4 objects.
OK, prepared yourself for a scary thought. Is there such a thing in R which can act like a data frame of (nested) data frames? Something like a list of a lists, but for data frames?
Luckily, yes! there is! and it is called a tibble. It looks like a data frame, you could access the rows and columns just like any data frame, but in fact each column is a list, and the rows are the elements in that list. And because it is a list, it can also contain non-atomic objects.
For the meta-analysis, all I had to do is creating a tibble with a single column, where each rows has a single study. From that column I extracted embedded/nested ‘layer’ within the S4, into new columns in my tibble (more details below). I then calculated an association statistic (odds ratios from a Logisitic regression). And findally, the last step was plotting forest plot that summarize association at each study.
How to extract the embedded layers from each of the tibble columns?
Arranging the S4 objects in a tibble column, for the purpose of structure simplicity, does not change the fact that it is still a list. And handling a list is sometimes (ok, many times), a lot of pain!
I used to handle such nested lists with lapply, sapply, tapply and its similar functions, but I it wasn’t quite stable. I never felt confidence enough that the input to these functions will result the expected output, and what would be the format of the returned output? Is it going to be a list, a number, a character, a table? What if I have multiple lists that I wish to go though simultaneously? It was a mess!
Luckily the purrr package makes some order in that chaos. The map() function unified the syntax for the returned object, and it goes along nicely with the tidyverse and pipe operator. It also has an elegant implementation for going though each element simultaneously across multiple lists. (map2, pmap). It is kind of like writing as many as lapply as needed, on the fly, which allowed me to send a (very) long hand into the Kishkas of the S4 class, get/extract what i need from there, and run some analysis on it. It was definitely wasn’t trivial at the beginning, but after some time I could finally see the benefit of using it. It ease my coding style so much that it felt like typing code with an extra finger.
credit: Disney XD
So finally it all made sense! I had a rich well annotated omic data with tremendous clinical interpretation potential, and a good hypothesis. I stored the data in a designated S4 container that support my routine analysis tasks. For the meta-analysis, where my row unit is a single study, I stored my S4 objects in a tibble, and extracted the layers of interest using purrr.
I even had a poster to present at the annual Bioconductor conference.
To sum it up, no need to reinvent the wheel. The secret recipe is very simple: MultiAssayExperiment+tibble+purrr
Let the designated packages do the heavy lifting. This is what they are good at, well, because this is what they were designed to do. What’s next?
Should there be a bioconductor package for meta-analysis of multi-omic data? Maybe. Hopefully someone is already working on that right now. It could have designated methods for forest plots and other mate-analysis techniques.
What I find really cool about this tidyverse approach is that it allows to extend the idea of nested objects, and go beyond the current resolution, for example, what about a meta-meta-analysis (this is not a typo). OK, maybe I got carried away. This can be a really scary idea.
How would you do it with other tools? What could be an interesting next analytical step to follow and extend this idea?
Check more related topics here: https://drorberel.github.io/