Stemformatics is a collection of curated stem cell focussed data, now containing over 400 independent datasets. The capacity to combine these many datasets provides a good representative sample of the stem cell biology landscape. However, combining data from differing platforms, i.e. Microarrays and RNA Sequencing experiments, poses a significant challenge. The structure of each platform is quite different, such as their dynamic range and the characteristics of their noise. Thus, different assumptions about the behaviour of the data hold. However, we find that gene-pair correlations can be reasonably compared between platforms – showing that biological signal can outweigh platform variance.
We present the results of applying two relatively simple steps – transformation of expression values to the Spearman rank, and the elimination of genes via a univariate estimation of their platform dependence. The impact of platform is nonlinear but it is possible to reveal the biological signal by removal of genes that have a substantial platform dependence. This results in a reduced dimensional space over which new samples can be overlayed. We can infer the topology of landscape and use it to test and benchmark new data, and gain deeper insight into the major drivers of the cell specific clustering.