Get More Out Of Your Data By Visualizing Screening Results with PCA
A principal component analysis visualization depicting breast cancer subtype data. Image source: Wikimedia Commons
Screening sets of bacterial mutants for single trait expression is tough, but imagine how hard it would be to screen hundreds of mutants for dozens of different traits simultaneously. Sure, it’s doable, experimentally speaking. Harvest your colonies, then set up your robotic screening microarrays and let them go to town.1
The real question is how to approach the mountain of data that you’ll have in your lap afterward. You’ll need to upload your mass of raw data to your organization’s collaborative software platform, then find a way to analyze and depict the screening data. The more traits of interest and the broader your set of colonies, the more hassle you’ll have.
At its worst, each trait requires a chart of its own to map the expression levels of all of the mutants that were screened, leaving you with hours of tedious data processing which can’t be understood at a glance. Sharing these analyzed data with others will also be quite a chore unless you have a flawless software platform to manage the data’s storage, processing, and labeling. New research in poly-variate data analysis and visualization offers a promising solution to the problem of sifting through vast quantities of bacterial mutant screening data: high efficiency principal component analysis (PCA) and visualization.2
The New Screening Scene
The creators of the new technique tested their new PCA technique on data which they produced while optimizing bacterial fermentation for industrial applications, but the technique is broadly applicable to other trait optimization scenarios.
The basic process of optimizing bacteria for a given trait is familiar to most researchers3:
- Culture bacterial colonies
- Harvest the colonies
- Induce mutations via irradiation or another method
- Screen for the desired trait
- Analyze the data across all variables and all colonies screened relative to all other data sets
- Repeat the process starting from the colonies which had the most desirable phenotype
The number of repetitions you perform depends on how much each trait needs to be optimized. Each repetition creates another set of data, which must then be analyzed in relation to the previous sets of data. A screen that investigates more than one variable will create even more data sets to compare.
The output of this process scales linearly with the amount of time that the researcher is willing to commit to screening and analysis. The rise of robotic screening has disrupted this trend, and researchers can now generate vast quantities of screening data with relatively little lab time. The problem of optimizing bacterial traits has shifted from an issue of hands-on time in the laboratory to an issue of hands-on time during repetitive data analysis. The quantities of data output by robotic systems can quickly overwhelm researchers who don’t have a robust information technology capacity in their institution.
Data analysis of screening traits is typically simple, and so is doled out among researchers or processed via software when possible. However, collaborative analysis won’t provide enough bandwidth for bacterial screening data moving forward into the age of total automation. Automated processing is necessary. There are a plurality of software based analytical methods for processing data relating to the optimization of bacterial traits, of which PCA is one. Fermentation is an area of particular interest, and will continue to be an area of innovation within multi-trait screening analysis moving forward.4
PCA for the Data Masses
So, what is PCA, and why is the new research exciting? Fully appreciating the impact of the new research requires a quick familiarization with the general concept of PCA itself. PCA was first pioneered as a mathematical technique in 1901 by the famous mathematician Karl Pearson.5 Abstractly, PCA takes a group of data with potentially correlated variables, finds the strength of the correlations, and then converts them into discrete uncorrelated variables called principal components. PCA accomplishes this by performing an orthogonal transformation of the data, which compresses highly correlated variables into one single variable. The resulting data set is effectively sorted by the amount of variance within each variable. Sorting the data in this way allows for easy identification of outliers during analysis while also describing the state of each of the variables simultaneously.
In other words, PCA is useful when you’ve got a ton of data points, each with a ton of variables, like screening bacterial mutants. The innovation of the new research lies in its application of PCA specifically to the difficulties of screening and visualizing data for bacterial colonies with many traits of interest. The new research’s scope underlines the need for an effective data management system for collaboration in the laboratory.
In the context of screening bacterial mutants, there are a few data sets to keep track of during PCA:
- One bacterial colony’s traits represented as discrete variables, which will be necessary to retain for any future engineering efforts after PCA
- Traits from all of the bacterial colonies combined represented as discrete variables, which is necessary to perform PCA
- The correlations found between the discrete variables across all colony traits, which will contain insights in and of itself
- The variables which are transformed into fewer new variables which describe the relationship between multiple traits, which is necessary to understand the PCA results and will be useful to retain as a shorthand description for a group of traits
- One bacterial colony’s traits after PCA’s orthogonal transformation
- Traits from all bacterial colonies after PCA, which is necessary for visualizing the data
After PCA, you’ve got a new data set, but no way to visualize it. After all, PCA is just a way of transforming variables. In traditional analysis, you’d simply have a series of conglomerate trait plots containing all of the different bacterial colonies and all of the conglomerate variables, but this is impractical for large screens. To fully unlock the power of PCA, the new research also visualizes the data.
The new research includes a method for the multidimensional visualization of all the PCA data at once. Potentially hundreds of individual graphs are compressed into one single 3D graph with a number of different axes aside from x, y, and z. Using the new methods, it’s possible to process and depict orders of magnitude more data in less time than before. The resulting 3D plots aren’t trivial to understand, but they’re a massive improvement over squinting at dozens or hundreds of separate plots.
In order to get the most out of the new bacterial optimization screening PCA and visualization technique, you’ll need a powerful information technology suite which can handle data storage, collaborative analysis, documentation, and experiment planning. A typical ELN isn’t equipped to seamlessly move from vast quantities of raw data to algorithmic analysis and depiction; there are just too many levels of organization and too much data. Thankfully, there is a cloud-based software suite which is equipped to support researchers using PCA.
Science Cloud is the software package which allows for the seamless large scale data processing and management that will be necessary to visualize bacterial screens analyzed via PCA. Using Science Cloud, you’ll be able to find deeper insights from your vast data sets with ease. Contact us today to find out how we can help you use advanced statistical and visualization techniques to overhaul your bacterial screening process.
- “Basic Concepts of Microarrays and Potential Applications in Clinical Microbiology.” October 2009, http://cmr.asm.org/content/22/4/611.full ↩
- “Simplifying multidimensional fermentation dataset analysis and visualization: One step closer to capturing high-quality mutant strains.” January 2017, http://www.nature.com/articles/srep39875 ↩
- “Gamma irradiation as a useful tool for the isolation of astaxanthin-overproducing mutant strains of Phaffia rhodozyma.” August 2011, https://www.researchgate.net/publication/51583411_Gamma_irradiation_as_a_useful_tool_for_the_isolation_of_astaxanthin-overproducing_mutant_strains_of_Phaffia_rhodozyma ↩
- “Fermentation Process Optimization.” 2007, http://scialert.net/fulltext/?doi=jm.2007.201.208 ↩
- “On lines and planes of closest fit to systems of points in space.” 1901, http://www.tandfonline.com/doi/abs/10.1080/14786440109462720 ↩