--- title: "Introduction to Partition" author: "Malcolm Barrett" date: "`r Sys.Date()`" output: rmarkdown::html_vignette references: - id: R-partition type: article-journal author: - family: Millstein given: Joshua - family: Battaglin given: Francesca - family: Barrett given: Malcolm - family: Cao given: Shu - family: Zhang given: Wu - family: Stintzing given: Sebastian - family: Heinemann given: Volker - family: Lenz given: Heinz-Josef issued: - year: 2020 title: 'Partition: A surjective mapping approach for dimensionality reduction' title-short: Partition container-title: Bioinformatics page: 676-681 volume: '36' issue: '3' URL: https://doi.org/10.1093/bioinformatics/btz661 vignette: > %\VignetteIndexEntry{Introduction to Partition} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} if (identical(Sys.getenv("IN_PKGDOWN"), "true")) { dpi <- 320 } else { dpi <- 72 } knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, fig.align = "center", fig.dpi = dpi, warning = FALSE, message = FALSE ) options(tibble.max_extra_cols = 10) ``` ## Introduction to the partition package partition is a fast and flexible data reduction framework for R [@R-partition]. There are many approaches to data reduction, such as principal components analysis (PCA) and hierarchical clustering (both supported in base R). In contrast, partition attempts to create a reduced data set that is both interpretable (each raw feature maps to one and only one reduced feature) and information-rich (reduced features must meet an information constraint). Reducing the data this way often results in a data set that has a mix of raw features from the original data and reduced features. partition is particularly useful for highly correlated data, such as genomic data, where there is a lot of redundancy. A simple model of say, gene expression data could be block correlated Gaussian variables. `simulate_block_data()` simulates data like this: blocks of correlated data that are themselves independent of the other blocks in the data. ```{r} library(partition) library(ggplot2) set.seed(1234) # create a 100 x 15 data set with 3 blocks df <- simulate_block_data( # create 3 correlated blocks of 5 features each block_sizes = rep(5, 3), lower_corr = .4, upper_corr = .6, n = 100 ) ``` In a heatmap showing the correlations between the simulated features, blocks of correlated features are visible: ```{r} ggcorrplot::ggcorrplot(corr(df)) ``` Many types of data follow a pattern like this. Closely related to the block correlation structure found in genetic data is that found in microbiome data. The data set `baxter_otu` has microbiome data on 172 healthy patients. Each row represents a patient, and each column represents an Operational Taxonomic Unit (OTU). OTUs are species-like relationships between bacteria determined by analyzing their RNA. Each cell in the dataset represents the logged-count of an OTU found in a patient's stool sample, with 1,234 OTUs in all. ```{r} baxter_otu ``` While not as apparent as simulated data, correlated blocks also appear in these data; bacteria tend to group together into communities or cliques in the microbiomes of participants. Here are the first 200 OTUs: ```{r} correlation_subset <- corr(baxter_otu[, 1:200]) ggcorrplot::ggcorrplot(correlation_subset, hc.order = TRUE) + ggplot2::theme_void() ``` # Reducing data with partition Because there are many more features (OTUs) in these data than rows (patients), it's useful to reduce the data for use in statistical modeling. The primary function, `partition()`, takes a data frame and an information threshold and reduces the data to as few variables as possible, subject to the information constraint. ```{r} prt <- partition(baxter_otu, threshold = .5) prt ``` For the microbiome data, `partition()` reduced 158 of the OTUs to 63 reduced features. The other 1,076 OTUs were not reduced because doing so would have removed too much information from the data. `partition()` creates a `tibble` with the newly reduced features, as well as any features that were not reduced, which you can get with `partition_scores()`: ```{r} partition_scores(prt) ``` In comparison, PCA produces a data set of the same dimensions of the original data where all original variables map to all new components in some amount. While components are organized by their informativeness (the first component explains the most variance), no original features are retained. ```{r} pca <- prcomp(baxter_otu) # print the results more neatly tibble::as_tibble(pca$x) ``` Notably, these approaches can be easily combined (see below). # The partition algorithm partition uses an approach called Direct-Measure-Reduce to create agglomerative (bottom-up) partitions that capture the user-specified minimum level of information. Each variable starts as an individual partition subset, and candidate partition subsets are assessed by verifying that the minimum information is captured in the reduced variable. Reduced variables are easily interpretable because original variables map to one and only one variable in the reduced data set. The partition software is flexible and customizable in the way features are agglomerated, information is measured, and data are reduced. In this partition, 63 reduced features consist of two to seven features each, as well as 1,076 of the original features that did not get reduced because reducing them would lose too much information. Here are the top 20 clusters, ordered by how many raw features they represent: ```{r} plot_ncluster(prt, show_n = 20) + # plot_*() functions return ggplots, so they can be extended using ggplot2 theme_minimal(14) ``` Each reduced feature explains at least 50% of the information of the original features that it summarizes. The distribution of information has a lower limit of our threshold, .5. ```{r} plot_information(prt, geom = geom_histogram) + theme_minimal(14) ``` Retrieve a key for these mappings and the information each feature explains with `mapping_key()`, which returns a nested `tibble`. ```{r} mapping_key(prt) ``` To see each mapping, unnest them using `unnest_mappings()` (or do it yourself with `tidyr::unnest()`) ```{r} unnest_mappings(prt) ``` ## Direct-Measure-Reduce and Partitioners Partitioners are functions that tell the partition algorithm 1) what to reduce 2) how to measure how much information and 3) how to reduce the data. We call this approach Direct-Measure-Reduce. In partition, functions that handle (1) are thus called directors, functions that handle (2) are called metrics, and functions that handle (3) are called reducers. partition has many pre-specified partitioners, but this aspect of the approach is also quite flexible. See the [vignette on extending partition](extending-partition.html) to learn more about custom partitioners. The default partitioner in `partition()` is `part_icc()`. `part_icc()` uses a correlation-based distance matrix to find the pair of features with the smallest distance between them; intraclass correlation (ICC) to measure information explained by the reduced feature; and scaled row means to reduce features with a sufficient minimum ICC. `part_icc()` is generally fast and scalable. ```{r} part_icc() ``` There are several other partitioners, which all have names in the format `part_*()` | partitioner | direct | measure | reduce | |:--|:--|:--|:--| | `part_icc()`| Minimum Distance | ICC | scaled row means | | `part_kmeans()`| K-Means Clusters | Minimum ICC | scaled row means | | `part_minr2()`| Minimum Distance | Minimum R-Squared | scaled row means | | `part_pc1()`| Minimum Distance | Variance Explained (PCA) | first principal component | | `part_icc()`| Minimum Distance | Standardized Mutual Information | scaled row means | To apply a different partitioner, use the `partitioner` argument in `partition()`. ```{r} prt_pc1 <- partition(baxter_otu, threshold = .5, partitioner = part_pc1()) prt_pc1 ``` ## Permutation tests Data sets with statistically independent variables reduce only at lower thresholds because variables can only be reduced if a substantial proportion of information is discarded. In this set of 10 independent variables, setting the threshold to 0.5 results in a partition with no reduction. `partition()` returns the original data. ```{r} # create a data.frame of 10 independent features ind_df <- purrr::map_dfc(1:10, ~ rnorm(30)) ind_part <- partition(ind_df, .5) ind_part identical(ind_df, partition_scores(ind_part)) ``` Comparing partitioning that occurs in our observed data to what occurs in data where all features are statistically independent gives us insight into the structure of dependencies in our data. We can make this type of comparison with the function `plot_stacked_area_clusters()`, which creates partitions for a series of information thresholds using both the observed data and a permuted (independent) version of the data. In the permuted version, each variable is randomly permuted relative to all other variables, thereby enforcing statistical independence. In general, there are many fewer reduced features for the real data than the independent data due to these dependencies; there is more common information across features in the real data. ```{r} plot_stacked_area_clusters(df) + theme_minimal(14) ``` partition also has a set of tools for more extensive permutation tests. `map_partition()` will fit partitions for a range of thresholds for the observed data; `test_permutation()` will do the same but also for a set of permuted data sets (100 by default). `plot_permutation()` visualizes the results, comparing information, number of clusters, or the number of raw features reduced. ```{r} perms <- test_permutation(df, nperm = 10) perms ``` ```{r, fig.height = 7} plot_permutation(perms, .plot = "nreduced") + theme_minimal(14) ``` `plot_ncluster()` and `plot_information()`, in addition to plotting individual partitions, also plot the results of `test_permutation()`. # References