Package 'modACDC'

Title: Association of Covariance for Detecting Differential Co-Expression
Description: A series of functions to implement association of covariance for detecting differential co-expression (ACDC), a novel approach for detection of differential co-expression that simultaneously accommodates multiple phenotypes or exposures with binary, ordinal, or continuous data types. Users can use the default method which identifies modules by Partition or may supply their own modules. Also included are functions to choose an information loss criterion (ILC) for Partition using OmicS-data-based Complex trait Analysis (OSCA) and Genome-wide Complex trait Analysis (GCTA). The manuscript describing these methods is as follows: Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. "ACDC: a general approach for detecting phenotype or exposure associated co-expression" (2023) <doi:10.3389/fmed.2023.1118824>.
Authors: Katelyn Queen [aut, cre, cph] , Joshua Millstein [aut, cph]
Maintainer: Katelyn Queen <[email protected]>
License: MIT + file LICENSE
Version: 2.0.1
Built: 2024-11-05 04:13:09 UTC
Source: https://github.com/uscbiostats/acdc

Help Index


ACDC

Description

ACDC detects differential co-expression between a set of genes, such as a module of co-expressed genes, and a set of external features (exposures or responses) by using canonical correlation analysis (CCA) on the external features and module co-expression values. Modules are detected via Partition.

Usage

ACDC(
  fullData,
  ILC = 0.5,
  externalVar,
  identifierList = colnames(fullData),
  numNodes = 1
)

Arguments

fullData

data frame or matrix with samples as rows, all features as columns; each entry should be numeric gene expression or other molecular data values

ILC

information loss criterion for Partition, or the minimum intra-class correlation required for features to be condensed; 0 \leq ILC \leq 1; default is 0.50

externalVar

data frame, matrix, or vector containing external variable data to be used for CCA, rows are samples; all elements must be numeric

identifierList

optional row vector of identifiers, of the same length and order, corresponding to columns in fullData (ex: HUGO symbols for genes); default value is the column names from fullData

numNodes

number of available compute nodes for parallelization; default is 1

Details

Modules are identified by Partition, an agglomerative data reduction method which performs both feature condensation and extraction based on a user provided information loss criterion (ILC). Feature condensation into modules are only accepted if the intraclass correlation between the features is at least the ILC. For more information about how the co-expression features are calculated, see the coVar documentation.

Following CCA, which determines linear combinations of the co-expression and external feature vectors that maximize the cross-covariance matrix for each module, a Wilks-Lambda test is performed to determine if the correlation between these linear combinations is significant. If they are significant, that implies there is differential co-expression. If there is only one co-expression value for a module (ie two features in the module) and a single external variable, CCA reduces to a simple correlation test, and the t-distribution is used to test for significant correlation (Widmann, 2005). If the number of co-expression features in a particular module is larger than the number of samples, CCA will return correlation coefficients of 1, and p-values and BH FDR q-values will not be calculated. See ACDChighdim for our solution.

Value

Tibble, sorted by ascending BH FDR value, with columns

moduleNum

module identifier

colNames

list of column names from fullData of the features in the module

features

list of identifiers from input parameter "identifierList" for all features in the module

CCA_corr

list of CCA canonical correlation coefficients

CCA_pval

Wilks-Lamda F-test p-value or t-test p-value

BHFDR_qval

Benjamini-Hochberg false discovery rate q-value

Author(s)

Katelyn Queen, [email protected]

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1995) 289–300.

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824..

Widmann M. One-Dimensional CCA and SVD, and Their Relationship to Regression Maps. Journal of Climate 18 (2005) 2785–2792. doi:10.1175/jcli3424.1.

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# run function for diet and genotype
ACDC(fullData = nutrimouse$lipid,
     ILC = 0.50, 
     externalVar = data.frame(diet=as.numeric(nutrimouse$diet), 
                              genotype=as.numeric(nutrimouse$genotype)))

ACDChighdim

Description

ACDC detects differential co-expression between a set of genes, such as a module of co-expressed genes, and a set of external features (exposures or responses) by using canonical correlation analysis (CCA) on the external features and module co-expression values. A high-dimensional module is supplied by the user.

Usage

ACDChighdim(
  moduleIdentifier = 1,
  moduleCols,
  fullData,
  externalVar,
  identifierList = colnames(fullData),
  corrThreshold = 0.75
)

Arguments

moduleIdentifier

the module identifier given by Partition or other dimension reduction/clustering algorithm; default is 1

moduleCols

list containing indices of column locations in fullData that specify features in the module

fullData

data frame or matrix with samples as rows, all features as columns; each entry should be numeric gene expression or other molecular data values

externalVar

data frame, matrix, or vector containing external variable data to be used for CCA, rows are samples; all elements must be numeric

identifierList

optional row vector of identifiers, of the same length and order, corresponding to columns in fullData (ex: HUGO symbols for genes); default value is the column names from fullData

corrThreshold

minimum correlation required between two features to be kept in the dataset; 0 \leq corrThreshold \leq 1; default value is 0.75

Details

If the number of co-expression features in a particular module is larger than the number of samples, CCA will return correlation coefficients of 1, and p-values and BH FDR q-values will not be calculated. This function accepts one of these high dimension modules and reduces the dimensionality by calculating the pairwise correlations for all features and only keeping feature pairs with |correlation| > the user defined corrThreshold with a maximum number of features pairs of N2\lfloor\frac{N}{2}\rfloor. We posit that these highly correlated pairs are the skeleton structure of the full module and therefore an appropriate approximation. Once this structure is identified, co-expression values are calculated and CCA is performed as in ACDC.

For more information about how the co-expression features are calculated, see the coVar documentation.

Following CCA, which determines linear combinations of the co-expression and external feature vectors that maximize the cross-covariance matrix for each module, a Wilks-Lambda test is performed to determine if the correlation between these linear combinations is significant. If they are significant, that implies there is differential co-expression. If there is only one co-expression value for a module (ie two features in the module) and a single external variable, CCA reduces to a simple correlation test, and the t-distribution is used to test for significant correlation (Widmann, 2005).

Value

Tibble, designed to be row binded with output from other ACDC functions after removing the final column, with columns

moduleNum

module identifier

colNames

list of column names from fullData of the features in the module

features

list of identifiers from input parameter "identifierList" for all features in the module

CCA_corr

list of CCA canonical correlation coefficients

CCA_pval

Wilks-Lamda F-test p-value or t-test p-value

numPairsUsed

number of feature pairs with correlation above corrThreshold

Author(s)

Katelyn Queen, [email protected]

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1995) 289–300.

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824..

Widmann M. One-Dimensional CCA and SVD, and Their Relationship to Regression Maps. Journal of Climate 18 (2005) 2785–2792. doi:10.1175/jcli3424.1.

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# run function for diet and genotype
ACDChighdim(moduleIdentifier = 1,
            moduleCols = list(1:ncol(nutrimouse$lipid)),
            fullData = nutrimouse$lipid,
            externalVar = data.frame(diet=as.numeric(nutrimouse$diet), 
                                     genotype=as.numeric(nutrimouse$genotype)))

ACDCmod

Description

ACDCmod detects differential co-expression between a set of genes, such as a module of co-expressed genes, and a set of external features (exposures or responses) by using canonical correlation analysis (CCA) on the external features and module co-expression values. Modules are provided by the user.

Usage

ACDCmod(
  fullData,
  modules,
  externalVar,
  identifierList = colnames(fullData),
  numNodes = 1
)

Arguments

fullData

data frame or matrix with samples as rows, all probes as columns; each entry should be numeric gene expression or other molecular data values

modules

vector of lists where each list contains indices of column locations in fullData that specify features in each module

externalVar

data frame, matrix, or vector containing external variable data to be used for CCA, rows are samples; all elements must be numeric

identifierList

optional row vector of identifiers, of the same length and order, corresponding to columns in fullData (ex: HUGO symbols for genes); default value is the column names from fullData

numNodes

number of available compute nodes for parallelization; default is 1

Details

For more information about how the co-expression features are calculated, see the coVar documentation.

Following CCA, which determines linear combinations of the co-expression and external feature vectors that maximize the cross-covariance matrix for each module, a Wilks-Lambda test is performed to determine if the correlation between these linear combinations is significant. If they are significant, that implies there is differential co-expression. If there is only one co-expression value for a module (ie two features in the module) and a single external variable, CCA reduces to a simple correlation test, and the t-distribution is used to test for significant correlation (Widmann, 2005). If the number of co-expression features in a particular module is larger than the number of samples, CCA will return correlation coefficients of 1, and p-values and BH FDR q-values will not be calculated. See ACDChighdim for our solution.

Value

Tibble, sorted by ascending BH FDR value, with columns

moduleNum

module identifier

colNames

list of column names from fullData of the features in the module

features

list of identifiers from input parameter "identifierList" for all features in the module

CCA_corr

list of CCA canonical correlation coefficients

CCA_pval

Wilks-Lamda F-test p-value; t-test p-value if there are only 2 features in the module and a single external variable

BHFDR_qval

Benjamini-Hochberg false discovery rate q-value

Author(s)

Katelyn Queen, [email protected]

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1995) 289–300.

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824.

Widmann M. One-Dimensional CCA and SVD, and Their Relationship to Regression Maps. Journal of Climate 18 (2005) 2785–2792. doi:10.1175/jcli3424.1.

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# partition dataset and save modules
library(partition)
part <- partition(nutrimouse$lipid, threshold = 0.50)
mods <- part$mapping_key[which(grepl("reduced_var_", part$mapping_key$variable)), ]$mapping

# run function for diet and genotype
ACDCmod(fullData = nutrimouse$lipid,
        modules = mods,
        externalVar = data.frame(diet=as.numeric(nutrimouse$diet), 
                                  genotype=as.numeric(nutrimouse$genotype)))

coVar

Description

Function to calculate ACDC covariances within a data pair for all samples

Usage

coVar(dataPair, fullData)

Arguments

dataPair

column indices of two genes to calculate covariance between

fullData

dataframe or matrix with samples as rows, all probes as columns; each entry should be numeric gene expression or other molecular data values

Details

Co-expression for a single sample, s, is defined as

cs,j,k(gs,jgjˉ)(gs,kgkˉ)c_{s,j,k} \equiv \left(g_{s,j}-\bar{g_j}\right)\left(g_{s,k}-\bar{g_k}\right)

where gs,jg_{s,j} denotes the expression of gene j in sample s and gjˉ\bar{g_j} denotes the mean expression of gene j in all samples.

Denoting the sample size as N, coVar returns the co-expression profile across all samples:

cj,k=(c1,j,k,c2,j,k,...,cN,j,k)c_{j,k} = (c_{1,j,k}, c_{2,j,k}, ... , c_{N,j,k})

Value

Co-expression profile, or pairwise covariances for all samples, vector for given features

Author(s)

Katelyn Queen, [email protected]

References

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824.

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# run function with first two samples
coVar(dataPair = c(1, 2), 
      fullData = nutrimouse$lipid)

GCTA_par

Description

GCTA_par determines the average heritability of the first principal component of either the co-expression or covariance of gene expression modules for a range of increasingly reduced datasets. Dimension reduction is done with Partition, where features are only condensed into modules if the intraclass correlation between the features is at least the user-supplied information loss criterion (ILC), 0 <= ILC <= 1. An ILC of one returns the full dataset with no reduction, and an ILC of zero returns one module of all input features, reducing the dataset to the mean value. For each ILC value, with the number of ILCs tested determined by input parameter ILCincrement, the function returns the point estimate and standard error of the average heritability of the first principal component of the co-expression or covariance of the gene expression modules in the reduced dataset. If input parameter permute is true, the function also returns the same values for a random permutation of the first principle component of the appropriate matrix.

Usage

GCTA_par(
  df,
  ILCincrement = 0.05,
  fileLoc,
  gctaPath,
  remlAlg = 0,
  maxRemlIt = 100,
  numCovars = NULL,
  catCovars = NULL,
  summaryType,
  permute = TRUE,
  numNodes = 1,
  verbose = TRUE
)

Arguments

df

n x p data frame or matrix of numeric -omics values with no ID column

ILCincrement

float between zero and one determining interval between tested ILC values; default is 0.05

fileLoc

absolute file path to bed, bim, and fam files, including prefix

gctaPath

absolute path to GCTA software

remlAlg

algorithm to run REML iterations in GCTA; 0 = average information (AI), 1 = Fisher-scoring, 2 = EM; default is 0 (AI)

maxRemlIt

the maximum number of REML iterations; default is 100

numCovars

n x c_n matrix of numerical covariates to adjust heritability model for; must be in same person order as fam file; default is NULL

catCovars

n x c_c matrix of categorical covariates to adjust heritability model for; must be in same person order as fam file; default is NULL

summaryType

one of "coexpression" or "covariance"; determines how to summarize each module

permute

boolean value for whether or not to calculate values for a random permutation module summary; default is true

numNodes

number of available compute nodes for parallelization; default is 1

verbose

logical for whether or not to display progress updates; default is TRUE

Details

Genome-wide Complex Trait Analysis (GCTA) is a suite of C++ functions. In order to use the GCTA functions, the user must specify the absolute path to the GCTA software, which can be downloaded from the Yang Lab website here.

Here, we use GCTA's Genomics REstricted Maximum Likelihood (GREML) method to estimate the heritability of an external phenotype. GREML is called 2*number of modules for each ILC tested if permutations are included.

Dimension reduction is done with Partition, an agglomerative data reduction method which performs both feature condensation and extraction based on a user provided information loss criterion (ILC). Feature condensation into modules are only accepted if the intraclass correlation between the features is at least the ILC. The superPartition function is called if the gene expression dataset contains more than 4,000 features.

Value

Data frame with columns

ILC

the information loss criterion used for that iteration

InformationLost

percent information lost due to data reduction

PercentReduction

percent of variables condensed compared to unreduced data

AveVarianceExplained_Observed

average heritability estimate for PC1 of observed summary data

OverallSD_Observed

standard deviation of the heritability estimates for PC1 of observed summary data

VarianceExplained_Observed

list of heritability estimates for PC1 of observed summary for all modules

SE_Observed

list of standard errors of the heritability estimates for PC1 of observed summary data for all modules

AveVarianceExplained_Permuted

average heritability for PC1 of permuted summary data

OverallSD_Permuted

standard deviation of the heritability estimates for PC1 of permuted summary data

VarianceExplained_Permuted

list of heritability estimates for PC1 of permuted summary data for all modules

SE_Permuted

list of standard errors of the heritability estimates for PC1 of permuted summary data for all modules

Author(s)

Katelyn Queen, [email protected]

References

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011 Jan 7;88(1):76-82. doi: 10.1016/j.ajhg.2010.11.011. Epub 2010 Dec 17. PMID: 21167468; PMCID: PMC3014363.

See Also

GCTA software - https://yanglab.westlake.edu.cn/software/gcta/

Examples

# run function; input absolute path to OSCA software before running
## Not run: GCTA_par(df = geneExpressionData, 
          ILCincrement = 0.25, 
          fileLoc = "pathHere",
          gctaPath = "pathHere",
          summaryType = "coexpression",
          permute = TRUE,
          numNodes = 1)
## End(Not run)

GCTA_parPlot

Description

GCTA_parPlot creates a graph of the output from the GCTA_par function, plotting average heritability of the first principal component of either co-expression or covariance of gene modules against information lost/percent reduction for both observed and permuted data.

Usage

GCTA_parPlot(df, dataName = "", summaryType)

Arguments

df

output from GCTA_par function with permutations

dataName

string of name of data for graph labels; default is blank

summaryType

one of "coexpression" or "covariance"; how modules were summarized for GCTA calculations

Details

Genome-wide Complex Trait Analysis (GCTA) is a suite of C++ functions. In order to use the GCTA functions, the user must specify the absolute path to the GCTA software, which can be downloaded from the Yang Lab website here.

In GCTA_par, we use GCTA's Genomics REstricted Maximum Likelihood (GREML) method to estimate the average heritability of the first principal component of either co-expression or covariance of gene modules. The produced plot shows these heritability estimates at varying levels of dataset reduction, calculated for observed data in blue and permuted data in red. An information loss value of 0 represents the unreduced dataset, and an information loss level of 100 represents the data being reduced to the average expression of all features.

Value

ggplot object

Author(s)

Katelyn Queen, [email protected]

References

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011 Jan 7;88(1):76-82. doi: 10.1016/j.ajhg.2010.11.011. Epub 2010 Dec 17. PMID: 21167468; PMCID: PMC3014363.

See Also

GCTA software - https://yanglab.westlake.edu.cn/software/gcta/#Overview

Examples

# run OSCA_par and save output; input absolute path to OSCA software before running
## Not run: par <- GCTA_par(df = geneExpressionData, 
          ILCincrement = 0.25, 
          fileLoc = "pathHere",
          gctaPath = "pathHere",
          summaryType = "coexpression",
          permute = TRUE,
          numNodes = 1)
## End(Not run)

# run function
## Not run: GCTA_parPlot(df=par, dataName = "Example Data", summaryType = "coexpression")

GCTA_singleValue

Description

Function to return the heritability of an external phenotype for a single dataset

Usage

GCTA_singleValue(
  fileLoc,
  externalVar,
  gctaPath,
  remlAlg = 0,
  maxRemlIt = 100,
  numCovars = NULL,
  catCovars = NULL
)

Arguments

fileLoc

absolute file path to bed, bim, and fam files, including prefix

externalVar

vector of length n of external variable values with no ID column; must be in the same sample order as bed, bim, fam files

gctaPath

absolute path to GCTA software

remlAlg

algorithm to run REML iterations in GCTA; 0 = average information (AI), 1 = Fisher-scoring, 2 = EM; default is 0 (AI)

maxRemlIt

the maximum number of REML iterations; default is 100

numCovars

n x c_n matrix of numerical covariates to adjust heritability model for; must be in same person order as fam file; default is NULL

catCovars

n x c_c matrix of categorical covariates to adjust heritability model for; must be in same person order as fam file; default is NULL

Details

Genome-wide Complex Trait Analysis (GCTA) is a suite of C++ functions. In order to use the GCTA functions, the user must specify the absolute path to the GCTA software, which can be downloaded from the Yang Lab website here.

Here, we use GCTA's Genomics REstricted Maximum Likelihood (GREML) method to estimate the heritability of an external phenotype.

Value

Row of GREML output containing heritability point estimate of external data and standard error

Author(s)

Katelyn Queen, [email protected]

References

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011 Jan 7;88(1):76-82. doi: 10.1016/j.ajhg.2010.11.011. Epub 2010 Dec 17. PMID: 21167468; PMCID: PMC3014363.

See Also

GCTA software - https://yanglab.westlake.edu.cn/software/gcta/

Examples

externalVar <- c()

# run function; input data before running
## Not run: OSCA_singleValue(fileLoc = "pathHere", 
                  externalVar = externalVar,
                  gctaPath = "pathHere")
## End(Not run)

OSCA_par

Description

OSCA_par determines the percent variance explained in an external variable (exposures or responses) for a range of increasingly reduced datasets. Dimension reduction is done with Partition, where features are only condensed into modules if the intraclass correlation between the features is at least the user-supplied information loss criterion (ILC), 0 <= ILC <= 1. An ILC of one returns the full dataset with no reduction, and an ILC of zero returns one module of all input features, reducing the dataset to the mean value. For each ILC value, with the number of ILCs tested determined by input parameter ILCincrement, the function returns the point estimate and standard error of the percent variance explained in the observed external variable by the reduced dataset. If input parameter permute is true, the function also returns the same values for a random permutation of the external variable.

Usage

OSCA_par(
  df,
  externalVar,
  ILCincrement = 0.05,
  oscaPath,
  remlAlg = 0,
  maxRemlIt = 100,
  numCovars = NULL,
  catCovars = NULL,
  permute = TRUE,
  numNodes = 1,
  verbose = TRUE
)

Arguments

df

n x p data frame or matrix of numeric -omics values with no ID column

externalVar

vector of length n of external variable values with no ID column

ILCincrement

float between zero and one determining interval between tested ILC values; default is 0.05

oscaPath

absolute path to OSCA software

remlAlg

which algorithm to run REML iterations in GCTA; 0 = average information (AI), 1 = Fisher-scoring, 2 = EM; default is 0 (AI)

maxRemlIt

the maximum number of REML iterations; default is 100

numCovars

n x c_n matrix of numerical covariates to adjust heritability model for; must be in same person order as externalVar; default is NULL

catCovars

n x c_c matrix of categorical covariates to adjust heritability model for; must be in same person order as externalVar; default is NULL

permute

boolean value for whether or not to calculate values for a random permutation of the external variable; default is true

numNodes

number of available compute nodes for parallelization; default is 1

verbose

logical for whether or not to display progress updates; default is TRUE

Details

OmicS-data-based Complex trait Analysis (OSCA) is a suite of C++ functions. In order to use the OSCA functions, the user must specify the absolute path to the OSCA software, which can be downloaded from the Yang Lab website here.

Here, we use OSCA's Omics Restricted Maximum Likelihood (OREML) method to estimate the percent of variance in an external phenotype that can be explained by an omics profile, akin to heritability estimates in GWAS. OREML is called twice for each ILC tested if permutations are included.

Dimension reduction is done with Partition, an agglomerative data reduction method which performs both feature condensation and extraction based on a user provided information loss criterion (ILC). Feature condensation into modules are only accepted if the intraclass correlation between the features is at least the ILC. The superPartition function is called if the gene expression dataset contains more than 4,000 features.

Value

Data frame with columns

ILC

the information loss criterion used for that iteration

InformationLost

percent information lost due to data reduction

PercentReduction

percent of variables condensed compared to unreduced data

VarianceExplained_Observed

percent variance explained in observed external variable by the data

SE_Observed

standard error of the percent variance estimate for observed external variable

VarianceExplained_Permuted

percent variance explained in permuted external variable by the data; only if input parameter "permute" is true

SE_Permuted

standard error of the percent variance estimate for permuted external variable; only if input parameter "permute" is true

Author(s)

Katelyn Queen, [email protected]

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1995) 289–300.

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824.

See Also

OSCA software - https://yanglab.westlake.edu.cn/software/osca/

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# run function; input absolute path to OSCA software before running
## Not run: OSCA_par(df = nutrimouse$gene, 
          externalVar = as.numeric(nutrimouse$diet),
          ILCincrement = 0.25, 
          oscaPath = "pathHere")
## End(Not run)

OSCA_parPlot

Description

OSCA_parPlot creates a graph of the output from the OSCA_par function, plotting percent variance explained in an external variable (exposure or response) against information lost/percent reduction for both observed and permuted data.

Usage

OSCA_parPlot(df, externalVarName = "", dataName = "")

Arguments

df

output from OSCA_par function with permutations

externalVarName

string of name of external variable for graph labels; default is blank

dataName

string of name of data for graph labels; default is blank

Details

OmicS-data-based Complex trait Analysis (OSCA) is a suite of C++ functions. In order to use the OSCA functions, the user must specify the absolute path to the OSCA software, which can be downloaded from the Yang Lab website here.

In OSCA_par, we use OSCA's Omics Restricted Maximum Likelihood (OREML) method to estimate the percent of variance in an external phenotype that can be explained by an omics profile, akin to heritability estimates in GWAS. The produced plot shows the percent variance explained in an external variable at varying levels of dataset reduction, calculated for observed external variables in blue and permuted external variables in red. An information loss value of 0 represents the unreduced dataset, and an information loss level of 100 represents the data being reduced to the average expression of all features.

Value

ggplot object

Author(s)

Katelyn Queen, [email protected]

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1995) 289–300.

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824.

See Also

OSCA software - https://yanglab.westlake.edu.cn/software/osca/

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# run OSCA_par and save output; input absolute path to OSCA software before running
## Not run: par <- OSCA_par(df = nutrimouse$gene, 
                externalVar = as.numeric(nutrimouse$diet),
                 ILCincrement = 0.25,
                 oscaPath = "pathHere")
## End(Not run)

# run function
## Not run: OSCA_parPlot(df=par, externalVarName = "Diet", dataName = "Nutritional Issue Genes")

OSCA_singleValue

Description

Function to return the percent variance explained in an external phenotype for a single dataset

Usage

OSCA_singleValue(
  df,
  externalVar,
  oscaPath,
  remlAlg = 0,
  maxRemlIt = 100,
  numCovars = NULL,
  catCovars = NULL
)

Arguments

df

n x p dataframe or matrix of numeric -omics values with no ID column

externalVar

vector of length n of external variable values with no ID column

oscaPath

absolute path to OSCA software

remlAlg

which algorithm to run REML iterations in GCTA; 0 = average information (AI), 1 = Fisher-scoring, 2 = EM; default is 0 (AI)

maxRemlIt

the maximum number of REML iterations; default is 100

numCovars

n x c_n matrix of numerical covariates to adjust heritability model for; must be in same person order as fam file; default is NULL

catCovars

n x c_c matrix of categorical covariates to adjust heritability model for; must be in same person order as fam file; default is NULL

Details

OmicS-data-based Complex trait Analysis (OSCA) is a suite of C++ functions. In order to use the OSCA functions, the user must specify the absolute path to the OSCA software, which can be downloaded from the Yang Lab website here.

Here, we use OSCA's Omics Restricted Maximum Likelihood (OREML) method to estimate the percent of variance in an external phenotype that can be explained by an omics profile, akin to heritability estimates in GWAS.

Value

Row of OREML output containing percent variance explained in external data and standard error

Author(s)

Katelyn Queen, [email protected]

References

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57 (1995) 289–300.

Martin P, et al. Novel aspects of PPARalpha-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology, in press, 2007.

Millstein J, Battaglin F, Barrett M, Cao S, Zhang W, Stintzing S, et al. Partition: a surjective mapping approach for dimensionality reduction. Bioinformatics 36 (2019) 676–681. doi:10.1093/bioinformatics/ btz661.

Queen K, Nguyen MN, Gilliland F, Chun S, Raby BA, Millstein J. ACDC: a general approach for detecting phenotype or exposure associated co-expression. Frontiers in Medicine (2023) 10. doi:10.3389/fmed.2023.1118824.

See Also

OSCA software - https://yanglab.westlake.edu.cn/software/osca/

Examples

#load CCA package for example dataset
library(CCA)

# load dataset
data("nutrimouse")

# run function; input absolute path to OSCA software before running
## Not run: OSCA_singleValue(df = nutrimouse$gene, 
                  externalVar = as.numeric(nutrimouse$diet),
                  oscaPath = "pathHere")
## End(Not run)