Title: | Latent Unknown Clustering Integrating Multi-View Data |
---|---|
Description: | An implementation of the LUCID model (Peng (2019) <doi:10.1093/bioinformatics/btz667>). LUCID conducts integrated clustering using exposures, omics data (and outcome as an option). An EM algorithm is implemented to estimate MLE of the LUCID model. 'LUCIDus' features integrated variable selection, incorporation of missing omics data, bootstrap inference, prediction and visualization of the model. |
Authors: | Yinqi Zhao [aut, cre] , David Conti [ths] , Jesse Goodrich [ctb] , Cheng Peng [ctb] |
Maintainer: | Yinqi Zhao <[email protected]> |
License: | GPL-3 |
Version: | 2.2.1 |
Built: | 2024-11-21 03:20:13 UTC |
Source: | https://github.com/uscbiostats/lucidus |
Generate R
bootstrap replicates of LUCID parameters and
derive confidence interval (CI) base on bootstrap. Bootstrap replicates are
generated based on nonparameteric resampling, implemented by ordinary
method of codeboot::boot function.
boot_lucid(G, Z, Y, CoG = NULL, CoY = NULL, model, conf = 0.95, R = 100)
boot_lucid(G, Z, Y, CoG = NULL, CoY = NULL, model, conf = 0.95, R = 100)
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
model |
A LUCID model fitted by |
conf |
A numeric scalar between 0 and 1 to specify confidence level(s) of the required interval(s). |
R |
An integer to specify number of bootstrap replicates for LUCID model. If feasible, it is recommended to set R >= 1000. |
A list, containing the following components:
beta |
effect estimate for each exposure |
mu |
cluster-specific mean for each omics feature |
gamma |
effect estiamte for the association btween latent cluster and outcome |
bootstrap |
The |
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # fit lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008) # conduct bootstrap resampling boot1 <- boot_lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100) # check distribution for bootstrap replicates of the variable of interest plot(boot1$bootstrap, 1) # use 90% CI boot2 <- boot_lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100, conf = 0.9) ## End(Not run)
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # fit lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008) # conduct bootstrap resampling boot1 <- boot_lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100) # check distribution for bootstrap replicates of the variable of interest plot(boot1$bootstrap, 1) # use 90% CI boot2 <- boot_lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100, conf = 0.9) ## End(Not run)
This function deprecates. Please use boot_lucid instead.
boot.lucid(G, Z, Y, CoG = NULL, CoY = NULL, model, conf = 0.95, R = 100)
boot.lucid(G, Z, Y, CoG = NULL, CoY = NULL, model, conf = 0.95, R = 100)
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
model |
A LUCID model fitted by |
conf |
A numeric scalar between 0 and 1 to specify confidence level(s) of the required interval(s). |
R |
An integer to specify number of bootstrap replicates for LUCID model. If feasible, it is recommended to set R >= 1000. |
Check missing patterns in omics data Z
check_na(Z)
check_na(Z)
Z |
A data matrix representing omics data |
1. index:indeces for missing values in omics data 2. indicator_na: missing pattern for each observation 3. impute_flag: - flag to initialize imputation. Only happens when sporadic missing pattern is observed
The Latent Unknown Clustering with Integrated Data (LUCID) performs integrative clustering using multi-view data. LUCID model is estimated via EM algorithm for model-based clustering. It also features variable selection, integrated imputation, bootstrap inference and visualization via Sankey diagram.
est_lucid( G, Z, Y, CoG = NULL, CoY = NULL, K = 2, family = c("normal", "binary"), useY = TRUE, tol = 0.001, max_itr = 1000, max_tot.itr = 10000, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, modelName = NULL, seed = 123, init_impute = c("mclust", "lod"), init_par = c("mclust", "random"), verbose = FALSE )
est_lucid( G, Z, Y, CoG = NULL, CoY = NULL, K = 2, family = c("normal", "binary"), useY = TRUE, tol = 0.001, max_itr = 1000, max_tot.itr = 10000, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, modelName = NULL, seed = 123, init_impute = c("mclust", "lod"), init_par = c("mclust", "random"), verbose = FALSE )
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
K |
Number of latent clusters. An integer greater or equal to 2. User
can use |
family |
Distribution of outcome. For continuous outcome, use "normal"; for binary outcome, use "binary". Default is "normal". |
useY |
Flag to include information of outcome when estimating the latent cluster. Default is TRUE. |
tol |
Tolerance for convergence of EM algorithm. Default is 1e-3. |
max_itr |
Max number of iterations for EM algorithm. |
max_tot.itr |
Max number of total iterations for |
Rho_G |
A scalar. This parameter is the LASSO penalty to regularize
exposures. If user wants to tune the penalty, use the wrapper
function |
Rho_Z_Mu |
A scalar. This parameter is the LASSO penalty to
regularize cluster-specific means for omics data (Z). If user wants to tune the
penalty, use the wrapper function |
Rho_Z_Cov |
A scalar. This parameter is the graphical LASSO
penalty to estimate sparse cluster-specific variance-covariance matrices for omics
data (Z). If user wants to tune the penalty, use the wrapper function |
modelName |
The variance-covariance structure for omics data.
See |
seed |
An integer to initialize the EM algorithm or imputing missing values. Default is 123. |
init_impute |
Method to initialize the imputation of missing values in
LUCID. "mclust" will use |
init_par |
Method to initialize the EM algorithm. "mclust" will use mclust model to initialize parameters; "random" initialize parameters from uniform distribution. |
verbose |
A flag indicates whether detailed information for each iteration of EM algorithm is printed in console. Default is FALSE. |
A list which contains the several features of LUCID, including:
pars |
Estimates of parameters of LUCID, including beta (effect of exposure), mu (cluster-specific mean for omics data), sigma (cluster-specific variance-covariance matrix for omics data) and gamma (effect estimate of association between latent cluster and outcome) |
K |
Number of latent cluster |
modelName |
Geometric model to estiamte variance-covariance matrix for omics data |
likelihood |
The log likelihood of the LUCID model |
post.p |
Posterior inclusion probability (PIP) for assigning observation i to latent cluster j |
Z |
If missing values are observed, this is the complet dataset for omics data with missing values imputed by LUCID |
Cheng Peng, Jun Wang, Isaac Asante, Stan Louie, Ran Jin, Lida Chatzi, Graham Casey, Duncan C Thomas, David V Conti, A Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) with Phenotypic Traits, Bioinformatics, btz667, https://doi.org/10.1093/bioinformatics/btz667.
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal Y_binary <- sim_data$Y_binary cov <- sim_data$Covariate # fit LUCID model with continuous outcome fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008) # fit LUCID model with block-wise missing pattern in omics data Z_miss_1 <- Z Z_miss_1[sample(1:nrow(Z), 0.3 * nrow(Z)), ] <- NA fit2 <- est_lucid(G = G, Z = Z_miss_1, Y = Y_normal, family = "normal", K = 2) # fit LUCID model with sporadic missing pattern in omics data Z_miss_2 <- Z index <- arrayInd(sample(length(Z_miss_2), 0.3 * length(Z_miss_2)), dim(Z_miss_2)) Z_miss_2[index] <- NA # initialize imputation by imputing fit3 <- est_lucid(G = G, Z = Z_miss_2, Y = Y_normal, family = "normal", K = 2, seed = 1008, init_impute = "lod") LOD # initialize imputation by mclust fit4 <- est_lucid(G = G, Z = Z_miss_2, Y = Y, family = "normal", K = 2, seed = 123, init_impute = "mclust") # fit LUCID model with binary outcome fit5 <- est_lucid(G = G, Z = Z, Y = Y_binary, family = "binary", K = 2, seed = 1008) # fit LUCID model with covariates fit6 <- est_lucid(G = G, Z = Z, Y = Y_binary, CoY = cov, family = "binary", K = 2, seed = 1008) # use LUCID model to conduct integrated variable selection # select exposure fit6 <- est_lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", K = 2, seed = 1008, Rho_G = 0.1) # select omics data fit7 <- est_lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", K = 2, seed = 1008, Rho_Z_Mu = 90, Rho_Z_Cov = 0.1, init_par = "random") ## End(Not run)
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal Y_binary <- sim_data$Y_binary cov <- sim_data$Covariate # fit LUCID model with continuous outcome fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008) # fit LUCID model with block-wise missing pattern in omics data Z_miss_1 <- Z Z_miss_1[sample(1:nrow(Z), 0.3 * nrow(Z)), ] <- NA fit2 <- est_lucid(G = G, Z = Z_miss_1, Y = Y_normal, family = "normal", K = 2) # fit LUCID model with sporadic missing pattern in omics data Z_miss_2 <- Z index <- arrayInd(sample(length(Z_miss_2), 0.3 * length(Z_miss_2)), dim(Z_miss_2)) Z_miss_2[index] <- NA # initialize imputation by imputing fit3 <- est_lucid(G = G, Z = Z_miss_2, Y = Y_normal, family = "normal", K = 2, seed = 1008, init_impute = "lod") LOD # initialize imputation by mclust fit4 <- est_lucid(G = G, Z = Z_miss_2, Y = Y, family = "normal", K = 2, seed = 123, init_impute = "mclust") # fit LUCID model with binary outcome fit5 <- est_lucid(G = G, Z = Z, Y = Y_binary, family = "binary", K = 2, seed = 1008) # fit LUCID model with covariates fit6 <- est_lucid(G = G, Z = Z, Y = Y_binary, CoY = cov, family = "binary", K = 2, seed = 1008) # use LUCID model to conduct integrated variable selection # select exposure fit6 <- est_lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", K = 2, seed = 1008, Rho_G = 0.1) # select omics data fit7 <- est_lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", K = 2, seed = 1008, Rho_Z_Mu = 90, Rho_Z_Cov = 0.1, init_par = "random") ## End(Not run)
This function deprecates. Please use est_lucid instead.
est.lucid( G, Z, Y, CoG = NULL, CoY = NULL, K = 2, family = c("normal", "binary"), useY = TRUE, tol = 0.001, max_itr = 1000, max_tot.itr = 10000, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, modelName = "VVV", seed = 123, init_impute = c("mclust", "lod"), init_par = c("mclust", "random"), verbose = FALSE )
est.lucid( G, Z, Y, CoG = NULL, CoY = NULL, K = 2, family = c("normal", "binary"), useY = TRUE, tol = 0.001, max_itr = 1000, max_tot.itr = 10000, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, modelName = "VVV", seed = 123, init_impute = c("mclust", "lod"), init_par = c("mclust", "random"), verbose = FALSE )
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
K |
Number of latent clusters. An integer greater or equal to 2. User
can use |
family |
Distribution of outcome. For continuous outcome, use "normal"; for binary outcome, use "binary". Default is "normal". |
useY |
Flag to include information of outcome when estimating the latent cluster. Default is TRUE. |
tol |
Tolerance for convergence of EM algorithm. Default is 1e-3. |
max_itr |
Max number of iterations for EM algorithm. |
max_tot.itr |
Max number of total iterations for |
Rho_G |
A scalar. This parameter is the LASSO penalty to regularize
exposures. If user wants to tune the penalty, use the wrapper
function |
Rho_Z_Mu |
A scalar. This parameter is the LASSO penalty to
regularize cluster-specific means for omics data (Z). If user wants to tune the
penalty, use the wrapper function |
Rho_Z_Cov |
A scalar. This parameter is the graphical LASSO
penalty to estimate sparse cluster-specific variance-covariance matrices for omics
data (Z). If user wants to tune the penalty, use the wrapper function |
modelName |
The variance-covariance structure for omics data.
See |
seed |
An integer to initialize the EM algorithm or imputing missing values. Default is 123. |
init_impute |
Method to initialize the imputation of missing values in
LUCID. "mclust" will use |
init_par |
Method to initialize the EM algorithm. "mclust" will use mclust model to initialize parameters; "random" initialize parameters from uniform distribution. |
verbose |
A flag indicates whether detailed information for each iteration of EM algorithm is printed in console. Default is FALSE. |
Impute missing data by optimizing the likelihood function
fill_data(obs, mu, sigma, p, index)
fill_data(obs, mu, sigma, p, index)
obs |
a vector of length M |
mu |
a matrix of size M x K |
sigma |
a matrix of size M x M x K |
p |
a vector of length K |
index |
a vector of length M, indicating whether a value is missing or not in the raw data |
an observation with updated imputed value
generate bootstrp ci (normal, basic and percentile)
gen_ci(x, conf = 0.95)
gen_ci(x, conf = 0.95)
x |
an object return by boot function |
conf |
A numeric scalar between 0 and 1 to specify confidence level(s) of the required interval(s). |
a matrix, the first column is t0 statistic from original model
The Human Early-Life Exposome (HELIX) project is multi-center research project that aims to characterize early-life environmental exposures and associate these with omics biomarkers and child health outcomes (Vrijheid, 2014. doi: 10.1289/ehp.1307204). We used a subset of HELIX data from Exposome Data Challenge 2021 (hold by ISGlobal) as an example to illustrate LUCID model.
helix_data
helix_data
A list with 4 matrices corresponding to exposures (G), omics data (Z), outcome (Y) and covariates (CoY)
8 exposures to environmental pollutants. Variables end with m represent maternal exposures; end with c represent children exposures
10 proteins
A continuous outcome for BMI-z score based on WHO standard, A binary outcome for body mass index categories at 6-11 years old based on WHO reference (0: Thinness or Normal; 1: Overweight or Obese)
3 covariates including mother's bmi, child sex, maternal age
Impute missing data in Z by maximizing the likelihood given fixed parameters of LUCID
Istep_Z(Z, p, mu, sigma, index)
Istep_Z(Z, p, mu, sigma, index)
Z |
an N by P matrix representing the omics data |
p |
an N by K matrix representing posterior inclusion probability for each latent cluster |
mu |
an M by K matrix representing cluster-specific means |
sigma |
an M by M by K array representing cluster-specific covariance |
index |
an N by M matrix representing missing values in Z |
a complete dataset of Z
Fit a lucid model for integrated analysis on exposure, outcome and multi-omics data
lucid( G, Z, Y, CoG = NULL, CoY = NULL, family = "normal", K = 2, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, verbose_tune = FALSE, ... )
lucid( G, Z, Y, CoG = NULL, CoY = NULL, family = "normal", K = 2, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, verbose_tune = FALSE, ... )
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
family |
Distribution of outcome. For continuous outcome, use "normal"; for binary outcome, use "binary". Default is "normal". |
K |
Number of latent clusters (should be greater or equal than 2). Either an integer or a vector of integer. If K is a vector, model selection on K is performed. |
Rho_G |
A scalar or a vector. This parameter is the LASSO penalty to regularize
exposures. If it is a vector, |
Rho_Z_Mu |
A scalar or a vector. This parameter is the LASSO penalty to
regularize cluster-specific means for omics data (Z). If it is a vector,
|
Rho_Z_Cov |
A scalar or a vector. This parameter is the graphical LASSO
penalty to estimate sparse cluster-specific variance-covariance matrices for omics
data (Z). If it is a vector, |
verbose_tune |
A flag to print details of tuning process. |
... |
Other parameters passed to |
An optimal lucid model
## Not run: G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal Y_binary <- sim_data$Y_binary cov <- sim_data$Covariate # fit lucid model fit1 <- lucid(G = G, Z = Z, Y = Y_normal, family = "normal") fit2 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", useY = FALSE) # including covariates fit3 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", CoG = cov) fit4 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", CoY = cov) # tune K fit5 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", K = 2:5) # variable selection fit6 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", Rho_G = seq(0.01, 0.1, by = 0.01)) fit7 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", Rho_Z_Mu = seq(10, 100, by = 10), Rho_Z_Cov = 0.5, init_par = "random", verbose_tune = TRUE) ## End(Not run)
## Not run: G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal Y_binary <- sim_data$Y_binary cov <- sim_data$Covariate # fit lucid model fit1 <- lucid(G = G, Z = Z, Y = Y_normal, family = "normal") fit2 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", useY = FALSE) # including covariates fit3 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", CoG = cov) fit4 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", CoY = cov) # tune K fit5 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", K = 2:5) # variable selection fit6 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", Rho_G = seq(0.01, 0.1, by = 0.01)) fit7 <- lucid(G = G, Z = Z, Y = Y_binary, family = "binary", Rho_Z_Mu = seq(10, 100, by = 10), Rho_Z_Cov = 0.5, init_par = "random", verbose_tune = TRUE) ## End(Not run)
In the Sankey diagram, each node either represents a variable (exposure, omics or outcome) or a latent cluster. Each line represents an association. The color of the node represents variable type, either exposure, omics or outcome. The width of the line represents the effect size of a certain association; the color of the line represents the direction of a certain association.
plot_lucid( x, G_color = "dimgray", X_color = "#eb8c30", Z_color = "#2fa4da", Y_color = "#afa58e", pos_link_color = "#67928b", neg_link_color = "#d1e5eb", fontsize = 7 )
plot_lucid( x, G_color = "dimgray", X_color = "#eb8c30", Z_color = "#2fa4da", Y_color = "#afa58e", pos_link_color = "#67928b", neg_link_color = "#d1e5eb", fontsize = 7 )
x |
A LUCID model fitted by |
G_color |
Color of node for exposure |
X_color |
Color of node for latent cluster |
Z_color |
Color of node for omics data |
Y_color |
Color of node for outcome |
pos_link_color |
Color of link corresponds to positive association |
neg_link_color |
Color of link corresponds to negative association |
fontsize |
Font size for annotation |
A DAG graph created by sankeyNetwork
## Not run: # prepare data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal Y_binary <- sim_data$Y_binary cov <- sim_data$Covariate # plot lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", K = 2, seed = 1008) plot_lucid(fit1) # change node color plot_lucid(fit1, G_color = "yellow") plot_lucid(fit1, Z_color = "red") # change link color plot_lucid(fit1, pos_link_color = "red", neg_link_color = "green") ## End(Not run)
## Not run: # prepare data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal Y_binary <- sim_data$Y_binary cov <- sim_data$Covariate # plot lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, CoY = NULL, family = "normal", K = 2, seed = 1008) plot_lucid(fit1) # change node color plot_lucid(fit1, G_color = "yellow") plot_lucid(fit1, Z_color = "red") # change link color plot_lucid(fit1, pos_link_color = "red", neg_link_color = "green") ## End(Not run)
Predict cluster assignment and outcome based on LUCID model
predict_lucid(model, G, Z, Y = NULL, CoG = NULL, CoY = NULL, response = TRUE)
predict_lucid(model, G, Z, Y = NULL, CoG = NULL, CoY = NULL, response = TRUE)
model |
A model fitted and returned by |
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
response |
If TRUE, when predicting binary outcome, the response will be returned. If FALSE, the linear predictor is returned. |
A list contains predicted latent cluster and outcome for each observation
## Not run: # prepare data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # fit lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, K = 2, family = "normal") # prediction on training set pred1 <- predict_lucid(model = fit1, G = G, Z = Z, Y = Y_normal) pred2 <- predict_lucid(model = fit1, G = G, Z = Z) ## End(Not run)
## Not run: # prepare data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # fit lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, K = 2, family = "normal") # prediction on training set pred1 <- predict_lucid(model = fit1, G = G, Z = Z, Y = Y_normal) pred2 <- predict_lucid(model = fit1, G = G, Z = Z) ## End(Not run)
est_lucid
Print the output of est_lucid
## S3 method for class 'lucid' print(x, ...)
## S3 method for class 'lucid' print(x, ...)
x |
An object of LUCID model, returned by |
... |
Other arguments to be passed to |
Print the output of LUCID in a nicer table
## S3 method for class 'sumlucid' print(x, ...)
## S3 method for class 'sumlucid' print(x, ...)
x |
An object returned by |
... |
Other parameters to be passed to |
This is an example dataset to illustrate LUCID model. It is simulated by assuming there are 2 latent clusters in the data. We assume the exposures are associated with latent cluster which ultimately affects the PFAS concentration and liver injury in children. The latent clusters are also characterized by differential levels of metabolites.
sim_data
sim_data
A list with 5 matrices corresponding to exposures (G), omics data (Z), a continuous outcome, a binary outcome and 2 covariates (can be used either as CoX or CoY). Each matrice contains 2000 observations.
10 exposures
10 metabolites
Outcome, PFAS concentration in children
Bianry outcome, liver injury status
2 continous covariates, can be treated as either CoX or CoY
Latent clusters
Summarize results of LUCID model
summary_lucid(object, boot.se = NULL)
summary_lucid(object, boot.se = NULL)
object |
A LUCID model fitted by |
boot.se |
An object returned by |
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # fit lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008) # conduct bootstrap resampling boot1 <- boot_lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100) # summarize lucid model summary_lucid(fit1) # summarize lucid model with bootstrap CIs summary_lucid(fit1, boot.se = boot1) ## End(Not run)
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # fit lucid model fit1 <- est_lucid(G = G, Z = Z, Y = Y_normal, family = "normal", K = 2, seed = 1008) # conduct bootstrap resampling boot1 <- boot_lucid(G = G, Z = Z, Y = Y_normal, model = fit1, R = 100) # summarize lucid model summary_lucid(fit1) # summarize lucid model with bootstrap CIs summary_lucid(fit1, boot.se = boot1) ## End(Not run)
Given a grid of K and L1 penalties (incluing Rho_G, Rho_Z_mu and Rho_Z_Cov), fit LUCID model over all combinations of K and L1 penalties to determine the optimal penalty.
tune_lucid( G, Z, Y, CoG = NULL, CoY = NULL, family = "normal", K = 2:5, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, ... )
tune_lucid( G, Z, Y, CoG = NULL, CoY = NULL, family = "normal", K = 2:5, Rho_G = 0, Rho_Z_Mu = 0, Rho_Z_Cov = 0, ... )
G |
Exposures, a numeric vector, matrix, or data frame. Categorical variable should be transformed into dummy variables. If a matrix or data frame, rows represent observations and columns correspond to variables. |
Z |
Omics data, a numeric matrix or data frame. Rows correspond to observations and columns correspond to variables. |
Y |
Outcome, a numeric vector. Categorical variable is not allowed. Binary outcome should be coded as 0 and 1. |
CoG |
Optional, covariates to be adjusted for estimating the latent cluster. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
CoY |
Optional, covariates to be adjusted for estimating the association between latent cluster and the outcome. A numeric vector, matrix or data frame. Categorical variable should be transformed into dummy variables. |
family |
Distribution of outcome. For continuous outcome, use "normal"; for binary outcome, use "binary". Default is "normal". |
K |
Number of latent clusters. An integer greater or equal to 2. If K is a vector, model selection on K is performed |
Rho_G |
A scalar or a vector. This parameter is the LASSO penalty to regularize
exposures. If it is a vector, |
Rho_Z_Mu |
A scalar or a vector. This parameter is the LASSO penalty to
regularize cluster-specific means for omics data (Z). If it is a vector,
|
Rho_Z_Cov |
A scalar or a vector. This parameter is the graphical LASSO
penalty to estimate sparse cluster-specific variance-covariance matrices for omics
data (Z). If it is a vector, |
... |
Other parameters passed to |
A list:
best_model |
the best model over different combination of tuning parameters |
tune_list |
a data frame contains combination of tuning parameters and c orresponding BIC |
res_model |
a list of LUCID models corresponding to each combination of tuning parameters |
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # find the optimal model over the grid of K tune_K <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2:5) # tune penalties tune_Rho_G <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2, Rho_G = c(0.1, 0.2, 0.3, 0.4)) tune_Rho_Z_Mu <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2, Rho_Z_Mu = c(10, 20, 30, 40)) tune_Rho_Z_Cov <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2, Rho_Z_Cov = c(0.1, 0.2, 0.3)) ## End(Not run)
## Not run: # use simulated data G <- sim_data$G Z <- sim_data$Z Y_normal <- sim_data$Y_normal # find the optimal model over the grid of K tune_K <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2:5) # tune penalties tune_Rho_G <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2, Rho_G = c(0.1, 0.2, 0.3, 0.4)) tune_Rho_Z_Mu <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2, Rho_Z_Mu = c(10, 20, 30, 40)) tune_Rho_Z_Cov <- tune_lucid(G = G, Z = Z, Y = Y_normal, useY = FALSE, tol = 1e-3, seed = 1, K = 2, Rho_Z_Cov = c(0.1, 0.2, 0.3)) ## End(Not run)