Yi-Juan Hu
LDM: Testing Hypotheses about the Microbiome using an Ordination-based Linear Decomposition Model


LDM is an R package that provides a single analysis path that includes distance-based ordination, global tests of any effect of the microbiome, and tests of the effects of individual OTUs (i.e., operational taxonomic units) with false discovery rate (FDR)-based correction for multiple testing. It accommodates both continuous and discrete variables (e.g., clinical outcomes, environmental factors, treatment groups) as well as interaction terms to be tested either singly or in combination, allows for adjustment of confounding covariates, and uses permutation-based p-values that can control for correlation (e.g., repeated measurements on the same individual). It can also be applied to transformed data, and an `omnibus' test can easily combine results from analyses conducted on different transformation scales.


To download the R package for Linux/Mac/Windows, please click here. To download the vignette, please click here. To download the manual, please click here.
TASER: Test of Association using SEquencing Reads


TASER is a command-line program written in C/C++ for testing association using sequencing reads without calling genotypes. TASER constructs the weighted or unweighted burden test for rare-variant associations. TASER includes a screening procedure to estimate the loci that are variants and a bootstrap procedure for assessing the significance of our burden statistic. Our test is robust to a wide range of differential sequencing qualities between cases and controls, and are at least as powerful as the standard genotype calling approach when the latter controls the type I error.


To download TASER for 64-bit X86 based Linux, documentation, and an example dataset, please click here.


process_BAM_for_TASER.bash is a utility program to generate input files used by TASER. To download, please click here.

TASER-PC: Robust Inference of Population Structure from Next-Generation Sequencing Data


TASER-PC is a command-line program written in C/C++ for estimating principal components (PCs) that uses read count data directly. TASER-PC uses a subsampling procedure and a read-flipping procedure to adjust the data so that the sequencing quality appears to be equal among groups. TASER-PC is robust to various differential sequencing qualities among a number of groups. TASER-PC can perform subsampling and read-flipping in parallel if number_jobs is greater than 1 and the specified number of cores are available.


To download TASER-PC for 64-bit X86 based Linux, documentation, and an example dataset, please click here.
PhredEM: a phred-score-informed genotype-calling approach


PhredEM is a command-line program written in C/C++ for genotype calling in next-generation sequencing studies. PhredEM estimates base-calling error rates from the read data while incorporating the information in phred scores. PhredEM can also identify loci with no variation through a simple, computationally efficient screening algorithm.


To download PhredEM for 64-bit X86 based Linux, documentation, and an example dataset, please click here.


process_BAM_for_PhredEM.bash is a utility program to generate input files used by PhredEM. To download, please click here.

TRECASE_MLE: eQTL Mapping based on Total Read Count and Allele-Specific Expression in RNA-Seq Data with Maximum-Likelihood Estimation


TRECASE_MLE is a command-line program written in C/C++ for eQTL mapping with RNA-seq data. TRECASE_MLE implemented the following steps in the five-step pipeline: (step 1) testing every local SNP for association with the expression of a gene and reporting the SNP with the minimum p-value (referred to as the minimum-p SNP) for each gene; (step 2) assessing the significance of every minimum-p SNP by a permutation process; (step 4) conducting the cis-trans test at every minimum-p SNP; and (step 5) estimating the effect size at every minimum-p SNP. All of these steps are performed for the TReC and TReCASE models in parallel. Note that Step 3 in the five-step pipeline that detects eQTLs among genome-wide minimum-p SNPs by FDR control can be performed using the R utility program “detect_eQTLs_byFDRcontrol.R” (provided in the zip file) based on the output of TRECASE_MLE; this R program generates the final list of detected eQTLs with determined cis or trans mechanisms and estimated effect sizes. We are working intensely to improve the capabilities of TRECASE_MLE, so please check back frequently for updates.


To download TRECASE_MLE for 64-bit X86 based Linux, documentation, and an example dataset, please click here.
SEQGWAS: Integrative Analysis of Sequencing and GWAS Data


SEQGWAS is a command-line program written in C/C++ for integrative analysis of sequencing and GWAS data. SEQGWAS produces all commonly used gene-level tests, including the burden test, variable threshold (VT) test, and sequence-kernel association test (SKAT), all of which are based on the score statistic for assessing the effects of individual variants on the trait of interest. SEQGWAS calculates the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for non-sequenced subjects, and constructs a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, so that the corresponding association tests always have correct type I error.


To download SEQGWAS for 64-bit X86 based Linux, documentation, and an example dataset, please click here.
MAGA: Meta-Analysis of Gene-Level Associations


MAGA is a command-line program written in C/C++ for meta-analysis of gene-level associations based on single-variant statistics (i.e., p-values of association tests and effect estimates) of rare variants from participating studies. MAGA recovers the multivariate statistics of gene-level association tests from single-variant statistics together with the correlation matrix of the single-variant test statistics, which is estimated from one of the participating studies or from a publicly available database. MAGA accommodates any disease phenotype and any study design and produces all commonly used gene-level tests, i.e., the burden, variable threshold, and variance-component tests. MAGA can perform meta-analysis of gene-level associations by combining rare variants in sequencing studies or by combining low-frequency variants in genome-wide association studies (GWAS). By treating each variant as a “gene”, MAGA can also perform meta-analysis of single-variant associations, which is more stable than inverse-variance method in the presence of rare or low-frequency variants.


To download MAGA for 64-bit X86 based Linux, documentation, and an example dataset, please click here.
CNVstat: Statistical Association Analysis of Copy Number Variants


Copy number variants (CNVs) and single nucleotide polymorphisms (SNPs) co-exist throughout the human genome and jointly contribute to phenotypic variations. Thus, it is desirable to consider both types of variants, as characterized by allele-specific copy numbers (ASCNs), in association studies of complex human diseases. Current SNP genotyping technologies capture the CNV and SNP information simultaneously via fuorescent intensity measurements. The common practice of calling ASCNs from the intensity measurements and then using the ASCN calls in downstream association analysis has important limitations. First, the association tests are prone to false-positive findings when differential measurement errors between cases and controls arise from differences in DNA quality or handling. Second, the uncertainties in the ASCN calls are ignored.


CNVstat is a command-line program written in C/C++ for the statistical association analysis of CNVs and SNPs. CNVstat allows the user to estimate or test the effects of CNVs and SNPs by maximizing the (observed-data) likelihood that properly accounts for differential measurement errors and calling uncertainties. It is versatile in several aspects: (1) it provides the integrated analysis of CNVs and SNPs as well as the analysis of total CNVs; (2) it can accommodate both Affymetrix and Illumina data, as well as all platforms that assay CNVs quantitatively, such as array CGH; (3) it accounts for the case-control sampling, differential measurement errors and calling uncertainties; (4) it can be readily extended to other study designs and traits; (5) it formulates the effects of CNVs and SNPs on the phenotype through flexible regression models, which can accommodate various genetic mechanisms and gene-environment interactions; and (6) it allows genetic and environmental variables to be correlated. The program is fast and scalable to genomewide association scans. For example, it took about 2 hrs on a 64-bit, 3.0-GHz Intel Xeon machine to perform the analysis on chromosome 1 of the schizophrenia data (Hu et al. Submitted for publication).


For more information or download CNVstat, please click here.
tagIMPUTE: Tag-based Imputation


TagIMPUTE is a command-line program written in C/C++ for the imputation of untyped SNPs. TagIMPUTE is based on a few flanking SNPs that can optimally predict the SNP under imputation. For more details, see Hu and Lin (2010).


For more information or download tagIMPUTE, please click here.
SNPMStat: Statistical Analysis of SNP-Disease Association with Missing Genotype Data


SNPMStat is a command-line program written in C/C++ for the statistical analysis of SNP-disease association in case-control/cohort/cross-sectional studies with potentially missing genotype data. SNPMStat allows the user to estimate or test SNP effects and SNP-environment interactions by maximizing the (observed-data) likelihood that properly accounts for phase uncertainty, study design and gene-environment dependence. For SNPs without missing data, the program performs the standard association analysis. For typed SNPs with missing data or untyped SNPs, the program performs the maximum-likelihood analysis described in Lin, Hu and Huang (2008) and Hu, Lin and Zeng (2010).


For more information or download SNPMStat, please click here.