Documentation

Index

Description

Here, we explain how Stargazer works using the CYP2D6 gene as an example.

Stargazer diagram
Figure 1. A schematic diagram of the Stargazer CYP2D6 genotyping pipeline. This figure was taken from Lee et al., 2018.

Input and output data of Stargazer

The Stargazer CYP2D6 genotyping pipeline is outlined in Figure 1. The pipeline uses BAM files comprised of sequence reads aligned with BWA-MEM to human reference genome assembly GRCh37. BAM files are then used to generate a VCF file with GATK-HaplotypeCaller, from which Stargazer extracts all SNVs/indels located within 3kb from either end of CYP2D6. More specifically, Stargazer stores the genomic position of each variant, the reference allele, the alternate allele(s), the genotype status (homozygous or heterozygous), and the allelic depth for each sample. Stargazer uses the variant information from the VCF file to call star alleles based on SNVs/indels.

BAM files are also used to calculate read depth for CYP2D6 and CYP2D7 with GATK-DepthOfCoverage. For convenience, we will refer to this output as a target GDF (GATK-DepthOfCoverage format) file. Since the high homology between CYP2D6 and CYP2D7 can cause reads to align to erroneous or multiple locations, only uniquely mapping reads with a mapping quality >= 20 are counted. Similarly, a control GDF file is produced from a user-chosen locus which serves as a read depth normalization factor. Stargazer computes paralog-specific copy number using read depth from target and control GDF files in order to detect SVs.

The output data of Stargazer include each sample's CYP2D6 diplotype, predicted phenotype, and plots to visually inspect copy number for CYP2D6 and CYP2D7 (Figure 2). Stargazer also returns all detected SNVs/indels including those that are novel as well as those that are known but are not currently used to define any star allele, so investigators will know which variants to potentially follow up. In addition, those variants can be annotated with SeattleSeq Annotation and used to define new star alleles. Note that when calling diplotypes, Stargazer only considers those variants that are currently used by the Pharmacogene Variation Consortium.

Structural variation
Figure 2. Examples of SVs detected by Stargazer. Grey dots are copy number calculated from read depth. Dots colored purple and orange are the mean copy number for CYP2D6 and CYP2D7, respectively, determined by the changepoint algorithm. Each panel contains scaled CYP2D6 and CYP2D7 gene models, in which the exons and introns are depicted with boxes and lines, respectively. (A) shows no SVs and is included for comparison. (B) shows a gene deletion (CYP2D6*5). (C) shows a gene duplication (CYP2D6*2x2). (D) shows both a gene duplication and a gene conversion (CYP2D6*68+*4). (E) shows multiple gene duplications and gene conversions (CYP2D6*36+*10/*36+*10). (F) shows a CYP2D6-to-CYP2D7 gene conversion (CYP2D6*78+*2). This figure was taken from Lee et al., 2018.

Prediction of star alleles

From a VCF file, Stargazer uses Beagle to haplotype phase heterozygous variants for CYP2D6 with over 2,500 reference samples from the 1000 Genomes Project. Stargazer then matches phased haplotypes to star alleles with a translation table built from publicly available data (e.g., the Pharmacogene Variation Consortium). The table contains information on more than 90 star alleles and 185 SNVs/indels including variant positions and nucleotide changes in relation to the reference CYP2D6*1 allele and human reference genome assembly GRCh37.

Detection of SVs

From a target GDF file, Stargazer converts read depth for CYP2D6 and CYP2D7 to copy number by performing intra- and inter-sample normalizations. The former accounts for individual variation in sequencing efficiency using read depth from a control GDF file, while the latter considers heterogeneity in coverage across all samples. Stargazer then automates detection of SVs with changepoint, an R package that approximates one or more points at which the statistical properties of a sequence of observations change. Here, the sequence is DNA, the observation is per-base copy number, and the statistical property is the mean copy number. If there is a significant shift in the mean copy number (e.g., from 2 to 1), the algorithm returns the change point location and the two mean values (e.g., 2 and 1).

Identification of diplotypes

For samples without SVs, Stargazer determines CYP2D6 diplotypes by combining the star allele used to assign each phased haplotype. For samples with a whole gene deletion, the affected haplotype is assigned the CYP2D6*5 deletion allele, which is then combined with the star allele assigned to the other haplotype to form a diplotype. For samples with a whole gene duplication, the affected haplotype will be assigned "x2" (e.g., CYP2D6*1x2, *2x2, etc.) because it has two gene copies of CYP2D6. For samples with more complex SVs such as CYP2D6/CYP2D7 hybrids, individual algorithms have been developed to determine diplotypes.

Assignment of predicted phenotypes

There are four CYP2D6 metabolizer classes: poor, intermediate, normal, and ultrarapid. To predict these phenotypes, Stargazer first translates CYP2D6 diplotypes into a standard unit of enzyme activity known as an activity score. The fully functional reference CYP2D6*1 allele is assigned a value of 1, decreased function alleles such as CYP2D6*10 and *17 receive a value of 0.5, and non-functional alleles including CYP2D6*4 and *5 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with CYP2D6*1/*1, *1/*4 and *4/*5 diplotypes will have an activity score of 2, 1 and 0, respectively. These activity scores are used to predict the four metabolizer classes as following: poor, 0; intermediate, 0.5; normal, 1 to 2; ultrarapid, greater than 2.

Arguments

Argument Description
Tools
genotype tool for genotyping PGx gene from genomic data
view tool for viewing result for various analyses
summary tool for creating project-level summary report from genotype result
meta tool for creating comparison chart from two or more summary reports
sgep tool for running per-project genotyping pipeline with SGE
sges tool for running per-sample genotyping pipeline with SGE
Inputs
--analysis, -a type of analysis
--target_gene, -t name of target gene (e.g. cyp2d6)
--control_gene, -c name of control gene (choose among egfr, ryr1, and vdr), custom, or cnsr
--vcf path to VCF file (.vcf or .vcf.gz)
--gdf path to GDF file (.gdf)
--data source of input data (choose among wgs, ts, and chip)
--output_prefix, -o prefix for output files
--item tool-specific input item (see the Usage section for details)
--list tool-specific list of space-delimited input items (see the Usage section for details)
--region chromosome interval (e.g. chr22:42546883-42551883)
--gatk path to GATK file (GenomeAnalysisTK.jar)
--java version of Java (e.g. 8u25) for GATK
--assembly path to reference genome assembly file (.fa or .fasta) for GATK
--dbsnp path to dbSNP file (.vcf) for GATK
Parameters
--ref_vcf path to reference haplotype panel file (.vcf or .vcf.gz) for statistical phasing (default: 1KGP)
--working_dir, -w path to directory at which output will be saved (default: current directory)
--batch, -b number of sample batches for parallelization (default: 1)
Flags
--keep do not discard intermediate files
--detail output more detailed versions of the CN plot and the AF plot
--impute allow imputation of ungenotyped markers during statistical phasing
--sampling allow sampling when computing read depth for control gene (using this flag should speed up processing)

Usage

Below are example commands that show how to run each of the Stargazer tools for typical use cases. For each synopsis, optional arguments are shown in square brackets and user-provided arguments in capital letters. WE STRONGLY RECOMMEND that you try running these commands with the test datasets included in Stargazer.

1. genotype

Genotyping with NGS data

python3 stargazer.py genotype \
-o OUTPUT_PREFIX \
-t TARGET_GENE \
-c CONTROL_GENE \
--vcf VCF_FILE \
--gdf GDF_FILE \
--data INPUT_TYPE \
[-w WORKING_DIR] \
[--keep] \
[--ref_vcf REF_VCF_FILE] \
[--impute] \
[--detail] \
[--list SAMPLE_ID_1 SAMPLE_ID_2 ...]

If the input is TS data, you can use --list tag to provide a list of reference samples that are known to have no SVs and to make Stargazer use this set of samples as the normalization factor during intersample normalization.

Genotyping with VCF only

python3 stargazer.py genotype \
-o OUTPUT_PREFIX \
-t TARGET_GENE \
--vcf VCF_FILE \
--data INPUT_TYPE \
[-w WORKING_DIR] \
[--keep] \
[--ref_vcf REF_VCF_FILE] \
[--impute]

In the VCF-only mode, Stargazer assumes the samples have no SVs. This assumption holds well for most of the PGx genes currently targeted by Stargazer except for few such as CYP2D6. See the DPSV page for the complete list of the PGx genes in which Stargazer has detected SVs.

2. view

Viewing the list of SNPs and their information of a given star allele for specific samples (e.g. genotype call, allelic depth, and functional annotation)

python3 stargazer.py view \
-a snp \
-o OUTPUT_PREFIX \
-t TARGET_GENE \
--list SAMPLE_STAR_PAIR_1 SAMPLE_STAR_PAIR_2 ...

For the SNP analysis, Stargazer assumes the current working diretory already contains the required sif.finalized.vcf file, which can be created by the genotype tool with the --keep tag.

Use a slash (/) to separate the sample ID and the star allele in each pair (e.g. "--list sample_id_1/*10 sample_id_1/*36 sample_id_2/*10 sample_id_2/*36").

3. summary

Creating project-level summary report from genotype result

python3 stargazer.py summary \
--item RESULT_FILE

The --item tag specifies the path to the project's result file (.result.txt), which is one of the standard outputs of the genotype tool.

4. meta

Creating a comparison chart for multiple projects

python3 stargazer.py meta \
-o OUTPUT_PREFIX \
--list SUMMARY_FILE_1 SUMMARY_FILE_2 ...

Use the --list tag to provide two or more summary files (.result.summary.txt), which are the standard output of the summary tool.

5. sgep

Running the per-project genotyping pipeline with SGE

python3 stargazer.py sgep \
-o OUTPUT_PREFIX \
--item MANIFEST_FILE \
-t TARGET_GENE \
-c CONTROL_GENE \
--data INPUT_TYPE \
[-b BATCH] \
[-w WORKING_DIR] \
[--keep] \
[--detail] \
[--sampling]

The --item tag specifies the path to the project's manifest file. This file must be tab-delimited and have two required columns with the following headers: sample_id and bam.

6. sges

Running the per-sample genotyping pipeline with SGE

python3 stargazer.py sges \
-o OUTPUT_PREFIX \
--item BAM_FILE \
-c CONTROL_GENE \
[--list TARGET_GENE_1 TARGET_GENE_2 ...]

The --item tag specifies the path to the sample's BAM file (.bam).

The default is to attempt genotyping all of the PGx genes currently targeted by Stargazer. If you only want a subset, use the --list tag (e.g. "--list cyp2a6 cyp2d6 tpmt").

Interpretation

Explanation of the headers in the .result.txt file (produced from the genotype tool)

name: Sample ID used in the VCF file and the GDF file

status: Genotype status with following choices: g (genotyped), ng (not genotyped), qc (failed to pass quality control)

hap1_main: Main star allele for the first haplotype

hap2_main: Main star allele for the second haplotype

hap1_cand: Candidate star alleles for the first haplotype

hap2_cand: Candidate star alleles for the second haplotype

hap1_score: Activity score for the first haplotype

hap2_score: Activity score for the second haplotype

dip_score: Combined activity score for both haplotypes

phenotype: Predicted phenotype based on the diplotype activity score

dip_sv: Combination of two unphased structural variant (SV) calls

hap1_sv: Structural variant (SV) call for the first haplotype

hap2_sv: Structural variant (SV) call for the second haplotype

ssr: The sum of squared residuals (SSR) computed during the SV detection step

dip_cand: All possible candidate star alleles, obtained by combining the SNPs of both haplotypes to create a pseudo-haplotype

hap1_main_snp: SNPs of the first haplotype's main star allele

hap2_main_snp: SNPs of the second haplotype's main star allele

hap1_af_mean_gene: Mean allele fraction for the first haplotype, calculated using all detected SNPs from the target gene

hap2_af_mean_gene: Mean allele fraction for the second haplotype, calculated using all detected SNPs from the target gene

hap1_af_mean_main: Mean allele fraction for the first haplotype, calculated using the SNPs of the main star allele

hap2_af_mean_main: Mean allele fraction for the second haplotype, calculated using the SNPs of the main star allele