Documentation

Index

Description

Here, we explain how Stargazer works using the CYP2D6 gene as an example.

Stargazer diagram
Figure 1. A schematic diagram of the Stargazer CYP2D6 genotyping pipeline. This figure was taken from Lee et al., 2018.

Input and output data of Stargazer

The Stargazer CYP2D6 genotyping pipeline is outlined in Figure 1. The pipeline uses BAM files comprised of sequence reads aligned with BWA-MEM to human reference genome assembly GRCh37. BAM files are then used to generate a VCF file with GATK-HaplotypeCaller, from which Stargazer extracts all SNVs/indels located within 3kb from either end of CYP2D6. More specifically, Stargazer stores the genomic position of each variant, the reference allele, the alternate allele(s), the genotype status (homozygous or heterozygous), and the allelic depth for each sample. Stargazer uses the variant information from the VCF file to call star alleles based on SNVs/indels.

BAM files are also used to calculate read depth for CYP2D6 and CYP2D7 with GATK-DepthOfCoverage. For convenience, we will refer to this output as a target GDF (GATK-DepthOfCoverage format) file. Since the high homology between CYP2D6 and CYP2D7 can cause reads to align to erroneous or multiple locations, only uniquely mapping reads with a mapping quality >= 20 are counted. Similarly, a control GDF file is produced from a user-chosen locus which serves as a read depth normalization factor. Stargazer computes paralog-specific copy number using read depth from target and control GDF files in order to detect SVs.

The output data of Stargazer include each sample's CYP2D6 diplotype, predicted phenotype, and plots to visually inspect copy number for CYP2D6 and CYP2D7 (Figure 2). Stargazer also returns all detected SNVs/indels including those that are novel as well as those that are known but are not currently used to define any star allele, so investigators will know which variants to potentially follow up. In addition, those variants can be annotated with SeattleSeq Annotation and used to define new star alleles. Note that when calling diplotypes, Stargazer only considers those variants that are currently used by the Pharmacogene Variation Consortium.

Structural variation
Figure 2. Examples of SVs detected by Stargazer. Grey dots are copy number calculated from read depth. Dots colored purple and orange are the mean copy number for CYP2D6 and CYP2D7, respectively, determined by the changepoint algorithm. Each panel contains scaled CYP2D6 and CYP2D7 gene models, in which the exons and introns are depicted with boxes and lines, respectively. (A) shows no SVs and is included for comparison. (B) shows a gene deletion (CYP2D6*5). (C) shows a gene duplication (CYP2D6*2x2). (D) shows both a gene duplication and a gene conversion (CYP2D6*68+*4). (E) shows multiple gene duplications and gene conversions (CYP2D6*36+*10/*36+*10). (F) shows a CYP2D6-to-CYP2D7 gene conversion (CYP2D6*78+*2). This figure was taken from Lee et al., 2018.

Prediction of star alleles

From a VCF file, Stargazer uses Beagle to haplotype phase heterozygous variants for CYP2D6 with over 2,500 reference samples from the 1000 Genomes Project. Stargazer then matches phased haplotypes to star alleles with a translation table built from publicly available data (e.g., the Pharmacogene Variation Consortium). The table contains information on more than 90 star alleles and 185 SNVs/indels including variant positions and nucleotide changes in relation to the reference CYP2D6*1 allele and human reference genome assembly GRCh37.

Detection of SVs

From a target GDF file, Stargazer converts read depth for CYP2D6 and CYP2D7 to copy number by performing intra- and inter-sample normalizations. The former accounts for individual variation in sequencing efficiency using read depth from a control GDF file, while the latter considers heterogeneity in coverage across all samples. Stargazer then automates detection of SVs with changepoint, an R package that approximates one or more points at which the statistical properties of a sequence of observations change. Here, the sequence is DNA, the observation is per-base copy number, and the statistical property is the mean copy number. If there is a significant shift in the mean copy number (e.g., from 2 to 1), the algorithm returns the change point location and the two mean values (e.g., 2 and 1).

Identification of diplotypes

For samples without SVs, Stargazer determines CYP2D6 diplotypes by combining the star allele used to assign each phased haplotype. For samples with a whole gene deletion, the affected haplotype is assigned the CYP2D6*5 deletion allele, which is then combined with the star allele assigned to the other haplotype to form a diplotype. For samples with a whole gene duplication, the affected haplotype will be assigned "x2" (e.g., CYP2D6*1x2, *2x2, etc.) because it has two gene copies of CYP2D6. For samples with more complex SVs such as CYP2D6/CYP2D7 hybrids, individual algorithms have been developed to determine diplotypes.

Assignment of predicted phenotypes

There are four CYP2D6 metabolizer classes: poor, intermediate, normal, and ultrarapid. To predict these phenotypes, Stargazer first translates CYP2D6 diplotypes into a standard unit of enzyme activity known as an activity score. The fully functional reference CYP2D6*1 allele is assigned a value of 1, decreased function alleles such as CYP2D6*10 and *17 receive a value of 0.5, and non-functional alleles including CYP2D6*4 and *5 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with CYP2D6*1/*1, *1/*4 and *4/*5 diplotypes will have an activity score of 2, 1 and 0, respectively. These activity scores are used to predict the four metabolizer classes as following: poor, 0; intermediate, 0.5; normal, 1 to 2; ultrarapid, greater than 2.

Usage

The most current version of Stargazer (v1.0.6) has four tools: genotype, summary, meta, and sge. For each synopsis, frequently used optional arguments are shown in square brackets. The complete list of arguments can be found in the next section.

Genotype using targeted sequencing data

python3 stargazer.py genotype \
-t TARGET_GENE \
-c CONTROL_GENE \
--vcf VCF \
--gdf GDF \
[-o OUTPUT_PREFIX] \
[-w WORKING_DIR] \
[--keep] \
[--detail] \
[--ref_samples]

Genotype using WGS data

python3 stargazer.py genotype \
-t TARGET_GENE \
-c CONTROL_GENE \
--vcf VCF \
--gdf GDF \
--wgs \
[-o OUTPUT_PREFIX] \
[-w WORKING_DIR] \
[--keep] \
[--detail]

Genotype only using VCF from NGS data

Note: Stargazer will assume the samples have no SVs. This is a fair assumption for most PGx genes targeted by Stargazer except few such as CYP2D6. See the DPSV page for the complete list of PGx genes where Stargazer has detected SVs so far.

python3 stargazer.py genotype \
-t TARGET_GENE \
--vcf VCF \
[--wgs]

Genotype only using VCF from chip data

Note: Stargazer will assume the samples have no SVs. This is a fair assumption for most PGx genes targeted by Stargazer except few such as CYP2D6. See the DPSV page for the complete list of PGx genes where Stargazer has detected SVs so far.

python3 stargazer.py genotype \
-t TARGET_GENE \
--vcf VCF \
--chip

Create summary report from Stargazer result

python3 stargazer.py summary \
--result prefix.result.csv

Combine several summary reports for meta analysis

python3 stargazer.py meta \
--summary prefix1.result.summary.txt \
--summary prefix2.result.summary.txt \
[...]

Run the entire Stargazer genotyping pipeline with Sun Grid Engine (SGE)

python3 stargazer.py sge \
-o OUTPUT_PREFIX \
-m MANIFEST \
-t TARGET_GENE \
-c CONTROL_GENE \
[-b BATCH] \
[-w WORKING_DIR] \
[--keep] \
[--detail] \
[--wgs] \
[--sampling]

Arguments

Tools

genotype: Genotype using various genomic data (WGS, targeted sequencing, and SNP array)

summary: Create a summary report using results generated from the genotype tool

meta: Compare summary reports generated from the summary tool

sge: Create shell scripts and directories for running the entire Stargazer genotyping pipeline via SGE

Inputs

--target_gene, -t [string]: Target gene (e.g., cyp2d6, gstt1, ugt2b15)

--control_gene, -c [string]: Control gene (e.g., egfr, ryr1, vdr)

--vcf [file]: Target .vcf file

--gdf [file]: Target .gdf file

--output_prefix, -o [string]: Output file prefix

--result [file]: Output .result.txt file from the genotype tool

--summary [file]: Output .result.summary.txt file from the summary tool

--manifest, -m [file]: Manifest .csv file with two required columns, 'sample_id' and 'bam'

--ref_samples [file]: A file containing a single column of sample ID(s). Use this argument if you already know samples that don't have SVs (used as reference during inter-sample normalization)

--region [string]: A chromosome interval with the format chr:start-end (e.g., chr22:42546883-42551883)

Parameters

--batch, -b [positive number]: Number of batches used for parallelization (default: 1)

--working_dir, -w [directory]: Directory at which output will be saved (default: current directory)

--ref_vcf [file]: Reference haplotype panel used for phasing (default: 1000 Genomes Project)

Flags

--keep: Keep all intermediate files

--wgs: Omit the intersample normalization if input is WGS data (should not be used with targeted sequencing data)

--chip: Assume input VCF was generated from chip data

--sampling: Allow sampling when getting read depth for control locus (use this flag to increase the speed)

--impute: Impute ungenotyped SNVs/indels during haplotype phasing by Beagle

--detail: Output detailed copy number and allele fraction profiles