IntroNOTE: THIS DOCUMENTATION IS FOR STARGAZER_V2.0 If you are using an older version, please download the latest version.
Stargazer is a set of utilities that first began as a single-purpose tool for genotyping pharmacogenetics (PGx) genes from next-generation sequencing (NGS) data. It has now expanded to include the capabilities to genotype PGx genes from both NGS data and single nucleotide polymorphism (SNP) array data, as well as provide comprehensive end-to-end genotyping solutions. Stargazer can also create a genotype data file (GDF) from a BAM/CRAM file.
You can get help on running Stargazer in the console by entering:
python3 stargazer --help
The files / arguments determine the method for which Stargazer will be run. For performing structural variation detection, you must provide both '-c/--control-gene' and '-g/--gdf-file'. Without the control gene and gdf file Stargazer will run in the VCF-only mode. To create a GDF using Stargazer run as follows:
python3 stargazer –create-gdf-file <name_for_GDF> –control-gene <Control gene> –bam-file <path_to_BAM>
For more information, refer to the extensive documentation.
UsageDescriptionStargazer v2.0 has been simplified to a standalone program one to run as one of three of the commonly used functions of the original Stargazer: genotyping with and without the addition of read depth information, and the generation of the required file (GDF) for use when attempting to call star alleles with copy number or structural variant contributions. In this section, we will first explain how Stargazer's genotyping algorithm works using the CYP2D6 gene as an example. After that, we will provide some example usages and the full list of command-line arguments.
Figure 1. A schematic diagram of the Stargazer genotyping pipeline for any star allele, like many on CYP2D6, that requires read depth information for the determination of copy number (CNV) of other structural variant (SV) contributions to the genotype. For star alleles that are defined solely with single-nucleotide (SNV) or small insertion/deletion (indel) variants, the left path flow using a VCF file as input is all that is required. This figure was modified from an original taken from Lee et al., 2018. Part I. Input and output data of Stargazer The Stargazer CYP2D6 genotyping pipeline is outlined in Figure 1. The pipeline uses BAM- or CRAM-formatted files comprised of sequence reads aligned with a mapping program (such as BWA-MEM or DRAGEN) to either the GRCh37 or GRCh38 human reference genome assembly. From these BAM or CRAM files, the user must initially generate a VCF file (again, with variant-calling software such at GATK-HaplotypeCaller or DRAGEN) from which Stargazer extracts all SNVs/indels located within 3kb from either end of CYP2D6. More specifically, Stargazer stores the genomic position of each variant, the reference allele, the alternate allele(s), the genotype status (homozygous or heterozygous), and the allelic depth for each sample. Stargazer uses the variant information from the VCF file to call star alleles based on SNVs/indels.
BAM/CRAM files are also used to calculate read depth for CYP2D6 and CYP2D7 with a new internal feature of Stargazer. For convenience, we will continue to refer to this output with the GDF (GATK-DepthOfCoverage format) file. Using the -G/–-create-gdf-file as shown below, Stargazer v2.0 uses a few internal Python routines to generate the necessary files containing information of read depth across both the gene of interest (target gene) and a known user-chosen copy-number 2 reference gene (control gene) for use as a read depth normalization factor; with both sets of information being contained in one GDF file. Since the high homology between CYP2D6 and CYP2D7 can cause reads to align to erroneous or multiple locations, only uniquely mapping reads with a mapping quality >= 20 are counted. Stargazer computes paralog-specific copy number using read depth information from the target and control genes in the GDF files in order to detect SVs.
The output data of Stargazer include each sample's CYP2D6 diplotype, predicted phenotype, and plots to visually inspect copy number for CYP2D6 and CYP2D7 (Figure 2). Stargazer also, in a finalized.vcf file, returns all detected SNVs/indels including those that are novel as well as those that are known but are not currently used to define any star allele, so investigators will know which variants to potentially follow up. In addition, those variants can be annotated with SeattleSeq Annotation and used to define new star alleles. Note that when calling diplotypes, Stargazer only considers those variants that are currently used by the Pharmacogene Variation Consortium (PharmVar).
Figure 2. An example of a hybrid star allele detected by Stargazer: CYP2D6 *2x2/*68+*4 diplotype. (*68 is a CYP2D6-CYP2D7 hybrid with a switch region in intron 1; it is found in tandem with *4.) Green dots are points of raw read depth, shown here for both the target and control genes. Grey dots are copy number calculated from read depth. The light and dark blue dots show the allele fraction values of relevant SNV or indel variants.
Part II. Prediction of star alleles From a VCF file, Stargazer uses Beagle to haplotype phase heterozygous variants for CYP2D6 with over 2,500 reference samples from the 1000 Genomes Project. Stargazer then matches phased haplotypes to star alleles with a translation table built from publicly available data (e.g., the Pharmacogene Variation Consortium). The table contains information on more than 90 star alleles and 185 SNVs/indels including variant positions and nucleotide changes in relation to the reference CYP2D6*1 allele and the GRCh37 and GRCh38 human reference genome assemblies. Part III. Detection of SVs From a GDF file, Stargazer converts read depth for CYP2D6 and CYP2D7 (target gene) to copy number by performing intra- and inter-sample normalizations. The former accounts for individual variation in sequencing efficiency using read depth from a control gene; also contained in the GDF file, while the latter considers heterogeneity in coverage across all samples. Stargazer automates the detection of SVs using four of the most basic Python modules: numpy, pandas, scipy and matplotlib that approximate one or more points at which the statistical properties of a sequence of observations change. Here, the sequence is DNA, the observation is per-base copy number, and the statistical property is the mean copy number. If there is a significant shift in the mean copy number (e.g., from 2 to 1), the algorithm returns the change point location and the two mean values (e.g., 2 and 1). Part IV. Identification of diplotypes For samples without SVs, Stargazer determines CYP2D6 diplotypes by combining the star allele used to assign each phased haplotype. For samples with a whole gene deletion, the affected haplotype is assigned the CYP2D6*5 deletion allele, which is then combined with the star allele assigned to the other haplotype to form a diplotype. For samples with a whole gene duplication, the affected haplotype will be assigned "x2" (e.g., CYP2D6*1x2, *2x2, etc.) because it has two gene copies of CYP2D6. For samples with more complex SVs such as CYP2D6/CYP2D7 hybrids, individual algorithms have been developed to determine diplotypes. Part V. Assignment of predicted phenotypes There are four CYP2D6 metabolizer classes: poor, intermediate, normal, and ultrarapid. To predict these phenotypes, Stargazer first translates CYP2D6 diplotypes into a standard unit of enzyme activity known as an activity score. The fully functional reference CYP2D6*1 allele is assigned a value of 1, decreased function alleles such as CYP2D6*10 and *17 receive a value of 0.5, and non-functional alleles including CYP2D6*4 and *5 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with CYP2D6*1/*1, *1/*4 and *4/*5 diplotypes will have an activity score of 2, 1 and 0, respectively. These activity scores are used to predict the four metabolizer classes as following: poor, 0; intermediate, 0.5; normal, 1 to 2; ultrarapid, greater than 2. Part VI. Reporting Stargazer v2.0 creates a report that shows the phased haplotypes as well as other possible candidates to consider such as unphased, multiple star alleles in one haplotype, and warns against BEAGLE imputed missing genotypes.
Main usages Genotype with WGS data (or TS data)
stargazer -a <genome_build> -o <output_prefix> -d wgs (or -d ts) -t <target_gene> -i <vcf> -c <control_gene> -g <gdf>
Genotype with WGS-generated (or TS-generated) VCF only
stargazer -a <genome_build> -o <output_prefix> -d wgs (or -d ts) -t <target_gene> -i <vcf>
Genotype with SNP array-generated VCF only
stargazer -a <genome_build> -o <output_prefix> -d chip t <target_gene> -i <vcf>
Genotype with SNP array-generated VCF only, using the Beagle program's imputation algorithm stargazer -a <genome_build> -o <output_prefix> -d chip t <target_gene> -i <vcf> -impute
Command-line arguments
Interpretation of results Here's the explanation of the column headers in output_prefix.stargazer-genotype.txt files:
Here's the explanation of the column headers in report.txt files:
|