Intro

NOTE: THIS DOCUMENTATION IS FOR STARGAZER_V2.0 If you are using an older version, please download the latest version.

Stargazer is a set of utilities that first began as a single-purpose tool for genotyping pharmacogenetics (PGx) genes from next-generation sequencing (NGS) data. It has now expanded to include the capabilities to genotype PGx genes from both NGS data and single nucleotide polymorphism (SNP) array data, as well as provide comprehensive end-to-end genotyping solutions. Stargazer can also create a genotype data file (GDF) from a BAM/CRAM file.

You can get help on running Stargazer in the console by entering:

python3 stargazer --help

The files / arguments determine the method for which Stargazer will be run. For performing structural variation detection, you must provide both

'-c/--control-gene' and '-g/--gdf-file'. Without the control gene and gdf file Stargazer will run in the VCF-only mode. To create a GDF using Stargazer run as follows:

python3 stargazer –create-gdf-file <name_for_GDF> –control-gene <Control gene> –bam-file <path_to_BAM>

For more information, refer to the extensive documentation.

Usage

Description

Stargazer v2.0 has been simplified to a standalone program one to run as one of three of the commonly used functions of the original Stargazer: genotyping with and without the addition of read depth information, and the generation of the required file (GDF) for use when attempting to call star alleles with copy number or structural variant contributions. In this section, we will first explain how Stargazer's genotyping algorithm works using the CYP2D6 gene as an example. After that, we will provide some example usages and the full list of command-line arguments.

Figure 1. A schematic diagram of the Stargazer genotyping pipeline for any star allele, like many on CYP2D6, that requires read depth information for the determination of copy number (CNV) of other structural variant (SV) contributions to the genotype. For star alleles that are defined solely with single-nucleotide (SNV) or small insertion/deletion (indel) variants, the left path flow using a VCF file as input is all that is required. This figure was modified from an original taken from Lee et al., 2018.

Part I. Input and output data of Stargazer

The Stargazer CYP2D6 genotyping pipeline is outlined in Figure 1. The pipeline uses BAM- or CRAM-formatted files comprised of sequence reads aligned with a mapping program (such as BWA-MEM or DRAGEN) to either the GRCh37 or GRCh38 human reference genome assembly. From these BAM or CRAM files, the user must initially generate a VCF file (again, with variant-calling software such at  GATK-HaplotypeCaller or DRAGEN) from which Stargazer extracts all SNVs/indels located within 3kb from either end of CYP2D6. More specifically, Stargazer stores the genomic position of each variant, the reference allele, the alternate allele(s), the genotype status (homozygous or heterozygous), and the allelic depth for each sample. Stargazer uses the variant information from the VCF file to call star alleles based on SNVs/indels.

BAM/CRAM files are also used to calculate read depth for CYP2D6 and CYP2D7 with a new internal feature of Stargazer. For convenience, we will continue to refer to this output with the GDF (GATK-DepthOfCoverage format) file. Using the -G/–-create-gdf-file as shown below, Stargazer v2.0 uses a few internal Python routines to generate the necessary files containing information of read depth across both the gene of interest (target gene) and a known user-chosen copy-number 2 reference gene (control gene) for use as a read depth normalization factor; with both sets of information being contained in one GDF file. Since the high homology between CYP2D6 and CYP2D7 can cause reads to align to erroneous or multiple locations, only uniquely mapping reads with a mapping quality >= 20 are counted. Stargazer computes paralog-specific copy number using read depth information from the target and control genes in the GDF files in order to detect SVs.

The output data of Stargazer include each sample's CYP2D6 diplotype, predicted phenotype, and plots to visually inspect copy number for CYP2D6 and CYP2D7 (Figure 2). Stargazer also, in a finalized.vcf file, returns all detected SNVs/indels including those that are novel as well as those that are known but are not currently used to define any star allele, so investigators will know which variants to potentially follow up. In addition, those variants can be annotated with SeattleSeq Annotation and used to define new star alleles. Note that when calling diplotypes, Stargazer only considers those variants that are currently used by the Pharmacogene Variation Consortium (PharmVar).

Figure 2. An example of a hybrid star allele detected by Stargazer: CYP2D6 *2x2/*68+*4 diplotype. (*68 is a CYP2D6-CYP2D7 hybrid with a switch region in intron 1; it is found in tandem with *4.) Green dots are points of raw read depth, shown here for both the target and control genes. Grey dots are copy number calculated from read depth. The light and dark blue dots show the allele fraction values of relevant SNV or indel variants.

Part II. Prediction of star alleles

From a VCF file, Stargazer uses Beagle to haplotype phase heterozygous variants for CYP2D6 with over 2,500 reference samples from the 1000 Genomes Project. Stargazer then matches phased haplotypes to star alleles with a translation table built from publicly available data (e.g., the Pharmacogene Variation Consortium). The table contains information on more than 90 star alleles and 185 SNVs/indels including variant positions and nucleotide changes in relation to the reference CYP2D6*1 allele and the GRCh37 and GRCh38 human reference genome assemblies.

Part III. Detection of SVs

From a GDF file, Stargazer converts read depth for CYP2D6 and CYP2D7 (target gene) to copy number by performing intra- and inter-sample normalizations. The former accounts for individual variation in sequencing efficiency using read depth from a control gene; also contained in the GDF file, while the latter considers heterogeneity in coverage across all samples. Stargazer automates the detection of SVs using four of the most basic Python modules: numpy, pandas, scipy and matplotlib that approximate one or more points at which the statistical properties of a sequence of observations change. Here, the sequence is DNA, the observation is per-base copy number, and the statistical property is the mean copy number. If there is a significant shift in the mean copy number (e.g., from 2 to 1), the algorithm returns the change point location and the two mean values (e.g., 2 and 1).

Part IV. Identification of diplotypes

For samples without SVs, Stargazer determines CYP2D6 diplotypes by combining the star allele used to assign each phased haplotype. For samples with a whole gene deletion, the affected haplotype is assigned the CYP2D6*5 deletion allele, which is then combined with the star allele assigned to the other haplotype to form a diplotype. For samples with a whole gene duplication, the affected haplotype will be assigned "x2" (e.g., CYP2D6*1x2, *2x2, etc.) because it has two gene copies of CYP2D6. For samples with more complex SVs such as CYP2D6/CYP2D7 hybrids, individual algorithms have been developed to determine diplotypes.

Part V. Assignment of predicted phenotypes

There are four CYP2D6 metabolizer classes: poor, intermediate, normal, and ultrarapid. To predict these phenotypes, Stargazer first translates CYP2D6 diplotypes into a standard unit of enzyme activity known as an activity score. The fully functional reference CYP2D6*1 allele is assigned a value of 1, decreased function alleles such as CYP2D6*10 and *17 receive a value of 0.5, and non-functional alleles including CYP2D6*4 and *5 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with CYP2D6*1/*1, *1/*4 and *4/*5 diplotypes will have an activity score of 2, 1 and 0, respectively. These activity scores are used to predict the four metabolizer classes as following: poor, 0; intermediate, 0.5; normal, 1 to 2; ultrarapid, greater than 2.

Part VI. Reporting

Stargazer v2.0 creates a report that shows the phased haplotypes as well as other possible candidates to consider such as unphased, multiple star alleles in one haplotype, and warns against BEAGLE imputed missing genotypes.

Main usages

Genotype with WGS data (or TS data)

stargazer -a <genome_build> -o <output_prefix> -d wgs (or -d ts) -t <target_gene> -i <vcf> -c <control_gene> -g <gdf>

Genotype with WGS-generated (or TS-generated) VCF only

stargazer -a <genome_build> -o <output_prefix> -d wgs (or -d ts) -t <target_gene> -i <vcf>

Genotype with SNP array-generated VCF only

stargazer -a <genome_build> -o <output_prefix> -d chip t <target_gene> -i <vcf>

Genotype with SNP array-generated VCF only, using the Beagle program's imputation algorithm

stargazer -a <genome_build> -o <output_prefix> -d chip t <target_gene> -i <vcf>

-impute

Command-line arguments

Options

Argument

Description

-v/--version

.

Print the Stargazer version number and exit.

-i/--vcf-file

PATH

Path to the input VCF file (.vcf or .vcf.gz). An empty VCF file containing only the header and metadata is still accepted by Stargazer as it means no variants were detected from the sample(s). [required for standard running]

-o/--output-dir

PATH

Path to the output directory. If the directory with the same name already exists Stargazer will overwrite it. [required for all processes]

-t/--target-gene

TEXT

Name of the target gene. [required for all processes]

-a/--genome-build

TEXT

Build of the reference genome assembly used to create the input data. [default: 'hg19']

-d/--data-type

TEXT

Type of the input data: wgs whole genome sequencing; ts targeted sequencing including whole exome sequencing; or chip single nucleotide polymorphism microarray. [default: 'wgs']

-c/--control-gene

TEXT

Name of a preselected control gene. Used for intrasample normalization during copy number analysis.  Alternatively you can provide a custom genomic region with the 'chr:start-end' format (e.g. chr12:48232319-48301814). [required with --create-gdf-file]

-g/--gdf-file

PATH

Path to a GDF file containing read depth for both the target and control genes.

-r/--hap-panel

PATH

Path to a reference haplotype panel file (.vcf or .vcf.gz). By default Stargazer uses the haplotype panel from the 1000 Genomes Project (phase 3).

-s/--sample-list

TEXT

List of space-delimited sample IDs to be used as control for intersample normalization during copy number analysis. This argument has an effect only when the input data type is targeted sequencing.

-p/--include-profiles

.

Use this tag to include profiles for read depth copy number and allele fraction. This tag has no effect if Stargazer is running in the VCF-only mode.

-b/--impute-ungenotyped-markers

.

Use this tag to impute ungenotyped markers during statistical phasing. Potentially useful for low-density DNA microarray data.

-st/--star_table

PATH

Use this tag to specify the path for star_table to be used [default=f {program_dir}/star_table.tsv]

-G/--create-gdf-file

TEXT

This tag causes Stargazer to run as a GDF-file generator. Argument requires a file name (which will be placed in the output directory). [must include -B option]

-B/--bam-file

TEXT

List of paths to BAM or CRAM files. [required for --create-gdf-file runs]

*IS BAM LIST TO BE SPACE-DELIMITED?

Interpretation of results

Here's the explanation of the column headers in output_prefix.stargazer-genotype.txt files:

Header

Description

name

Sample ID used in the VCF file and the GDF file

status

Genotype status with following choices: g

  (genotyped), ng (not genotyped)

hap1_main

Main star allele for the 1st haplotype

hap2_main

Main star allele for the 2nd haplotype

hap1_cand

Candidate star alleles for the 1st haplotype

hap2_cand

Candidate star alleles for the 2nd haplotype

hap1_score

Activity score for the 1st haplotype

hap2_score

Activity score for the 2nd haplotype

dip_score

Combined activity score for both haplotypes

phenotype

Predicted phenotype based on the diplotype

  activity score

dip_sv

Combination of two unphased SV calls

hap1_sv

SV call for the 1st haplotype

hap2_sv

SV call for the 2nd haplotype

ssr

The sum of squared residuals (SSR) computed

  during the SV detection step

dip_cand

All possible candidate star alleles, obtained by

  combining the SNPs of both haplotypes to create a pseudo-haplotype

hap1_main_core

Core SNPs of the 1st haplotype's main star

  allele

hap2_main_core

Core SNPs of the 2nd haplotype's main star

  allele

hap1_main_tag

Tag SNPs of the 1st haplotype's main star allele

hap2_main_tag

Tag SNPs of the 2nd haplotype's main star allele

hap1_af_mean_gene

Mean allele fraction for the 1st haplotype,

  calculated using all detected SNPs from the target gene

hap2_af_mean_gene

Mean allele fraction for the 2nd haplotype,

  calculated using all detected SNPs from the target gene/td>

hap1_af_mean_main

Mean allele fraction for the 1st haplotype,

  calculated using the SNPs of the main star allele

hap2_af_mean_main

Mean allele fraction for the 2nd haplotype, calculated using the

  SNPs of the main star allele

 

Here's the explanation of the column headers in report.txt files:

Name

Description

Sample

name of the sample

Gene

name of the gene

Diplotype

diplotype (hap1_main/hap2_main)

BEAGLE imputed

if hap1_main or hap2_main is imputed that star-allele will be listed

Phenotype

the phenotype calculated as per the additive activity score of the diplotype

Score

the additive activity score calculated for the diplotype

May also be

the other possible candidates of haplotypes as per the alleles present in the sample

Also possible haplotype

the other possible candidates of haplotypes as per the alleles present in the sample if not phased