3.7.6. Gene and Transcript Preferences¶
When performing variant interpretation and filtering, gene annotations alone are insufficient to evaluate a variant’s effect and clinical relevance. Ultimately, specific choices must be made about how to report each gene and those preferences must be remembered. The VarSeq algorithms and VSClinical workflow engine look up these properties from Gene Preferences files which specify the clinically relevant transcript, inheritance mode, and disorder for each gene. In this chapter we describe the process used by VarSeq to select the clinically relevant transcript for a gene and we discuss how this selection process can be overridden trough custom Gene Preferences.
Clinically Relevant Transcripts¶
While many genes have a single dominant transcript, there are often multiple distinct mRNA transcripts associated with a given gene. When describing a variant, a choice of transcript must be made to provide an HGVS coding and protein description of the mutation. In some cases, the choice of transcript may change the variant from being described as exonic to intronic or from a loss-of-function pathogenic mutation to a non-coding benign length polymorphism.
An example illustrating the importance of selecting the correct transcript can be seen in the variant 2: 39,351,827 T/- which overlaps the transcript NM_001382394 of the gene SOS1. Evaluating the variant in the context of this transcript, we see that it is a frameshift mutation in exon 1 (NM_001382394.1:p.Q15Rfs*5) and is likely to result in nonsense mediated decay. However, in VarSeq, the default clinically relevant transcript for this gene is NM_005633. Using this transcript, the variant is considered intergenic, as it does not overlap the clinically relevant transcript.
This example shows that picking the most biologically and clinically relevant transcript is an important choice when performing variant interpretation.
The VarSeq transcript annotation algorithm performs a one-to-many annotation of a variant against every overlapping transcript. By default, the annotated transcripts are filtered to mRNA and non-coding RNAs as the complete set of RefSeq transcripts include many experimentally predicted transcripts (with an XM prefix) that should generally be ignored. However, the algorithm can be manually configured to annotate against all transcripts as shown below:
To handle the presence of multiple transcripts, the VarSeq gene annotation algorithms produce three column groups:
Summary Fields that display per-transcript annotations in the context of a single clinically relevant or combined transcript annotation.
Transcript Interactions that include the per-transcript annotations for all transcripts.
Aux Fields that pass through the transcript-level additional auxiliary fields from the annotation source.
To aggregate the per-transcript annotations into a single value that can be used for filtering or exporting the variant table in a meaningful way, there are two strategies and corresponding summary fields:
Combined: For fields like Sequence Ontology and Gene Region, this takes the most damaging annotation result from all transcripts and is useful for conservative filtering.
Clinically Relevant: These fields show annotation results for the gene’s clinically relevant transcript.
Clinically Relevant Transcript Selection¶
VarSeq uses a set of heuristics to select the Clinically Relevant transcript for a given gene. These heuristics are designed to match the leading variant annotation sources and is based on two key annotation sources:
The Matched Annotation from NCBI and EMBL-EBI (MANE)
The Locus Reference Genomic (LRG) database
The MANE transcript set, which was created through a joint effort by NCBI and Ensembl, aims to define a single representative transcript per gene that is well-supported by experimental data and represents the biology of the gene. The LRG is a DNA sequence format that provides unique identifiers for genes along with fixed genome-independent reference sequences. These annotations are leveraged when selecting a clinically relevant transcript based on the following heuristics:
Prefer a transcript that is a MANE “Select” transcript
Prefer a transcript that has an LRG identifier
Prefer a transcript that has correctly encoded start and stop codons over incomplete transcripts
Prefer a transcript that is protein coding over one that is non-coding
Prefer transcripts with longer coding sequences
If all else is identical, select the first in lexicographic order
In practice, for the roughly 18,000 genes in MANE, a Select transcript based on computation tools such as per-tissue expression will be selected as the Clinically Relevant transcript. However, the heuristics described above may overridden by saved user or system gene preference.
Transcript Selection in VSClinical¶
While VSClinical adds variants to an evaluation based on the Clinically Relevant transcript, it will also warn when the annotation differs in effect on other transcripts.
VSClinical supports switching the analysis of the variant to another transcript at any time. When switching transcripts, a variant is updated to reflect the per-transcript annotations, sequence ontology, in-silico functional predictions, spicing effects, and ultimately the recommended criteria following the ACMG guidelines.
VarSeq uses two Gene Preferences files to specify clinically relevant transcripts for certain common clinical genes:
User Gene Preferences
System Gene Preferences
The transcripts specified in these files override the default clinically relevant transcript selected by VarSeq. The User Gene Preferences always take precedence and can be shared among multiple lab users using the same folder location as assessment catalogs (often configured as a network share). The System Gene Preferences file is shipped with VarSeq and is updated with the software. The Golden Helix team adds transcripts to the System Gene Preferences file if:
A transcript has the most ClinVar submission references
and it has a submission count greater than 10
and it is not the default transcript choice
and the default transcript choice does not have a submission count greater than half of the most submitted transcript
The Gene Preferences files can be used to customize various gene level properties in addition to the clinically relevant transcript. These customizable properties include:
Transcript: The preferred clinically relevant transcript to use, overriding the default choice chosen by VarSeq
Inheritance Mode: The disease inheritance model to be used: Dominant or Recessive
Disorder: The disorder to report for this gene (along with linked OMIM / MONDO identifiers)
WildType for Tumors: A Cancer workflow specific option to specify which tumor types indicate this gene as clinically significant in a Wild Type state.
Updating User Gene Preferences¶
There are a several ways to modify the User Gene Preferences. In VSClinical, when a variant interpretation is saved, if the transcript or currently selected disorder differs from the current gene preference, the bottom of the save dialog will display an option to update the saved gene preference of each of these properties when the variant interpretation is saved.
The mode of inheritance preference is updated in the Tolerated Frequency section of the ACMG guidelines. You can change and save the Inheritance value directly and see the implications on the frequency thresholds used for BA1, BS2, and PM2. On the right, the associated conditions from OMIM and their annotated inheritance model are provided for reference.
Users can also directly modify the GenePreferences.gene-pref file used to store the User Gene Preferences by going to Tools > Open Folder > Assessments Catalog Folder. This structured text file can be browsed to review and modify the currently saved gene preferences.
Where Gene Preferences Are Used¶
There are several contexts in the VarSeq annotation and filtering workflow and the VSClinical interpretation workflow that leverage the gene preferences system. Because you can customize gene preferences, the clinical lab defines the behavior in these contexts when defining and validating their clinical test.
Transcript Annotation: The VarSeq gene annotation algorithm deals with a lot of complexity when describing the interaction between project variants and overlapping transcripts. While there is a detailed table called “Transcript Interactions” that provides the combination of every alternate allele for a variant combined with every transcript, each variant must be reported in the context of a single clinically relevant transcript. This Clinically Relevant transcript is selected by VarSeq based on the heuristics described in the preceding section, but these heuristics can be overridden by specifying a clinically relevant transcript in the gene preferences file. The transcript annotation algorithm provides the HGVS notation, sequence ontology, splice site predictions, and other relevant gene annotation details for the clinically relevant transcript.
ACMG Classifier: The scoring of the ACMG guideline criteria codes can be automated for the criteria based entirely on annotations and bioinformatically derived attributes of variants. The ACMG Classifier will evaluate a variant based on the clinically relevant transcript selected by the gene annotation algorithm in the project. Also, the Inheritance of the gene is reported and used when evaluating certain criteria. The evaluation of whether a variant is rare or common uses different thresholds (stricter) for dominant model gene inheritance versus recessive (or unknown).
VSClinical Recommendations: After annotation and filtering, certain high-quality variants of interest will be taken forward to interactive scoring and interpretation using VSClincial. Both the ACMG and AMP (Cancer) workflows do up-to-date variant scoring when the variants are added to an evaluation. At this time, the clinically relevant transcript, inheritance model, and any previously saved disorder are looked up for the variant and used to perform the variant scoring and fill into the interpretation. The cancer workflow will also lookup whether the current patient’s tumor type matches any marked as relevant wild types in the gene preferences.