High-performance Variant Calling Implementation: A Bayesian approach to discovering variants in NGS data.

Bolseiro:

Vera Pinto

Orientador(es):

Carina Silva (CEAUL-EsTSL);

Lisete Sousa (CEAUL-FCUL) ;

Tipo de bolsa

Bolsa de Doutoramento

Estado do projeto:

A decorrer

Introdução

n the last two decades, the whole human genome DNA sequencing modernization has improved our knowledge of the human genome and worked as a game changer of modern biological science, which resulted in a reduction in DNA sequencing costs due to nextgeneration sequencing (NGS1) technology. This advancement handles a high volume of data acquired through NGS. Indeed, there is a clear demand for data handling and statistical processing, in what is regarding downstream analysis tools of genomic data and analysis processes, which are still an everyday challenge to the scientists.
In the downstream analysis of NGS data, one of the critical steps is variant calling (VC), it comprehends the process of identifying variants that are different from the chosen reference sample. These nucleotide alterations can have distinctive effects on human molecular biology,
1 Next-generation sequencing (NGS) is a high-throughput methodology that enables rapid sequencing of the base pairs in DNA or RNA samples. Supporting a broad range of applications, including gene
expression profiling, chromosome counting, detection of epigenetic changes, and molecular analysis, making the identification of the true variant fundamental. The process of identification can be achieved by using a precise VC tool.
Variant Calling tools are designed to identify specific types of variants. These include germline variants, somatic variants (Xu, 2018), copy number variants (CNV) (Yu & F. D., 2021) and structural variants (SV). Different VC tools use as base different algorithms, and throughout
the time they have been progressing and adjusting in the past years (Alosaimi et al., 2021).
A standard pipeline to detect single nucleotides variants (SNVs) and indels (insertions or deletions) includes: (1) sequencing, (2) pre-processing (quality control analysis), (3) mapping reads to the reference genome using tools like Bowtie, or others, (4) post-processing of the
alignment results (marking duplicates and sorting), (5) calling SNVs/indels using tools such as SAM tools and/or the Unified Genotyper implemented in the Genome Analysis Toolkit
(GATK), and (6) filtering (Jia et al., 2012). Notwithstanding the several already existing computational methods, their performance may
be degraded at scenarios with severe fluctuation of read depth signal (Yu & Du, 2021) and structural variants (SV) (Mahmoud et.al., 2019). Consequently, all VC methods fall into four categories: germline callers, somatic callers, CNV identifiers and SV identifiers. Within germline variant callers, there are two approaches employed by variant calling programs: heuristic and statistical (Alosaimi et al., 2021, Kojima et al., 2013).
Present-day VC approaches have been designed to leverage populations of long-rang haplotypes and were benchmarked using populations of European descent, even though most genetic diversity, with high rates of heterogeneity, is found in non-European populations
such as African populations (Alosaimi et al., 2021). These VC tools may produce false positives (FP) and false negatives (FN) results, which can set in motion the perpetuation of ambiguous conclusions in which mutations to prioritize, questioning the clinical relevancy and actionability of variants. The new upcoming efforts rely on increasing specificity, and it may result in the loss of true positive calls. On the other hand, giving preference to sensitivity will result in increased FP. Depending on the desired output for a study, sensitivity or specificity
should be preferred as there is a cost to choosing between these two (Alosaimi et al., 2021).
Variant identification from sequencing may contain errors and hence, are likely to be falsepositive calls. Conventionally, genetic studies have two types of approaches for assessing quality scores: “filtering” and “classification” approaches. In the filtering approach, several filters are applied to remove problematic variants. One main problem with this type of approach is that thresholds are often study-specific and need to be manually fine-tuned for each study (Li, et.al., 2019). The classification approach attempts to learn variants with low quality using machine learning techniques. For instance, Variant Quality Score Recalibration
(VQSR) of GATK 2uses a Gaussian mixture model to learn the multidimensional annotation profile of variants with high and low quality. One of the concerns with VQSR is that one needs training datasets acquired from existing databases on variants such as 1000 Genomes Project and HapMap, which could be biased to maintain known variants and filter out novel ones.
Known databases of genetic variants may not always be accurate, which would lead to incorrect classification of variants (Li et.al., 2019).
2 The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and
structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data
The heuristic approach is the original method of VC employed by tools such as VarScan23.

This method uses filtering and quality cut-off values to identify an initial set of genotypes from which single nucleotide polymorphisms (SNPs)
4 are inferred (Alosaimi et al., 2021). This approach relies on the simplicity of not having potential violable assumptions to make
about the data. A disadvantage is that if a high degree of sequencing depth5 is lacking, the heuristic method is prone to reducing the number SNPs identified. As a result, for accurate VC, high sequencing depth is mandatory making it computationally more complex and
demanding. This method is also incapable of utilizing information on the quality of individual reads6 and as so provides no measure of uncertainty for each predicted variant. If the SNPs are based on identified genotypes and if the identified genotype is incorrect, the called variants will also be incorrect (Alosaimi et al., 2021).
The statistical methods solve several of the problems associated with the heuristic approach and provide a measure of uncertainty for each variant, thus promoting the increase of the accuracy of variant identification (Kojima et al., 2013). These methods are based on the
likelihood of observing a specific outcome, given all prior known information. In this context the prior information required, will be the base quality (BQ) information, assessed after the pre-processing of the data. Bayes’ formula is used to determine the posterior probability Pr
(G|X) of the genotype identified (G), given the data received (Nunn et al., 2021). The highest posterior probabilities are chosen, and the likelihood ratios form a measure of confidence in the output. This method also allows easy incorporation of prior data such as known
population scale-allele frequencies (Alosaimi et al., 2021).
The new possible opportunities emerging in the area try to respond to the traditional modelbased variant callers relying heavily on ad-hoc filters to reduce false calls because artifacts are generated in an overly complex way that is beyond simple modelling. In fact, one underlying
aspect is that for most VC tools available, have higher rates of sensibility and low rates of specificity, this problem is caused by the complexity of the data, which results in an increase of false positives (Bian et al., 2018).
Subsequently, a variant caller often contains dozens of difficult comprehension parameters. To come in response for this demand, deep neural networks (DNN) have recently been applied to variant calling with superior performance and the trained model can be easily
applied to other datasets with consistent performance (Xu, 2018), creating an opportunity to make it a reproducible science and applicable everywhere.

3 VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome
resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments.
4 SNPs are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block, called a nucleotide.
5 Or read depth, describes the number of times that a given nucleotide in the genome has been read in an experiment.
6 Refers to the number of base pairs (bp) sequenced from a DNA fragment

Objetivos

The general goal is to develop a Bayesian method that deals with the different types of challenges related to variant calling.
We are interested in increasing specificity without losing any information from the original data, thus designing a method to identify specific types of variants with a higher level of precision.
We intend to exploit the methods that underlie both approaches with the aim of dissecting their theorical statistical assumptions and see if there is a way to converge their strong characteristics in one more robust and specific statistical method.

Síntese do Plano de Trabalho

State-of-the-art of Variant Calling methodologies through a scoping review.
Evaluation of the designed method throughout the proposed pipeline in Olson et al. (2015).
Understanding the underlying advantages and disadvantages of the VC tools applied nowadays.
Developing a Bayesian method which may increase specificity without the loss of true positive calls and that can be adapted to all populations, as it was problematized in Alosaimi et al. (2021).
Evaluation of the performance of the developed statistical method through NGS simulated data and real data.
Implementation of the proposed method into a R/Python package.
Designing a clear workflow on how to implement the package/algorithm.
Proposing a consensual workflow on how to analyze NGS data from the beginning and with the incorporation of the developed package/algorithm.
Exploiting the implementations of machine learning and deep learning algorithms to improve
the performance of variant calling throughout the revision of Yu & Du (2021) but trying to implement it for the identification of all specific types of variants, as performance validation for the method designed.

Resultados Esperados

It is expected to successfully implement the proposed method and to apply it to real NGS data, to identify new variants related to haemoglobinopathies (data provided by Professor Miguel Brito, HT&RC), thus facilitating the development of personalized therapies.
Applying the method to other rare genetic diseases, to identify unknown or problematic variants with the hope that more precise therapies can be developed for these rare diseases.
It is intended to explore application areas, such as standardizing Pipelines and Workflows for other types of genomics data. One of the outputs of these analyses will be VCF7 files prediction with more precision and accuracy, together with the respective evaluation indicators for the performance of the VC method developed.
The statistical method designed should be integrated into a R/Python package, that will be developed, turning it more user friendly for the scientific community and facilitating the access to a precise variant identification within medical laboratories.
We expect to collaborate with teams of the application areas (statistical, medical and biological) to make the relevant questions and answers, thus creating a space for science and knowledge to grow.
The methodologies and the results of the applications will be published in international scientific journals.
We anticipate participating in conferences and scientific meetings to publicize the work developed among peers.

Referências

Alosaimi, S., van Biljon, N., Awany, D., Thami, P. K., Defo, J., Mugo, J. W., Bope, C. D., Mazandu,G. K., Mulder, N. J., & Chimusa, E. R. (2021). Simulation of African and non-African low andhigh coverage whole genome sequence data to assess variant calling approaches. Briefings in Bioinformatics. https://doi.org/10.1093/bib/bbaa366.

Bian, X., Zhu, B., Wang, M., Hu, Y., Chen, Q., Nguyen, C., Hicks, B., & Meerzaman, D. (2018). Comparing the performance of selected variant callers using synthetic data and genome segmentation. BMC bioinformatics, 19(1), 429. https://doi.org/10.1186/s12859-018-2440-7.

Ferretti, L., Tennakoon, C., Silesian, A., Freimanis, G., Ribeca, P. (2019). SiNPle: Fast and sensitive variant calling for deep sequencing data. Genes, 10(8). https://doi.org/10.3390/genes10080561.

Jia, P., Li, F., Xia, J., Chen, H., Ji, H., Pao, W., & Zhao, Z. (2012). Consensus rules in variant detection from next-generation sequencing data. PloS one, 7(6), e38470. https://doi.org/10.1371/journal.pone.0038470.

Kojima, K., Nariai, N., Mimori, T., Takahashi, M., Yamaguchi-Kabata, Y., Sato, Y., & Nagasaki, M. (2013). A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads. Bioinformatics (Oxford, England), 29(22), 2835–
2843. https://doi.org/10.1093/bioinformatics/btt503.

Li J, Jew B, Zhan L, Hwang S, Coppola G, et al. (2019) ForestQC: Quality control on genetic variants from next-generation sequencing data using random forest. PLOS Computational Biology 15(12): e1007556. https://doi.org/10.1371/journal.pcbi.1007556.

Mahmoud, M., Gobet, N., Cruz-Dávalos, D.I. et al. (2019). Structural variant calling: the long and the short of it. Genome Biol 20, 246. https://doi.org/10.1186/s13059-019-1828-7.

Nunn, A., Otto, C., Fasold, M., Stadler, P. F., Langenberger, D. (2021). Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional Bayesian approacheshttps://doi.org/10.1101/2021.01.11.425926.

Olson, N. D., Lund, S. P., Colman, R. E., Foster, J. T., Sahl, J. W., Schupp, J. M., Keim, P., Morrow, J. B., Salit, M. L., & Zook, J. M. (2015). Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in Genetics, 6. https://doi.org/10.3389/fgene.2015.00235.

Xu C. (2018). A review of somatic single nucleotide variant calling algorithms for nextgeneration sequencing data. Computational and structural biotechnology journal, 16, 15–24. https://doi.org/10.1016/j.csbj.2018.01.003.

Z. Yu and F. Du (2021). Enhanced Bayesian detection for copy number alterations from nextgeneration sequencing data. IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2021, pp. 2931-2936. https://doi.org/10.1109/BIBM52615.2021.9669548.