Identification of SNPs in rice GPAT genes and in silico analysis of their functional impact on GPAT proteins

SNPs are the most common nucleotide variations in the genome. Functional SNPs in the coding region, known as nonsynonymous SNPs (nsSNPs), change amino acid residues and affect protein function. Identifying functional SNPs is an uphill task as it is difficult to correlate between variation and phenotypes in association studies. Computational in silico analysis provides an opportunity to understand the SNPs functional impact to proteins and facilitate experimental approaches in understanding the relationship between the phenotype and genotype. Advancement in sequencing technologies contributed to sequencing thousands of genomes. As a result, many public databases have been designed incorporating this sequenced data to explore nucleotide variations. In this study, we explored functional SNPs in the rice GPAT family (as a model plant gene family), using 3000 Rice Genome Sequencing Project data. We identified 1056 SNPs, among hundred rice varieties in 26 GPAT genes, and filtered 98 nsSNPs. We further investigated the structural and functional impact of these nsSNPs using various computational tools and shortlisted 13 SNPs having high damaging effects on protein structure. We found that rice GPAT genes can be influenced by nsSNPs and they might have a major effect on regulation and function of GPAT genes. This information will be useful to understand the possible relationships between genetic mutation and phenotypic variation, and their functional implication on rice GPAT proteins. The study will also provide a computational pathway to identify SNPs in other rice gene families.

Rice genome was initially sequenced among cereal crops. It paved the path to sequence other complex plant genomes. The impact of sequencing rice genome was immediate: elicited high citations, DNA marker usage and research groups curated public databases (Jackson, 2016). An easy access to genome data stored in public databases provides opportunity to identify and explore nucleotide variations.
2 SNPs are the key nucleotide variations found predominantly in genome, and are simplest in form. For instance, in the human genome, 90% of the variations belong to SNPs, located after every 100-300bp, and 170 bp in rice, but their densities may vary in different regions (Goff et al., 2002;Nelson et al., 2004;Kharabian, 2010). They have a wide presence in the genome and can be found in any part such as intronic region, mRNA, or in the intergenic regions.
SNPs are involved in many processes like the occurrence of disease in humans where a single nucleotide variation can be responsible for life-threatening disease (Lek et al., 2016). In plants they can induce physiological, biochemical and phenotypic changes and alter function (Gailing et al., 2009;Majeed et al., 2019;Sandhu et al., 2020). SNPs are named and categorized according to the region they are present and functions they perform. Among them, SNPs in the coding regions, termed as non-synonymous SNPs (nsSNPs) or missense SNPs are particularly important as they are responsible to change amino acid residues. These nsSNPs, therefore influence protein function, provide harmful or neutral effects to the protein structure, or reduce protein solubility (Zhang et al., 2018) . They can alter gene regulation, protein charge, or either change inter and intra protein interactions. Therefore, they are crucial and it is beneficial to catalogue functional SNPs in different species. Although nsSNPs are important but SNPs including the coding synonymous SNPs (SNPs present in coding regions, but don't change the amino acid residues) or SNPs positioned outside the coding regions still have functions. They are attributed to impact gene expression due to changes in regulatory elements, exonic splice enhancers, binding of a transcription factor or in splicing processes (De Alencar and Lopes, 2010).
With the passage of time SNPs are gradually taking the place of SSRs, to use as markers in breeding purposes, as they are stable, efficient, have high presence, and bear less cost (McCouch et al., 2010;Mammadov et al., 2012). Marker-assisted selection is one of the methods used to attain molecular markers, discovered in large scale SNP-genotyping. Similarly, GWAS studies are also used to establish relationships with the SNP and the candidate genes (Zhang et al., 2018).
Identification of SNPs is an initial yet tedious task to explore functional effects of SNPs and their correlation with the phenotype(Tibbs Cortes et al., 2021), as it needs multi testing of hundreds of SNPs in the gene of interest. Still, the question would be which SNP set required for selection is important for association studies success, providing the strong reason for SNP variation. Many SNP assays have been employed, but because of not having accurate phenotypic and genotypic data, it's not easy to conduct these experiments. A particular breeding segregation population or near-isogenic line would be required to identify SNPS functional effects (Pea et al., 2013).
Similarly, another task after the explosion of genome wide studies is to understand functional significance of the identified SNP, and apply it to application studies. Regarding experimental assays, many SNPs from the GWAS are not as impactful to select for certain traits, and breeders found it difficult to use these molecular markers. As a successful breaded crop depend largely on the accuracy of these functional SNPs (Tibbs Cortes et al., 2021).
SNPs detected by experimental studies provide the strongest evidence to demonstrate functional significance. However, because of the lack of exactness of the phenotypic and genotypic data, these experiments are not easy to identify. An alternative, to explore the possible significant pattern of SNPs is to prioritize SNPs on their functional significance using in silico computational tools. These computational tools can identify and differentiate the neutral SNPs from the functionally significant SNPs, and infer nature of mutation and changes in protein structure caused by the particular SNP (Kharabian, 2010;Arshad and Attya Bhatti, 2018).
In addition, similar to past in silico studies, they could also provide an independent evidence source alongside experimental studies to explore functionally important SNPs (Yang et al., 2004;Jiang et al., 2010;Chaisan et al., 2012;Bhardwaj et al., 2016;Withana et al., 2020).
Advances in big data analytics and artificial intelligence systems provide opportunities to build machine learning models that can predict accurate DNA variations and protein models identified from sequence data. For instance, Alphafold recently gained appreciation among the scientific community. Alphafold is an artificial 3 intelligence and deep learning based prediction model for protein structures that achieved high accuracy even for sequences having lower homologous sequences (Senior et al., 2020;Yang et al., 2020). Such breakthroughs reveal the potential of AI on genome data and it is presumed that computational approaches will be frequently used in plants to recruit SNPs in the future (Korani et al., 2019).
There are several public databases to curate SNPs in rice genome. 3000 Rice Genome Project (3KRGP) consortium has sequenced 3000 rice varieties from 89 different countries to explore genomic diversity in rice crop, and approximately 18.9 million SNPs were identified from this project (Li et al., 2014). The consortium provides rice breeders and scientists a massive resource. In the past years, some comprehensive databases have been created using the 3KRGP data including, SNP-Seek database (Alexandrov et al., 2015), Rice Functional and Genomic Breeding (RFGB) (Wang et al., 2020a), RiceVarMap (Zhao et al., 2015). These public datasets provide opportunities to identify large-scale discovery of genomic variants associated with various traits, and can be tapped to increase yield potential. SNP-Seek database platform has incorporated millions of variants from 3000 rice accessions and provide easy access to mine alleles (Mansueto et al., 2016).
In this study, we aim to analyze all the functional SNPs in the rice Glycerol-3-phosphate acyltransferase (GPAT) gene family, using 3 K RGP data among 100 Chinese rice varieties from different geographical regions.
We used rice GPAT genes as a model family, we previously identified in a genome wide study  to demonstrate in silico analysis in a gene family. We used different computational tools to explore SNPs, and their impact to the structure and function on rice GPAT proteins and prioritized SNPs having significant impact on GPAT proteins. The study will provide useful information about important SNPs that can affect protein functions and would be useful in future investigations.

SNP Dataset and missense SNP identification in GPAT genes
We retrieved the SNP data set of 100 rice accessions from the SNP-Seek database (http://snpseek.irri.org) (Alexandrov et al., 2015) at each rice GPAT locus. We used 26 GPAT locus positions identified in a genome wide study (Table S6). Then, we searched each locus position among 100 rice varieties in the SNP- Seek database and find all the SNPs in 26 GPAT genes. Afterwards, we filtered all missense or nonsynonymous SNPs (nsSNPs) among initially identified SNPs. We find and documented nsSNPs details (SNP position, protein remnant change) to investigate further.

SNPs identification with damaging effects
We used three computational databases to explore functional impact of nsSNPs on protein structures including, Protein Variation Effect Analyzer (PROVEAN) [http://provean.jcvi.org/index.php] Choi and Chan, 2015), Sorting Intolerant from Tolerant (SIFT) (Ng and Henikoff, 2003), and PolyPhen-2 (Adzhubei et al., 2010). These computational tools predicts whether an amino acide substitution by a DNA variant affects protein function based on sequence homology and physical properties of amino acids. The nsSNPs having a damaging or deleterious effect identified by these three tools were regarded as high risk nsSNPs and taken further for more investigation to analyze their putative effect.
SNPs influence on structural and functional properties of GPAT proteins GPAT protein sequences carrying the high risk nsSNPs identified in the previous step were further examined along with the mutated amino acid residues, and submitted to the MutPred v1.2 (Li et al., 2009a) database. The MutPred database investigates the outcome of mutations at proteins and predicts the molecular mechanism associated with the mutation and also provides different gain and loss structural properties of a protein.

Effects on protein stability
We used I-Mutant to examine protein stability affected by nonsynonymous SNPs by submitting the normal and mutated amino acid sequences. I-Mutant evaluates protein stability changes or any structural change after variation. The tool provides a relativity RI index (RI) of results, ranges 0 to 10 score, showing the reliability of the score.

Evolutionary conservation of amino acid positions
We used ConSurf tool to determine amino acids evolutionary conservation, affected by nsSNPs; to analyze if this amino acid position is highly conserved in a protein sequence. Consurf analyzes the degree of evolutionary conservation using 50 homologous sequences from different species based on phylogenetic relation. We considered those positions important, if they were highly conserved and located on sites affected by nsSNPs. (Ashkenazy et al., 2010;Celniker et al., 2013;Ashkenazy et al., 2016).

3D modeling of protein structures and RMSD calculation
Native and mutated protein structure models were generated using protein homology tools to evaluate effects to protein structure caused by high risk nsSNPs. Phyre2 (Kelley et al., 2015) was used to generate the protein models, and these structures were further viewed by Chimera (Pettersen et al., 2004). In addition, we used two more tools including Root Mean Square Deviation (RMSD), and Template Modeling Score (TMalign). RMSD indicates variation between two protein structures; a high score shows more variation between native and mutant structure (Carugo and Pongor, 2001;Zhang and Skolnick, 2005).
Post translational modification (PTM) sites Post translation modification sites (PTM) in the rice GPAT Proteins were predicted including the phosphorylation, ubiquitylation and methylation sites. We used two computational tools to predict every PTM event. Phosphorylation PTM sites at Serine (S), Threonine (T), and Tyrosine (Y) amino acid residues were predicted using GPS 5.0 (Xue et al., 2005) and NetPhos 3.1 (Blom et al., 1999). In NetPhos 3.1, residues having at least 0.5 score were considered as phosphorylated. Ubiquitylation PTM sites were predicted using BDM-PUB (Li et al., 2009b) and UbPred (Radivojac et al., 2010). In UbPred only those residues were considered having a 0.62 or more score. Methlyation sites were predicted using PSSMe (Wen et al., 2016) and iMethyl-PseAAC (Qiu et al., 2014) tools. PSSMe predicted those lysine or arginine residues that had higher probability ratios.

Results Results Results
Functional SNPs extracted from the rice SNP-seek database We used the rice SNP-seek database to explore all the SNPs in 26 GPAT genes. We used rice GPAT gene family as a model plant gene family, we reported in a genome wide study . SNP-seek database comprised 18 million SNPs identified in 3KRGP. To recruit SNPs, we shortlisted 100 Chinese rice accessions from different geographical regions (including indica, japonica) (Table S1). Initially we find 1056 SNPs, among 100 varieties at each GPAT locus; 99 were found in the UTR regions, 683 in the intronic regions and 140 in the codon synonymous region (Figure 1). We find 98 SNPs (non-synonymous SNPs or nsSNPs) in the coding regions, which can influence protein structure and function (Table 1). We selected these 98 nsSNPs for further investigation.   amino acid residues. These tools predict functional effects of amino acid substitutions to the protein structure (Table 2). Each tool has a cut-off value; a nsSNP is considered functional below this value (Table 2.). For instance, PROVEAN score is -2.5, below this cutoff, the substitutions are considered "deleterious", having high risk or deleterious effects on protein, and above this cutoff variant are considered Neutral. We considered nsSNPs as high risk or having damaging effects, if predicted by 2 of these tools. We find at least 13 nsSNPs (Table 2) in different genes having high risk or damaging effects on the protein structure.

Structural and functional modifications
We submitted all the high risk nsSNPs to MuPred, to explore functional changes due to amino acid variations (Table 3). Deleterious nsSNPs influenced several mechanisms; some with high probability score in Arg355Pro (altered trans membrane protein), Pro157Leu (gain of helix), E62 (gain of Acetylation, loss of loop). Variations in V492M and V549I affected most mechanisms. Some amino acid variations had low probability score; therefore, no mechanism was detected. These functional changes revealed nsSNPs could influence functional variation in GPAT proteins. Table 3. Table 3. Table 3. Effects on protein stability We used I-Mutant to analyze stability changes to GPAT proteins by the nsSNPs. This tool predicts the reliability index (RI) and energy change values. The results revealed that all the variations decreased the stability of GPAT proteins (Table 4). Protein stability is vital to maintain its three dimensional structure; a reduction in stability causes protein denaturation, unfolding, and protein aggregation. (Ortbauer et al., 2013;Deller et al., 2016). Free Energy change value (DDG); Reliability index (RI); Tm-Align score -similarity between native and mutant protein 3D models; Root mean square difference (RMSD)-variation between wild type and mutant protein; 3D model score Evolutionary conservation of amino acid residue positions in rice GPAT genes We used Consurf web tool to infer the evolutionary conservation of amino acid residues positions effected by nsSNPs. Consurf analyzes amino acid residues and allocate a conservation scale ranging between highly conserved, average, and variable to each amino acid residues, where variable reflect the lowest score. The score is calculated by merging the solvent accessibility predictions and evolutionary conservation data. Table 5 demonstrates the conservation score of some high-risk nsSNPs in each GPAT gene. We found that amino acid positions R355P, P157L, F128L, V549I were highly conserved, whereas R430H, E62K, A502P, D36H were conserved. These results suggested amino acid residues affected by nsSNPs are located at evolutionary conserved positions. We have shown two highly conserved and exposed nsSNPs structures in Figures 2A and 2B respectively.

Structural analysis and modeling of wild type and mutant protein 3D structures
To explore whether the high risk nsSNPs impact protein structure, we generated 3D protein models of native and variant structures in genes carrying high impact nsSNPs mentioned in Table 4. We initially used Phyre2 that generated structures in pdb format, and then visualized by using Chimera. Besides, we analyzed the TM-align score and RMSD scores for the protein models (Table 4). A higher RMSD value demonstrates the deviation between mutant and the native model. Based on RMSD values, most mutant models showed high variation in structures with 2 or more RMSD value. Similarly, no model revealed low or zero value, demonstrating each nsSNPs have affected protein structures in some capacity. Finally, to demonstrate 3D structures, we selected four proteins having at least 2 RMSD values and low TM values. Figure 3 display the location of amino acid substitutions in each protein native and variant model. In these models, there were residual changes, along with variation in parameters including total energy, decrease in protein stability, suggesting they can influence protein folding. We further superimposed these variant models in figure 3 over wild types to explore the structural difference between them. Every superimposed model showed a difference in 3D structure, suggesting these nsSNPs affected the protein structure ( Figure 4). 12 Figure 2A. Figure 2A. Figure 2A. Amino acid residues conservation profile demonstrating the extent of evolutionary conservation. The conservation scale ranged from 1-10, where 10 is the highly conserved and 1 the lowest. Protein regions associated with function tend to be highly conserved over the course of evolution 13 Figure 2B. Figure 2B. Figure 2B. Figure 2B. Evolutionary conservation profile in OsGPAT 8 demonstrating R355P amino acid position Amino acid residues conservation profile demonstrating the extent of evolutionary conservation. The conservation scale ranged from 1-10, where 10 is the highly conserved and 1 the lowest. Protein regions associated with function tend to be highly conserved over the course of evolution 14 Figure 3.       Post translational modifications (PTM) influence protein conformation, stability, activity, localization and their interactions. They provide proteome diversity, impact signaling pathways, influence gene expression and enzyme kinetics (Piquerez et al., 2014;Friso and van Wijk, 2015). Therefore, we estimated if nsSNPS resided on the PTM sites and they had any influence on PTM sites. We utilized many in silico tools to predict PTM sites in GPAT proteins. We used at least two in silico tools for each PTM event including the phosphorylation, methylation and ubiquitylation and selected positions predicted by both tools. We also explored whether any PTM site was conserved as well.
We investigated Phosphorylation sites by GPS 5.0 and NetPhos 3.1; both tools predicted 100 sites in various rice GPAT proteins (Table S2). 79 amino acid residues were Serine, 16 residues were Threonine and only 4 residues were Tyrosine. For methylation PTMs of lysine residues PSSMe and iMethyl-PseAAC tools were used. We considered those sites predicted by both these tools at 0.5 SVM Probability thresholds. There were 127 lysine residues that were common in both tools (Table S3). We used BDM-PUB and Ub-Pred to predict Ubiquitylation sites. 36 common amino acid sites were predicted for having a potential ubiquitylation site (Table S4). We used Consurf results to explore the highly conserved PTM sites (Table S5). But we could not find any nsSNP coincided with the conserved PTM sites. Past studies have demonstrated a variation in the conserved PTM site affect protein function.

Discussion Discussion Discussion
The current study demonstrates various nonsynonymous SNPs could disturb rice GPAT protein structure and function. We filtered the nsSNPs (Table 1), shortlisted the high risk or deleterious SNPs (Table  2), and analysed their impact on protein structure and functions.
Initially, we find 1056 SNPs among 100 Chinese rice varieties in 26 GPAT genes (Table S6) as a model gene family, we identified in a genome-wide study . We shortlisted 98 coding nsSNPs from these 1056 SNPs, that can change amino acid residues and alter protein function. We further used various computational tools including Provean, SIFT, Polyphen-2 to find significant non-synonymous SNPs having deleterious or potentially damaging effects to proteins. Since every computational tool uses different algorithms, their output could slightly differ; still, we find similarity in the predictions. For instance, many nsSNPs predicted as deleterious by an individual tool (Table 2), were often predicted by a second tool. Though these nsSNPs (predicted by a single tool in Table 2) can't be neglected and should be considered important, we only selected nsSNPs predicted by all the tools to strengthen our results. Hence, we found 13 SNPs (predicted by all three tools), considered as high-risk nsSNPs among the 98 nsSNPs (Table 3). We also explored functional PTM amino acid residues at phosphorylation, ubiquitylation and methylation sites, and investigated if the nsSNPs coincide with the PTM site; as modification in PTM sites can alter protein function.
Therefore, SNP identification and their functional elucidation provide a better understanding of their effects on gene functions and their phenotypes. Identifying functional SNPs is a complicated task in large population studies. Hence, it is beneficial to prior explore putative functional SNPs. Past studies suggest nsSNPs influence different mechanisms. Biosynthesis of secondary metabolite, and signal transduction pathways are involved in fruit ripening and defense responses. A change in the cell wall structure and starch conversion to sugar, play roles in fruit ripening. Deleterious nsSNPs are involved in these signal transduction and metabolic pathways. An SNP in tomato invertase gene changed amino acid residue near proteins catalytic site and affected enzyme activity. In rice, 66 functional SNPs were discovered from 18 genes involved in starch biosynthesis. An important SNP was reported at the 1188 nucleotide position in Glucose-6-Phosphate Translocator 1 (GPT1), changed amino acid residues associated with amylase content. In another study, SNPs in several genes were identified involving in maize root development and associated with seedling root traits. Four functional SNPs in the HSP17.8 gene in barley varieties were associated with agronomic traits. Similarly, in tomato, functional SNPs affect gene expression and are associated with phenotypic differences among the tomato lines. These examples demonstrate functional SNPs influence gene functions, reflecting the significance of the current study, as the identified nsSNPs could have a major effect on GPAT genes (Kharabian-Masouleh et al., 2012;Hirakawa et al., 2013;Seymour et al., 2013;Xia et al., 2013;Kumar et al., 2014;Schreiber et al., 2014;Bhardwaj et al., 2016;Huq et al., 2016;Zaynab et al., 2018).
In past studies we find, in-silico approaches used in humans, plants, and identified functionally significant SNPs (De Alencar and Lopes, 2010;Kharabian, 2010;Kamaraj and Purohit, 2013;Arshad and Attya Bhatti, 2018;Zhang et al., 2020). In rice, the Granule Bound Starch Synthase I GBSSI was used as a model plant gene and explored functional SNPs. The study identified a candidate SNP imparting a major impact on GBSSI and its phenotype. This nsSNP at exon 6, showed the highest effect on amylose content according to the SIFT prediction results (Kharabian, 2010), also reported in a previous study (Chen et al., 2008), suggesting coherence between in silico analysis results and experimental approaches. Although experimental approaches provide strong evidence for estimating SNPs functional significance, they are tedious, lack exactness in phenotypic and genotypic data, so these experiments are not easy to detect SNPs (Cobb et al., 2013). Besides, high false positive results crop up due to population structures and multiple testing. These results, therefore, mislead when a casual SNP is considered more significant in contrast to a truly casual variant (Tibbs Cortes et al.).
Computational methods are used to explore functional variants to reduce the burden from statistical association studies. Therefore, in silico tools provide an opportunity to facilitate experimental approaches and they can be used to recruit, identify and characterize functionally important SNPs having a major impact on phenotype (Tam et al., 2019). In this regard, an in-silico study identified SNP diversity in cultivated and wild tomato genomes, investigated 1838 nsSNPs among 988 genes. There were 28 deleterious SNPs distributed among 27 genes predicted in hormonal and plant pathogen pathways similar to the current study. Further, they selected nsSNPs deleterious effect on the protein structure (Bhardwaj et al., 2016). Likewise, a study explored important SNPs in the TGF-β receptor type 3 gene in chicken by using tools Sift, PANTHER, and I-mutant, and found a nsSNP in the coding region with deleterious impact, decreasing the protein stability (Rasal et al., 2015). Another study predicted putative effects of SNPs on 58 Prunus rootstocks genes using SnpEff. The SNPs were categorized as a modifier, low, moderate, and high impact; high impact SNPs were further explored using in silico tools (Guajardo et al., 2020).
Our results identified many high impact nsSNPs found in GPATs conserved protein regions. For instance, nsSNPs including P157L, R355P, F128L, V549I, and A32V located at highly conserved positions (Table 5). Past studies have suggested mutations in conserved regions can affect protein functions. Mutations in the ZP domain in the TGFBR3 gene affected its protein function; a conserved amino acid is essential for product and substrate specificity in triterpene synthases. Regarding variation between native and variant protein structures, we found most protein models had high RMSD value (two or more), reflecting differences between native and variant models affecting the protein stability (Jovine et al., 2002;Han et al., 2006;Salmon et al., 2016;Islam et al., 2019;Liao et al., 2019;Bhardwaj and Purohit, 2020).
We could not find any PTM site coincided with the nsSNP, but many PTM sites were highly conserved. Past studies demonstrated PTM sites influence protein function, if there is a variation in their conserved positions, and affect protein stability or inter protein structures (Arif et al., 2017;Gulzar et al., 2017).

Conclusions Conclusions Conclusions Conclusions
Briefly, in the study we identified all candidate SNPs most likely to affect rice GPAT proteins and influence their function. We demonstrated in silico tools could help us to characterize functional SNPs which possibly have potential impact on GPAT genes and related phenotypes. These functional SNPs could provide value in developing functional markers by associating their link to phenotypic traits. The study will also provide a computational pathway to find candidate SNPs in other rice gene families.    Table S2. Table S2. Table S2.