Genome-wide identification of WD40 superfamily genes and prediction of WD40 genes involved in flavonoid biosynthesis in Ginkgo biloba

The WD40 transcription factor family is a superfamily found in eukaryotes and implicated in regulating growth and development. In this study, 167 WD40 family genes are identified in the Ginkgo biloba genome. They are divided into 5 clusters and 16 subfamilies based on the difference analysis of a phylogenetic tree and domain structures. The distribution of WD40 genes in chromosomes, gene structures, and motifs is analyzed. Promoter analysis shows that five GbWD40 gene promoters contain the MYB binding site participating in the regulation of flavonoid metabolism, suggesting that these five genes may participate in the regulation of flavonoid synthesis in G. biloba. The correlation analysis is carried out based on FPKM value of WD40 genes and flavonoid content in 8 tissues of G. biloba. Six GbWD40 genes that may participate in flavonoid metabolism are screened. The biological functions of the WD40 family genes in G. biloba are systematically analyzed, providing a foundation for further elucidating their regulatory mechanisms. A number of WD40 candidate genes involved in the biosynthesis and metabolism of G. biloba also predicted. This study presents an important basis and direction for conducting further research on the regulatory network of flavonoid synthesis and metabolism.


Introduction
Ginkgo biloba L. is a typical gymnosperm left over from the Quaternary glaciation movement. G. biloba, which is also named white fruit tree, Gongsun tree, and "living fossil," is a deciduous tree that has experienced 270 million years of history, but its shape and structure have slightly changed (Gong et al., 2008;Liu et al., 2017). G. biloba, native to China, is an ancient dioecious plant with important medicinal properties (Lin et al., 2011). G. biloba is rich in natural active ingredients, such as flavonoids and terpene trilactones (TTLs). Among them, flavonoids are commonly used as an herbal dietary supplement for the treatment of many diseases, and they can improve the psychological ability of patients with Alzheimer's disease (Albert et al., 2018;Ni et al., 2018). Flavonoids are a large class of secondary polyphenol metabolites in plants. They can be divided into six 2 categories based on molecular structures: chalcones, flavones, flavonols, flavandiols, anthocyanins, and proanthocyanidins (PAs; also called condensed tannins). As antibacterial agents, flavonoids play an important role in the interaction and defense reaction between plants and microorganisms and have potential beneficial effects on human health (Winkel-Shirley et al., 2001). Flavonoids in G. biloba extract are its active components, which play an important role in pharmacology (Ude et al., 2013).
WD40 protein, also known as a WD40 repeat, widely exists in eukaryotes (Migliori et al., 2012). It is characterized by a peptide motif with 40-60 amino acids, which are usually defined by the GH dipeptide (Gly-His) at the C terminal and the WD dipeptide (Trp-Asp) at the N terminal (van Nocker et al., 2003). WD40 is widely involved in many functional processes, including signal transduction, cell division, vesicle formation, secondary metabolite synthesis, transcription regulation, cell cycle regulation, and chromatin histone modification (van Nocker et al., 2003;Suganuma et al., 2008;Xu et al., 2011). WD40 protein acts as an adaptor in many protein complexes or protein-DNA complexes (Pesch et al., 2015). It has been identified in plants, humans, and prokaryotes (Hu et al., 2011;Zou et al., 2016;Feng et al., 2019). A total of 743 WD40 proteins are identified in the wheat genome, which is divided into 5 clusters and 11 subfamilies (Hu et al., 2018). A total of 220 WD40 protein family genes are identified in the peach genome (Feng et al., 2019). In Oryza sativa, a monocotyledonous model plant, 200 WD40 family members are divided into 5 clusters and 11 subfamilies based on their domain composition (Ouyang et al., 2012). The WD40 protein family has been studied in many plants, but it is rarely reported in G. biloba. Many findings have shown that the WD40 protein is involved in flavonoid regulation (Xie et al., 2016;Chen et al., 2019). Some studies have demonstrated that flavonoid synthesis regulators interact with R2R3-MYBs or bHLHs, which either increase or decrease the flavonoid contents . The interaction between WD40, MYB, and bHLH transcription factor proteins has been widely studied. Flavonoid biosynthesis may be a good model for their interaction (Nakatsuka et al., 2008). Ye et al. (2019) prepared three generations of full-length transcriptome sequencing of eight different tissues of G. biloba, revealed 12 structural genes and transcription factor modules involved in flavonoid biosynthesis, and identified 7 hub genes participating in flavonoid biosynthesis and metabolism (Ye et al., 2019). Wu et al. showed the key genes involved in flavonoid synthesis, transportation, and regulation through the transcriptome sequencing of G. biloba with different flavonoid contents . However, few studies have been conducted on the WD40 gene involved in the flavonoid biosynthesis of G. biloba. In this study, the WD40 family genes in G. biloba is comprehensively and systematically identified and analyzed on the basis of the genomic data of G. biloba (http://gigadb.org/dataset/100613) (Guan et al., 2019) and the transcriptome sequencing data of three generations of G. biloba obtained previously (Ye et al., 2019). The GbWD40 gene related to flavonoid biosynthesis is screened through the correlation analysis with flavonoid contents in eight different tissues of G. biloba. This work is performed to comprehensively identify the WD40 family gene in Ginkgo genome by (1) determining the exact number of IDs of the WD40 family gene in the Ginkgo genome; (2) phylogenetically analyzing GbWD40 identified on the basis of the WD40 family gene in Arabidopsis thaliana and subclassifying the WD40 family members of G. biloba by combining with the domain structure; (3) analyzing the structure of the genes and proteins of each GbWD40 member in G. biloba; (4) evaluating the chromosomal distribution; (5) carrying out a correlation analysis based on FPKM data and flavonoid content in eight tissues of G. Biloba and screening out GbWD40 significantly correlated with flavonoid contents. Our study provides a basis for revealing the synthesis metabolism and content variation in flavonoids in G. biloba.

Materials and Methods
Plant materials and determination of flavonoids The plant materials (31-years-old trees) of G. biloba was collected from Ginkgo Germplasm Repository of Yangtze University (N30.35, E112.14), China. The 8 independent sampled tissues included root (R), stem 3 (S), immature leaf (IL), mature leaf (ML), microstrobilus (M), ovulate strobilus (OS), immature fruit (IF), and mature fruit (MF). The tissue samples were immediately frozen in liquid nitrogen, and stored at ultra-low temperatures in a refrigerator for further analysis. The flavonoid contents in 8 tissues were determined according to the method of Ye et al. (2019). The flavonoid content determination was three biological replicates, with six technical replicates for each biological replicate.
Identification of WD40 superfamily genes in G. biloba To order to identify WD40 proteins in G. biloba, the whole protein sequence of GbWD40 was downloaded from the GIGADB (http://gigadb.org/dataset/100613, 2019), and the whole protein sequence of A. thaliana was downloaded from TAIR (https://www.arabidopsis.org/, TAIR 10). The hidden Markov model (HMM) profile of the domain (PF00400) was downloaded from Pfam (http://pfam.xfam.org/family/PF00400). The hmmsearch program (HMMER 3.0 package, https ://hmmer.org/) was employed against the whole protein sequence by using the HMM of the WD40 domain (PF00400) as the query file with E value ≤ 10 -5 . To ensure the presence of the WD domain for each protein, the protein sequences were uploaded to the Batch CD search (https://www.ncbi.nlm.nih.gov/). After removing redundant sequences, all candidate proteins were evaluated via Pfam (https://pfam.xfam.org) (El-Gebali et al., 2019) and Smart (https://smart.embl-heide lberg.de/) (Ponting et al., 1999). Protein sequence, the coding sequence (CDS) and genomic sequence were extracted from ginkgo Genome Database by using TBtools software fasta extract tool (Chen et al., 2020). The basic physical and chemical parameters (primary structure), including the number of amino acids, molecular weight (Mw), theoretical pI, aliphatic index, and grand average of hydropathicity (GRAVY), for each protein were collected from the ProtParam (https://web.expasy.org/protparam/) website.

Chromosome locations of GbWD40s
The General Feature Format (GFF) was downloaded from the GIGADB (http://gigadb.org/dataset/100613, 2019), and the physical location of all GbWD40 genes on the chromosome was drawn using the TBtools tool (Chen et al., 2020).
Phylogenetic analysis and classification of WD40 proteins in G. biloba WD40 gene sequences of A. thaliana was downloaded from the TAIR database https://www.arabidopsis.org/index.jsp) and extracted using TBtools tool. Combining these 100 WD40 superfamily proteins in A. thaliana with WD40 proteins in G. biloba, one phylogenetic tree was constructed using MEGAX software (https://www.megasoftware.net). The all proteins sequence was aligned by using the MUSCLE algorithm and the tree was constructed with the neighbor-joining (NJ) method with a bootstrap test (1000 replicates). The phylogenetic tree was classified according to its topology and evolutionary relationship.
The gene structure prediction and motif distribution of the WD40 superfamily members in G. biloba The exon-intron structures of GbWD40s was predicted by GSDS (http://gsds.cbi.pku.edu.cn/index.php) (Hu et al., 2015). The conserved motifs of the GbWD40s proteins were searched in MEME 5.1.0 (http://meme-suite.org/tools/meme) (Bailey et al., 2009) with a maximum of 20 motifs and under default parameters and drawn by TBtools (Chen et al., 2020).
Promoter analysis of GbWD40s genes The upstream 2000 bp genomic DNA sequences of the WD40 genes were downloaded and submitted to PlantCARE (Lescot et al., 2002) to predict putative cis-elements.

Correlation analysis of flavonoids
The Fragments Per Kilobase of transcript per Million fragments mapped (FPKM) of WD40 was downloaded from the full-length transcriptome of G. biloba (Ye et al., 2019), and the correlation coefficient was determined with OmicShare tools, which is a free online platform (htpp://www.omicshare.com/tools).
Gene ontology (GO) annotation According to the annotations of the full-length transcriptome of G. biloba (Ye et al., 2019), the GO annotations of WD40 family genes were classified and counted.

Results
Identification of WD40 genes in G. biloba and chromosomal distribution of GbWD40 genes A total of 167 GbWD40 family genes are identified and obtained on the basis of the HMM results and verification of conservative motifs. For convenience, the positions of 167 WD40 genes on the Ginkgo chromosome are sorted, and these genes are named GbWD40-001 to GbWD40-167 ( Figure 1, Table S1) in combination with G. biloba. The length of WD40 protein sequence is in the range of 99-3230 aa, and the average length is about 620 aa. The relative molecular mass of the WD40 family protein is in the range of 11,116.5 (GbWD40-015) Da to 361,146.48 (GbWD40-052) Da, and the average molecular mass is 68,670.55 Da. The theoretical pI is in the range of 4.51 to 9.62. A total of 68 WD40 proteins with an average pI of more than 6.93. The instability index is in the range of 18.86 to 59.78, and the average is 41.11. A total of 96 WD40 protein sequences are considered to be unstable, and 71 are considered to be stable (Table S1). The grand average of hydropathicity (GRAVY) is in the range of −0.822 to 0.331, and the average value is −0.28. e length of 167 WD40 genome sequences is from 425 bp to 448,032 bp, with an average of 96,329 bp.
Phylogenetic analysis and subfamily classification of WD40 proteins in G. biloba A phylogenetic analysis is conducted using the NJ method to identify evolutionary relationships among the GbWD40 protein members. The WD40 protein sequence of G. biloba and the WD40 protein sequence of A. thaliana are used to construct an evolutionary tree in MEGAX based on the protein sequence similarities and their subsequent phylogenetic tree. GbWD40 genes are divided into five clusters (Cluster I to V) based on the evolutionary relationship between the WD40 protein of G. biloba and the WD40 protein of A. thaliana, and these five clusters include 40, 54, 20, 61, and 92 GbWD40 members, respectively (Figure 2, Table S1). According to their domain structure, the GbWD40 protein is divided into 16 subfamilies. Among them, 116 proteins only contain the WD40 domain, which is classified as A subcategory. The 51 other GbWD40 genes include other domains excluding the WD40 domain, and they are classified as subfamilies B to P (Figure 3; Figure S3). The domain structures of the genes in other families are similar except the subfamily P. This result indicates that WD40 proteins with similar domains are clustered together.
The motif distribution and gene structure of WD40 superfamily members in G. biloba A motif analysis is carried out for each protein sequence by using a SMART online tool to confirm that GbWD40s are WD40 superfamily genes ( Figure S3). The results show that motifs 1, 2, 3, and 4 contain 21, 15, 11, and 15 amino acids, and they are highly conserved in the WD40 protein sequence of G. biloba. Furthermore, 166, 164, 164, and 153 GbWD40 proteins contain the corresponding motifs (Figure 4). The result also demonstrates that motif 1 is the most conserved in 20 motifs, followed by motifs 2 and 3. The sequences in motifs 1-20 correspond to logos 1-20 ( Figure S2). The results suggest that the proteins in the same subfamily have similar motifs, such as subfamilies B and E (Figure 4).  The intron-exon schematic structure of GbWD40 genes is constructed on the basis of the subfamily classification of 167 GbWD40s to study the distribution of introns and exons in each GbWD40 gene ( Figure  5). The analysis result shows that the difference between the numbers of introns and exons of GbWD40 family genes is large even in the same subfamily. The gene structure, especially subfamily A containing 116 members, differs. However, subfamilies H, I, and J have only two members, but they slightly vary in gene structure. In 167 WD40 protein sequences, GbWD40-52 has the most introns (total of 36), followed by GbWD40-69, which contains 31 introns. Next, GbWD40-59 and GbWD40-153 have 30 introns. A total of 21 (12.57%) GbWD40s have only one exon but have no introns. A total of 89 GbWD40 genes contain 1-10 introns, the 57 remaining genes contain 11 or more introns ( Figure 5, Table S1).

Promoter analysis of GbWD40 genes
To further investigate the putative functions of GbWD40 genes, we identified and analysed the potential cis-elements in the promoter regions of 2000-bp upstream of the start codon of WD40 genes using PlantCARE software. As shown in Figure S1, the GbWD40 genes are rich in cis-acting elements, including elements that respond to hormones (gibberellin-responsive element, cis-acting element involved in salicylic acid responsiveness, auxin-responsive element, cis-acting regulatory element involved in the MeJAresponsiveness, cis-acting element involved in the abscisic acid responsiveness), elements that respond to hormones (cis-acting element involved in low-temperature responsiveness, MYB binding site involved in drought-inducibility, cis-acting element involved in light responsiveness), MYB binding site involved in flavonoid biosynthetic genes regulation, MYBHv1 binding site, cis-regulatory element involved in endosperm expression, and cis-acting regulatory element essential for the anaerobic induction. Among those GbWD40s, GbWD40-036(Gb_06853), GbWD40-059(Gb_14861), GbWD40-030(Gb_28754), GbWD40-078(Gb_32521), and GbWD40-163(Gb_38061) have "MYB binding site involved in flavonoid biosynthetic genes regulation" (Figure 7). We speculate that these five WD40 genes are involved in the synthesis of flavonoid in G. biloba. GO annotation analysis of GbWD40s genes GO annotation analysis is performed on the target genes of GbWD40s. A total of 79 (47.31%) GbWD40s are divided into one or more functional GO categories, including molecular function, cellular component, and biological process (Figure 8). All the matched WD40s are annotated to 35 functional terms, including cellular component (20 subgroups), molecular function (10 subgroups), and biological process (5 subgroups; Figure 8). In the biological process, the following observations are found: 46 proteins with GO annotation enriched in cellular process, 40 proteins enriched in single-organism process, 33 proteins enriched in metabolic process, 22 proteins enriched in cellular component organization or biogenesis, 21 proteins enriched in biological regulation, 21 proteins enriched in biological process regulation. In the cell composition, the following findings are detected: 52 proteins with GO annotation enriched in cell and cell parts, 38 proteins enriched in macromolecular complex, 24 proteins enriched in organelles. In molecular functions, the following data are observed: 29 proteins with GO annotation enriched in binding and 21 proteins enriched in catalytic activities. Therefore, the members of this family likely play an important role in the regulation of proteinprotein interactions and secondary metabolite synthesis.

Discussion
The WD40 protein is a scaffold molecule for protein-protein interaction, which plays an important role in basic biological processes. They are highly conserved and abundant in eukaryotes, and they may play a key role in many biological processes, including signal transduction, protein trafficking, chromatin modification, and transcription (van Nocker et al., 2003;Stirnimann et al., 2010;Jain et al., 2018). The WD40 family is identified and analysed in many species (Hu et al., 2018;Salih et al., 2018;Feng et al., 2019;Liu et al., 2020). These results show that WD40 family genes are amplified to different extents in different plant evolutionary stages. In this study, 167 WD40 proteins are identified in G. biloba genome. The analysis of the 12 positions of these GbWD40 genes on the chromosome shows that WD40 genes in G. biloba are unevenly distributed among 12 chromosomes, which may be related to the evolution of G. biloba. The WD40 family genes in G. biloba are divided into 5 clusters according to the topological structure of the evolutionary tree and further divided into 16 subclasses according to the domain structure of GbWD40 protein. A total of 116 proteins contain the WD40 domain, and they are classified as subclass A. The 51 remaining GbWD40 genes consist of other domains, and they are categorized as subclasses B to P. In comparison with the classification of other plants, our results on the WD40 family genes in G. biloba show similarities and differences (Feng et al., 2019;Liu et al., 2020;Sun et al., 2020). The GbWD40 protein sequence is subjected to motif analysis. The result indicates that motif 1 is the most conserved in the WD40 family protein sequences in G. biloba.
The exon/intron diversification of gene family members plays an important role in the evolution of multiple gene families through the three main types of mechanisms, namely, exon/intron gain/loss, exonization/pseudoexonization, and insertion/deletion (Xu et al., 2012). Of the 167 identified WD40 genes, 146 contain introns in the range of 1-36, and 21 genes have no introns. This result is consistent with Sun's identification result in Rosa chinensis (Sun et al., 2020). Introns are found in most WD40 genes, and no introns are detected in few genes. The WD40 family genes in G. biloba have experienced intron deletion or increase in evolution and development. The function of WD40 family members is affected by the size of exons and the length of introns. Studies on the gene structure of GbWD40s show that the size of exons and the length of introns are diversified, and the genome sequence length is in the range of 425-448 kb. This result indicates that the function of GbWD40 genes may be diversified (Table S1).
Plant promoters are regulatory elements important for plant gene transcription and transcriptional level regulation (Danino et al., 2015). Feyissa et al. (2019) observed that SPL13, as an inhibitory factor, can directly combine with the DFR promoter to affect its expression. Gou et al. (2011) indicated that SPL9 can bind to the promoter region of DFR. Therefore, promoter analysis is essential for studies on gene function. The promoters of 167 GbWD40 genes (2000 bp) are analysed, and GbWD40-036 (II, B), GbWD40-059 (IV, P), GbWD40-030 (I, A), GbWD40-078 (III, F), and GbWD40-163 (I, C) contain the cis element of the "MYB binding site involved in the regulation of flavonoid biosynthesis genes." These five WD40 genes are distributed in Clusters I-IV, but do not belong to Cluster V, only containing subfamily A, B, C, F, and P. Therefore, genes with similar domain structures are not only the same in function but also different. Studies have shown that WD40 combines with bHLH and MYB to form a complex, which is involved in the regulation of flavonoid biosynthesis (Hichri et al., 2011;Hong et al., 2015). bHLH, WD40, and MYB are also involved as monomers in the regulation of flavonoid synthesis. In maize, PAC1 (WD40), R (bHLH), and C1 (MYB) seem to be independently regulated (Carey et al., 2004). In apples, MdMYB10 does not seem to regulate the MdbHLH3 and MdbHLH33 expression (Espley et al., 2007). The results demonstrate that the R2R3-MYB gene GbMYBF2 plays a negative regulator role in flavonoid biosynthesis in G. biloba leaves (Xu et al., 2014). Therefore, five GbWD40 genes containing the cis-element of the "MYB binding site involved in the regulation of flavonoid biosynthesis genes" likely combine with MYB or combine with MYB and bHLH to form complexes. It plays a role in regulating the synthesis and metabolism of G. biloba and flavonoids.
The synthesis of flavonoids in A. thaliana is relatively clear (Li, 2014), but the metabolic pathway of flavonoids in G. biloba is still unclear. A number of studies have shown that flavonoid synthesis is regulated by bHLH, MYB, WD40, and other transcription factors (Hichri et al., 2011;Carey et al., 2004;Li, 2014;Ye et al., 2019). Five and six WD40 genes related to flavonoid synthesis are screened through promoter analysis and 13 correlation analysis, respectively. Two repeat genes are found in these 11 genes, which are also the key genes in the next step of functional verification. This result provides a direction for further studying WD40 in the synthesis and metabolism of flavonoids in G. biloba.

Conclusions
A total of 167 WD40 family genes in G. biloba genome are identified and analysed. A phylogenetic tree is constructed on the basis of 167 protein sequences and 100 WD40 family genes in A. thaliana. It is divided into 5 families based on the evolutionary relationship and further divided into 16 subfamilies from A to P based on the domain structure. The analysis on the promoters of GbWD40 genes reveals that the promoters in GbWD40-036 (Gb_06853), GbWD40-059 (Gb_14861), GbWD40-030 (Gb_28754), GbWD40-078 (Gb_32521), and GbWD40-163 (Gb_38061) genes contain the cis-acting element of the MYB binding site