Flavonoid accumulation and identification of flavonoid biosynthesis genes in Dimocarpus longan Lour. by transcriptome sequencing

Dimocarpus longan Lour. is widely cultivated and is very popular around the world. Its by-products such as roots and leaves have been used as traditional Chinese medicines due to their content of important secondary metabolites, especially flavonoids. However, the economic value and application of D. longan roots and leaves are limited because they accumulate relatively low levels of flavonoids. Therefore, it is important to find key genes that regulate the accumulation of the predominant flavonoid compounds in D. longan roots and leaves. Here, we have used RNA-sequencing to describe the transcriptome of D. longan. We obtained 75,229,529 raw reads and 15.04 GB of clean data, generating 56,055 unigenes (N50 = 1,583 nt, mean length = 829.61 nt). Next, we annotated these unigenes using the various available bioinformatics databases. By this approach, we identified 6,684 genes differentially expressed between root and leaf tissues, of which thirteen were identified as flavonoid biosynthesis genes. Of these, eight genes were much highly expressed in roots (DlC4H, DlHCT, DlDFR, DlANS, DlANR, DlCHS, DlF3′H, and DlF3H), and two were much highly expressed in leaves (DlLAR and DlFLS). The contents of thirteen flavonoids in D. longan roots and leaves were measured by LCMS, and epicatechin was found to be the predominant flavonoid in both tissues, which was significantly higher than the other flavonoids measured in the study. Its contents were 213,773.65 ng/g in roots and 22,388.71 ng/g in leaves. Our findings will facilitate efforts to increase the economic value and expand the applications of D. longan roots and leaves by means of genetic engineering.


Introduction
Dimocarpus longan Lour., a member of the Sapindaceae family, is a well-known tropical and subtropical fruit tree that has been widely cultivated in Southeast Asia, especially in China, Vietnam, and Thailand . Due to its delicate flavour and sweet taste, the fruit of D. longan has been extensively accepted by consumers and has enjoyed great popularity around the world (Tseng et al., 2014). Most tissues of D. longan (fruit, leaf, flower, root, and seed) contain abundant bioactive compounds such as flavonoids, phenolic acids, and polysaccharides, and they have been used as traditional Chinese medicines for more than a century (Tang et al., 2019). The use of D. longan as a traditional Chinese medicine has been associated with improved blood metabolism, increased immunity, relief of insomnia, and improvements in learning and memory (Yang et al., 2008;Park et al., 2010).
Flavonoids are an abundant class of plant secondary metabolites. More than 10,000 plant flavonoids have been identified, and antiviral, antibacterial, estrogenic, and anti-obese properties have been described for many of these compounds (Ferrer et al., 2008;Tang et al., 2009;Wilson et al., 2009;Fang, 2014). In plants, flavonoids contribute to plant growth and development, reproduction, and stress responses. They are categorized into six subclasses based on their structural properties: flavones, flavanones, flavonols, isoflavones, proanthocyanidins, and anthocyanins (Dixon and Pasinetti, 2010;Yao et al., 2018). Mechanisms of flavonoid accumulation have been characterized in numerous plants, including Salvia miltiorrhiza, Dendrobium officinale, and Syringa oblata (Deng et al., 2018;Liu et al., 2019;Yuan et al., 2020). Because of the functional importance of flavonoids, much research effort is being aimed at cataloguing the genes responsible for plant flavonoid biosynthesis. Numerous enzymes encoded by plant flavonoid biosynthesis genes have been characterized (Petrussa et al., 2013;Saito et al., 2013;Dastmalchi and Dhaubhadel, 2015;Deng et al., 2018;Ni et al., 2018;Yuan et al., 2020). However, flavonoid biosynthesis genes from D. longan have not yet been studied.
Because of its accuracy, throughput, and reproducibility, RNA sequencing (RNA-Seq) is widely used for the study of transcriptomes (Lei et al., 2018). In plants, RNA-Seq data has been used to greatly advance our understanding of gene transcription patterns, posttranscriptional modification and mutation, functional analysis, and gene regulatory networks Glenn, 2011). To date, we still know little about the expression of flavonoid biosynthesis genes in its roots and leaves. Here, we use RNA-Seq to identify genes differently expressed between the roots and leaves of D. longan. We also identified genes associated with flavonoid biosynthesis: DlC4H, DlHCT,DlDFR,DlANS,DlANR,DlCHS,DlF3′H,and DlF3H. Genes identified as differentially expressed by RNA-Seq were then validated by quantitative real time PCR (qRT-PCR). Also, we measured the flavonoid contents of roots and leaves by LC-MS in order to study the relationship between flavonoid accumulation and biosynthesis gene expression. To our knowledge, no other study has applied RNA-Seq to investigate flavonoid biosynthesis gene expression in the roots and leaves of D. longan. Our results will aid future efforts to increase flavonoid accumulation in D. longan through genetic engineering.

Materials and Methods
Plant materials D. longan was cultivated in a greenhouse at 25 ℃ and 50% relative humidity. Leaves from the upper peripheral branches and roots were harvested from ten 2-month-old plants. Upon collection, samples were frozen in liquid nitrogen and stored at -80 ℃.

RNA isolation, library construction, and RNA sequencing
First, for each sample, RNA was extracted using a commercial kit (RN38 EASYspin plus Plant RNA Kit, Aidlab Biotech, Beijing, China). The recovered RNA was assessed (The concentration and integrity) using a UV-Vis spectrophotometer and Agilent 2100 Bioanalyzer. High-quality RNA samples were then pooled to achieve the optimal diversity of transcriptional units.
A commercially available kit (TruSeq RNA Sample Preparation Kit, Illumina Inc, San Diego, CA, USA) was then used to construct mRNA-Seq libraries. Poly-A mRNA was then isolated using magnetic beads carrying poly-T oligos. The recovered poly-A mRNA was then fragmented using divalent cations at high temperature. First strand cDNA synthesis was done using a reverse transcriptase enzyme and random primers, and second-strand cDNA synthesis was done using DNA Polymerase I. Next, the purified short fragments were ligated to a single 'A' base and ligated to sequencing adapters. Fragments (~200 ± 25 bp) were then separated by agarose gel electrophoresis and used as sequencing templates for PCR. This generated a cDNA library that was sequenced on the Illumina HiSeq TM 2000 platform.

Sequence analysis and assembly
To ensure that the data used for assembly was of sufficient quality, adaptor sequences, low-quality reads, and reads where > 10% of bases had a Q-value < 20 were removed from the raw reads Xie et al., 2012). Sequences shorter than 60 bases were also discarded (because small reads are likely to be sequencing artifacts) (Meyer et al., 2009). Trinity (http://trinityrnaseq.sourceforge.net/) was used to assemble clean reads into contigs, recovering full-length transcripts (Grabherr et al., 2011). These contigs were then combined into transcripts based on paired-end information. The resulting transcripts were then clustered based on sequence identity. The longest transcripts classified as unigenes were then combined, giving the final assembly that could be used for functional annotation.

Functional annotation
We used multiple databases [NCBI Nr, Gene Ontology (GO), Clusters of Orthologous Group (COG), Kyoto Encyclopaedia of Genes and Genomes (KEGG), and SwissProt] to annotate the assembled sequences. Gene names were assigned to unigenes based on the top BLAST hit. Genes without descriptions were excluded. As an additional step, we submitted predicted protein sequences to BLASTX (searching against GenBank, NCBI Nt, PDB, and RefSeq). The GetORF function of EMBOSS was used to predict open reading frames (ORFs) and the longest of there were extracted for each unigene (Hernández et al., 2018).
The Swiss-Prot BLAST results were imported into Blast2GO to assign GO terms and putative gene functions (Conesa et al., 2005;Conesa and Gotz, 2008). ANNEX was then applied to enrich and refine these annotations (Ye et al., 2006;Myhre et al., 2006).
To further annotate gene functions, unigenes were aligned to COG database (Galperin et al., 2019), and the KEGG pathways were assigned using the online server (http://www. genome.jp/kegg/kaas/). KEGG Orthology (KO) assignments were made using the bi-directional best hit method (Moriya et al., 2007). The KEGG analysis outputs included KO assignments and KEGG pathways.
Detection of simple sequence repeat (SSR) markers MISA software (http://pgrc.ipk-gatersleben.de/misa/) was used to detect SSR markers in the assembled sequences (those longer than 1 kb). The parameters were optimized to identify perfect dinucleotide motifs (minimum of six repeats, as well as tri-, tetra-, penta-and hexa-nucleotide motifs with a minimum of five repeats) (Wei et al., 2011).

Quantitative Real Time PCR analysis (qRT-PCR)
Total RNA was extracted from D. longan root and leaf tissues using the cetyltrime thylammonium bromide (CTAB) method (Jordon-Thaden et al., 2015). The cDNA Synthesis SuperMix (TransGen Biotech, Beijing, China) and TransScript® One-Step gDNA Removal kits were used to synthesize first strand cDNA. All qRT-PCR reactions were done using TransStart® Top Green qPCR SuperMix (TransGen Biotech, Beijing, China) according to the manufacturer's instructions. The primers used for qRT-PCR are given in Table 1, and the D. longan tubulin gene was selected as a reference gene. The qRT-PCR cycling parameters were: 95 °C for 1 min, followed by 95 for 5 s, 60 °C for 30 s, and 72 °C for 30 s. All experiments were conducted in triplicate.

Identification and quantification of individual flavonoid by Liquid Chromatography-Mass Spectrometry (LC-MS)
Root and leaf tissues from D. longan were freeze-dried at -80 °C for 48 h, then ground into a powder using a mortar and pestle. An aliquot of 50 mg of dried sample was transferred to a microcentrifuge tube (1.5 mL), and two steel balls added to the tube. Then 80% methanol solution (1.2 mL, containing 0.01 mol L -1 BHT and 0.1% formic acid) was added to each tube, and the tissue was ground at 60 Hz for 2 min in a grinder. The mixture was ultrasonicated at 50 °C for 2 h, then centrifuged at 12,000 rpm for 10 min at room temperature. The supernatant was transferred to a glass sampling vial. Later, 1 mL of 80% methanol solution (containing 0.01 mol L -1 BHT and 0.1% formic acid) was added to the residue, and the mixture was ultrasonicated at 50 °C for 30 min, followed by centrifugation at 12,000 rpm for 10 min at ambient temperature. The two supernatants were combined and dried at room temperature. An aliquot of 200 µL of 35% methanol/ 65% water were added to dissolve the dried supernatants. The resulting mixture was vortexed vigorously and then centrifuged for 10 min. The supernatants were placed at room temperature for 30 min before LC-MS analysis. Flavonoids were separated on a Waters ACQUITY UPLC HSS T3 (100 × 2.1 mm, 1.8 um) with a Thermo UHPLC system. The mobile phase A consisted of 0.1% formic acid and water, and the mobile phase B consisted of methanol. The flow rate was maintained at 0.2 mL min -1 , and the injection volume was 5 μL. The concentrations of flavonoid compounds were determined from a standard curve. All samples were analysed in triplicate. Values are displayed as means ± SDs.

D. longan RNA-Seq and transcriptome assembly
To comprehensively analyse transcript diversity and gene expression, RNA samples from D. longan leaves and roots (three replicate samples of each tissue) were sequenced on the Illumina HiSeq TM 2000 platform. We obtained 75,229,529 raw reads and 15.04 GB of clean data. The mean GC content was 45.84%. Q value ≥ 20 was the basic requirement for all reads, and greater than 85.03% of raw reads showed Q ≥ 30, indicating that the reads were of acceptable quality for assembly. After a stringent filtering step to remove contaminated, low-quality, and short reads, 15,397,494 contigs were assembled . Table 2 summarizes the sequencing pipeline. The high-quality reads have been submitted to the NCBI SRA database (SRP155595). Using Trinity, short-read sequences were assembled into 106,991 transcripts (N50, 2100 nt; mean length, 1249.38 nt). All assembled transcripts were clustered to produce 56,055 unigenes with an N50 of 1,583 nt and a mean length of 829.61 nt. Among these unigenes, 14,260 were longer than 1000 nt. Figure 1 shows the length distributions of contigs, transcripts, and unigenes.

Functional annotation
The assembled sequences were annotated in multiple ways. In total, 44.9% of the unigenes (25,161) were annotated based on similarity to sequences deposited in diverse databases, as described in the Methods. Of these, 3,028 unigenes were successfully annotated by all the databases used in this study ( Figure 2 and Table 3). First, the Basic Local Alignment Search Tool (BLAST) tool was used to search the NCBI Nr and Swiss-Prot protein databases (E ≤ 10-5). In total, 25,074 and 17,484 unigenes were homologous to sequences in the Nr and Swiss-Prot databases (E ≤ 10-5), accounting for 44.7% and 31.2% of the total assembled unigenes, respectively. Additionally, 8,372 (14.9%), 19,631 (35.0%), and 5,267 (9.4%) of the unigenes matched entries in the COG, GO, and KEGG databases, respectively.
Among the matches to the Nr database, 62.0% of the aligned unigenes had an E-value of < 1.0E -60 , indicating strong sequence homology. The remaining aligned unigenes had E-values ranging between 1.0E -5 and 1.0E -60 ( Figure 3A). The similarities of the unigenes to their top Blast hits in the Nr database was also assessed: 74.5% of the unigenes had a similarity > 60%, 21.5% had a similarity between 40% and 60%, and 4.0% had a similarity lower than 40% ( Figure 3B). Over 80% of the unigenes had top Blast hits from the following seven species: Fragaria vescasubsp. vesca, Glycine max, Populus trichocarpa, Prunus persica, Ricinus communis, Theobroma cacao, and Vitis vinifera ( Figure 3C).

GO classification
Gene Ontology (GO) analysis was used to classify the unigenes into functional groups (Hillier et al., 2009). Of the 56,066 assembled unigenes, 19,631 unigenes were categorized into 57 functional groups and three major categories: 49.6% (n = 2,849) in biological process, 23.7% (n = 540) in cellular component, and 26.7% (n = 2,236) in molecular function (Figure 4). Among the biological process terms, metabolic process, cellular process, and response to stimulus were relatively well represented. Among the molecular function terms, binding, catalytic activity, and transporter activity were common. Among the cellular component terms, cell part, cell, and organelle were most abundant.

Differential gene expression in D. longan root and leaf tissues
In total, 6,684 genes were differentially expressed between root and leaf tissues, and their expression levels are presented as a clustered heat map in Figure 6. Among the differentially expressed genes (DEGs), 3,550 genes had higher expression in roots, and 3,134 genes had higher expression in leaves (Figure 7).
The DEGs were analyzed based on their functions in diverse pathways. These pathways could be divided into five categories: genetic information processing, organismal systems, cellular processes, environmental information processing, and metabolism ( Figure 8). One-hundred-thirty-one DEGs were related to ribosomes in genetic information processing, accounting for the largest percentage in this category. Thirty-four DEGs were related to plant-pathogen interactions in the organismal systems category. In the cellular process's category, 20 and 16 DEGs were associated with the peroxisome and the phagosome, respectively. Only one function (plant hormone signal transduction) was identified in the environmental information processing category (n = 63 DEGs). In the metabolism category, 53 DEGs were annotated as having a role in glycolysis/gluconeogenesis, followed by fewer genes related to purine metabolism, pyrimidine metabolism, carbon fixation in photosynthetic organisms, and other functions. Given that flavonoids are a principal group of secondary metabolites in D. longan, we next studied putative flavonoid biosynthesis genes and their relationships with flavonoid contents in root and leaf tissues.   Table 5 indicate that flavonoid biosynthesis genes from D. longan exhibit high percentage identities with other orthologous genes.

Expression analysis of flavonoid biosynthesis genes in roots and leaves of D. longan
Based on RNA-Seq data from root and leaf tissues, we found that the thirteen putative flavonoid biosynthesis genes had different expression levels. qRT-PCR was used to verify the accuracy of the transcriptome data for the 13 flavonoid biosynthetic genes. The qRT-PCR results (Figure 9) were broadly in line with the RNA-Seq results. All 13 flavonoid biosynthesis genes were expressed in root and leaf tissues but with different expression patterns. DlCHI, DlCCoAOMT, and DlCYP were expressed similarly in root and leaf tissues. Eight genes (DlC4H,DlHCT,DlDFR,DlANS,DlANR,DlCHS,DlF3′H,and DlF3H) had relatively higher expression in roots (vs. leaves), and two genes (DlLAR and DlFLS) had much higher expression levels in leaves (vs. roots).

Analysis of flavonoid contents of D. longan root and leaf tissues
The same plant materials used for qRT-PCR were also used to analyse the contents of fifteen flavonoids in D. longan roots and leaves by LC-MS. Eight flavonoids were not detected in roots or leaves (genistein, isoliquiritigenin, isorhamnetin, formononetin, (S)-pinocembrin, chrysin, galangin, and daidzein). Among the seven flavonoids detected, vitexin was detected only in roots. The remaining six flavonoids (catechin, epicatechin, rutin, morin, luteolin, and apigenin) were detected in both roots and leaves, but at different levels. Of these, epicatechin was the predominant ingredient in both root and leaf tissues of D. longan, and epicatechin content differed by approximately an order of magnitude between roots (213,773.65 ng/g) and leaves (22,388.71 ng/g). By comparison, apigenin content in roots (5.31 ng/g) and leaves (6.06 ng/g) was very low and did not differ significantly between the tissues. The flavonoid composition and contents in roots and leaves of D. longan were shown in Table 6. In general, roughly half of the flavonoids we tested were not detected in either root or leaf tissue. Among the remaining flavonoids, the most abundant one was epicatechin in the roots. It was present at much higher levels than any other flavonoid and deserves further investigation for future development and utilization.

Discussion
Flavonoids are one of the most important groups of phenolic compounds and the most common secondary metabolites in plants. Up to the present, more than 10,000 flavonoids have been identified in plant, and they are divided into six subclasses including flavones, flavanones, flavonols, isoflavones, proanthocyanidins, and anthocyanins (Dixon and Pasinetti, 2010;Yao et al., 2018). Flavonoids have numerous pharmacological activities. They can not only be beneficial to the plant itself but also can be used to treat human diseases such as hypertension, dysentery, liver and urinary problems, diarrhoea, and cerebral disorders (Azuma et al., 2012;Jennings et al., 2012;Chang et al., 2013).
D. longan is one of the most famous fruits in China, and China accounts for more than 50% of its global production. The secondary metabolites, especially flavonoids, found in D. longan fruits, leaves, flowers, and seeds are thought to have medicinal properties, and D. longan has long been used in traditional Chinese medicine (Thitiratsakul and Anprung, 2014). However, the leaves and roots of D. longan, while rich in resources, nonetheless accumulate relatively small amounts of flavonoids, limiting the scope of their application and their economic value. We therefore hope to enhance flavonoid accumulation in D. longan roots and leaves by means of biotechnology.
Transcriptome sequencing is a powerful biological technology for discovering molecular markers and identifying novel genes. Compared with the SOLiD, Roche 454, and HeliScope platforms, the Illumina HiSeq platform used in this study has been widely adopted because of its many advantages, in particular its capacity to analyse the expression levels of transcripts and to identify previously undiscovered transcripts (Ronaghi et al., 1998;Harris et al., 2008;Smith et al., 2008). In this study, we used RNA-Seq to profile D. longan transcript levels. A total of 15.04 GB of clean data were obtained and assembled into 56,055 unigenes, 25,161 (44.9%) of which were successfully annotated with the GO, COG, KEGG, Swiss-Prot, or NCBI Nr databases, suggesting that they may have relatively conserved functions. The remaining unannotated unigenes (55.1%) could not be matched to known sequences in public databases, likely because of the presence of poorly conserved regions, such as untranslated regions (UTRs) (Hou et al., 2011). These unannotated unigenes may play specific roles in D. longan which have not yet been identified and therefore deserve further investigation. The mean D. longan unigene length (829.61 nt) was greater than that of Litchi chinensis Sonn (687 nt for phase 1 and 669 nt for phase 2) (Zhang et al., 2014). N50 length is commonly used to evaluate the quality of a transcriptome assembly, with a high number indicating a higher quality assembly (Li and Dewey, 2011). The N50 of the D. longan transcriptome was 1,583 nt, significantly longer than that of Litchi chinensis Sonn (811 nt), suggesting that the assembly quality was adequate for subsequent transcriptome analysis (Li et al., 2013). We also identified DEGs when comparing the root and leaf transcriptomes of D. longan. In total, 6,684 genes differed in expression between roots and leaves: 3,550 genes were much highly expressed in roots, and 3,134 genes were much highly expressed in leaves. Expression patterns analysed by RNA-Seq and qRT-PCR were broadly consistent, suggesting that the RNA-Seq data will be very useful for subsequent studies of D. longan.
Thirteen putative flavonoid biosynthesis genes were differentially expressed between D. longan leaf and root tissues: DlC4H, DlHCT, DlDFR, DlANS, DlANR, DlCHS, DlF3′H, and DlF3H were much highly expressed in roots, and DlLAR and DlFLS were much highly expressed in leaves. Based on known flavonoid biosynthetic pathways (Vogt, 2010;Tohge et al., 2017), all of the thirteen identified genes were predicted to function in flavonoid biosynthesis. To ascertain the relationship between gene expression and flavonoid accumulation, we measured the contents of fifteen flavonoids by LC-MS. In total, seven flavonoids were detected, including catechin, epicatechin, rutin, morin, luteolin, apigenin, and vitexin. Vitexin was only detected in roots. The other eight flavonoids (genistein, isoliquiritigenin, isorhamnetin, formononetin, (S)pinocembrin, chrysin, galangin, and daidzein) were detected in neither roots nor leaves. The epicatechin contents were 213,773.65 ng/g in roots and 22,388.71 ng/g in leaves, both significantly higher than the contents of the other six flavonoids. Thus, epicatechin was clearly the predominant bioactive flavonoid in root and leaf tissues of D. longan and deserves further investigation into means of increasing its accumulation. Epicatechin has been demonstrated to have multiple functions including anticancer, antilipidemic, antineoplastic, cardio protective, anti-inflammatory, and antioxidant properties (Shanmugam et al., 2017). In addition, the natural epicatechin was verified to regulate insulin secretion from pancreatic β cells (Yang and Chan, 2018). ANR is thought to be the gene involved in catalysing plant epicatechin production (Han et al., 2012;Tan et al., 2018;Yuan et al., 2020). The expression level of DlANR was higher in roots than that in leaves, consistent with the high accumulation of epicatechin in roots. Therefore, we speculate that protein encoded by DlANR catalyses the biosynthesis of epicatechin in D. longan. We could increase epicatechin content of D. longan roots and leaves by means of genetic engineering in future. As a result, the medicinal value of roots and leaves, by-products of D. longan fruit cultivation, will be enhanced, turning waste into treasure.

Conclusions
In this study, RNA-Seq technology was used to profile the transcriptome of D. longan on the Illumina HiSeq TM 2000 platform. In total, 56,055 unigenes were obtained, with an N50 of 1,583 nt and a mean length of 829.61 nt. Unigenes were annotated using NCBI Nr, Gene Ontology (GO), Clusters of Orthologous Group (COG), and Kyoto Encyclopaedia of Genes and Genomes (KEGG) databases. A total of 6,684 genes were differentially expressed between root and leaf tissues, of which thirteen were identified as flavonoid biosynthesis genes. Of these, eight genes were much highly expressed in roots (DlC4H,DlHCT,DlDFR,DlANS,DlANR,DlCHS,DlF3′H,and DlF3H), and two were much highly expressed in leaves (DlLAR and DlFLS). The contents of thirteen flavonoids in D. longan roots and leaves were measured by LC-MS, and epicatechin was found to be the predominant flavonoid in both tissues, which was significantly higher than the other flavonoids measured in the study. Its contents were 213,773.65 ng/g in roots and 22,388.71 ng/g in leaves. The results of this study will help to increase the economic value and expand the applications of D. longan roots and leaves by means of genetic engineering.