Transcriptome analysis to identify genes involved in lignan, sesquiterpenoid and triterpenoid biosynthesis in medicinal plant Kadsura heteroclita

Stems and roots of Kadsura plant species were the significant ingredients of traditional Chinese medicine. Kadsura heteroclita is one of the popular medicinal plants used in Tujia and Yao nationalities of China. Antioxidant compounds like lignan, sesquiterpenoid and triterpenoid are the major active components of K. hetroclita. Mass cultivation and bio-manufacturing strategies were being proposed to meet the increasing demand of Kadsura species plant parts. Therefore, it is important to reveal the molecular networks involved in biosynthesis of these highly efficient medicinal compounds. Here, transcriptomes of roots, stems and leaves in K. heteroclite seedling were sequenced by Hiseq2000 and unigenes involved in biosynthesis of lignan, sesquiterpenoid and triterpenoid biosynthesis were mined. As a result, 472 million clean reads were obtained which after aligning resulted in 160,248 transcripts and 98,005 genes. 191 and 279 unigenes were expected to be involved in biosynthesis of lignan, sesquiterpenoid and triterpenoid biosynthetic pathways respectively. Lignan, sesquiterpenoid and triterpenoid biosynthesis pathway genes were highly significant and differentially upregulated in roots and stems and downregulated in leaves. Also, genes encoding for MYB and bHLH transcription factors possibly involved in regulation of lignan, sesquiterpenoid and triterpenoid biosynthesis were discovered. These results provide the fundamental genomic resources for dissecting of biosynthetic pathways of the active components in K. hetroclita.

family Schisandraceae. It is mainly distributed in Hubei, Guangdong, Hainan, Guangxi, Guizhou, Yunnan provinces of China, Bangladesh, Vietnam, Laos, Myanmar, Thailand, India, Sri Lanka, etc. in Asia (Editorial Committee of Flora of China, 1996). K. heteroclita is one of the major ingredients of both ethnomedicines "Xuetong" and "dahongzuan" which is highly used among the Tujia and Yao nationalities . Rattan and root parts of K. heteroclita exhibit several medicinal properties such as promoting Qi, relieving pain, expelling wind and dehumidification, which is mainly used to treat rheumatism and rheumatoid arthritis, low back pain, lumbar muscle strain, stomach pain, abdominal pain, hemiplegia, postpartum paralysis, dysmenorrhea, fracture and bruise (Editorial Committee of Flora of China, 1996;Zhang et al., 2019;. Modern studies have reported that lignans, sesquiterpenes and triterpenes are main components of rattan and root in K. heteroclita. The lignans, sesquiterpenes and triterpenes exhibit various pharmacological functions such as anti-oxidation, anti-virus, anti-tumor, cardiovascular protection, liver protection, anti-inflammatory, analgesic, and in memory improvement respectively . The biosynthesis of lignan consists of three stages in plant: (1) biosynthesis of lignan precursor coniferylalcohol through phenylpropanoid pathway, (2) biosynthesis of lignan with various structures through coniferyl-alcohol and (3) in final stage lignans are modified by glycosylation to form lignan glycosides Kanehisa, 2020). Similarly, terpenoids are synthesized from isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMADP) which are derived from plastidial methylerythritol phosphate (MEP) pathway and cytosolic mevalonic acid (MVA) pathway (Zhang et al., 2015). Later, one molecule of IPP and one molecule of DMADP are condensed into geranyl diphosphate (GPP), a C10-compound, by geranyl diphosphate synthase (GPPS) (Zhang et al., 2015). One molecule of GPP and one IPP are further condensed into farnesyl diphosphate (FPP), a C15-compound, by farnesyl diphosphate synthase (FPPS) in the cytoplasm (Fidan and Zhan, 2018). The FPP is a branch-point precursor for the synthesis of both sesquiterpenoid and triterpenoid (Fidan and Zhan, 2018). The sesquiterpenoid solavetivone can be synthesized by vetispiradiene synthase (VES) and premnaspirodiene oxygenase (PO) from FPP, while germacrene D, 7-epi-α-selinene and valencene can be synthesized by germacrene D synthase (GDS), 7-epi-alpha-selinene synthase (SS) and valencene synthase (VS), separately (Kanehisa, 2020). Thus, briefly the biosynthesis of triterpenoid involves two molecules of FPP are condensed into one molecule of squalene by squalene synthase (SQS) in the endoplasmic reticulum membrane followed by conversion of squalene into 2,3-epoxy-2,3-dihydrosqualene by squalene monooxygenase or squalene epoxidase (SE). Thus, 2,3-epoxy-2,3-dihydrosqualene is a branch-point precursor for the synthesis of the linear and cyclic triterpenoid compounds (Kanehisa, 2020).
Till date research studies on K. heteroclita are majorly focused on isolation of chemical components, pharmacological effects, pharmacokinetics, fingerprints and DNA barcodes (Zhou et al., 2016;Guo, 2017;Fan et al., 2019;Liu et al., 2020;Shehla et al., 2020). Limited research has been conducted towards understanding the lignan, sesquiterpenoid and triterpenoid biosynthetic pathways. As of today, 69 different types of lignan compounds (including dibenzocyclooctadienes, aryltetralins, aryltetralins, diarylbutanes, and tetrahydrofurans), 64 different types of triterpenoids (including cycloartane type, lanolin type, oleanane type and schiartane type) and 15 different types of sesquiterpenoids has been isolated from K. heteroclita (Cao et al., 2019;Zhang et al., 2019;. As these bioactive components are sparsely produced in plant, it is highly important to implement and develop novel synthetic biological strategies for industrial production to meet the demand (Fidan and Zhan, 2018). Another major problem in production is usage of several names for the same medicinal plant and the same name for various other species which have caused the wrong usage of medicinal material (Editorial Committee of Flora of China, 1996). Therefore, presence of complete genome sequence of K. heteroclita would significantly benefit both scientific and industrial communities by separating it from other species. As of today, the complete genome sequence of K. heteroclita is unavailable and only chloroplast genome is publicly available (Guo, 2017).
In our present study, we have performed genome-wide transcriptome analysis of the roots, stems and leaves of K. heteroclita using the Illumina Hiseq 2000 ® platform. For the first time we have reported the complete genome-wide transcriptome analysis of K. heteroclita to reveal the gene expression patterns of roots, stems and leaves of K. heteroclita seedlings, which are involved in the biosynthesis pathways of lignan, sesquiterpenoid and terpenoid and their regulatory mechanisms. We strongly believe that the results obtained in our study will lay a strong foundation for future studies on identification of medicinal compounds and synthetic biological studies on K. heteroclita.

Plant materials
The seedlings of Kadsura heteroclita (24°28'25''N, 100°8'33''E) were cultivated from the branch cuttings obtained from the K. hetroclita. All the branch cuttings of K. hetroclita were kindly provided by Lincang Biological Pharmaceutical Technology Company Limited, Yun County, China. The roots, stems and leaves were collected from five-month-old K. heteroclita seedlings (Figure 1). Materials from three individual plants were collected using scissors to yield 1 g of root, stem and leaf samples which were used in our further transcriptome studies (BioSample accession numbers: SAMN14883617, SAMN14883618, SAMN14883619, SAMN14883620, SAMN14883621, SAMN14883622, SAMN14883623, SAMN14883624, SAMN14883625). All the samples were immediately frozen in liquid nitrogen and then stored in -80 ℃ refrigerator after cutting and wrapping with tinfoil respectively. RNA isolation and quality control Total RNA was extracted by TRIzol ® reagent (Invitrogen, Carlsbad, CA, USA), and DNase I (Takara, Dalian, China) was used to remove the residual genomic DNA. The RNA isolation was performed according to the manufacturer's protocols. The purity of the RNA was determined by the NanoPhotometer ® (IMPLEN, Westlake Village, CA, USA). The integrity of the RNA was determined by the Bioanalyzer 2100 system (Agilent Technologies, Palo Alto, CA, USA). RNA concentration was precisely measured by the Qubit ® 2.0 Flurometer (Life Technologies, Carlsbad, CA, USA). Thus, the obtained RNA was used for library construction.
Library construction, sequencing and quality control Illumina TruSeq™ RNA Sample Preparation Kit v2 (Illumia, San Diego, USA) was used for the library construction according to the manufacturer's manual. mRNA was enriched by the magnetic beads with Oligo(dT). Fragmentation buffer is then added to break the mRNA into short fragments. The 1st-strand of cDNA is synthesized with random hexamers using pure mRNA as a template and then the 2nd-strand of cDNA was synthesized by adding the buffer, dNTPs, DNA polymerase I and RNase H. AMPure® XP beads (Beckman Coulter, High Wycombe, UK) were used to purify the double-stranded cDNA. Thus, obtained purified double-stranded cDNA is first subjected to end repair, A-tailing, adding the sequencing adapter, and later the DNA fragments were selected using the AMPure XP beads respectively. Finally, PCR amplification was performed, and the PCR product were purified using the AMPure XP beads to obtain the final library. Once the library construction is completed, we have used Qubit ® 2.0 flurometer (Life Technologies, Carlsbad, CA, USA) for preliminary quantification and the libraries were diluted to 1.5 ng/μl, and then Agilent 2100 bioanalyzer (Agilent Technologies, Palo Alto, California) was used to detect the insert size of the library. The fragments meeting the ideal insert size were used, the fragments with effective concentration above 2 nM of library was accurately quantified using the Q-PCR method. The libraries obtained from the quality control analysis (effective concentration and the target data volume) were pooled and further used for the transcriptome sequencing which was achieved using Illumina HiSeq® sequencing platform.

Transcriptome assembly
The original sequence image files obtained from Illumina HiSeq were converted to raw sequenced reads using Casava base calling software. Thus, these obtained raw reads were stored in fastq files, and they were later processed to remove the ones containing adapter, ploy(N) and low-quality. We have calculated the Q20, Q30, GC content and sequence duplication level of the clean data. Then these clean reads were assembled by Trinity ® v2.4.0 software (Grabherr et al., 2011) with following parameters min_kmer_cov set to 3 and all other parameters were set to default.
Gene functional annotation, coding sequence (CDS) and transcription factor (TF) prediction The assembled transcriptome was annotated to obtain the functional annotations by mapping against BLAST databases: Nr (NCBI non-redundant protein sequences, diamond v0.8.22, e-value = 10 −5 ), Nt (NCBI nucleotide sequences, NCBI blast 2.2.28+, e-value = 10 −5 ), PFAM (Protein family, HMMER 3.0 package, hmmscan e-value = 10 −2 ), SwissProt (A manually annotated and reviewed protein sequence database, diamond v0.8.22, e-value = 10 −5 ), KOG (eukaryotic Ortholog Groups, diamond v0.8.22, e-value = 10 −5 ), KEGG (Kyoto Encyclopedia of Genes and Genomes, KEGG Automatic Annotation Server, e-value= 10 −10 ) and GO (Gene Ontology, Blast2GO v2.5, e-value = 10 −6 ). For the CDS prediction, unigenes were first aligned to the NR database, ORF (open reading frame) information of transcripts was extracted from the alignment results, and the sequence of the coding region was translated into amino acid sequence according to the standard codon table. For the transcripts without a possible match to the NR, ESTSCAN 3.0.3 software was used to predict the CDS. We have used iTAK (https://github.com/kentn/iTAK/) software for the prediction, identification and classification of the TF as previously reported by (Paulino et al., 2010).

Differential expression analysis
We have used RSEM software package to analyse the differential gene expression profiles (Li and Dewey, 2011). Also, we have used DESeq 1.18.0 R package to perform differential gene expression analysis. Unigenes with an adjusted p-value < 0.05 reported by DESeq were defined as differentially expressed genes (DEGs). The heatmaps were generated using the R package pheatmap.

GO enrichment and KEGG pathway enrichment analysis
GOseq R packages was used to perform GO enrichment analysis on the above obtained DEGs (Young et al., 2010). KOBAS software was used to perform the pathway enrichment of DEGs (Mao et al., 2005).

Phylogenetic analysis
The gene sequences coding for KhDIR, KhPLR, KhSDH, KhSQS, KhSE, KhMYB and KhbHLH TFs were retrieved from the K. heteroclita CDS, and all the sequences used were listed in supplementary file. The above-mentioned sequences coding for TFs were aligned using Clustal X 2.1 to generate the .phy format files. Then PhyML 3.0 (http://www.atgc-montpellier.fr/phyml/) was used to construct the phylogenetic tree by using the default parameters with the boostrap value set at = 100 respectively (Guindon et al., 2010). We have used the iTOL 5.5.1 (https://itol.embl.de/) software to visualize the trees (Letunic and Bork, 2019).

Sequence alignment and building of three-dimensional models
For the sequence alignment of DIR and SDH, we have used DNAMAN 10.0.2.8 software with the default parameters. And three-dimensional models of DIR proteins were built using the SWISS-MODEL online sever (https://swissmodel.expasy.org/) with the default parameters.

Sequencing and assembly
In order to understand and reveal the biosynthesis pathway of lignan, sesquiterpenoid and triterpenoid in K. heteroclita, nine sequencing libraries including roots (K_R1, K_R2, K_R3), stems (K_S1, K_S2, K_S3) and leaves (K_L1, K_L2, K_L3) were prepared and sequenced using the Illumina Hiseq 2000 ® platform. More than 6.58 G clean reads per library was obtained after quality control analysis ( Table 1). The error rate of all libraries was found to be 0.03%, while Q20 and Q30 were over 95.19% and 88.62% respectively, which indicated that these data were suitable for the downstream analysis. The raw datasets from the nine sequencing libraries have been deposited in the Short Reads Archive (SRA) database under the accession numbers: SRR11747016-SRR11747024. The clean reads were combined and assembled using the Trinity v2.4.0, with min_kmer_cov set to 3 and for all the other settings we have used default parameters (Grabherr et al., 2011).
Assembled sequences were subjected to cluster using the trinity algorithm. Finally, a total of 160,248 transcripts and 98,005 genes were assembled, with 49,623 (30.97%) transcripts and 48,659 (49.65%) genes being longer than 1 Kb in length ( Figure 2). The average length of transcripts and genes were found to be 927 bp and 1306 bp (Table 2), and the N50 for transcripts and genes were 1609 bp and 1838 bp respectively (Table 2).   Gene function annotation and classification A total of 98,005 unigenes were annotated against seven classic databases which included NR, NT, PFAM, SwissProt, KOG, KEGG and GO databases with an E-value cutoff of 10 -5 , 10 -5 , 10 -2 , 10 -5 , 10 -3 , 10 -10 and 10 -6 respectively. The 7458 (7.60%) unigenes were found to be commonly annotated to all the above seven databases, while 76,101 unigenes (77.65%) were annotated at least to one database (Table 3). Totally 54,930 unigenes (56.04%) exhibited high similarity with sequences in the NR database, 36,225 unigenes (36.96%) similarity to protein sequences in NT, and finally 56,998 unigenes (56.15%) exhibited similarity with known genes in SwissProt. Results obtained from the species distribution based on their similarity results against NR databases showed that highest number of transcripts were annotated to Nelumbo nucifera (20.42%), followed by Macleaya cordata (11.36%) and Amborella trichopoda (11.30%) respectively ( Figure 3). Coding sequences were also predicted using NR database and ESTSCAN software respectively. A total of 61,074 peptides were predicted by using BLASTp with the peptide length ranging between 15 to 1254 ( Figure 4a), whereas 37,045 peptides were predicted by ESTSCAN with peptide length ranging from 15 to 734 respectively ( Figure 4b). Out of which 26, 20 and 10 groups belong to biological process, cellular component and molecular function respectively. Results obtained from these biological contextualization studies have shown that the majority of the cellular and metabolic processes are predicted to be involved in biosynthesis of lignan, sesquiterpenoid and triterpenoid. The gene ontology distribution of unigenes can be categorized as: "cellular process" (30,693, 55.14%), "metabolic process" (28,461, 51.13%) and "single-organism process" (22,794, 40.95%), "binding" (33,337, 59.89%) and "catalytic activity" (24,598, 44.19%) ( Figure 5) respectively. Results obtained in this study are in accordance with previous genome-wide annotation results of Dendrobium huoshanense and Arisaema heterophyllum (Wang et al., 2018;Zhou et al., 2020).
Annotating against KOG database has resulted in a total of 20,480 unigenes classified among 26 KOG ( Figure 6). Amongst the 26 KOG groups, the majority of genes were distributed among "post-translational modification, protein turnover, and chaperon" (3126, 15.26%), followed by "general function prediction only" (2671, 13.04%), "translation, ribosomal structure and biogenesis" (2407, 11.75%) and "intracellular trafficking, secretion, and vesicular transport" (1468, 7.16%) respectively. Interestingly, the genes distributed among the top-three groups in Arisaema heterophyllum were "general function prediction only (9,277, 26.57%)", "transcription (5,784, 16.56%)" and "translation, ribosomal structure and biogenesis (5,667, 16.23%)" (Wang et al., 2018;Zhao et al., 2020). These results show that protein synthesis, processing and transport are more active in K. heteroclita seedling. Figure 5. GO classification map. The ordinate represents the GO terms of the three GO categories while the abscissa represents the number and percentage of genes annotated into the corresponding term Figure 6. KOG classification map. The abscissa represents KOG groups, while the ordinate represents the percentage of annotated genes Finally, to understand and reveal the active metabolic pathways differentially expressed in K. heteroclita seedlings, we have annotated the transcriptome against the KEGG database. A total of 98,005 unigenes were assigned into five categories including cellular processes, environmental information processing, genetic information processing, metabolism and organismal systems, 19 sub-categories ( Figure 7) and 130 pathways respectively. Especially, the following pathways: "Translation" (2794, 9.90%), "carbohydrate metabolism" (2282, 8.08%), "folding, sorting and degradation" (2048, 7.25%) and "overview" (1572, 5.57%) were found to be the top-four pathways ( Figure 7). Especially, we have observed that a total of 1359, 933 and 890 genes were involved in "amino acid metabolism", "metabolism of terpenoids and polyketides" and "biosynthesis of other secondary metabolites" respectively. These results conveyed that the amino acid pathway and terpenoid pathways were also highly active in K. heteroclita, and the corresponding genes would be good candidate genes for lignan, sesquiterpenoid and triterpenoid biosynthesis respectively.

Identification of DEGs, GO and KEGG enrichment analysis
Results obtained from the statistical analysis have revealed a list of highly significant and DEGs (p < 0.05 and log2 foldchange > 1). As our current study is focused on understanding the DEGs involved biosynthesis of lignan, sesquiterpenoid and triterpenoid pathways among the leaves, roots and stems of K. heteroclite, we have designed our analysis as K_L vs. K_R, K_S vs. K_R and K_L vs. K_S respectively. The statistical analysis has resulted in total of 41,497 significant DEGs. Among these significant DEGs, 22,468 (22.92%) genes were found to be significantly expressed in leaves and roots samples, out of which 9,511 (42.33%) genes were upregulated in leaves samples and 12,957 (57.67%) genes were found to be downregulated in leaves ( Figure  8A). Similarly, 7,373 genes (7.52% of all genes) were identified as significantly DEGs between leaves and stems samples with 3,950 (53.57%) genes upregulated and 3,423 (46.43%) genes downregulated in leaves ( Figure 8B). Finally, 11,656 genes (11.89%) were identified as significantly DEGs between stems and roots with 4,302 (36.91%) genes upregulated and 7,354 (63.09%) genes downregulated in stems ( Figure 8C). In order to compare the commonly expressed genes among the K_L vs. K_R, K_S vs. K_R and K_L vs. K_S conditions, Venn diagram for different comparison groups was used. Among these three groups, 1761 DEGs were identified in common among all the datasets ( Figure 8D). Specifically, 9594 DEGs were commonly identified among the "K_L vs. K_R" and "K_S vs. K_R"; 6032 DEGs were commonly identified among the "K_L vs. K_R" and "K_L vs. K_S" groups, while 2348 DEGs were identified commonly among the "K_L vs. K_S" and "K_S vs. K_R" groups respectively ( Figure 8D).

Figure 7. KEGG classification map
The ordinate is the pathway, and the abscissa is the proportion of genes belonging to the corresponding pathway. These genes were divided into five categories: A. Cellular Processes; B. Environmental Information Processing; C. Genetic Information Processing; D. Metabolism; E. Organismal Systems.
The above obtained significant DEGs were further analyzed using the GO and KEGG databases for understanding and revealing the functional involvement and biological contextualization of these genes in biosynthesis of lignan, sesquiterpenoid and triterpenoids in K. heteroclita (Table S1-S6). The results of gene enrichment analysis using KEGG pathway database showed that 83 DEGs (number of DEGs ranked 4 th ) fell into sesquiterpenoid and triterpenoid biosynthesis pathway, while 192 DEGs (ranked 10 th ) fell into phenylpropanoid biosynthesis pathway in the K_L vs. K_R comparison ( Figure 9A, Table S4). Whereas in the comparison of K_L vs. K_S, 18 DEGs (ranked 4 th ) were assigned into sesquiterpenoid and triterpenoid biosynthesis pathway and 79 DEGs (ranked 2 nd ) were assigned into phenylpropanoid biosynthesis pathways, respectively ( Figure 9B, Table S5). Finally, in the comparison of K_S vs. K_R, 60 (ranked 4 th ) and 107 (ranked 14 th ) significant DEGs were expected to be involved in the sesquiterpenoid and triterpenoid biosynthesis pathway and phenylpropanoid biosynthesis pathway respectively ( Figure 9C, Table S6). Also, the top 20 KEGG enrichment pathways obtained from the above enrichment analysis were shown in Figure 9. Biosynthetic genes of lignan in K. heteroclita The annotation of K. heteroclita genes against the KEGG database has resulted in a total of 191 unigenes assigned to the lignan biosynthesis pathway (Table 4). Interestingly, out of these 191 genes, 28 genes code for DIR (dirigent protein), 25 genes code for CAD (cinnamyl alcohol dehydrogenase), 24 genes code for SDH (shikimic acid dehydrogenase), 24 genes code for 4CL (4-Coumarate-CoA Ligase), and only 1 gene codes for CSE (Caffeoylshikimate esterase) respectively. Results obtained from our gene expression study reports that genes involved in lignan biosynthesis pathways were highly expressed in root samples. Specifically, we have observed that genes encoding for PAL (phenylalanine ammonia-lyase), C4H(Cinnamate-4-Hydroxylase), 4CL, CAD, CCOMT (caffeoyl CoA 3-O-methyltransferase), DIR and SDH were significant and highly expressed in root samples compared to leaf and stem ( Figure 10, Table 4). Interestingly, we have observed two genes coding for Coq6 (Coenzyme Q6, Monooxygenase) were slightly up-regulated in root and stem samples with zero expression in leaf samples. These results indicated that caffeoyl-CoA was synthesized from p-cinnamoyl-CoA by HCT and C3'H during developmental stages of K. heteroclita ( Figure 10, Table 4). We also predicted that K. heteroclita employs CCOMT instead of COMT gene for the synthesis of feruloyl-CoA based on the gene expression ratios of CCOMT/COMT in roots (4.77), stems (3.19) and leaves (1.07). Using the information obtained from our current transcriptome study and by using the pre-existing literature we have developed the tentative lignan biosynthesis pathway in K. heteroclita  The expression level is the sum of all the unigenes for each gene, and log10 (sum(FPKM) + 1) was used to plot the heatmap. Candidate unigenes come from the annotation. Among the list of significant DEGs, 28 genes code for DIRs (Table S7). Interestingly, the K. heteroclita genome codes double the number of DIR genes compared to its closely related species Schisandra chinensis (Chen et al., 2020). Earlier studies have revealed and classified plant DIR proteins into six subfamilies : DIR-a, DIR-b/d, DIR-c, DIR-e, DIR-f and DIR-g (Ralph et al., 2007), interestingly these subfamilies exhibits very low sequence similarity between them (Song and Peng, 2019). Kim et al. (2002) have reported that only the DIRa subfamily can guide the cellular machinery to synthesis correct three-dimensional structure of pinoresinol (Kim et al., 2002). While the functional involvement of other DIR-like subfamily proteins are unclear (Kim et al., 2002). Among these 28 significant KhDIR genes: 17 genes belong to DIR-b/d, 4 genes belong to DIR-c, 6 genes belong to DIR-e and only 1 gene belongs to DIR-a subfamilies respectively ( Figure 11A). These results suggest that the only one gene coding for KhDIR4 which belongs to DIR-a subfamily is responsible for the formation of pinoresinol in K. heteroclita. However, the expression levels of KhDIR4 gene are comparatively lesser than other DIR encoding genes ( Figure 11). Kim et al. (2012) has reported that DIR genes expressed in different phylogenetically related species of K. heteroclita such as SchDIR1, FiDIR, TpDIR5 and TpDIR8 transcribed for (+)-pinoresinol-forming proteins, and two other candidate AtDIR5 and AtDIR6 transcribed for (-)-pinoresinol formation (Kim et al., 2012). Research studies based on site directed mutagenesis, protein modelling and docking experiments have revealed that three phenylalanines F90, F113, F163 and amino acid sequence starting from Asn98 to Pro146 in SchDIR1 protein are critical for its (+)-pinoresinol-forming activity (Kim et al., 2012).
Results obtained from the phylogenetic analysis of DIR protein sequences have revealed that KhDIR4 and SchDIR1 shares same clade exhibiting 92.82 % similarity ( Figure 11). Interestingly, KhDIR4 protein also possesses the three key phenylalanine residues, and exhibits almost similar three-dimensional configuration as SchDIR1 protein ( Figure 12A-E). These results suggest that the role of KhDIR4 in stereoselective synthesis of (+)-pinoresinol from coniferyl alcohol. Li et al. (2017) has been reported the functional involvement of GmDIR22 (belonging to DIR-b/d subfamily) in synthesis of lignan (+)-pinoresinol by effectively directing Econiferyl alcohol coupling . We have observed from our results that, among the 28 DIRs three homologs KhDIR8, KhDIR19 and KhDIR20 are in the same clade with GmDIR22, suggesting their similar functions as GmDIR22 respectively ( Figure 11A reported that MtDIR exhibited 40% higher expression levels in root compared to other plant parts, while we have observed similar behavior in gene expression level ranging as high as 57% in K. heteroclita root samples respectively (Zhao et al., 2020). Thus, from these results we suspect the involvement of KhDIR8, KhDIR22, KhDIR23 and KhDIR28 genes in (+)-pinoresinol synthesis ( Figure 11B).
Previous studies have reported that PLR protein plays a crucial role in production of lariciresinol and/or secoisolariciresinol from pinoresinol (Markulin et al., 2019). However, the substrates and products involved in its production are species-dependent. For example, the substrates for LuPLR1 and TpPLR1 proteins are (-)pinoresinol and (-)-lariciresinol which leads to the product (+)-secoisolariciresinol (Fujita et al., 1999;von Heimendahl et al., 2005). Similarly, the substrates for LuPLR2, TcPLR2.2, TcPLR3, TpPLR2 and PpPLR proteins are (+)-pinoresinol, (+)-lariciresinol but the product is (-)-secoisolariciresinol respectively (Fujita et al., 1999;Hemmati et al., 2010;Chiang et al., 2019). It is still unknown that whether these two steps are catalyzed by one PLR or two different PLRs. We have observed two PLR candidate genes in K. heteroclita transcriptome. Further studies must be conducted to understand the function and mechanism of these PLR genes in production of lariciresinol/secoisolariciresinol.  Finally, we have observed 24 significant DEGs in K. heteroclita transcriptome which are coding for SDH protein. Xia et al. (2001) has reported that, NAD-dependent SDH protein of Forsythia intermedia (FiSDH) and Podophyllum peltatum (PpSDH) converts (-)-secoisolariciresinol into dibenzylbutyrolactone lignan (-)-matairesinol (Xia et al., 2001). Results obtained from our sequence alignment studies involving KhSDH, FiSDH and PpSDH showed that they have commonly exhibited the conserved glycine-rich motif GXXGXG which was reported to be involved in binding of the pyrophosphate group of NAD + . Also, these proteins commonly possess D47 found in SDRs which preferentially binds to NAD(H) ( Figure 13A). The catalytic triad Ser153, Tyr167, and Lys171 adjacent to both NAD + and substrate molecules in PpSDH (Youn et al., 2005) were also observed in KhSDH proteins (except the KhSDH9 protein) ( Figure 13A). Results obtained from phylogenetic analysis has revealed that KhSDH17, KhSDH6, KhSDH7 proteins shared same clade with FiSDH and ShSDH2 ( Figure S1). Based on these results and from these gene expression patterns we report that both KhSDH3 and KhSDH8 genes are comparatively more active during developmental stages ( Figure 13B).
Biosynthetic genes of the sesquiterpenoid and triterpenoid in K. heteroclita Naturally, sesquiterpenoid and triterpenoid are synthesized from methylerythritol phosphate (MEP) and mevalonate (MVA) pathways respectively. The genome-wide annotation of K. heteroclite against the KEGG database has resulted in a total of 124 genes assigned to the terpenoid backbone biosynthetic pathways, out of which 64 genes coding for 8 enzymes involved in MEP pathway and 60 genes coding for 6 enzymes involved in MVA pathway respectively ( Figure 14A, Table 5). Further classifying these 124 genes has showed that 32 genes code for DXS, 26 genes code for HMGS, 15 genes code for HMGR, 14 genes code for AACT and 1 gene each for PMK, MVD, MCT and CMK respectively. The expression patterns of these genes suggest that MEP pathway is highly active in root, stem and leaf, whereas MVA pathway is highly active in root and stem respectively ( Figure 14B). The genome-wide annotation of K. heteroclite against the KEGG database has resulted in a total of 150 genes assigned to the sesquiterpenoid biosynthetic pathway. Out of these 150 genes, 16 genes are significantly upregulated and 67 genes significantly downregulated in leaf vs. root, and 2 genes significantly upregulated and 13 genes significantly downregulated in leaf vs. stem (Figure14A , Table 5).
Further classifying these 150 genes has showed that: 121 genes code for GDS, 13 genes code for GPPS2, 6 genes code for GPPS, 4 genes code for FPPS and 1 gene each for PO, SS and VS respectively (Table 5). Among these significant DEGs, we have observed that six genes FPPS, VES, PO, GDS, SS and VS were significantly upregulated in root compared to stem and leaf samples ( Figure 14B, Table 5). These results suggest that sesquiterpenoid pathway is highly active in root samples (Figure14 , Table 5).

Identification of transcription factors
Transcription factors play an important role in regulation of plant secondary metabolism by activating or inhibiting the expression of functional genes involved in various biosynthesis pathways (Nag et al., 2020).
We have used the software HMMER3.0 to specifically search the TFs against the annotated transcriptome of K. heteroclita. Results obtained from the HMMER3.0 search have revealed that 4,528 genes coding for various TFs are belonging to 82 categories. Among these 4,528 genes, 297 (6.16% of total TFs) genes code for MYB, 274 (6.05%) genes code for mTERF, 247 (5.45%) genes code for bHLH, 198 (4.37%) genes code for NAC and 192 (4.24%) genes code for AP2-EREBP respectively ( Figure 15, Table 6, Table S8). We have also observed the genes coding for the bZIP and WRKY TFs in K. heteroclita transcriptome (Table 6). In summary, the majority of the above-mentioned TFs are highly upregulated in root and stem compared with leaf samples (Table 6).  Figure 15. Top 20 TFs in K. heteroclita analysis has revealed that KhMYB4 and KhMYB145 TFs share the same S7 subclade with AaMYB1, indicating their possible regulatory role in sesquiterpenoid biosynthesis ( Figure 16). Recent studies based on PtMYB4 in Pinus taeda and VvMYB5b in Vitis vinifera have reported that these TFs influenced the accumulation of terpenoids and phenylpropanoids in plants (Nagegowda and Gupta, 2020). Similarly, in Actinidia deliciosa the genes coding for AdMYBR2, AdMYBR3, AdMYB3, AdMYB7 and AdMYB8 TFs positively regulate carotenoid biosynthesis via transcriptional activation of lycopene beta-cyclase gene (Ampomah-Dwamena et al., 2019). These results suggest that KhMYBs in clade S2, S6, S8, S9, and S10 have the potential regulatory roles in terpenoid biosynthesis ( Figure 16).  Table S9.
The basic-helix-loop-helix (bHLH) is a class of TF which were also found to be involved in regulation of plant terpenoid biosynthesis by recruiting defined cis-regulatory elements which are part of a conserved model for jasmonate signaling pathway . , has reported about two candidate genes of Medicago truncatula coding for triterpene saponin biosynthesis activating regulators: MtTSAR1 and MtTSAR2 which play a key role in enhancing nonhemolytic and hemolytic soyasaponin biosynthesis by activating the corresponding pathway genes respectively . Recent study has reported that MtTSAR3 controls hemolytic saponin biosynthesis in developing seeds (Ribeiro et al., 2020). Jarvis et al. (2020), has reported that Chenopodium quinoa genes encoding for CqTSARL1 in seed and CqTSARL2 in root also regulate the triterpenoid saponin biosynthesis (Jarvis et al., 2017). Studies have also reported that Catharanthus roseus, a medicinal plant without triterpenoid saponin codes two bHLH TFs  (Table 6), out of which 179 putative KhbHLH are obtained after removing the identical proteins which are transcribed by different genes. We have totally observed ten classical bHLH clades Ia, Ib, IIIc, IIId, IIIe, IIIf, IVa, IVb, IVc and IVd in our phylogenetic tree (Figure 17, . Results obtained in our study shows that KhbHLH54 and KhbHLH154 cluster with CqTSARL1, CqTSARL2, CrBIS1, CrBIS2, MtTSAR1, MtTSAR2 and MtTSAR3 belonging to IVa bHLH (Figure 17), indicating their potential functional involvement in sesquiterpenoid and triterpenoid biosynthesis in K. heteroclita. Study conducted on Aquilaria sinensis has reported that AsMYC2 promotes the agarwood sesquiterpene biosynthesis by activating the expression of ASS1 (Aquilaria sesquiterpene synthase 1) gene through the jasmonate signaling pathway . Similar study conducted on Solanum lycopersicum, SlMYC1 reported that it positively regulates the monoterpene biosynthesis in both leaf and stem trichomes but negatively regulate the sesquiterpene biosynthesis in stem trichomes (Xu et al., 2018). However, in Catharanthus roseus, CrMYC2 promotes the accumulation of monoterpenoid indole alkaloids in shoot (Schweizer et al., 2018). Results obtained in our study are in accordance as the protein sequences encoding for KhbHLH114, KhbHLH134, KhbHLH60 and KhbHLH114 were found to cluster with CrMYC2, SlMYC1 and AsMYC2 in the IIIe clade (Figure 17), suggesting their role in sesquiterpenoid and triterpenoid biosynthesis in K. heteroclita. In A. thaliana, clade IIId bHLH TFs AtbHLH3, AtbHLH13, AtbHLH14 and AtbHLH17 acted as transcription repressors against the transcription activators, such as MYC2 and the MYB/bHLH/WD40 complex by binding to their target sequences (Song et al., 2013). Results obtained in our study shows that TFs encoding for KhbHLH60, KhbHLH163, KhbHLH162 and KhbHLH169 cluster with AtbHLH3, AtbHLH13, AtbHLH14 and AtbHLH17 in the IIId clade (Figure 17), which indicates their role as a negative regulator of the sesquiterpenoid and triterpenoid biosynthesis pathway in K. heteroclita.  Table S10 Discussion As of August 11 th , 2020, 19,936,210 cases of CoVID-19 infection and 732,499 deaths have been confirmed worldwide (WHO, 2020). However, this epidemic has been effectively controlled in May, 2020 in China, whose anti-epidemic experience has verified the effectiveness of traditional Chinese medicine (TCM) against the new coronavirus (State Council Information Office of the PRC, 2020). Therefore, strengthening basic research on TCM will provide a solid strategic reserve for future epidemic prevention and control. K. heteroclita is one of the most famous TCMs widely used in Tujia and Yao Nationality in China. Despite its extensive pharmacological effects and remarkable medicinal effects, it has not been collected in the Chinese Pharmacopoeia because of lacking in-depth basic research, which also prevent its national and global use . At the same time, several factors are leading to the incorrect usage of K. heteroclita as the same species with different names and same name with different species were being found and circulated, especially K. interior being used for K. heteroclita. Thus, addressing such problems will significantly benefit the growing research on TCM. For the first time we have reported complete genome-wide transcriptome sequence of K. hetroclita to elucidate molecular mechanisms underlying the biosynthesis of lignan, sesquiterpenoid and triterpenoid.
Our present study reports the transcriptome of roots, stems and leaves samples obtained from K. heteroclita seedlings. A total of 160,248 transcripts and 98,005 genes were obtained after analysis, suggesting us that one gene may have more than one transcript. The N50 value is an important value used for comparing the de novo transcriptome assemblies (Sadat-Hosseini et al., 2020). The average length of transcripts and genes were 927 bp and 1306 bp while their N50 were 1609 bp and 1838 bp respectively ( Table 2). The average transcript length of K. heteroclita were found to be longer than the average transcript lengths of Lolium multiflorum and Persea americana (Ge et al., 2019;Cechin et al., 2020). These results indicated that K. heteroclita transcriptome assembly quality was good for further downstream analysis. Results obtained from the species distribution analysis of unigenes showed us three top matched species: Nelumbo nucifera (20.42%), Macleaya cordata (11.36%) and Amborella trichopoda (11.30%) (Figure 3). The whole genome sequences of these plant species closely related to K. heteroclita were not studied till date.
Earlier studies, have reported that main active components of K. heteroclita are lignan, sesquiterpenoid and triterpenoid . Lignans are widely distributed in plants and it imparts strength and disease resistance in plants. But most importantly lignans constitute the active pharmacological compounds used for wellbeing of human beings (Markulin et al., 2019). Therefore, understanding the biosynthesis of lignan is gaining attention in the scientific community. In our study, we have revealed about the involvement of 191 unigenes (coding for 15 enzymes) in lignan biosynthetic pathway. Results obtained from gene expression profiling suggests that the lignan biosynthesis is more active in root, followed by stems and leaves samples respectively ( Figure 10). The results obtained were in accordance with previous studies and endorses the usage of stems and roots medicinal compounds. Using the results obtained from this gene expression analysis we have deduced tentative lignan biosynthesis mechanism in K. heteroclita which commences via PAL, C4H, 4CL, HCT, C3'H, HCT, CCOMT, CCR, CAD, DIR, PLR and SDH as shown using red arrows in Figure 10A. It is well known that lignan is synthesized via the phenylpropanoid pathway, therefore we have specifically focused on the lignan pathway genes KhDIR, KhPLR and KhSDH in our analysis. Studies have reported that in Urtica dioica and Schisandra chinensis, tissue-and organ-specific distribution of lignans are dependent on the expression patterns of DIR and PLR genes Chen et al., 2020). Results obtained from sequence alignment, phylogenetic analysis and gene expression analysis reveal that KhDIR4 differentially expressed in roots, stems and leaves might be responsible for the conversion of coniferyl alcohol to (+)pinoresinol. K. heterocliata transcriptome hosts two KhPLR genes, which supports the findings from previous reports, that more than one PLR coding genes exists in each plant (Markulin et al., 2019). However, the number of KhPLR is far less than that of KhDIR and KhSDH genes indicating that KhPLR might be a ratelimiting enzyme in the lignan biosynthesis pathway of K. heteroclita. The relatively lower expression of KhPLR gene in roots, stems and leaves samples further supports our hypothesis ( Figure 10B). The expression of PLR gene is species-and organ-specific. In mature Linum usitatissimum, both LuPLR1 and LuPLR2 are expressed in seeds and roots, whereas only LuPLR2 is expressed in stems and leaves (Hemmati et al., 2010). Similarly, the seedlings of U. dioica, UdPLR1, UdPLR2 and UdPLR3 genes are highly expressed in top stems, roots and leaves, respectively . It was also observed in the seedlings of A. thaliana, where AtPrP1 is highly expressed in roots and stems whereas, AtPrR2 is only expressed in roots (Nakatsubo et al., 2008). Similar expression pattern was observed with K. heteroclita transcriptome, only KhPLR1 gene is highly expressed in roots, with a lower expression in stems and leaves and KhPLR2 gene expressions in all three tissues are found be very weak (Table S11).
Sesquiterpenoids and triterpenoids are derived from terpenoid backbone pathway which further consists of MEP and MVA pathways. Previous studies have revealed that in MEP pathway, DXS is the first and rate-limiting enzyme, and it expresses based on the feedback generated from the last metabolite. Importantly, the gene expression and activity of the first enzyme in a pathway is considered as a common regulatory mechanism (Banerjee and Sharkey, 2014). K. heteroclita genome codes for 32 KhDXS genes (Table 5) indicating that it is not a rate-limiting enzyme in K. heteroclita terpenoid pathway, and occurrence of so many DXS could avoid the feedback inhibition in K. heteroclita. Recent study conducted by Xue et al. (2019) have reported that in Panax ginseng, MCT is an rate-limiting enzyme in MEP pathway (Xue et al., 2019). The transcriptome of K. heteroclita contains only one KhMCT gene and we have observed that the expression of KhMCT in root, stem and leaf samples were relatively low compared to other MEP pathway genes ( Figure 14B, Table 5). The gene expression profile of KhMCT indicates that it is a rate-limiting gene in MEP pathway.
Similarly, CMK could be another rate-limiting enzyme ( Figure 14B, Table 5). Previous studies also suggest that plastidial IDI plays an important role in regulating the relative ratio of IPP to the DMADP, as they are the key precursors for various downstream pathways (Pankratov et al., 2016). K. heteroclita transcriptome contains 4 IDI genes with two genes upregulated in leaves samples (Table 5). Although, there are no significant differences among the overall expression patterns of IDI genes in roots, stems and leaves ( Figure 14B), suggesting that IDI might not be a rate-limiting gene. Similarly, K. heteroclita transcriptome contains two genes encoding for KhHMGR genes (key rate-limiting enzyme of MVA pathway) which were found to be highly up regulated in stem and root samples (Table 5). Thus, results obtained in our study shed light on a new topic of investigation focused on the relative contribution of MEP and MVA pathway, and its downstream biosynthesis of active medicinal compounds. Previous studies conducted on Nothapodytes nimmoniana has revealed that MEP pathway is a major route for production of camptothecin (an active medicinal compound) (Rather et al., 2019). Whereas studies based on P. ginseng has reported that both MEP and MVA pathway contribute to the ginsenoside biosynthesis (Xue et al., 2019). However, differential gene expression analysis of K. heteroclita leaf, stem and root samples revealed that genes encoding for MVA pathway genes such as KhMK, KhPMK, KhMVD are lowly expressed in comparison with MEP pathway genes ( Figure 14B), suggesting that MEP pathway might play a crucial role in the production of precursors IPP and DMAPP involved in sesquiterpenoid and triterpenoid biosynthesis. Further studies must be conducted to understand and reveal the exact functional involvement of these genes.
Similarly, the FPP is a branchpoint for sesquiterpenoid and triterpenoid biosynthesis pathway. In the sesquiterpenoid pathway, FPP is cyclized into structurally diverse sesquiterpenoids by various sesquiterpene synthases, whereas in triterpenoid biosynthesis pathway FPP is dimerized into squalene by SQS (Kanehisa, 2020). Results obtained from differential gene expression analysis of K. heteroclita showed that genes encoding for GDS are highly expressed in root samples suggesting that the germacrene D is the main sesquiterpenoid product during this stage of development ( Figure 14B). K. heteroclita transcriptome expresses only one KhSQS gene involved in the triterpenoid pathway. These results are in accordance with other species such as Euphorbia pekinensis, Chlorophytum borivilianum, Euphorbia tirucalli, Taxus cuspidata, Lotus japonicus and Oryza sativa respectively. Whereas there are more than one SQS gene in the genomes of Glycyrrhiza glabra, Salvia miltiorrhiza and P. ginseng (Rong et al., 2016). In Salvia miltiorrhiza, the gene expression profiles of SmSQS2 in roots is almost twice the gene expression level of leaves samples (Rong et al., 2016). Similarly, previous studies conducted on Tripterygium wilfordii, Medicago sativa and Dryopteris fragrans, TwSQS, MsSQS and DfSQS1 have reported highest gene expression patterns in roots, followed by leaves and stem samples respectively (Zhang et al., 2018;Gao et al., 2019;Kang et al., 2019). Interestingly in K. heteroclita transcriptome, KhSQS gene exhibited almost the same expression pattern in roots, stems and leaves samples ( Figure 14B), suggesting the significance of KhSQS gene during K. heteroclita seedling development. Interestingly the results obtained from the phylogenetic analysis of KhSQS shows that it shares the same clade with monocots, although it belongs to eudicot, which implies its significance in the evolution from monocot to eudicot ( Figure S2).
Squalene epoxidase is an important rate-limiting enzyme which is involved in conversion of squalene to 2,3-epoxy-2,3-dihydrosqualene during the biosynthesis of sterols and triterpenoids. The T. wilfordii genome contains a total of five SE genes, out of which TwSE1-4 exhibited higher gene expression levels in the roots followed by stem samples, while TwSE5 exhibited lower gene expression profiles in roots, stem and leaves and eventually lost its SE activity (Zhou et al., 2018). Similar study conducted on A. thaliana reported that it contained six SE genes, among them AtSQE1-3 encodes functional SEs whereas the AtSQE4-6 gene does not code any functional genes, out of all these genes AtSE1 is responsible for the root and seed development (Rasbery et al., 2007). K. heteroclita transcriptome harbors four KhSE genes, while naturally most of the plants exhibit more than two SE genes (Rasbery et al., 2007). KhSE4 gene was found to be highly expressed in roots samples while KhSE1-3 exhibited weak and lower gene expression profile ( Figure S3). These results suggest us that only KhSE4 is majorly involved in the biosynthesis of triterpenoid during the K. heteroclita developmental stage. Results obtained from the phylogenetic analysis has revealed that KhSE1-4 clusters with TwSE1-4, whereas the TwSE5 gene formed a single clade ( Figure S4). These results mainly suggest that KhSE1-4 all other genes possessed the SE activity, and they are mostly organ-specific in K. heteroclita.
Recent studies have implemented transgenic technology for successfully enhancing the production of triterpenoid in plants. For example, over expression of MsSQS in alfalfa led to the accumulation of total saponins, suggesting a correlation between MsSQS expression level with saponins content respectively (Kang et al., 2019). Similarly, in Eleutherococcus Senticosus gene expression of the SQS and SE genes are positively correlated with the saponin content , Ganoderma lingzhi, and overexpression of SE doubled the production of ganoderic acid . Based on the occurrence of one SQS and the four SE genes and their expression profiles in K. heteroclita seedling samples, we believe that KhSQS and KhSE genes might exhibit great potential in increasing the active triterpenoid production. Based on the results obtained in our study we believe that K. heteroclita can synthesize various types of lignans, sesquiterpenoids and triterpenoids, indicating the diversity of genes and enzymes involved in their biosynthesis. For the first time our study provides preliminary information about the structural genes and possible regulatory genes involved in lignan, sesquiterpenoid and triterpenoid biosynthetic pathways in K. heteroclita. However, further studies must be conducted in future to figure out the downstream metabolic pathways by conducting other molecular and isotope tracing techniques.

Conclusions
Kadsura heteroclita is popularly known for its medicinal importance owing to its key ingredients used in TCM in Tujia and Yao Nationality. Although K. heteroclita has been used for centuries, the cellular and molecular by mechanisms involved in regulating the biosynthesis of the medicinal components, especially lignan, sesquiterpenoid and triterpenoid, is not known till date. Our present study reports genome-wide transcriptome of roots, stems and leaves samples are obtained from K. heteroclita seedlings. We have extensively reported about the genes involved in the lignan, sesquiterpenoid and triterpenoid biosynthetic pathways. Based on the information obtained from the gene expression profiles and sequence alignments we have reported a putative lignan biosynthetic pathway in K. heteroclita. Also, we have revealed that genes involved in these biosynthetic pathways were highly upregulated in root and stem samples. We strongly believe that results obtained from K. heteroclita transcriptome analysis could significantly benefit future studies conducted on understanding the lignan, sesquiterpenoid and terpenoid biosynthetic pathways in K. heteroclita.

Authors' Contributions
Conceptualization: XZ and WQ; Investigation: XZ and CL; Methodology: XZ and CL; Formal analysis: XZ, CL, CC and TM; Funding acquisition: CL and XZ; Writing-original draft: CL, XZ and AK; and Writing-review and editing: WQ and CC. All authors read and approved the final manuscript. Chiang