Characterization the coding and non-coding RNA components in the transcriptome of invasion weed Alternanthera philoxeroides

Alternanthera philoxeroides is a notorious invasive weed worldwide, but it still lacks a genome information currently. In this study, we collected 4 groups of A. philoxeroides Illumina RNA-seq data (62.5 Gb) and performed a comprehensive de novo assembling. Totally, 421,372 unigenes were obtained with a total length of 230,842,460 bp, with 43,430 (10.31%) unigenes longer than 1000 bp. Then 119,222 (28.3%) unigenes were functional annotated and 235,885 (56.0%) were grouped into reliable lncRNAs reservoir. Besides, 534 tRNA and 234 rRNAs were identified in assembly sequences. Additionally, 131,624 microsatellites were characterized in 106,761 sequences. Then SSR primers were developed for the amplification of 40,752 microsatellites in 36,329 sequences. The miRNAs are key post-transcriptional regulators, about 86 candidate miRNA sequences were detected from A. philoxeroides assembly, and miRNA target genes prediction revealed possible functions of them in growth and development as well as stress responding processes. These results provide a vital basis for sequence-based studies of A. philoxeroides in the future, especially gene function analysis.


Introduction
Alternanthera philoxeroides (Mart.) Griseb (A. philoxeroides), also known as alligator weed, belongs to family Amaranthaceae. It is a noxious semi-aquatic invasive weed native to South America and now is invasive in many parts of the world such as India, China, and Australia (Bai et al., 2017). This invasion weed displays strong adaptive ability to a variety of habitats (e.g. aquatic, terrestrial habitats, farm lands) and fluctuating environments (Liu et al., 2019). It inflicts serious damage to biodiversity and agricultural production to its invasion habitats . In China, A. philoxeroides caused about 98 million US dollars lost annually, and is on the first shortlist of the invasive species that requires special control (Liu et al., 2019).
2 High-throughput sequencing has been widely used to identify transcripts involved in specific biological processes in plants, which is particularly useful for those species without whole genome sequencing (Li et al., 2015;Zhu et al., 2019a;Zhu et al., 2019b). In A. philoxeroides, RNA-seq has been carried out in several studies to reveal its adaptations mechanisms to different habitats. For instance, Li et al. (2015) compared the gene expression profiles of A. philoxeroides under control and low potassium stress and discovered a large number of root-specific transcripts in response to low potassium stress. Gao et al. (2015) characterized transcript abundance variation in A. philoxeroides in contrasting hydrological conditions using RNA-seq, which improved our understanding of the biological processes underlying the phenotypic changes in this species . Liu et al. (2019) examined the differences in cold-responsive gene expression in A. philoxeroides from the northern edge and the central portion of its range, and suggested that genetic changes have accompanied the recent northward expansion of A. philoxeroides in China. However, since different RNA-seq projects were designed to solve specific problems, a single RNA-seq project may not enough to excavate as many transcripts as we expect, especially for A. philoxeroides, a species that lacks a genome information. Thus, combining different RNA-seq data together to perform de novo assembling is one robust strategy to obtain a more comprehensive transcripts reservoir (Iyer et al., 2015).
Previous studies revealed that A. philoxeroides showed strong phenotypic plasticity with extremely low genetic diversity in China (Wang et al., 2005;Geng et al., 2019). Thus, to provide a more comprehensive transcriptome information of A. philoxeroides, we downloaded all accessible RNA-seq data of this species from NCBI and used them as input to carried out the de novo assembling with Trinity. Then expression profiling and function annotation were performed to obtain unigenes (the clustered contigs designated during the Trinity assembly). This study will provide useful gene resource for further study in elucidating the environmental acclimation mechanism and invasion management of A. philoxeroides.

Function annotation
For unigenes annotation, the public databases including eggNOG (a database of orthologous groups and functional annotation), GO (Gene Ontology), KEGG (Kyoto Encyclopedia of Genes and Genomes), 3 KOG/COG (Clusters of Orthologous Groups of Proteins), Nr (NCBI non-redundant protein sequences), Pfam (Protein family), and Swiss-Prot (a manually annotated and reviewed protein sequence database, https://www.uniprot.org/) were used to perform gene function annotation, with a cut-off Expect value (E value) of 10 −10 (Yin et al., 2018a;Zhu et al,. 2019c).

Results and Discussion
De novo assembling To provide a comprehensive transcriptome data of A. philoxeroides, we collected all useful high throughput transcriptome sequencing data deposited in NCBI, and conducted the de novo assembling (Yin et al., 2020). The overall process of reads processing, transcriptome assembling, expression profiling, quality evaluation, coding RNA function annotation and non-coding RNA detection, is outlined in Figure 1. Illumina data belonging to four projects were used and their detailed information can be found in File S1. 4 Totally, 62.5 Gb data from four projects were used as input to perform the de novo assembling. Detailed information of these data can be found in File S1. As a result, 421,372 unigenes were obtained with a total length of 230,842,460 bp, an average length of 548 bp, and an N50 length of 621 bp. Specifically, 43,430 (10.31%) of 421,372 unigenes were longer than 1000 bp (Figure 2A, File S2). Similarly, in the RNA-seq study of Liu et al. (2019), a large number of unigenes (745,348) were detected and this may due to species specificity.
To assess the assembling quality, Benchmarking Universal Single-Copy Orthologs (BUSCO) was used to estimate the completeness of the assembly sequences and results showed that the ratio of "Complete BUSCOs (C)" for assembled unigenes was 77.08% ( Figure 2B). We then used assembly sequences as query to perform blastn against the previous assembly sequences by Liu et al. (2019). The result showed that 316,150 sequences assembled (75% of 421,372) in this study hit 723,070 of Liu et al. (2019) (97% of 745,348), which means most of the sequences reported by Liu et al. (2019) were assembled in this study. To be specific, 25% new sequences were assembled in the present study. Liu et al. (2019) used CEGMA to assess the completeness of their de novo assembly sequences and results showed that "complete" ratio was 92.7%, "missing" ratio was 2.7%, and "fragmented" ratio was 4.6%. In this research, BUSCO, the most popular and widely used assessment software for now, was used to assess the completeness of the de novo assembly results. For comparison, we further used BUSCO to assess the completeness of assembly sequences produced by Liu et al. (2019). As a result, the ratio of "Complete BUSCOs (C)" of assembly sequences produced by Liu et al. (2019) (85.4%) is slightly higher than ours (77.1%). However, most of sequences in the study of Liu et al. (2019) seem to be "duplicated" (75.1%), which is much higher than ours (2.3%). These results implied that the de novo assembly of this study contained more unique sequences with low duplication ratio, suggesting assembly sequences in this study are favourable and applicable for subsequent studies to functional characterize potential candidate genes involved in various metabolic pathways. the open reading frame (ORF) prediction analysis. The "complete" represents ORFs with 5prime start codon and 3prime stop codon; "5prime_partial" represents ORFs with 5prime start codon but without 3prime stop codon; "3prime_partial" represents ORFs with 3prime stop codon but without 5prime start codon; and "internal" represents ORFs without both 5prime start codon and 3prime stop codon. The detailed information can be found in File

5
Then the TransDecoder was used to perform the gene structure annotation and the detailed information of output files including "transdecoder.gff3", "transdecoder.cds", and "transdecoder.pep" were listed in Supplementary Files 3 to 5. Totally, 263,143 ORFs were recognized in 155,422 unigenes (File S3). Among them, we detected 91,812 complete length ORFs, 72,920 5prime_partial length, 38,297 3prime_partial length, and 60,114 internal length with both 5prime_partial and 3prime_partial ( Figure 2C). It seems that there is a significant bias between incomplete 5'UTR and 3'UTRs. The bias was also reported in other studies. For example, in the leaf of Cucurbita pepo L. infested by Aphis Gossypii, 25,000 and 8,220 truncated 5'UTR and 3'UTRs were detected, respectively (Vitiello et al., 2018). We speculated that the reasons for this significant bias between incomplete 5'UTR and 3'UTRs may due to: (1) software (e.g. Trinity and TransDecoder) induced bias; (2) Illumina sequencing strategy induced bias, since left reads quality are generally better than right reads when pair-end sequencing; (3) and reasons we still don't know.

Expression profiling and function annotation
Expression levels of unigenes represented by FPKM values were calculated by RSEM and the details were listed in File S6. Then expression pattern profiling was performed and results showed that the expression levels of most unigenes were generally low (84.90% unigenes, FPKM Mean value < 1), while 11.12% unigenes showed medium expression level (1 ≤ FPKM Mean value < 10), and 3.98% unigenes showed high expression level (FPKM Mean value ≥ 10) ( Figure 2D).
Functional annotation and classification of the assembled transcripts provide insights into the diverse molecular and biological functions of represented genes (Xia et al., 2011;Zhou et al., 2018). We then performed BLAST to functional annotate the unigene sequences with the public annotation databases (Yin et al., 2018b Figure 2E, File S7). The ratio of unigenes being functional annotated is acceptable since non-protein-coding RNAs (ncRNAs) constitute a substantial portion of transcribed sequences. For example, in human, over 68% of genes were classified as lncRNAs (long non coding RNAs) (Iyer et al., 2015). And in Arabidopsis at least 50% of transcripts were lncRNAs and more lncRNAs will be detected with the development of the next-generation sequencing technologies and transcriptomic analysis approaches (Liu et al., 2012;Yuan et al., 2016;Severing et al., 2018).

The tRNA and rRNA annotation
Despite the essential role of transfer RNA (tRNA) genes in protein translation in all living cells, little is known about the composition and amount of tRNAs in A. philoxeroides (Chan et al., 2008). Here, tRNAscan-SE were used to scan the tRNA and 534 tRNA sequences were detected, including the tRNAs of Ala (33) (21), and Trp (11) ( Figure 3A, File S8). The ribosomal RNA (rRNA) is another type of fundamental components in cells. Since rRNA genes are highly conserved sequences that are widespread across all genomes, they have extensive application values in the field of bioinformatics. For example, they can be used as the standard phylogenetic markers to characterize living species diversity and evolution (Karst et al., 2018). Therefore, the barrnap was used to detect rRNA sequences and 234 rRNAs were detected, including 129 sequences of 5S rRNA, 20 of 5.8S rRNA, 52 of 18S rRNA, and 33 of 28S rRNA ( Figure 3B, File S9).

Microsatellites identification and SSR primers design
Microsatellite markers (SSR) derived from plant genomic sequences have been extensively used over the past decades due to their characteristic of co-dominant, locus-specific, and high polymorphism levels (Fang et al., 2019;Zhang et al., 2020). Transcriptomic sequences can provide a rich source for genetic markers development, such as simple sequence repeats (SSRs) in microsatellites and single nucleotide polymorphisms (SNPs), especially for plant species without reference genome (Vašek et al., 2020). Previously, inter-simple sequence repeats (ISSR) markers were employed to interpret the genetic diversity of this invasive weed. However, due to the limited number of markers, the resolution of genetic diversity in natural populations was extremely low (Geng et al., 2016). Here, we used MIRA to detect microsatellite sequences from A. philoxeroides and 131,624 microsatellites in 106,761 sequences were detected, among which 8,754 were compound formation microsatellites. To be specific, 20,218 sequences contain more than one microsatellite and each 3.1 kb of analysed sequence contains one SSR marker ( Figure 3C, File S10). The most prevalent microsatellites were mononucleotide repeat pattern (80,750), followed by trinucleotide (19,265), dinucleotide (11,422), and compound (8,063) patterns. The most frequently observed microsatellite repeat was A/T, followed by G and AG/TC ( Figure 3D, File S10). Then 36,447 sequences were chosen to design SSR amplification primer using Primer3, which resulted in each of 40,752 microsatellites in 36,329 sequences been covered by four pair of PCR primers (File S11). Successful control of A. philoxeroides dependents on the fully 7 understanding of its genetic basis (Li and Ye, 2006). These large number of SSR markers developed in this study enable us to analyze population structure of A. philoxeroides at much higher resolution, which can accelerate the population genetic analysis of this invasion weed.

The miRNA
The miRNAs are key post-transcriptional regulators (Zhu et al., 2019a). They are involved in a broad range of plants growth and development and stress responding processes through cleavage and/or inhibition translation of functional gene transcripts (Zhu et al., 2019d). In this study, to explore the possible functions of A. philoxeroides miRNAs (Aph-miRNA), we detected miRNAs and predicted miRNA target genes. After BLASTn searching and ShortStack validation, about 86 candidate miRNA sequences were detected from A. philoxeroides assemblies (Files S12, S13). They were divided into 61 conserved miRNA families, including some well-studied members such as Aph-miR156 (regulation of flowering time), Aph-miR164 (response to water deficit), Aph-miR168 (influences phase transition, leaf epinasty, and fruit development), and Aph-miR394 (cold stress response) (Wang, 2014;Phookaew et al., 2014;Xian et al., 2014;Song et al., 2016). Then, miRNA target genes were predicted by psRNATarget. Most of these target genes were transcripts coding transcription factors (File S14), such as squamosa promoter-binding protein (targets of miR156), NAC domain containing protein (miR164), nuclear transcription factors-YA4, -YA5, -YA6 (miR169), and GRAS family transcription factors (miR171). Besides transcription factors, miRNA targets including genes coding UDP-glucoronate decarboxylase 2 (miR164), AGO1-1 (miR168), protein phosphatase and kinase (miR390), F-box family protein (miR394), and proteins with unknown functions were predicted. Most conserved miRNA target genes, especially transcription factors (SBP, NAC, YA, GRAS), were found to be homologous to the conserved miRNA targets in Arabidopsis and other plants, which lay the foundation for further miRNA functional analysis in A. philoxeroides (Lakhotia et al., 2014). Furthermore, gene ontology (GO) and enrichment analysis were performed by Blast2GO. Results demonstrated that within Cellular Component category most targets are genes coding membrane and nucleus components ( Figure 4A). Within Molecular Function category, targets are mainly involved in ATP, metal ion, and DNA binding functions ( Figure 4A). Within Biological Processes category, most targets participate in processes of oxidation-reduction, protein phosphorylation, and regulation of transcription ( Figure 4A). Using GO annotations of all assembling unigenes as background, miRNA targets are enriched in cellular protein modification process (GO:0006464) and macromolecule modification (GO:0043412) under, kinase activity (GO:0016301), transferase activity (GO:0016772), carbohydrate binding (GO:0030246) under Molecular Function, and membrane (GO:0016020) and plastid (GO:0009536) under Cellular Component ( Figure 4B).

The lncRNA reservoir
Considering the fact that large ratio of assembly sequences was not functionally annotated, we then used CNCI and CPC to confirm whether these sequences belong to lncRNA. Sequences that were identified as lncRNAs by both CNCI and CPC and did not contain TransDecoder interpreted protein coding regions were selected as reliable lncRNAs. Totally, 235,885 sequences (56% of 421,372 assembly sequences) were identified as lncRNAs ( Figure 5, File S15). It is widely accepted that ncRNAs play important roles in transcriptional and/or post-transcriptional regulation (Liu et al., 2012;Yuan et al., 2016;Severing et al., 2018). Accordingly, these large number of lncRNAs potentially involve in the regulation of physiological and biochemical processes of A. philoxeroides, and the function of which need to be deciphered in the future.

Conclusions
In this study, we performed a comprehensive de novo assembling using Trinity based on 4 groups of released A. philoxeroides Illumina RNA-seq data (62.5 Gb). Totally, 421,372 unigenes were obtained.
Completeness estimate showed the ratio of "Complete BUSCOs (C)" for assembled unigenes is 77.08%. The unigenes expression profiles were calculated by RSEM and their functions were annotated by public database including Nr, GO, KEGG, KOG/COG, Pfam, and Swiss-Prot. Further non-coding components analysis identified 534 tRNA and 234 rRNAs sequences. Microsatellites in transcriptome were scanned and detected 131,624 microsatellites in 106,761 sequences. Then 36,447 sequences were selected to design SSR primer, which resulted in each of 40,752 microsatellites in 36,329 sequences being covered by four pair of primers.
Eighty-six candidate miRNA sequences from A. philoxeroides assembly were detected, and miRNA target genes prediction suggested their possible functions in plant growth and stress responding processes. Besides the 119,222 (28.3%) functional annotated unigenes, 235,885 (56.0%) were grouped into reliable lncRNAs reservoir. These results provide an important reference for sequence-based studies of A. philoxeroides and enriched resource for gene function investigations.
All authors read and approved the final manuscript.