Now a days, there has been much evidence that the activity of the transposable elements (TEs; short DNA sequences that replicate themselves inside of a host’s genome) bring changes in the genomic diversity observed in plants. One of the principal goals in evolutionary genomics is to understand TEs activity driving this genomic diversity. One of the main questions we can ask regarding TEs deals with the age of insertion of the different types of these elements.
TEs fall into two classes, defined by their mode of transmission. Class 1 elements are retrotransposons, which have an RNA intermediate for transmission. The newly produced elements are 100% identical from the parental element, and they increase in the host’s genome by a copy and paste mechanism. These copy-paste mechanisms consist in breaking and re-attaching to the host’s DNA. Class 1 elements may or may not have a long terminal repeat (LTRs), which is a short sequence present at the ends of the TE. The presence of the LTR in retrotransposons plays a key role in proliferation, whereas non-LTR retrotransposons lack this attribute.
Class 2 elements, DNA transposons, exist solely as DNA and do not have an RNA intermediate. Protein translated either from an active element (which can proliferate on its own) or the host, helps in the increase of these elements, which also use a copy-paste mechanism. Most of the DNA transposons insert themselves randomly into host genomes, so their potential to generate deleterious mutations is high. Due to their mechanism of insertions retrotransposons are highly proliferative compared to DNA transposons.
Among the class 1 elements, LTR/Gypsy and LTR/Copia, which are a few hundred base pairs, are very common in plants in particular in the Rosids’ genomes. As there are presumably no selective forces acting on these elements, we expect them to evolve neutrally. Hence nucleotide (each letter in the DNA) divergence can be used to date the insertion time of these elements. Studies have shown that LTR retrotransposons in plants are relatively young, dating them back to less than 15 million years.
Cannabis sativa has a genome of approximately 800 million base pairs. Since there is no full assembly of the Cannabis genome, we have devised a method to estimate the divergence rates of TEs from the raw sequencing data. We have already determined the repetitive content of the genome and tried to annotate (identify) the transposable elements present in it. We found that most of the Cannabis genome is composed of Gypsy (15.4%) and Copia (14%) elements, as expected because Cannabis belongs to the rosids clade. We have characterized the repetitive content in the genome using the software Repeat Explorer, that uses a novel approach of graph-based clustering of the sequences generated from the whole genome sequence. This approach of clustering for TEs detection also helps in identifying the elements that have significantly diverged from the parental sequence and cannot be recognized by homology of the element. Sequences are grouped into different clusters based on the sequence similarity. The shapes of clusters (De Bruijn graphs) show characteristics of the genetic sequences present in them.
The figure below shows that about 65% percent of the sequences have high copy numbers in the Cannabis genome. The top 254 clusters (represented by each bar in the plot) cover 65% of the genome. We suppose that TEs are present in these high copy number regions.
We determined a representative amount of genetic sequence from all the classes of clusters of TEs present in the Cannabis genome. We have calculated pairwise divergence separately for each cluster, within mapped sequences of that specific cluster. We can establish from the depth of the alignment the copies of the elements present in the genome. This gives us the divergence rates (calculating Pi or Theta) between the LTR elements.