THE SATELLITE DNA OF Eimeria TENELLA: A QUANTITATIVE AND QUALITATIVE STUDY
T.J.P. Sobreira1,*, A. Gruber1,**, A.M. Durham2 & The
Eimeria tenella Genome Consortium3
1Faculty of Veterinary Medicine and Zootechny, 2Institute of
Mathematics and Statistics, University of São Paulo, Brazil;
A full list of authors is available at the web address http://www.lbm.fmvz.usp.br/Eimeria/consortium/members.html
*CNPq fellow; **argruber@usp.br
Eimeria tenella genome has a complexity
of 58 Mb, distributed in 14 chromosomes ranging from 1 to more than 7 Mb. One of
the most interesting features of this genome is the very high tandem repeat
content, with the triplet (GCA) and heptamer (TTTAGGG) repeats constituting the
most predominant repetitive units. The genome sequencing started in 2002 and a
set of circa 800,000 shotgun reads was generated at the Wellcome Trust Sanger
Institute, and made available on the internet at the address ftp://ftp.sanger.ac.uk/pub/pathogens/Eimeria/tenella.
Aiming at characterizing and quantitating the whole satellite content of E.
tenella genome, our group in Brazil developed TRAP, the Tandem Repeat Analysis
Program (Sobreira, T.J.P.; Durham, A.M. & Gruber, A. – manuscript in
preparation). TRAP is a companion tool for Tandem Repeats Finder (Benson, 1999),
a popular worldwide used application for ab initio tandem repeat finding. The
program provides a unified set of analyses for the selection, classification and
quantification of tandemly repeated sequences. The E. tenella genome assembly
file (version of May 24, 2005) was downloaded from the Sanger’s FTP site and
processed by TRF version 3.21. TRF output files were analyzed by TRAP, selecting
repeat loci with at least two repeat units, a minimum repeat period of 2 bp and
a maximum period of 1,000 bp. The repetitive content of the genome was
calculated using different identity percentages (id%), where id% values
represent percentages of matches between adjacent repeat units overall. The
whole genome satellite content varied from 1.9% at 100% identity to 16.8% at 70%
identity. From this latter result, 9.0% corresponded to microsatellite (repeat
period size of 2-6 bp), 7.5% to minisatellite (period size of 7-100 bp) and 0.3%
to satellite (period size longer than 100 bp) sequences. The five most prevalent
repeat units were GCA, TTTAGGG, TAAA, GCTA and AAATT, with the two former units
corresponding to 8.8% of the genome content. A complete catalogue of the repeat
units and statistics will be publicly available on the internet upon publication
of the corresponding paper. In the meantime, the authors can provide a username
and password under request.