High-order statistical compressor for long-term storage of DNA sequencing data
RAIRO - Operations Research - Recherche Opérationnelle, Volume 50 (2016) no. 2, pp. 351-361.

We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context. Compression of DNA reads is performed according to LZ-style with the use of the 5–7th order model, while nucleotides’ scores are encoded with the 3rd order model.

DOI: 10.1051/ro/2015039
Classification: 68P20, 68P30, 68W32, 92D20
Keywords: High-throughput DNA sequencing, data compression, FASTQ files
Chlopkowski, Marek 1; Antczak, Maciej 1; Slusarczyk, Michal 1; Wdowinski, Aleksander 1; Zajaczkowski, Michal 1; Kasprzak, Marta 1, 2

1 Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland.
2 Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland
