High-order statistical compressor for long-term storage of DNA sequencing data
RAIRO - Operations Research - Recherche Opérationnelle, Volume 50 (2016) no. 2, pp. 351-361.

We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context. Compression of DNA reads is performed according to LZ-style with the use of the 5–7th order model, while nucleotides’ scores are encoded with the 3rd order model.

Received:
Accepted:
DOI: 10.1051/ro/2015039
Classification: 68P20, 68P30, 68W32, 92D20
Keywords: High-throughput DNA sequencing, data compression, FASTQ files
Chlopkowski, Marek 1; Antczak, Maciej 1; Slusarczyk, Michal 1; Wdowinski, Aleksander 1; Zajaczkowski, Michal 1; Kasprzak, Marta 1, 2

1 Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland.
2 Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland
@article{RO_2016__50_2_351_0,
     author = {Chlopkowski, Marek and Antczak, Maciej and Slusarczyk, Michal and Wdowinski, Aleksander and Zajaczkowski, Michal and Kasprzak, Marta},
     title = {High-order statistical compressor for long-term storage of {DNA} sequencing data},
     journal = {RAIRO - Operations Research - Recherche Op\'erationnelle},
     pages = {351--361},
     publisher = {EDP-Sciences},
     volume = {50},
     number = {2},
     year = {2016},
     doi = {10.1051/ro/2015039},
     mrnumber = {3479875},
     language = {en},
     url = {http://archive.numdam.org/articles/10.1051/ro/2015039/}
}
TY  - JOUR
AU  - Chlopkowski, Marek
AU  - Antczak, Maciej
AU  - Slusarczyk, Michal
AU  - Wdowinski, Aleksander
AU  - Zajaczkowski, Michal
AU  - Kasprzak, Marta
TI  - High-order statistical compressor for long-term storage of DNA sequencing data
JO  - RAIRO - Operations Research - Recherche Opérationnelle
PY  - 2016
SP  - 351
EP  - 361
VL  - 50
IS  - 2
PB  - EDP-Sciences
UR  - http://archive.numdam.org/articles/10.1051/ro/2015039/
DO  - 10.1051/ro/2015039
LA  - en
ID  - RO_2016__50_2_351_0
ER  - 
%0 Journal Article
%A Chlopkowski, Marek
%A Antczak, Maciej
%A Slusarczyk, Michal
%A Wdowinski, Aleksander
%A Zajaczkowski, Michal
%A Kasprzak, Marta
%T High-order statistical compressor for long-term storage of DNA sequencing data
%J RAIRO - Operations Research - Recherche Opérationnelle
%D 2016
%P 351-361
%V 50
%N 2
%I EDP-Sciences
%U http://archive.numdam.org/articles/10.1051/ro/2015039/
%R 10.1051/ro/2015039
%G en
%F RO_2016__50_2_351_0
Chlopkowski, Marek; Antczak, Maciej; Slusarczyk, Michal; Wdowinski, Aleksander; Zajaczkowski, Michal; Kasprzak, Marta. High-order statistical compressor for long-term storage of DNA sequencing data. RAIRO - Operations Research - Recherche Opérationnelle, Volume 50 (2016) no. 2, pp. 351-361. doi : 10.1051/ro/2015039. http://archive.numdam.org/articles/10.1051/ro/2015039/

G.R. Abecasis, D. Altshuler, A. Auton, L.D. Brooks, R.M. Durbin, R.A. Gibbs, M.E. Hurles and G.A. Mcvean, A map of human genome variation from population-scale sequencing. Nature 467 (2010) 1061–1073. | DOI

J. Blazewicz, M. Bryja, M. Figlerowicz, P. Gawron, M. Kasprzak, E. Kirton, D. Platt, J. Przybytek, A. Swiercz and L. Szajkowski, Whole genome assembly from 454 sequencing output via modified DNA graph concept. Comput. Biol. Chem. 33 (2009) 224–230. | DOI

M. Chlopkowski and R. Walkowiak, A general purpose lossless data compression method for GPU. J. Parallel Distrib. Comput. 75 (2015) 40–52. | DOI

G. Cochrane et al. Facing growth in the European Nucleotide Archive. Nucleic Acids Res. 41 (2013) D30–D35. | DOI

P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer and P.M. Rice, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 38 (2010) 1767–1771. | DOI

S. Deorowicz and S. Grabowski, Compression of DNA sequence reads in FASTQ format. Bioinform. 27 (2011) 860–862. | DOI

S. Deorowicz and S. Grabowski, Data compression for sequencing data. Algorithms Mol. Biol. 8 (2013) 25. | DOI

F. Hach, I. Numanagic, C. Alkan and S.C. Sahinalp, SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinform. 28 (2012) 3051–3057. | DOI

D.A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the IRE 40 (1952) 1098–1101. | DOI | Zbl

Inc. Illumina, CASAVA v1.8 changes. [on-line] http://support.illumina.com/documentation.html, January (2011).

Inc. Illumina, BaseSpace user guide. [on-line] http://support.illumina.com/documentation.html, May (2013).

D.C. Jones, W.L. Ruzzo, X. Peng and M.G. Katze. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40 (2012) e171. | DOI

C. Kozanitis, C.T. Saunders, S. Kruglyak, V. Bafna and G. Varghese. Compressing genomic sequence fragments using SlimGene. J. Comput. Biol. 18 (2011) 401–413. | DOI | MR

M. Nelson. [on-line] http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/.

L. Roguski and S. Deorowicz, DSRC 2 - industry-oriented compression of FASTQ files. Bioinform. 30 (2014) 2213–2215. | DOI

D.S.H. Rosenthal, D. Rosenthal, E.L. Miller, I. Adams, M.W. Storer and E. Zadok, The economics of long-term digital storage. In The Memory of the World in the Digital Age: Digitization and Preservation, September (2012).

D. Salomon, Data Compression: The Complete Reference. With contributions by Giovanni Motta and David Bryant. Springer, London (2007). | MR

C.E. Shannon, A mathematical theory of communication. The Bell Syst. Tech. J. 27 (1948) 379–423, 623–656. | DOI | MR | Zbl

A. Swiercz, B. Bosak, M. Chlopkowski, A. Hoffa, M. Kasprzak, K. Kurowski, T. Piontek and J. Blazewicz, Preprocessing and storing high-throughput sequencing data. Comput. Methods Sci. Technol. 20 (2014) 9–20. | DOI

Y. Tateno, T. Imanishi, S. Miyazaki, K. Fukami-Kobayashi, N. Saitou, H. Sugawara and T. Gojobori, DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res. 30 (2002) 27–30. | DOI

T.A. Welch. A technique for high-performance data compression. Computer 17 (1984) 8–19. | DOI

I.H. Witten, R.M. Neal and J.G. Cleary, Arithmetic coding for data compression. Commun. ACM 30 (1987) 520–540. | DOI

J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23 (1977) 337–343. | DOI | MR | Zbl

J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24 (1978) 530–536. | DOI | MR | Zbl

Cited by Sources: