fast lossless compression via cascading bloom filters

0 downloads 0 Views 2MB Size Report
BARCODE: Bloom filter Alignment-free Reference- based Compression and Decompression. • Lossless compression of read sequences. • Whole reads are ...
BARCODE: Fast lossless compression via cascading Bloom filters Roye Rozov, Ron Shamir, Eran Halperin

BARCODE Intuition

FN

B1

l

Background – read compression

6

Repeated reads

R – the reads

Missed unique reads

Unique reads

2

query G against B1

FP

3

G- the reference genome ....

Bloom Filters (BFs)

Decompress FN

FP

B

FN

3

query G against B FP

G- the reference genome ....

Accepted, not in FP

Bloom Filter, http://en.wikipedia.org/wiki/Bloom_filter

Initialize R to FN

2

Missed

P2

FP1

6

Q(R’)

3

FP4

P3

FP2

8

B4

9 FP3

10

FP4

G- the reference genome ....

• In order to reduce the size of FP, BF loading/querying steps are repeated • BFs are loaded with false positives relative to previous accepts • An exponential drop-off in FP size results

B

• Queries from genome in place of 4l possiblities • Repeated reads not hashed to keep multiplicities • False positives & negatives stored to allow lossless decoding

Results

Decoding

Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM Format Sequencing Data. PLoS ONE 8(3): e59190. doi:10.1371/journal.pone.0059190

B

4

5

hash reads to B1

1

FP

B

R’ – the unique reads

• Probabilistic hash allowing insertion and query • Keys not explicitly stored – compact set representation • Chance of false positives, no false negatives

“Do (1) or do not (0). There is no try.”

P1

B3

7

R’ – the unique reads

R’

B4

Compress

FN

1

B2

5

Encoding FN

B1

2

B3

4

Q(FP1)

Reads BF 0 0 1 0 1 . . . . 1 0

B2 FN

1

4l queries: AA…….A? AA…….C? … TT……..T?

• Compression tools either reference-free or reference-based • Reference-based compress better, but rely on alignment of the reads to the reference, thus necessitating high run times

11 Compress Repeats

Q(FP2)

R – the reads

Unique

• BARCODE: Bloom filter Alignment-free Referencebased Compression and Decompression • Lossless compression of read sequences • Whole reads are hashed to Bloom Filters • Much (up to 9x) faster than alignment based tools with similar compression • Better compression than reference-free methods

• Whole reads hashed to BF • BF then acts as Read Oracle • Allows querying for reads in data

Cascading BFs

Q(G)

Highlights

• Genome scan repeated for decoding • Repeated reads and false negatives immediately added to reconstruction • Reads accepted by BF and not in FP also added

• Reads with random errors were sampled from hg19 chr20 • Coverage was varied from 10-50x • BARCODE demonstrated superior balance of speed and compression Acknowledgements

Reconstructed reads

RS was supported in part by the Israel Science Foundation (grant 317/13) and by the Raymond and Beverly Sackler chair in bioinformatics. RR was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv university, and by the Center for Absorption in Science, the Ministry of Immigrant Absorption in Israel. EH is a faculty fellow of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. EH and RR were partially supported by the Israeli Science Foundation (grant 1425/13). EH was also partially supported by National Science Foundation grant III1217615