Tuesday, October 2, 2012

How Accumulo Compresses Keys and Values

From the Acccumulo User mailing list, Keith T said:

There are two levels of compression in Accumulo. First redundant

parts of the key are not stored. If the row in a key is the same as

the previous row, then its not stored again. The same is done for

columns and time stamps. After the relative encoding is done a block

of key values is then compressed with gzip.





As data is read from an RFile, when the row of a key is the same as

the previous key it will just point to the previous keys row. This is

carried forward over the wire. As keys are transferred, duplicate

fields in the key are not transferred.



General consensus seemed to favor double compression - compression both at the application level (i.e., compress the values) and let Accumulo compress as well (i.e., the relative encoding).



In support of double compression, Ameet K. said:

I've switched to double compression as per previous posts and

its working nicely. I see about 10-15% more compression over just

application level Value compression.


No comments:

Post a Comment