[Zope] Any working experiences in using compression (for storage)?

Tue Jul 6 11:30:21 EDT 2004

From: "Ausum Studio" <ausum_studio at hotmail.com>
> ----- Original Message -----
> From: "Jonathan Hobbs" <toolkit at magma.ca>
> To: "Ausum Studio" <ausum_studio at hotmail.com>; <zope at zope.org>
> Sent: Tuesday, July 06, 2004 7:32 AM
> >
> > (...)
> >
> > Our current zodb size is about 7Gb and we are still investigating
various
> > methods for improving update speed and retrieval speed (we're not
Google,
> so
> > throwing 100,000 cpus at the problem is not an option!).  How much data
> are
> > you trying to store that would cause you to want to get into
compression?
>
> Hi, Jonathan. I just want to deliver a read only CD catalog.
> Your aproach seems to enable to get rid of the real objects and to have a
> compressed zcatalog. Both things save space. Interesting. Would you share
> the core lines of your external method?

Here are the relevant source code bits (it may be a bit cryptic as I just
ripped it from the various routines):

*** Source Code extracts from ZCatalog update routine ***

<snip>
from cStringIO import StringIO
import gzip
</snip>

<snip>
# all of the text fields are stored in the child record as a compressed
field (so save
# space and allow field to be included in meta data)

displaytext = compresstext(fielddata)

# we still need an uncompressed version of the text data for the zcatalog
indexer to use
fulltext = fielddata
</snip>

<snip>
#add new object to BTreeFolder2 folder (WebSitesChildData)

newobj =
self.WebSitesChildData.manage_addProduct['SWV2'].WebSites.createInObjectMana
ger(REQUEST['id'], REQUEST)

# add fields to new object

newobj.propertysheets.WebSitesPS.manage_editProperties({
                        'title' : title,
                        'displaytext' : displaytext,
                        'company_text' : company_text,
                        'rating' : rating,
                        'rating_count' : rating_count,
                        'createdby' : createdby,
                        'date' : date,
                        'last_modified_date' : date,
                        'master_recid' : master_recid})

# zclass is NOT 'catalog aware' (catalog aware instances require 2 catalog
updates: one when object is added to folder
# - the automatic update - and a second manual 'reindex' after the fields
are added to the object). The 'non-aware' approach
# only requires one catalog update

self.Catalog.catalog_object(newobj, REQUEST['id'])
</snip>

<snip>
# compresstext.py
#
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# BIG NOTE: tres important eh
#
#       zope seems to 'misplace' chr(13) characters from time to time, so we
need to
#       replace all chr(13) with something we can fix at decompression time.
This
#       routine replaces all chr(13) characters with the string '*fixme*'

def compresstext(buf):

        zbuf = StringIO()
        zfile = gzip.GzipFile(mode = 'w', fileobj = zbuf, compresslevel = 9)
        zfile.write(buf)
        zfile.close()
        return replace(zbuf.getvalue())

# replace.py
#
# replaces all occurences of chr(13) with '*fixme*'.
# Fix to get around problem of zope corrupting compressed data strings when
they
# are stored in property fields on objects (zope seems to lose some chr(13)
bytes)
#
# Note: string.replace and re (regular expressions) substitution doesn't
work
# on binary strings (binary strings can have embedded null chars which cause
# them to barf), so this binary replace routine was necessary

def replace(instr):

        newstr = ''
        pos = 0
        while pos < len(instr):
                if  ord(instr[pos:pos+1]) == 13:
                        newstr += '*fixme*'
                else:
                        newstr += instr[pos:pos+1]
                pos += 1
        return newstr
</snip>

*** Source Code extract from external method which does zcatalog search and
decompresses metadata
(we use an external method instead of a dtml method or python script and we
don't use zpt) ***

<snip>
from cStringIO import StringIO
import gzip
import jhgzip
</snip>

<snip>
 results = self.Catalog.searchResults(searchdict)
 for rec in results:
</snip>

<snip>
try:
         # try uncompressing with CRC enabled
         fulltext = uncompresstextCRC(rec.displaytext)
except:
         # here if error during decompression, so try decompressing with CRC
error checking turned off
          try:
                fulltext = uncompresstext(nstr)
          except:
                # died due to some unknown decompression error
                fulltext = 'description not available'
</snip>

<snip>
# uncompresstextCRC
#
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# NOTE: this version of uncompresstext uses the original/unmodified gzip.py
module and
# will return CRC errors
#
# IMPORTANT:
#
# the compresstext.py routine replaces all occurences of the chr(13)
character with the
# string '*fixme*'.  This is done because zope sometimes misplaces the
chr(13) character
# when it is stored/retrieved from the zodb. Therefore we must replace the
substituted string
# with the original character string.

def uncompresstextCRC(buf):

        buf = replacefixme(buf)

        zbuf = StringIO(buf)
        zfile= gzip.GzipFile(mode = 'rb', fileobj = zbuf)
        obuf = ''
        while 1:
                chunk = zfile.read()
                if not chunk:
                        break
                obuf = obuf + chunk
        zfile.close()

        return obuf

# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# NOTE: this version of uncompresstext uses a modified version of gzip which
ignores CRC errors
#
# IMPORTANT:
#
# the compresstext.py routine replaces all occurences of the chr(13)
character with the
# string '*fixme*'.  This is done because zope sometimes misplaces the
chr(13) character
# when it is stored/retrieved from the zodb. Therefore we must replace the
substituted string
# with the original character string.

def uncompresstext(buf):

        buf = replacefixme(buf)

        zbuf = StringIO(buf)
        zfile= jhgzip.GzipFile(mode = 'rb', fileobj = zbuf)
        obuf = ''
        while 1:
                chunk = zfile.read()
                if not chunk:
                        break
                obuf = obuf + chunk
        zfile.close()

        return obuf

# replacefixme.py
#
# replaces all occurences of the string '*fixme*' with chr(13).
# Fix to get around problem of zope corrupting compressed data strings when
they
# are stored in property fields on objects.
#
# Note: string.replace and re (regular expressions) substitution doesn't
work
# on binary strings (binary strings can have embedded null chars which cause
# them to barf), so this binary replace routine was necessary

def replacefixme(instr):

        newstr = ''
        pos = 0

        while pos < len(instr):
                currchar = ord(instr[pos:pos+1])

                # check to see if we have found a '*' character
                if currchar == 42:

                        # found a '*', so check for '*fixme*'
                        chk = currchar + ord(instr[pos+1:pos+2]) \
                                        + ord(instr[pos+2:pos+3]) \
                                        + ord(instr[pos+3:pos+4]) \
                                        + ord(instr[pos+4:pos+5]) \
                                        + ord(instr[pos+5:pos+6]) \
                                        + ord(instr[pos+6:pos+7])
                        if (chk == 621) and ( ord(instr[pos+6:pos+7]) ==
42 ):
                                # found '*fixme*', so insert missing chr(13)
character
                                newstr = newstr + chr(13)
                                # don't add '*fixme*' characters to new
string
                                pos = pos + 7
                        else:
                                # '*fixme*' not found, so add current
character to new string
                                newstr = newstr + chr(currchar)
                                pos = pos + 1
                else:
                        # '*' character not found, so add current character
to new string
                        newstr = newstr + chr(currchar)
                        pos = pos + 1

        return newstr
</snip>

The modifications to gzip.py (called jhgzip in the above code) are as
follows:

# standard gzip _read_eof with crc checking commented out
# (some CRC errors can be ignored)

    def _read_eof(self):
        # We've read to the end of the file, so we have to rewind in order
        # to reread the 8 bytes containing the CRC and the file size.
        # We check the that the computed CRC and size of the
        # uncompressed data matches the stored values.
        self.fileobj.seek(-8, 1)
        crc32 = read32(self.fileobj)
        isize = read32(self.fileobj)
#        if crc32%0x100000000L != self.crc%0x100000000L:
#            raise ValueError, "CRC check failed"
#        elif isize != self.size:
#            raise ValueError, "Incorrect length of data produced"

HTH,  Good Luck!

Jonathan