Re: [Zope] Any working experiences in using compression (for storage)?

6 Jul 2004

      From: "Ausum Studio" <ausum_studio@hotmail.com>
...
----- Original Message -----
From: "Jonathan Hobbs" <toolkit@magma.ca>
To: "Ausum Studio" <ausum_studio@hotmail.com>; <zope@zope.org>
Sent: Tuesday, July 06, 2004 7:32 AM
...
(...)
Our current zodb size is about 7Gb and we are still investigating
various
...
methods for improving update speed and retrieval speed (we're not
Google,
so
throwing 100,000 cpus at the problem is not an option!).  How much data
are
you trying to store that would cause you to want to get into
compression?
Hi, Jonathan. I just want to deliver a read only CD catalog.
Your aproach seems to enable to get rid of the real objects and to have a
compressed zcatalog. Both things save space. Interesting. Would you share
the core lines of your external method?
Here are the relevant source code bits (it may be a bit cryptic as I just
ripped it from the various routines):

*** Source Code extracts from ZCatalog update routine ***

<snip>
from cStringIO import StringIO
import gzip
</snip>

<snip>
# all of the text fields are stored in the child record as a compressed
field (so save
# space and allow field to be included in meta data)

displaytext = compresstext(fielddata)

# we still need an uncompressed version of the text data for the zcatalog
indexer to use
fulltext = fielddata
</snip>

<snip>
#add new object to BTreeFolder2 folder (WebSitesChildData)

newobj =
self.WebSitesChildData.manage_addProduct['SWV2'].WebSites.createInObjectMana
ger(REQUEST['id'], REQUEST)

# add fields to new object

newobj.propertysheets.WebSitesPS.manage_editProperties({
                        'title' : title,
                        'displaytext' : displaytext,
                        'company_text' : company_text,
                        'rating' : rating,
                        'rating_count' : rating_count,
                        'createdby' : createdby,
                        'date' : date,
                        'last_modified_date' : date,
                        'master_recid' : master_recid})

# zclass is NOT 'catalog aware' (catalog aware instances require 2 catalog
updates: one when object is added to folder
# - the automatic update - and a second manual 'reindex' after the fields
are added to the object). The 'non-aware' approach
# only requires one catalog update

self.Catalog.catalog_object(newobj, REQUEST['id'])
</snip>

<snip>
# compresstext.py
#
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# BIG NOTE: tres important eh
#
#       zope seems to 'misplace' chr(13) characters from time to time, so we
need to
#       replace all chr(13) with something we can fix at decompression time.
This
#       routine replaces all chr(13) characters with the string '*fixme*'

def compresstext(buf):

        zbuf = StringIO()
        zfile = gzip.GzipFile(mode = 'w', fileobj = zbuf, compresslevel = 9)
        zfile.write(buf)
        zfile.close()
        return replace(zbuf.getvalue())

# replace.py
#
# replaces all occurences of chr(13) with '*fixme*'.
# Fix to get around problem of zope corrupting compressed data strings when
they
# are stored in property fields on objects (zope seems to lose some chr(13)
bytes)
#
# Note: string.replace and re (regular expressions) substitution doesn't
work
# on binary strings (binary strings can have embedded null chars which cause
# them to barf), so this binary replace routine was necessary

def replace(instr):

        newstr = ''
        pos = 0
        while pos < len(instr):
                if  ord(instr[pos:pos+1]) == 13:
                        newstr += '*fixme*'
                else:
                        newstr += instr[pos:pos+1]
                pos += 1
        return newstr
</snip>

*** Source Code extract from external method which does zcatalog search and
decompresses metadata
(we use an external method instead of a dtml method or python script and we
don't use zpt) ***

<snip>
from cStringIO import StringIO
import gzip
import jhgzip
</snip>

<snip>
 results = self.Catalog.searchResults(searchdict)
 for rec in results:
</snip>

<snip>
try:
         # try uncompressing with CRC enabled
         fulltext = uncompresstextCRC(rec.displaytext)
except:
         # here if error during decompression, so try decompressing with CRC
error checking turned off
          try:
                fulltext = uncompresstext(nstr)
          except:
                # died due to some unknown decompression error
                fulltext = 'description not available'
</snip>

<snip>
# uncompresstextCRC
#
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# NOTE: this version of uncompresstext uses the original/unmodified gzip.py
module and
# will return CRC errors
#
# IMPORTANT:
#
# the compresstext.py routine replaces all occurences of the chr(13)
character with the
# string '*fixme*'.  This is done because zope sometimes misplaces the
chr(13) character
# when it is stored/retrieved from the zodb. Therefore we must replace the
substituted string
# with the original character string.

def uncompresstextCRC(buf):

        buf = replacefixme(buf)

        zbuf = StringIO(buf)
        zfile= gzip.GzipFile(mode = 'rb', fileobj = zbuf)
        obuf = ''
        while 1:
                chunk = zfile.read()
                if not chunk:
                        break
                obuf = obuf + chunk
        zfile.close()

        return obuf

# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# NOTE: this version of uncompresstext uses a modified version of gzip which
ignores CRC errors
#
# IMPORTANT:
#
# the compresstext.py routine replaces all occurences of the chr(13)
character with the
# string '*fixme*'.  This is done because zope sometimes misplaces the
chr(13) character
# when it is stored/retrieved from the zodb. Therefore we must replace the
substituted string
# with the original character string.

def uncompresstext(buf):

        buf = replacefixme(buf)

        zbuf = StringIO(buf)
        zfile= jhgzip.GzipFile(mode = 'rb', fileobj = zbuf)
        obuf = ''
        while 1:
                chunk = zfile.read()
                if not chunk:
                        break
                obuf = obuf + chunk
        zfile.close()

        return obuf

# replacefixme.py
#
# replaces all occurences of the string '*fixme*' with chr(13).
# Fix to get around problem of zope corrupting compressed data strings when
they
# are stored in property fields on objects.
#
# Note: string.replace and re (regular expressions) substitution doesn't
work
# on binary strings (binary strings can have embedded null chars which cause
# them to barf), so this binary replace routine was necessary

def replacefixme(instr):

        newstr = ''
        pos = 0

        while pos < len(instr):
                currchar = ord(instr[pos:pos+1])

                # check to see if we have found a '*' character
                if currchar == 42:

                        # found a '*', so check for '*fixme*'
                        chk = currchar + ord(instr[pos+1:pos+2]) \
                                        + ord(instr[pos+2:pos+3]) \
                                        + ord(instr[pos+3:pos+4]) \
                                        + ord(instr[pos+4:pos+5]) \
                                        + ord(instr[pos+5:pos+6]) \
                                        + ord(instr[pos+6:pos+7])
                        if (chk == 621) and ( ord(instr[pos+6:pos+7]) ==
42 ):
                                # found '*fixme*', so insert missing chr(13)
character
                                newstr = newstr + chr(13)
                                # don't add '*fixme*' characters to new
string
                                pos = pos + 7
                        else:
                                # '*fixme*' not found, so add current
character to new string
                                newstr = newstr + chr(currchar)
                                pos = pos + 1
                else:
                        # '*' character not found, so add current character
to new string
                        newstr = newstr + chr(currchar)
                        pos = pos + 1

        return newstr
</snip>

The modifications to gzip.py (called jhgzip in the above code) are as
follows:

# standard gzip _read_eof with crc checking commented out
# (some CRC errors can be ignored)

    def _read_eof(self):
        # We've read to the end of the file, so we have to rewind in order
        # to reread the 8 bytes containing the CRC and the file size.
        # We check the that the computed CRC and size of the
        # uncompressed data matches the stored values.
        self.fileobj.seek(-8, 1)
        crc32 = read32(self.fileobj)
        isize = read32(self.fileobj)
#        if crc32%0x100000000L != self.crc%0x100000000L:
#            raise ValueError, "CRC check failed"
#        elif isize != self.size:
#            raise ValueError, "Incorrect length of data produced"

HTH,  Good Luck!

Jonathan