[Zope] Any working experiences in using compression (for storage)?
Jonathan Hobbs
toolkit at magma.ca
Tue Jul 6 11:30:21 EDT 2004
From: "Ausum Studio" <ausum_studio at hotmail.com>
> ----- Original Message -----
> From: "Jonathan Hobbs" <toolkit at magma.ca>
> To: "Ausum Studio" <ausum_studio at hotmail.com>; <zope at zope.org>
> Sent: Tuesday, July 06, 2004 7:32 AM
> >
> > (...)
> >
> > Our current zodb size is about 7Gb and we are still investigating
various
> > methods for improving update speed and retrieval speed (we're not
Google,
> so
> > throwing 100,000 cpus at the problem is not an option!). How much data
> are
> > you trying to store that would cause you to want to get into
compression?
>
> Hi, Jonathan. I just want to deliver a read only CD catalog.
> Your aproach seems to enable to get rid of the real objects and to have a
> compressed zcatalog. Both things save space. Interesting. Would you share
> the core lines of your external method?
Here are the relevant source code bits (it may be a bit cryptic as I just
ripped it from the various routines):
*** Source Code extracts from ZCatalog update routine ***
<snip>
from cStringIO import StringIO
import gzip
</snip>
<snip>
# all of the text fields are stored in the child record as a compressed
field (so save
# space and allow field to be included in meta data)
displaytext = compresstext(fielddata)
# we still need an uncompressed version of the text data for the zcatalog
indexer to use
fulltext = fielddata
</snip>
<snip>
#add new object to BTreeFolder2 folder (WebSitesChildData)
newobj =
self.WebSitesChildData.manage_addProduct['SWV2'].WebSites.createInObjectMana
ger(REQUEST['id'], REQUEST)
# add fields to new object
newobj.propertysheets.WebSitesPS.manage_editProperties({
'title' : title,
'displaytext' : displaytext,
'company_text' : company_text,
'rating' : rating,
'rating_count' : rating_count,
'createdby' : createdby,
'date' : date,
'last_modified_date' : date,
'master_recid' : master_recid})
# zclass is NOT 'catalog aware' (catalog aware instances require 2 catalog
updates: one when object is added to folder
# - the automatic update - and a second manual 'reindex' after the fields
are added to the object). The 'non-aware' approach
# only requires one catalog update
self.Catalog.catalog_object(newobj, REQUEST['id'])
</snip>
<snip>
# compresstext.py
#
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# BIG NOTE: tres important eh
#
# zope seems to 'misplace' chr(13) characters from time to time, so we
need to
# replace all chr(13) with something we can fix at decompression time.
This
# routine replaces all chr(13) characters with the string '*fixme*'
def compresstext(buf):
zbuf = StringIO()
zfile = gzip.GzipFile(mode = 'w', fileobj = zbuf, compresslevel = 9)
zfile.write(buf)
zfile.close()
return replace(zbuf.getvalue())
# replace.py
#
# replaces all occurences of chr(13) with '*fixme*'.
# Fix to get around problem of zope corrupting compressed data strings when
they
# are stored in property fields on objects (zope seems to lose some chr(13)
bytes)
#
# Note: string.replace and re (regular expressions) substitution doesn't
work
# on binary strings (binary strings can have embedded null chars which cause
# them to barf), so this binary replace routine was necessary
def replace(instr):
newstr = ''
pos = 0
while pos < len(instr):
if ord(instr[pos:pos+1]) == 13:
newstr += '*fixme*'
else:
newstr += instr[pos:pos+1]
pos += 1
return newstr
</snip>
*** Source Code extract from external method which does zcatalog search and
decompresses metadata
(we use an external method instead of a dtml method or python script and we
don't use zpt) ***
<snip>
from cStringIO import StringIO
import gzip
import jhgzip
</snip>
<snip>
results = self.Catalog.searchResults(searchdict)
for rec in results:
</snip>
<snip>
try:
# try uncompressing with CRC enabled
fulltext = uncompresstextCRC(rec.displaytext)
except:
# here if error during decompression, so try decompressing with CRC
error checking turned off
try:
fulltext = uncompresstext(nstr)
except:
# died due to some unknown decompression error
fulltext = 'description not available'
</snip>
<snip>
# uncompresstextCRC
#
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# NOTE: this version of uncompresstext uses the original/unmodified gzip.py
module and
# will return CRC errors
#
# IMPORTANT:
#
# the compresstext.py routine replaces all occurences of the chr(13)
character with the
# string '*fixme*'. This is done because zope sometimes misplaces the
chr(13) character
# when it is stored/retrieved from the zodb. Therefore we must replace the
substituted string
# with the original character string.
def uncompresstextCRC(buf):
buf = replacefixme(buf)
zbuf = StringIO(buf)
zfile= gzip.GzipFile(mode = 'rb', fileobj = zbuf)
obuf = ''
while 1:
chunk = zfile.read()
if not chunk:
break
obuf = obuf + chunk
zfile.close()
return obuf
# Returns concatenated fields that have been compressed to save storage
space and allow
# 'comptext' field to be held in meta data table (for fast access at
retrieval time)
#
# NOTE: this version of uncompresstext uses a modified version of gzip which
ignores CRC errors
#
# IMPORTANT:
#
# the compresstext.py routine replaces all occurences of the chr(13)
character with the
# string '*fixme*'. This is done because zope sometimes misplaces the
chr(13) character
# when it is stored/retrieved from the zodb. Therefore we must replace the
substituted string
# with the original character string.
def uncompresstext(buf):
buf = replacefixme(buf)
zbuf = StringIO(buf)
zfile= jhgzip.GzipFile(mode = 'rb', fileobj = zbuf)
obuf = ''
while 1:
chunk = zfile.read()
if not chunk:
break
obuf = obuf + chunk
zfile.close()
return obuf
# replacefixme.py
#
# replaces all occurences of the string '*fixme*' with chr(13).
# Fix to get around problem of zope corrupting compressed data strings when
they
# are stored in property fields on objects.
#
# Note: string.replace and re (regular expressions) substitution doesn't
work
# on binary strings (binary strings can have embedded null chars which cause
# them to barf), so this binary replace routine was necessary
def replacefixme(instr):
newstr = ''
pos = 0
while pos < len(instr):
currchar = ord(instr[pos:pos+1])
# check to see if we have found a '*' character
if currchar == 42:
# found a '*', so check for '*fixme*'
chk = currchar + ord(instr[pos+1:pos+2]) \
+ ord(instr[pos+2:pos+3]) \
+ ord(instr[pos+3:pos+4]) \
+ ord(instr[pos+4:pos+5]) \
+ ord(instr[pos+5:pos+6]) \
+ ord(instr[pos+6:pos+7])
if (chk == 621) and ( ord(instr[pos+6:pos+7]) ==
42 ):
# found '*fixme*', so insert missing chr(13)
character
newstr = newstr + chr(13)
# don't add '*fixme*' characters to new
string
pos = pos + 7
else:
# '*fixme*' not found, so add current
character to new string
newstr = newstr + chr(currchar)
pos = pos + 1
else:
# '*' character not found, so add current character
to new string
newstr = newstr + chr(currchar)
pos = pos + 1
return newstr
</snip>
The modifications to gzip.py (called jhgzip in the above code) are as
follows:
# standard gzip _read_eof with crc checking commented out
# (some CRC errors can be ignored)
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values.
self.fileobj.seek(-8, 1)
crc32 = read32(self.fileobj)
isize = read32(self.fileobj)
# if crc32%0x100000000L != self.crc%0x100000000L:
# raise ValueError, "CRC check failed"
# elif isize != self.size:
# raise ValueError, "Incorrect length of data produced"
HTH, Good Luck!
Jonathan
More information about the Zope
mailing list