From: "Ausum Studio" <ausum_studio@hotmail.com>
----- Original Message ----- From: "Jonathan Hobbs" <toolkit@magma.ca> To: "Ausum Studio" <ausum_studio@hotmail.com>; <zope@zope.org> Sent: Tuesday, July 06, 2004 7:32 AM
(...)
Our current zodb size is about 7Gb and we are still investigating
various
methods for improving update speed and retrieval speed (we're not Google, so throwing 100,000 cpus at the problem is not an option!). How much data are you trying to store that would cause you to want to get into compression?
Hi, Jonathan. I just want to deliver a read only CD catalog. Your aproach seems to enable to get rid of the real objects and to have a compressed zcatalog. Both things save space. Interesting. Would you share the core lines of your external method?
Here are the relevant source code bits (it may be a bit cryptic as I just ripped it from the various routines): *** Source Code extracts from ZCatalog update routine *** <snip> from cStringIO import StringIO import gzip </snip> <snip> # all of the text fields are stored in the child record as a compressed field (so save # space and allow field to be included in meta data) displaytext = compresstext(fielddata) # we still need an uncompressed version of the text data for the zcatalog indexer to use fulltext = fielddata </snip> <snip> #add new object to BTreeFolder2 folder (WebSitesChildData) newobj = self.WebSitesChildData.manage_addProduct['SWV2'].WebSites.createInObjectMana ger(REQUEST['id'], REQUEST) # add fields to new object newobj.propertysheets.WebSitesPS.manage_editProperties({ 'title' : title, 'displaytext' : displaytext, 'company_text' : company_text, 'rating' : rating, 'rating_count' : rating_count, 'createdby' : createdby, 'date' : date, 'last_modified_date' : date, 'master_recid' : master_recid}) # zclass is NOT 'catalog aware' (catalog aware instances require 2 catalog updates: one when object is added to folder # - the automatic update - and a second manual 'reindex' after the fields are added to the object). The 'non-aware' approach # only requires one catalog update self.Catalog.catalog_object(newobj, REQUEST['id']) </snip> <snip> # compresstext.py # # Returns concatenated fields that have been compressed to save storage space and allow # 'comptext' field to be held in meta data table (for fast access at retrieval time) # # BIG NOTE: tres important eh # # zope seems to 'misplace' chr(13) characters from time to time, so we need to # replace all chr(13) with something we can fix at decompression time. This # routine replaces all chr(13) characters with the string '*fixme*' def compresstext(buf): zbuf = StringIO() zfile = gzip.GzipFile(mode = 'w', fileobj = zbuf, compresslevel = 9) zfile.write(buf) zfile.close() return replace(zbuf.getvalue()) # replace.py # # replaces all occurences of chr(13) with '*fixme*'. # Fix to get around problem of zope corrupting compressed data strings when they # are stored in property fields on objects (zope seems to lose some chr(13) bytes) # # Note: string.replace and re (regular expressions) substitution doesn't work # on binary strings (binary strings can have embedded null chars which cause # them to barf), so this binary replace routine was necessary def replace(instr): newstr = '' pos = 0 while pos < len(instr): if ord(instr[pos:pos+1]) == 13: newstr += '*fixme*' else: newstr += instr[pos:pos+1] pos += 1 return newstr </snip> *** Source Code extract from external method which does zcatalog search and decompresses metadata (we use an external method instead of a dtml method or python script and we don't use zpt) *** <snip> from cStringIO import StringIO import gzip import jhgzip </snip> <snip> results = self.Catalog.searchResults(searchdict) for rec in results: </snip> <snip> try: # try uncompressing with CRC enabled fulltext = uncompresstextCRC(rec.displaytext) except: # here if error during decompression, so try decompressing with CRC error checking turned off try: fulltext = uncompresstext(nstr) except: # died due to some unknown decompression error fulltext = 'description not available' </snip> <snip> # uncompresstextCRC # # Returns concatenated fields that have been compressed to save storage space and allow # 'comptext' field to be held in meta data table (for fast access at retrieval time) # # NOTE: this version of uncompresstext uses the original/unmodified gzip.py module and # will return CRC errors # # IMPORTANT: # # the compresstext.py routine replaces all occurences of the chr(13) character with the # string '*fixme*'. This is done because zope sometimes misplaces the chr(13) character # when it is stored/retrieved from the zodb. Therefore we must replace the substituted string # with the original character string. def uncompresstextCRC(buf): buf = replacefixme(buf) zbuf = StringIO(buf) zfile= gzip.GzipFile(mode = 'rb', fileobj = zbuf) obuf = '' while 1: chunk = zfile.read() if not chunk: break obuf = obuf + chunk zfile.close() return obuf # Returns concatenated fields that have been compressed to save storage space and allow # 'comptext' field to be held in meta data table (for fast access at retrieval time) # # NOTE: this version of uncompresstext uses a modified version of gzip which ignores CRC errors # # IMPORTANT: # # the compresstext.py routine replaces all occurences of the chr(13) character with the # string '*fixme*'. This is done because zope sometimes misplaces the chr(13) character # when it is stored/retrieved from the zodb. Therefore we must replace the substituted string # with the original character string. def uncompresstext(buf): buf = replacefixme(buf) zbuf = StringIO(buf) zfile= jhgzip.GzipFile(mode = 'rb', fileobj = zbuf) obuf = '' while 1: chunk = zfile.read() if not chunk: break obuf = obuf + chunk zfile.close() return obuf # replacefixme.py # # replaces all occurences of the string '*fixme*' with chr(13). # Fix to get around problem of zope corrupting compressed data strings when they # are stored in property fields on objects. # # Note: string.replace and re (regular expressions) substitution doesn't work # on binary strings (binary strings can have embedded null chars which cause # them to barf), so this binary replace routine was necessary def replacefixme(instr): newstr = '' pos = 0 while pos < len(instr): currchar = ord(instr[pos:pos+1]) # check to see if we have found a '*' character if currchar == 42: # found a '*', so check for '*fixme*' chk = currchar + ord(instr[pos+1:pos+2]) \ + ord(instr[pos+2:pos+3]) \ + ord(instr[pos+3:pos+4]) \ + ord(instr[pos+4:pos+5]) \ + ord(instr[pos+5:pos+6]) \ + ord(instr[pos+6:pos+7]) if (chk == 621) and ( ord(instr[pos+6:pos+7]) == 42 ): # found '*fixme*', so insert missing chr(13) character newstr = newstr + chr(13) # don't add '*fixme*' characters to new string pos = pos + 7 else: # '*fixme*' not found, so add current character to new string newstr = newstr + chr(currchar) pos = pos + 1 else: # '*' character not found, so add current character to new string newstr = newstr + chr(currchar) pos = pos + 1 return newstr </snip> The modifications to gzip.py (called jhgzip in the above code) are as follows: # standard gzip _read_eof with crc checking commented out # (some CRC errors can be ignored) def _read_eof(self): # We've read to the end of the file, so we have to rewind in order # to reread the 8 bytes containing the CRC and the file size. # We check the that the computed CRC and size of the # uncompressed data matches the stored values. self.fileobj.seek(-8, 1) crc32 = read32(self.fileobj) isize = read32(self.fileobj) # if crc32%0x100000000L != self.crc%0x100000000L: # raise ValueError, "CRC check failed" # elif isize != self.size: # raise ValueError, "Incorrect length of data produced" HTH, Good Luck! Jonathan