Any working experiences in using compression (for storage)?
We are in the need to put more text information into Zope's database without exceeding its file-size. I know there's a product CompressedStorage out there, but it's kind of old. Hence I'd like to know whether there's succesful experiences on using it and/or compressing stuff inside ZODB. Which is the best approach for a task like this? Thanks in advance, Ausum
From: "Ausum Studio" <ausum_studio@hotmail.com>
We are in the need to put more text information into Zope's database without exceeding its file-size. I know there's a product CompressedStorage out there, but it's kind of old. Hence I'd like to know whether there's succesful experiences on using it and/or compressing stuff inside ZODB.
Which is the best approach for a task like this?
We used gzip (via an external method) to compress data before storing it in a meta-data table of a zcatalog. We then decompressed the meta-data on-the-fly at retrieval time. We found that this was faster than accessing the object at retrieval time to get the data field (zcatalog meta-data is available without having to access the indexed object itself). We did however encounter a couple of difficulties with this approach: (1) it is only scalable to a certain point. When the size of the zcatalog (indexes + metadata) exceeds available RAM, then swapping eliminates the speed advantage of storing compressed metadata (we went back to accessing the object itself in order to support a larger zcatalog) (2) Zope seems to loose chr(13) characters in certain compressed data sequences. To fix this problem we modified the compression routine to replace chr(13) characters with '*fixme*', then compressed/stored the data. On the decompression side, we decompressed the data and then replaced the *fixme* occurences with chr(13) characters and all was well (if you don't do this the decompression process results in corrupted data). Our current zodb size is about 7Gb and we are still investigating various methods for improving update speed and retrieval speed (we're not Google, so throwing 100,000 cpus at the problem is not an option!). How much data are you trying to store that would cause you to want to get into compression? HTH Jonathan
----- Original Message ----- From: "Jonathan Hobbs" <toolkit@magma.ca> To: "Ausum Studio" <ausum_studio@hotmail.com>; <zope@zope.org> Sent: Tuesday, July 06, 2004 7:32 AM
(...)
Our current zodb size is about 7Gb and we are still investigating various methods for improving update speed and retrieval speed (we're not Google,
so
throwing 100,000 cpus at the problem is not an option!). How much data are you trying to store that would cause you to want to get into compression?
Hi, Jonathan. I just want to deliver a read only CD catalog. Your aproach seems to enable to get rid of the real objects and to have a compressed zcatalog. Both things save space. Interesting. Would you share the core lines of your external method? Thanks for the response Ausum
HTH
Jonathan
From: "Ausum Studio" <ausum_studio@hotmail.com>
----- Original Message ----- From: "Jonathan Hobbs" <toolkit@magma.ca> To: "Ausum Studio" <ausum_studio@hotmail.com>; <zope@zope.org> Sent: Tuesday, July 06, 2004 7:32 AM
(...)
Our current zodb size is about 7Gb and we are still investigating
various
methods for improving update speed and retrieval speed (we're not Google, so throwing 100,000 cpus at the problem is not an option!). How much data are you trying to store that would cause you to want to get into compression?
Hi, Jonathan. I just want to deliver a read only CD catalog. Your aproach seems to enable to get rid of the real objects and to have a compressed zcatalog. Both things save space. Interesting. Would you share the core lines of your external method?
Here are the relevant source code bits (it may be a bit cryptic as I just ripped it from the various routines): *** Source Code extracts from ZCatalog update routine *** <snip> from cStringIO import StringIO import gzip </snip> <snip> # all of the text fields are stored in the child record as a compressed field (so save # space and allow field to be included in meta data) displaytext = compresstext(fielddata) # we still need an uncompressed version of the text data for the zcatalog indexer to use fulltext = fielddata </snip> <snip> #add new object to BTreeFolder2 folder (WebSitesChildData) newobj = self.WebSitesChildData.manage_addProduct['SWV2'].WebSites.createInObjectMana ger(REQUEST['id'], REQUEST) # add fields to new object newobj.propertysheets.WebSitesPS.manage_editProperties({ 'title' : title, 'displaytext' : displaytext, 'company_text' : company_text, 'rating' : rating, 'rating_count' : rating_count, 'createdby' : createdby, 'date' : date, 'last_modified_date' : date, 'master_recid' : master_recid}) # zclass is NOT 'catalog aware' (catalog aware instances require 2 catalog updates: one when object is added to folder # - the automatic update - and a second manual 'reindex' after the fields are added to the object). The 'non-aware' approach # only requires one catalog update self.Catalog.catalog_object(newobj, REQUEST['id']) </snip> <snip> # compresstext.py # # Returns concatenated fields that have been compressed to save storage space and allow # 'comptext' field to be held in meta data table (for fast access at retrieval time) # # BIG NOTE: tres important eh # # zope seems to 'misplace' chr(13) characters from time to time, so we need to # replace all chr(13) with something we can fix at decompression time. This # routine replaces all chr(13) characters with the string '*fixme*' def compresstext(buf): zbuf = StringIO() zfile = gzip.GzipFile(mode = 'w', fileobj = zbuf, compresslevel = 9) zfile.write(buf) zfile.close() return replace(zbuf.getvalue()) # replace.py # # replaces all occurences of chr(13) with '*fixme*'. # Fix to get around problem of zope corrupting compressed data strings when they # are stored in property fields on objects (zope seems to lose some chr(13) bytes) # # Note: string.replace and re (regular expressions) substitution doesn't work # on binary strings (binary strings can have embedded null chars which cause # them to barf), so this binary replace routine was necessary def replace(instr): newstr = '' pos = 0 while pos < len(instr): if ord(instr[pos:pos+1]) == 13: newstr += '*fixme*' else: newstr += instr[pos:pos+1] pos += 1 return newstr </snip> *** Source Code extract from external method which does zcatalog search and decompresses metadata (we use an external method instead of a dtml method or python script and we don't use zpt) *** <snip> from cStringIO import StringIO import gzip import jhgzip </snip> <snip> results = self.Catalog.searchResults(searchdict) for rec in results: </snip> <snip> try: # try uncompressing with CRC enabled fulltext = uncompresstextCRC(rec.displaytext) except: # here if error during decompression, so try decompressing with CRC error checking turned off try: fulltext = uncompresstext(nstr) except: # died due to some unknown decompression error fulltext = 'description not available' </snip> <snip> # uncompresstextCRC # # Returns concatenated fields that have been compressed to save storage space and allow # 'comptext' field to be held in meta data table (for fast access at retrieval time) # # NOTE: this version of uncompresstext uses the original/unmodified gzip.py module and # will return CRC errors # # IMPORTANT: # # the compresstext.py routine replaces all occurences of the chr(13) character with the # string '*fixme*'. This is done because zope sometimes misplaces the chr(13) character # when it is stored/retrieved from the zodb. Therefore we must replace the substituted string # with the original character string. def uncompresstextCRC(buf): buf = replacefixme(buf) zbuf = StringIO(buf) zfile= gzip.GzipFile(mode = 'rb', fileobj = zbuf) obuf = '' while 1: chunk = zfile.read() if not chunk: break obuf = obuf + chunk zfile.close() return obuf # Returns concatenated fields that have been compressed to save storage space and allow # 'comptext' field to be held in meta data table (for fast access at retrieval time) # # NOTE: this version of uncompresstext uses a modified version of gzip which ignores CRC errors # # IMPORTANT: # # the compresstext.py routine replaces all occurences of the chr(13) character with the # string '*fixme*'. This is done because zope sometimes misplaces the chr(13) character # when it is stored/retrieved from the zodb. Therefore we must replace the substituted string # with the original character string. def uncompresstext(buf): buf = replacefixme(buf) zbuf = StringIO(buf) zfile= jhgzip.GzipFile(mode = 'rb', fileobj = zbuf) obuf = '' while 1: chunk = zfile.read() if not chunk: break obuf = obuf + chunk zfile.close() return obuf # replacefixme.py # # replaces all occurences of the string '*fixme*' with chr(13). # Fix to get around problem of zope corrupting compressed data strings when they # are stored in property fields on objects. # # Note: string.replace and re (regular expressions) substitution doesn't work # on binary strings (binary strings can have embedded null chars which cause # them to barf), so this binary replace routine was necessary def replacefixme(instr): newstr = '' pos = 0 while pos < len(instr): currchar = ord(instr[pos:pos+1]) # check to see if we have found a '*' character if currchar == 42: # found a '*', so check for '*fixme*' chk = currchar + ord(instr[pos+1:pos+2]) \ + ord(instr[pos+2:pos+3]) \ + ord(instr[pos+3:pos+4]) \ + ord(instr[pos+4:pos+5]) \ + ord(instr[pos+5:pos+6]) \ + ord(instr[pos+6:pos+7]) if (chk == 621) and ( ord(instr[pos+6:pos+7]) == 42 ): # found '*fixme*', so insert missing chr(13) character newstr = newstr + chr(13) # don't add '*fixme*' characters to new string pos = pos + 7 else: # '*fixme*' not found, so add current character to new string newstr = newstr + chr(currchar) pos = pos + 1 else: # '*' character not found, so add current character to new string newstr = newstr + chr(currchar) pos = pos + 1 return newstr </snip> The modifications to gzip.py (called jhgzip in the above code) are as follows: # standard gzip _read_eof with crc checking commented out # (some CRC errors can be ignored) def _read_eof(self): # We've read to the end of the file, so we have to rewind in order # to reread the 8 bytes containing the CRC and the file size. # We check the that the computed CRC and size of the # uncompressed data matches the stored values. self.fileobj.seek(-8, 1) crc32 = read32(self.fileobj) isize = read32(self.fileobj) # if crc32%0x100000000L != self.crc%0x100000000L: # raise ValueError, "CRC check failed" # elif isize != self.size: # raise ValueError, "Incorrect length of data produced" HTH, Good Luck! Jonathan
(looked around and cannot find anything on this) When starting Zope 2.7.1 with Pythin 2.3.4 I get: 2004-07-10T05:09:34 INFO(0) Zope Set effective user to "nobody" Traceback (most recent call last): File "/usr/local/Zope-2.7.1-0/lib/python/Zope/Startup/run.py", line 50, in ? File "/usr/local/Zope-2.7.1-0/lib/python/Zope/Startup/run.py", line 19, in run File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 49, in start_zope File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 245, in makeLockFile ImportError: No module named misc.lock_file Which to me looks like a PIL problem, but this is a fresh install with no links to PIL. Anyone? Jake ____________________ http://www.ZopeZone.com "Zoping for the rest of us"
(looked around and cannot find anything on this) When starting Zope 2.7.1 with Pythin 2.3.4 I get: 2004-07-10T05:09:34 INFO(0) Zope Set effective user to "nobody" Traceback (most recent call last): File "/usr/local/Zope-2.7.1-0/lib/python/Zope/Startup/run.py", line 50, in ? File "/usr/local/Zope-2.7.1-0/lib/python/Zope/Startup/run.py", line 19, in run File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 49, in start_zope File "/usr/local/lib/python2.3/site-packages/PIL/__init__.py", line 245, in makeLockFile ImportError: No module named misc.lock_file Which to me looks like a PIL problem, but this is a fresh install with no links to PIL. Anyone? Jake ____________________ http://www.ZopeZone.com "Zoping for the rest of us"
participants (3)
-
Ausum Studio -
Jake -
Jonathan Hobbs