Hi zopistas, I need to access large textfiles (~120Mb) from zope. I know the python lager file support and that it is better to keep large files out of the ZODB. I have to do a full text search on these files residing in a folder hierachy on the server, show their content around the location of the found string and allow scrolling in that files source from zope. Has anybody done something similar to this with that lager files and would share his experiences? Are there any do's and don'ts or best ways to do it? Thanks for your answers, SK
Sebastian wrote:
I need to access large textfiles (~120Mb) from zope. I know the python lager file support and that it is better to keep large files out of the ZODB. I have to do a full text search on these files residing in a folder hierachy on the server, show their content around the location of the found string and allow scrolling in that files source from zope.
Has anybody done something similar to this with that lager files and would share his experiences? Are there any do's and don'ts or best ways to do it?
We have a similar application, but with many smaller text records instead of one large one. We currently use ZCatalog with ZCTextIndex to maintain a database of about 700,000 text records; average record size is about 10k bytes. The total ZODB size is about 5Gb. We also do a similar thing in that we locate the users search term within the record and display the relevant sections of the search results records. We found that we had to include the full-text of the record within the metadata table (even though the recommended practice is to have a maximum of 200 bytes in the metadata table) because the time required to access the original document was much too long. There are some downsides to our approach though. Retrieval speed is excellent, however it currently takes us about 30 days of processing on a dedicated server to rebuild the database. We are currently running Zope 2.6.1 on linux servers and will be upgrading to 2.6.2 as soon as our current update cycle completes. We are also looking for alternative ways to store the full-text. HTH Jonathan
Small Business Services wrote at 2003-12-5 08:31 -0500:
... We also do a similar thing in that we locate the users search term within the record and display the relevant sections of the search results records. We found that we had to include the full-text of the record within the metadata table (even though the recommended practice is to have a maximum of 200 bytes in the metadata table) because the time required to access the original document was much too long.
Something seems to be strange with your setup. In general, it is no problem to load a few 10k objects from the ZODB (this is different when you load hundreds or thousands). Putting the text in your MetaData causes huge MetaData blocks. With your avarage 10k content, you get 300k to 600k MetaData blocks (each block contains metadata records for up to 60 objects). -- Dieter
Small Business Services wrote at 2003-12-5 08:31 -0500:
... We also do a similar thing in that we locate the users search term within the record and display the relevant sections of the search results records. We found that we had to include the full-text of the record within the metadata table (even though the recommended practice is to have a maximum of 200 bytes in the metadata table) because the time required to access the original document was much too long.
Something seems to be strange with your setup.
In general, it is no problem to load a few 10k objects from the ZODB (this is different when you load hundreds or thousands).
Putting the text in your MetaData causes huge MetaData blocks. With your avarage 10k content, you get 300k to 600k MetaData blocks (each block contains metadata records for up to 60 objects).
When we used getitem to access the fulltext field of the target document our average response time (using Call Profiler) was in the range of 5-7 seconds (we actually had to call getitem on 10 different documents for each search result set). When we moved the fulltext data to metadata the average response time dropped to 1-1.5 seconds. There were no other differences between the tests, so we concluded that doing a getitem call to the actual document was very expensive. Jonathan
Small Business Services wrote at 2003-12-8 08:19 -0500:
...
Something seems to be strange with your setup.
In general, it is no problem to load a few 10k objects from the ZODB (this is different when you load hundreds or thousands).
Putting the text in your MetaData causes huge MetaData blocks. With your avarage 10k content, you get 300k to 600k MetaData blocks (each block contains metadata records for up to 60 objects).
When we used getitem to access the fulltext field of the target document our average response time (using Call Profiler) was in the range of 5-7 seconds (we actually had to call getitem on 10 different documents for each search result set). When we moved the fulltext data to metadata the average response time dropped to 1-1.5 seconds.
There were no other differences between the tests, so we concluded that doing a getitem call to the actual document was very expensive.
I recently measured ZEO interaction (on a quite fast computer with ZEO client and server on the same machine). Usually, a ZEO load took about 1 to 3 ms for small to medium size objects. When you access an object from a catalog search, you probably load more than on object (all objects on the path must be loaded, unless they are cached). Nevertheless, I have difficulties to understand the 5-7 seconds. -- Dieter
On Fri, Dec 05, 2003 at 10:31:08AM +0100, Sebastian Krollmann wrote:
Hi zopistas,
I need to access large textfiles (~120Mb) from zope. I know the python lager file support and that it is better to keep large files out of the ZODB. I have to do a full text search on these files residing in a folder hierachy on the server, show their content around the location of the found string and allow scrolling in that files source from zope.
Has anybody done something similar to this with that lager files and would share his experiences? Are there any do's and don'ts or best ways to do it?
I think you will find that serving a 120 mb object through zope will cripple your performance. Zope is reeeeallly slow with large chunks of data. A couple of concurrent downloads of 100 MB files can cause your site to crawl for all users. However, there are a couple of ways you could store and index the text files in zope but avoid having the users hit zope to download them. I'm experimenting with FSCacheManager (downloadable from cvs on collective.sf.net) which does "funky caching" in conjunction with an apache rule. Apache tries to serve the file directly from the filesystem. If it doesn't exist, apache then forwards the request to zope. The FSCacheManager causes the file to be stored to the filesystem each time it's hit in zope. Once a file is on the filesystem, zope won't see further requests for it. This works fine and it's very easy to set up. The big limitation is that, once the file is on the filesystem, it's available to all ... zope authorization is never checked again. Also you can't really control life of the cache but that may not be an issue. You could do something similar with Squid filesystem cacheing, which IIRC can be configured to request authorization from zope each time someone downloads the file, and clean out the cache according to some policy. Of course, you'll need a lot of disk space either way, but who cares? In either case, the first download will still be slow but you can prevent that by using wget or similar to "prime" the cache during off-hours. -- Paul Winkler http://www.slinkp.com Look! Up in the sky! It's FLYING ACTION HERO! (random hero from isometric.spaceninja.com)
Sebastian Krollmann wrote at 2003-12-5 10:31 +0100:
I need to access large textfiles (~120Mb) from zope. I know the python lager file support and that it is better to keep large files out of the ZODB. I have to do a full text search on these files residing in a folder hierachy on the server, show their content around the location of the found string and allow scrolling in that files source from zope.
You can do most of this even when the files are in the file system (You would use an External Method to extract ranges on the external file). Expect a hard time to efficiently index "positions" in your file. I think, you will not want to search a string in a 100MB+ file without position information (at least not often)... -- Dieter
Hi Peter,
Hm, I think (downloading) performance should be roughly the same -- after all, the files would still have to loaded by Zope, at least with Products like LocalFS and ExtFile that first load the whole file into memory before starting to serve it? I wrote myself a streaming method for this reason.
So either to patch LocalFS and ExtFile not to wait until complete load or reading the file by external methods like Dieter Maurer suggested. What does your streaming method do, just downloading the file? Thanks for your answers, SK
Sebastian Krollmann wrote:
Hi Peter,
Hm, I think (downloading) performance should be roughly the same -- after all, the files would still have to loaded by Zope, at least with Products like LocalFS and ExtFile that first load the whole file into memory before starting to serve it? I wrote myself a streaming method for this reason.
So either to patch LocalFS and ExtFile not to wait until complete load or reading the file by external methods like Dieter Maurer suggested.
What does your streaming method do, just downloading the file?
Its pretty simple -- I actually use LocalFS so users can browse / upload files, and this: def streamFile(file, name, REQUEST): """file is a LocalFS file object, name is the file name""" resp = REQUEST.RESPONSE path = file.path ctype = file._getType() size = file.get_size() # logging.debug("setting type: %s, path: %s" % (ctype, path)) resp.setHeader('Content-Type', ctype) resp.setHeader('Content-Length', size) resp.setHeader('Content-Location', name) read = open(path, 'rb').read write = resp.write while 1: data = read(CONST_STREAMREADSIZE) if data == '': break write(data) is for downloading. hth, peter.
Thanks for your answers,
SK
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
participants (5)
-
Dieter Maurer -
Paul Winkler -
Peter Sabaini -
Sebastian Krollmann -
Small Business Services