Efficient Processing Of Large ZCatalog Queries
Hello, I have a zope setup that does a ZCatalog Query, grabs each item (i.e., it does not use the query-return objects), and then does some processing on each returned object. In pseudocode, whenever I do a ZCatalog Query, I do the following: return [myFunction(getObject(x)) for x in catalog.search(myquery)] The problem is that some of the query response will be quite large -- up to 10,000 objects returned. Doing a dtml-in over a result set this size does not seem to be feasible -- the browser times out, for one thing. I have avoided batch processing so far, because I understand (perhaps incorrectly) that batch-processing delays the evaluation of later batch results until the batch is viewed. As I am running a script on the results, often via a cron job, I want all of the objects to be processed. If this understanding is incorrect, please let me know. So my questions are as follows: 1. Is my understanding about batch processing correct or incorrect? Can I have the browser only display the first hundred responses, but have all 5-10,000 results processed? 2. Is there a way to stream responses as they come? More specifically, would it perhaps be feasible to grab the result set and process them one at a time, displaying partial results? I tried using REQUEST.RESPONSE.write, but the results all seemed to come out at the same time anyway. 3. Finally, what factors influence the speed of a ZCatalog query? Is it the total number of indexed objects? Is it constant? Currently I do one query and then process the results. If I reworked this so that behind the scenes it was doing n queries, I would obviously be slowing myself down, but by how much? I have created a system for managing data, but I seem to be running into some difficulty now that I have put in the data. For smaller numbers (up to 5000 main records), Zope seems to work fine, but I'm having trouble scaling up. Can anyone help? I'm happy to answer any questions to illuminate the problem more clearly. Thanks In Advance, Van Lindberg
On Thursday 17 Oct 2002 9:48 pm, VanL wrote:
Hello,
I have a zope setup that does a ZCatalog Query, grabs each item (i.e., it does not use the query-return objects), and then does some processing on each returned object.
In pseudocode, whenever I do a ZCatalog Query, I do the following:
return [myFunction(getObject(x)) for x in catalog.search(myquery)]
The problem is that some of the query response will be quite large -- up to 10,000 objects returned. Doing a dtml-in over a result set this size does not seem to be feasible -- the browser times out, for one thing.
Are you sure the time is spent in the dtml, rather than in the pseudocode loop your presented above?
Yes, the time is spent in the pseudocode loop, not in the dtml. Sorry that wasn't clearer. However, the dtml waits for the loop to completely finish processing before it will display the result page. One of the things that I am curious about is a method to display the results as they become available. (As I noted in the original message, REQUEST.RESPONSE.write doesn't seem to be working for me in my current setup. Thanks, Van Toby Dickenson wrote:
On Thursday 17 Oct 2002 9:48 pm, VanL wrote:
Hello,
I have a zope setup that does a ZCatalog Query, grabs each item (i.e., it does not use the query-return objects), and then does some processing on each returned object.
In pseudocode, whenever I do a ZCatalog Query, I do the following:
return [myFunction(getObject(x)) for x in catalog.search(myquery)]
The problem is that some of the query response will be quite large -- up to 10,000 objects returned. Doing a dtml-in over a result set this size does not seem to be feasible -- the browser times out, for one thing.
Are you sure the time is spent in the dtml, rather than in the pseudocode loop your presented above?
On Thursday 17 Oct 2002 10:04 pm, VanL wrote:
Yes, the time is spent in the pseudocode loop, not in the dtml. Sorry that wasn't clearer. However, the dtml waits for the loop to completely finish processing before it will display the result page. One of the things that I am curious about is a method to display the results as they become available. (As I noted in the original message, REQUEST.RESPONSE.write doesn't seem to be working for me in my current setup.
That makes sense if you are only calling RESPONSE.write once the pseudocode loop has finished. I think you need: for x in catalog.search(myquery): ob = myFunction(getObject(x)) RESPONSE.write(formatting_function(ob)) if you dont want to always display the whole list: for x in catalog.search(myquery)[start:end]: ob = myFunction(getObject(x)) RESPONSE.write(formatting_function(ob))
That makes sense if you are only calling RESPONSE.write once the pseudocode loop has finished. I think you need:
for x in catalog.search(myquery): ob = myFunction(getObject(x)) RESPONSE.write(formatting_function(ob))
I actually try to call RESPONSE.write in myFunction, above. I actually have: <dtml-in expr="Query(REQUEST.form)"> [formatting code for responses] </dtml-in> It is actually a bit more complicated than I originally expressed: I can't hardcode myFunction(x)into the response page because I don't know in advance what function (if any) will be called on the query results. I dynamically bind the name of the script to actual script at run time. The calling chain looks like this: Form Response Page Query Control Script Query Preparsing (separate out control and query fields) Query Control Catalog Query (We actually query the catalog at this point, return result.getObject) Query Control Association Manager (given a list of objects, returns a list of related objects) Query Control Object Processor NameBinding (Associates the name of a script with the actual script object) Object Processor (Calls the script on each object in turn) ** This is where I try to call RESPONSE.write and it doesn't seem to work** Query Control Response Page (displays result) I realize this is rather complex, but this gives me a simple API which I can write to that allows me to run an arbitrary script on an arbitrary set of input objects. Performance seems to be reasonable for smaller input sets -- going through this calling chain does not appear to take significantly longer than just doing a straight query for result sets up to about 500. (unless, of course, the called script does something that takes a lot of time). For larger queries, though, it seems to take an extraordinarily long time to return. I'm trying to figure out where the problem is and fix it. You said that perhaps I could do:
for x in catalog.search(myquery)[start:end]
That is reasonable, but would I only process the objects between[start:end], or would I process *all* the objects, but only truncate the dsiplayed result? Thanks, Van
I didn't say, but one reason that I am looking at the dtml and ZCatalog as culprits is that I appear to be getting the same poor performance on large queries, even if I short-circuit the other parts of the chain I just described. Thanks, Van
participants (2)
-
Toby Dickenson -
VanL