Hi, I'm looking for documentation on Record.pyd, preferably the Record.py equivalent (for Zope 2.5.1). I'm trying to figure out if it's possible to add grouping and statistics to ZCatalogs. Best Regards, Johan Carlsson -- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
I have googled up these a while ago: http://www.zope.org/Members/petrilli/WritingADA http://www.zope.org/Members/Caseman/ZCatalog_for_2.3 http://www.amk.ca/python/writing/DB-API.html And I would suggest reading the (Z)Catalog sources. HTH, Stefan --On Donnerstag, 08. August 2002 09:01 +0200 "Johan Carlsson [Torped]" <johanc@torped.se> wrote:
Hi, I'm looking for documentation on Record.pyd, preferably the Record.py equivalent (for Zope 2.5.1).
I'm trying to figure out if it's possible to add grouping and statistics to ZCatalogs.
Best Regards, Johan Carlsson
-- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se
Mail: Birkagatan 9 SE-113 36 Stockholm Sweden
Visit: Västmannagatan 67, Stockholm, Sweden
Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
-- Those who write software only for pay should go hurt some other field. /Erik Naggum/
At 11:39 2002-08-08 +0200, Stefan H. Holek said:
I have googled up these a while ago:
http://www.zope.org/Members/petrilli/WritingADA http://www.zope.org/Members/Caseman/ZCatalog_for_2.3 http://www.amk.ca/python/writing/DB-API.html
Thanks for the tips, even though they didn't help much :-(
And I would suggest reading the (Z)Catalog sources.
Well, it was while reading the ZCatalogBrain I realized that it used the Record.pyd :-) In worst case I probably have to read the .c file, even though it's not as fun as reading .py files :-) Johan
HTH, Stefan
--On Donnerstag, 08. August 2002 09:01 +0200 "Johan Carlsson [Torped]" <johanc@torped.se> wrote:
Hi, I'm looking for documentation on Record.pyd, preferably the Record.py equivalent (for Zope 2.5.1).
I'm trying to figure out if it's possible to add grouping and statistics to ZCatalogs.
Best Regards, Johan Carlsson
-- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se
Mail: Birkagatan 9 SE-113 36 Stockholm Sweden
Visit: Västmannagatan 67, Stockholm, Sweden
Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
-- Those who write software only for pay should go hurt some other field. /Erik Naggum/
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
-- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
On Thursday 08 August 2002 03:01 am, Johan Carlsson [Torped] wrote:
Hi, I'm looking for documentation on Record.pyd, preferably the Record.py equivalent (for Zope 2.5.1).
I'm trying to figure out if it's possible to add grouping and statistics to ZCatalogs.
Best Regards, Johan Carlsson
I'm not sure the record structure is relevant to grouping. All a record is a fast and efficient way to represent data stored in a tuple (from the catalog metadata) as a python object with attributes where the attr names are mapped to the correct tuple item (column). Grouping can be done at a higher level, probably using a python sort or a dictionary. Lets say you have a record/metadata schema like so: id (string) name (string) amount (float) and you want to group by name and total the amounts, then you could use something like: totals = {} for r in records: totals[r.name] = totals.get(r.name, 0.0) + r.amount totals = totals.items() totals.sort() Which would return a list of (name, total amount) tuples sorted by name. hth, Casey
At 09:07 2002-08-08 -0400, Casey Duncan said:
On Thursday 08 August 2002 03:01 am, Johan Carlsson [Torped] wrote:
Hi, I'm looking for documentation on Record.pyd, preferably the Record.py equivalent (for Zope 2.5.1).
I'm trying to figure out if it's possible to add grouping and statistics to ZCatalogs.
Best Regards, Johan Carlsson
I'm not sure the record structure is relevant to grouping. All a record is a fast and efficient way to represent data stored in a tuple (from the catalog metadata) as a python object with attributes where the attr names are mapped to the correct tuple item (column).
Grouping can be done at a higher level, probably using a python sort or a dictionary. Lets say you have a record/metadata schema like so:
id (string) name (string) amount (float)
and you want to group by name and total the amounts, then you could use something like:
totals = {} for r in records: totals[r.name] = totals.get(r.name, 0.0) + r.amount totals = totals.items() totals.sort()
Which would return a list of (name, total amount) tuples sorted by name.
Thanks for the input Casey. (I'm still curious about Record.pyd, its not the first time I have wondered what it does inside :-) Would your example scale well. I suppose it's not very Lacy? I envision a brain that would represent a group of records with methods for calculate statistical data when needed. Well, it's something I need to think more about anyway. Best Regards, Johan Carlsson -- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
On Thursday 08 August 2002 09:22 am, Johan Carlsson [Torped] wrote: [snip]
Thanks for the input Casey. (I'm still curious about Record.pyd, its not the first time I have wondered what it does inside :-)
Would your example scale well. I suppose it's not very Lacy?
Ooooo, lacy... Nope it ain't, the example above is quick and dirty. To do it lazily you would get the results presorted from the catalog (by name for above) and compute the statistics in batches. That would only load the records that were being grouped. That would be only a minor change to the above code.
I envision a brain that would represent a group of records with methods for calculate statistical data when needed.
I suppose, but random access to groups would be pretty slow. I would favor a simpler walk over the list since brains are really designed for a one to one brain to record mapping. I think maybe a lazy grouping class that takes a lazy sequence as input could be devised.
Well, it's something I need to think more about anyway.
indeed.
Best Regards, Johan Carlsson
good luck! Maybe you should post a recipe on Zopelabs when you get it done. -Casey
At 09:40 2002-08-08 -0400, Casey Duncan said:
On Thursday 08 August 2002 09:22 am, Johan Carlsson [Torped] wrote: [snip] I suppose, but random access to groups would be pretty slow. I would favor a simpler walk over the list since brains are really designed for a one to one brain to record mapping.
Your probably correct. I'm still thinking about this (and it kind of gives me a headache or it might be the weather).
I think maybe a lazy grouping class that takes a lazy sequence as input could be devised.
I probably end up with a special kind of LacyMap, which takes a list of sets, I think. I'm currently trying to figure out how it would work with other Lacy sequences.
good luck! Thanks, I'll probably need it :-)
Maybe you should post a recipe on Zopelabs when you get it done. It probably end up as a Product/Patch and a Proposal I guess, if I get that far :-)
Cheers, Johan -- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
Hi, I'm back on the Brain track :-) What function does the __record_schema__ attribute of the Brains have? Does it do anything else when provide the has_key feature? def has_key(self, key): return self.__record_schema__.has_key(key) Best Regards, Johan Carlsson -- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
__record_schema__ is simply a dictionary which maps field names to column positions (ints) so that the record knows the index of each field in the record tuples. See line 154 of Catalog.py to see how it is initialized to the Metadata schema plus a few extra columns for catalog rid and scores. -Casey On Friday 09 August 2002 07:17 am, Johan Carlsson [Torped] wrote:
Hi, I'm back on the Brain track :-) What function does the __record_schema__ attribute of the Brains have?
Does it do anything else when provide the has_key feature? def has_key(self, key): return self.__record_schema__.has_key(key)
Best Regards, Johan Carlsson
At 08:59 2002-08-09 -0400, Casey Duncan said:
__record_schema__ is simply a dictionary which maps field names to column positions (ints) so that the record knows the index of each field in the record tuples.
See line 154 of Catalog.py to see how it is initialized to the Metadata schema plus a few extra columns for catalog rid and scores.
Hi Casey (and zope-dev), Thanks! After some experimenting I realized that :-) One of the reasons I was because I am thinking about how to implement a "SELECT col1 as 'name', ... "type of feature for ZCatalogs. I'm not entirely sure it's an good idea to start with, but I'm thinking in the line of large ZCatalogs (by large I mean allot of columns in the self.data structure). If all columns are copied the brains would grow larger as well and by selecting explicitly which columns should be copied to the brain they would be lighter. Now that I understand how the data tuples are copied to the brain I'm not at all sure adding a filter when copying the tuple will optimize thing, because of the overhead in the filter process. (The way that I "solved" the group/calc part of my "project", I don't think it will lead to memory bloat. I'm going to implement a LacyGroupMap which take an extra parameter (a list of IISet). Each brain created in the LacyMap will have methods for calculations directly on the self.data in the Catalog. The data it self will not be stored. There will most probably be a pre calculate method that calculate all variables that are applicable and caches the result.) One way to reduce memory consumption in wide Catalogs would be to have LacyBrains (vertical lacyness, there might be reasons why that would be a bad idea, which I'm not aware of) Another way would be to have multiple data attributes in the Catalog, like tables, and to join the tuples from them with a "from table1, table2" statement. In this way it would be possible to control the width of the brains. It would also be possible for the object indexing it self to tell the catalog in which "tables" it should store meta data. There have been some proposals (ObjectHub et al) which I read some time ago. I didn't feel then that we what I was looking for. Please tell me if there's been any proposals or discussions regarding this. Regards, Johan Carlsson -- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
On Saturday 10 August 2002 11:25 am, Johan Carlsson [Torped] wrote:
At 08:59 2002-08-09 -0400, Casey Duncan said:
__record_schema__ is simply a dictionary which maps field names to column positions (ints) so that the record knows the index of each field in the record tuples.
See line 154 of Catalog.py to see how it is initialized to the Metadata schema plus a few extra columns for catalog rid and scores.
Hi Casey (and zope-dev), Thanks! After some experimenting I realized that :-)
One of the reasons I was because I am thinking about how to implement a "SELECT col1 as 'name', ... "type of feature for ZCatalogs.
I'm not entirely sure it's an good idea to start with, but I'm thinking in the line of large ZCatalogs (by large I mean allot of columns in the self.data structure). If all columns are copied the brains would grow larger as well and by selecting explicitly which columns should be copied to the brain they would be lighter.
Now that I understand how the data tuples are copied to the brain I'm not at all sure adding a filter when copying the tuple will optimize thing, because of the overhead in the filter process.
This occurs lazily so the savings would be heavily dependant on the application. For most web apps presenting small batches of records, the savings in limiting columns returned would be pretty minimal. The general usage is to put a minimal set of columns in metadata, only enough to create a results page and load the objects in cases where either large, dynamic or otherwise arbitrary data elements are needed.
(The way that I "solved" the group/calc part of my "project", I don't think it will lead to memory bloat. I'm going to implement a LacyGroupMap which take an extra parameter (a list of IISet). Each brain created in the LacyMap will have methods for calculations directly on the self.data in the Catalog. The data it self will not be stored. There will most probably be a pre calculate method that calculate all variables that are applicable and caches the result.)
Sounds like a pretty good solution. However, I would be hesitant in creating direct dependancies on the internal Catalog data structures if you can help it (sometimes you can't though).
One way to reduce memory consumption in wide Catalogs would be to have LacyBrains (vertical lacyness, there might be reasons why that would be a bad idea, which I'm not aware of)
That would pretty much require a rewrite of the Catalog as the data structures would need to be completely different. It would introduce significant database overhead since each metadata field would need to be loaded individually. I think that would negate whatever performance benefit metadata might have over simply loading the objects.
Another way would be to have multiple data attributes in the Catalog, like tables, and to join the tuples from them with a "from table1, table2" statement. In this way it would be possible to control the width of the brains. It would also be possible for the object indexing it self to tell the catalog in which "tables" it should store meta data.
Yes, this would be better. You could have different sets of metadata for each catalog record. You would select which one you wanted at query time.
There have been some proposals (ObjectHub et al) which I read some time ago. I didn't feel then that we what I was looking for. Please tell me if there's been any proposals or discussions regarding this.
I don't think so. If you feel strongly about this, write up a proposal and provide some use cases for discussion.
Regards, Johan Carlsson
-Casey
At 21:28 2002-08-10 -0400, Casey Duncan said:
On Saturday 10 August 2002 11:25 am, Johan Carlsson [Torped] wrote:
Now that I understand how the data tuples are copied to the brain I'm not at all sure adding a filter when copying the tuple will optimize thing, because of the overhead in the filter process.
This occurs lazily so the savings would be heavily dependant on the application. For most web apps presenting small batches of records, the savings in limiting columns returned would be pretty minimal.
But there must be some though implementing Record.pyd i C, but off course I suppose Record.pyd was first used for ZSQL? An easy filter would be to let __record_schema__ control which columns to save, as it works to day __record_schema__ must point on a sequence starting with 0, so I can't specify indexes into the tuple like this: __record_schema__= {'hey':12, 'dude': 22} Maybe this is "easy" to change in the record.pyd, or I just implement it in a special brain base class? After revisited Record.c I realized that the tuple from the catalogs self.data is stored as a tuple (or as a C-array I suppose?) in a Record or as attributes depending on what you provide to the constructor. I suppose coping data to a C-array is much faster than creating attributes on each brain, but if the array is large and the number attributes needed to be set is small it might be the other way around. I have no idea where they would break even. Maybe I just will settle with having two different brain base classes and use one that suits the current need.
The general usage is to put a minimal set of columns in metadata, only enough to create a results page and load the objects in cases where either large, dynamic or otherwise arbitrary data elements are needed.
Yes, and that is somewhat restricting. My current applications use several different catalogs to get the width of the meta_data down. The downside of this approach is that I end up with allot of catalogs and that it's a multitude time more things to do for management, e.g. I must reindex all catalogs instead of just one. My primary goals are: 1. Get a general ZCatalog that can be used for all ZCatalog requirement (not only site searches), 2. Implement feature that removes the need for external RDBS (for instance report generation is hard with ZCatalogs because of the lack of grouping/statistics). 3. Make ZCatalogs easier to manage, for instance the need of updating indexes and meta_data definitions every time you change your applications data structure is annoying, especially at development time. Objects could tell the ZCatalog which meta_data and indexes it wants removing the need to manually add them. Off course you will need to clean up the ZCatalog from time to time.
(The way that I "solved" the group/calc part of my "project", I don't think it will lead to memory bloat. I'm going to implement a LacyGroupMap which take an extra parameter (a list of IISet). Each brain created in the LacyMap will have methods for calculations directly on the self.data in the Catalog. The data it self will not be stored. There will most probably be a pre calculate method that calculate all variables that are applicable and caches the result.)
Sounds like a pretty good solution. However, I would be hesitant in creating direct dependancies on the internal Catalog data structures if you can help it (sometimes you can't though).
I could "soften" the dependency by providing the catalog with an interface for calculations and give the brain an reference to the catalog it self and use the interface on that reference.
One way to reduce memory consumption in wide Catalogs would be to have LacyBrains (vertical lacyness, there might be reasons why that would be a bad idea, which I'm not aware of)
That would pretty much require a rewrite of the Catalog as the data structures would need to be completely different. It would introduce significant database overhead since each metadata field would need to be loaded individually. I think that would negate whatever performance benefit metadata might have over simply loading the objects.
I'm not sure that it would be necessary to change the data structure, the brain could use the same method as the LacyMap uses to load the data. But LacyBrain would need to save all applicable data at once to be efficient. The different would be that the brain will not fetch any data before the first attribute has been called. When the first is called all applicable data will be copied to the attribute according to __record_schema__. This would probably not be more efficient for regular use of brains, but for calculated group brains they wouldn't need to store the data at all if they only used calculated fields.
Another way would be to have multiple data attributes in the Catalog, like tables, and to join the tuples from them with a "from table1, table2" statement. In this way it would be possible to control the width of the brains. It would also be possible for the object indexing it self to tell the catalog in which "tables" it should store meta data.
Yes, this would be better. You could have different sets of metadata for each catalog record. You would select which one you wanted at query time.
Yeah I like it as well. It would also require a more SQL-like query interface.
There have been some proposals (ObjectHub et al) which I read some time ago. I didn't feel then that we what I was looking for. Please tell me if there's been any proposals or discussions regarding this.
I don't think so. If you feel strongly about this, write up a proposal and provide some use cases for discussion.
Yes, but first implementation :-) I'm very XP in that aspect. I find code easier to communicate when specifications :-) Or at least Python Code, I don't C-code easier to communicate. Cheers, Johan Carlsson -- Torped Strategi och Kommunikation AB Johan Carlsson johanc@torped.se Mail: Birkagatan 9 SE-113 36 Stockholm Sweden Visit: Västmannagatan 67, Stockholm, Sweden Phone +46-(0)8-32 31 23 Fax +46-(0)8-32 31 83 Mobil +46-(0)70-558 25 24 http://www.torped.se http://www.easypublisher.com
participants (3)
-
Casey Duncan -
Johan Carlsson [Torped] -
Stefan H. Holek