[Zope-dev] KeywordIndex performance / multiunion

Thu Nov 6 15:46:51 EST 2003

[Seb Bacon]
>> But main the reason I'm posting is to wonder if there any reason not
>> to use the multiunion operator instead of the union operator in
>> UnIndex.py... it should be faster, right?  It seems a touch faster in
>> some informal tests.

[Casey Duncan]
> Yes, it probably should be used. I think it is much faster when
> unioning very small and very large sets as well. Tim would know
> better, tho.

time.clock() knows best <wink>.  multiunion() has a systemic algorithmic
advantage when applied to at least 3 input sets:  its runtime is
proportional to the sum of the sizes of the input sets.  A plain union of
two sets also has runtime proportional to the sum of the sizes of its input
sets, but if all you've got is plain two-argument union, it's not possible
to unite N sets in time proportional to the sum of their sizes.  You have to
pick some pair to unite first, and at worst that result has a size equal to
the sum of the sizes of its inputs, and tnat larger intermediate result has
to feed in again to a later union (etc).

OTOH, two-argument union crawls over its input sets once each.  multiunion()
crawls over each of its inputs essentially six or seven times each (once to
extract the values, once for each of 4 passes of a bytewise radix sort, once
over the sorted result to weed out duplicates, and once again over the
duplicate-free sorted result to stuff the integers back into a set object).

That implies the relative timing results can be data-dependent -- and they
are.  The more input sets you have, though, the harder it is for a chain of
two-input unions to overcome multiunion's ever-growing algorithmic
advantage.

Of course, if the data isn't already in memory, none of that matters unless
you've got many sets to unite (data fetch time will dominate).