[Zope-dev] RE: [Zope] Need a list of words not indexed by Catalog

Michel Pelletier michel@digicool.com
Mon, 13 Sep 1999 15:53:22 -0400


> -----Original Message-----
> From: Jason Spisak [mailto:webmaster@hiretechs.com]
> Sent: Monday, September 13, 1999 12:31 PM
> To: zope@zope.org
> Subject: [Zope] Need a list of words not indexed by Catalog
> 
> 
> Can we get a public list of the words not indexed by the Catalog?

You asked for it:

stop_words=(
    'am', 'ii', 'iii', 'per', 'po', 're', 'a', 'about', 'above',
'across',
    'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone',
    'along', 'already', 'also', 'although', 'always', 'am', 'among',
    'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any',
    'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are',
'around',
    'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes',
    'becoming', 'been', 'before', 'beforehand', 'behind', 'being',
    'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both',
    'bottom', 'but', 'by', 'can', 'cannot', 'cant', 'con', 'could',
    'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due',
    'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else',
    'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone',
    'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty',
    'fill', 'find', 'fire', 'first', 'five', 'for', 'former',
'formerly',
    'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
    'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her',
    'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers',
    'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred',
'i',
    'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it',
    'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least',
    'less', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill',
    'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much',
'must',
    'my', 'myself', 'name', 'namely', 'neither', 'never',
'nevertheless',
    'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not',
    'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once',
    'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our',
    'ours', 'ourselves', 'out', 'over', 'own', 'per', 'perhaps',
    'please', 'pre', 'put', 'rather', 're', 'same', 'see', 'seem',
    'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should',
    'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some',
    'somehow', 'someone', 'something', 'sometime', 'sometimes',
    'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the',
    'their', 'them', 'themselves', 'then', 'thence', 'there',
    'thereafter', 'thereby', 'therefore', 'therein', 'thereupon',
'these',
    'they', 'thick', 'thin', 'third', 'this', 'those', 'though',
'three',
    'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',
    'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under',
    'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well',
    'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where',
    'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon',
    'wherever', 'whether', 'which', 'while', 'whither', 'who',
'whoever',
    'whole', 'whom', 'whose', 'why', 'will', 'with', 'within',
'without',
    'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves',
    )

This is defined in lib/python/SearchIndex/UnTextIndex.py and
TextIndex.py.  ZCatalog uses UnTextIndexes.  I believe confera uses
TextIndexes.

This list is obviously not going to work for lots of applications.  What
if you want to search for words like 'six' and 'five'?  What if your
indexing python code and want to find all ocourances of 'while' and
'for'?  What if your indexing French or German?  The reason why we have
these as stop words is because, for most (English) 'documents' (meaning
large bodies of text), these are the most common words encounterd, and
have a higher chance of being in a document than not being in a
document.  The technical reasons get into the very precise and voodooish
concepts of relevance, precision, and result set size.  This is very
deep stuff.

Suffice to say, without stop words indexes grow huge, so clip this list
at your own risk, and be aware that taking out 'Many' may improve your
ability to index your employees, it may also drasticly reduce the
relevance and precision while increasing the size of your result set for
a different 'kind' of text corpus.

> I just spent a few hours being tortured trying to find out 
> why a guy in
> the Datbase of 10,000 whose name is 'Max Many', doesn't come up on a
> last name search, and guess what, the Catalog doesn't index his last
> name.  The word 'many'.  Suprize! This is the third gotcha word we've
> expericened (don't try to find C/C++ in a document) and it's makes
> people doubt the software.  (I explain about the size of indexes and
> such, and I understand it, but every surprise word you can't find is
> another step backward.)  

The current TextIndexing machinery was intended to be used from a
document-centric standpoint.  The need to text index small values was
not considered way back when the first TextIndex was written.  Given the
amount of feedback we have recieved on this, it is obviously on our
radar.

-Michel