RE: [Zope] Need a list of words not indexed by Catalog
-----Original Message----- From: Jason Spisak [mailto:webmaster@hiretechs.com] Sent: Monday, September 13, 1999 12:31 PM To: zope@zope.org Subject: [Zope] Need a list of words not indexed by Catalog
Can we get a public list of the words not indexed by the Catalog?
You asked for it: stop_words=( 'am', 'ii', 'iii', 'per', 'po', 're', 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'can', 'cannot', 'cant', 'con', 'could', 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'per', 'perhaps', 'please', 'pre', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', ) This is defined in lib/python/SearchIndex/UnTextIndex.py and TextIndex.py. ZCatalog uses UnTextIndexes. I believe confera uses TextIndexes. This list is obviously not going to work for lots of applications. What if you want to search for words like 'six' and 'five'? What if your indexing python code and want to find all ocourances of 'while' and 'for'? What if your indexing French or German? The reason why we have these as stop words is because, for most (English) 'documents' (meaning large bodies of text), these are the most common words encounterd, and have a higher chance of being in a document than not being in a document. The technical reasons get into the very precise and voodooish concepts of relevance, precision, and result set size. This is very deep stuff. Suffice to say, without stop words indexes grow huge, so clip this list at your own risk, and be aware that taking out 'Many' may improve your ability to index your employees, it may also drasticly reduce the relevance and precision while increasing the size of your result set for a different 'kind' of text corpus.
I just spent a few hours being tortured trying to find out why a guy in the Datbase of 10,000 whose name is 'Max Many', doesn't come up on a last name search, and guess what, the Catalog doesn't index his last name. The word 'many'. Suprize! This is the third gotcha word we've expericened (don't try to find C/C++ in a document) and it's makes people doubt the software. (I explain about the size of indexes and such, and I understand it, but every surprise word you can't find is another step backward.)
The current TextIndexing machinery was intended to be used from a document-centric standpoint. The need to text index small values was not considered way back when the first TextIndex was written. Given the amount of feedback we have recieved on this, it is obviously on our radar. -Michel
Michel, I understand completely. Now I can look for these possibilities and even publish it to the hire-ups, so that there is no misunderstanding. I am fully aware of the work you guys are doing to allow for variable stop words, letters. Thanks Michel! Jason Spisak
participants (2)
-
Jason Spisak -
Michel Pelletier