[Zope-CMF] Dublin Core Subject Qualifier Implementation

seb bacon seb@jamkit.com
20 Feb 2002 10:32:54 +0000


Sean, 

That's a really interesting idea.  It would be a great thing to
integrate with the CMF.  

Here's some of my thoughts, since you asked ;-)

The namespace qualifier seems like a good idea.  

The language aspect should be dealt with by l18n structures rather than
on the application level, e.g. the system locale (I've never looked at
ZBabel etc so I'm not up on the accepted way of doing this).

The UI problem of selecting a subject from 1000s has been discussed on
the list before - have a search around for ideas.  My feeling is that
the best way of doing this is to arrange the subjects heirarchically. 
For example, there are 17 categories in the IPTC subjects.  The UI
should allow you to select an entire category as well as its
subdivisions.

The internal representation should be an XML-like tree, which you could
manipulate in a similar way to XML (like a SAX parser, for example). 
The tool could have an 'import' function, so people can load in
specialist vocabularies - possibly from an XML format?

The job of mapping between id and name shouldn't be tricky - you should
only ever specify an id to the tool, and it could always return (id,
name) tuples.  I noticed that a lot of subjects in the specs you mention
have descriptions too - you could make it a (id,name,description) tuple
or, something similar, to expose this.   

Regarding vocabulary, you could optionally supply a vocabulary id to
each method, or you could rely on a default vocabulary which can be set
by the user.

I'd be tempted to miss out the icon thing, although it's a nice idea. 
It's only any use if the application requires it, and someone has the
time to generate 1000s of icons - wouldn't this be a minority of cases? 
Anyway, here's my take on the interface:

 getSubject(subject_id, vocabulary=None):
   "return (id, name) tuple"

 searchSubjects(search_term, vocabulary=None):
   """do a text search of subject names,
      return list of (id, name) tuples"""

 getChildSubjects(subject_id, vocabulary=None):
   "return list of children of subject_id"

 getParentSubject(subject_id, vocabulary=None):
   "return parent of subject_id"

 getSiblingSubjects(subject_id, vocabulary=None):
   "return siblings of subject_id"

 getRootSubjects(subject_id, vocabulary=None):
   "return list of root subjects"

 setDefaultVocabulary(vocabulary):
   "set a default vocabulary, return None if it doesn't exist"
 
 setSubject(subject_id, subject_name, vocabulary):
   "add a new subject to vocabulary"

 getVocab(subject_id):
   "return a (id, name, description) tuple for the volcabulary of the
specified subject"



On Tue, 2002-02-19 at 20:46, sean.upton@uniontrib.com wrote:
> Hey everybody,
> I am looking at implementing Dublin Core Qualifiers for Subject metadata as
> a means of expressing subjects within multiple controlled and standardized
> vocabularies (namely, IPTC subjects for news and sports stuff, and NAICS or
> SIC codes for Business information), in addition to supporting plain-text
> subject vocabularies as well.  Is there any established pattern or syntax
> for dealing with subject codes this way in the CMF?  I haven't found
> anything, so I have been thinking about a solution... my thoughts are below.
> 
> The Dublin Core Qualifiers spec has several recommended element encoding
> schemes for LC and medical subjects, but nothing excludes other
> industry-standard subject vocabularies, such as IPTC (news/media, worldwide)
> or NAICS (used by North American governments, business/economic news, and
> yellow pages), or market names (stock tickers).
> 	http://www.dublincore.org/documents/dcmes-qualifiers/#subject
> 	http://www.iptc.org/
> 	http://www.census.gov/epcd/www/naics.html
> 
> My first hunch is that the best way to convey a namespace/qualifier for a
> subject code system is with a colon in the text, separating the vocabulary
> "NAME" (in dcmes-qualifers terms).  My second hunch is that I need to create
> a subject lookup tool that performs lookups for "human-readable"
> counterparts for codes, so that codes with qualifiers can get a description
> that makes sense to a content user.  I also think such a framework might be
> useful for content creators if the user interface for metadata entry enabled
> efficient lookup with these codes (the biggest UI issue is that number of
> these codes may be in the order of thousands, something like a popup
> search/browse dialog might be appropriate).
> 
> Example lookup/translation input/output:
> 
> 	NAICS:511110  --> "Newspaper Publishers"
> 	IPTC:01016000 --> "Television"
> 	NASDAQ:MSFT --> "Microsoft Corporation"
> 	Media Companies --> "Media Companies" (verbatim translation of
> unqualified text)
> 
> This tool should support internationalization (or is it localization?) of
> description lookup, because these vocabularies are often defined by
> multi-national organizations (thus multi-lingual lookup tables might exist,
> for example IPTC supports most Eurpoean languages, Turkish, and Arabic);
> this isn't to say that one need implement every language a vocabulary
> supports to satisfy this, but that the interface for this tool should
> support a language encoding parameter for this purpose, so that a
> multi-lingual site can support multiple languages with one vocabulary
> (SignOnSanDiego publishes content in English and some Spanish).
> 
> In use of this tool, there would still be interfacing issues to make this
> work with the metadata tool and content types, both in terms of suporting a
> user interface for massive amounts of subject codes, as well as determining
> when to display the code and when to display the lookup description...
> 
> I'd be interested to see what people think about this.  I wrote some
> interface documentation, which is pasted below that might help in explaining
> my idea.  Thoughts?
> 
> Thanks,
> Sean
> 
> #####################################
> ##################################### 
> 
> import Interface
> 
> class portal_subjectlookup(Interface.Base):
>       """
>         Interface for registry of subject code qualifier
>         vocabularies.  Among other things that a tool 
>         implementing this interface should do is provides the
>         ability to query with a code, language, and
>         vocabulary, and get descriptions.
>       """
> 
>       def getDescriptionFromCode(code, vocabulary=None, language='en-US'):
>           """
>             Lookup code in registry specified by vocabulary for language 
>             specified in language.
> 
>             Pre-condition:  code is a string object and is not None
>             Post-condition: a string is returned with a human-readable
>                             text description (string) for a code in
>                             the language specified, if available.
> 
>                             If a registry implementation in the tool is not
>                             available in the language specified, a default 
>                             language should be used.
> 
>                             If no viable option can be found in lookup, 
>                             method should return None.
> 
>             Notes: sorry about the ethnocentrism in the language default.
>           """
> 
>        def findCodeByKeyword(query, vocabulary=None, language='en-US'):
>           """
>             Used primarily by content producers, or agents on their behalf.
> 
> 
>             This method is used to find a correct code, for a piece of
> content
>             when the code is unknown, but the subject matter is.  This
>             allows a query, which can be either a single string keyword, or
> a
>             sequence of keyword strings.  The query is an "or" query, so
> that if
>             query == ['foo','bar'] topic codes with descriptions matching
> both
>             should be returned.
> 
>             Pre-condition:  query is a string or a sequence of strings and
>                             is not None.  If query is a sequence, a query
>                             will be performed for all terms as specified
>                             above.  If vocabulary is specified, only search
>                             that vocabulary, otherwise a 'search all' is
>                             assumed.
> 
>             Assumptions:    it is assumed that the query that is passed to
> this
>                             method should match with a wildcard on the end 
>                             of each keyword, so that a query of
>                             ['bio','tech','medi'] would find biotechnology,
>                             technology, medicine, technical, medical, etc...
> 
>             Post-condition: a sequence of matches is returned, where a match
>                             is a tuple of vocabulary, code, and description
>                             in the language of choice.
> 
>                             If a registry implementation in the tool is not
>                             available in the language specified, a default 
>                             language should be used.
> 
>                             If no viable option can be found in lookup, 
>                             method should return None.           
>           """
> 
>        def listAllCodes(vocabulary=None, language='en-US'):
>           """
>             This method lists all entries in lookup tables for subject
>             vocabulary codes, either globally, or within a particular
>             vocabulary.  Output is similar to findCodeByKeyword()...
>             
>             Assumptions:    If vocabulary is not None, then search 
>                             globally across all vocabularies present
>                             in this tool.
>                            
>             Post-Condition: a sequence of entries is returned, where an
>                             entry is a tuple of vocabulary, code, and
>                             description in the language of choice.
> 
>                             If a registry implementation in the tool is not
>                             available in the language specified, a default 
>                             language should be used.
> 
>                             If no viable option can be found in lookup, 
>                             method should return None.
>           """
>        def getIconPathForSubject(code, vocabulary=None):
>           """
>             Attempts to find an icon path registered for a code/vocabulary
>             combo.  Since vocab is optional, this could potentially need
>             to look through the registry for entries in all vocabularies.
> 
>             Returns a list of "wrapped-icons" where a wrapped-icon is a 
>             tuple containing the icon width, icon height, and icon path
>             as a list; example: 
>             [ (32,32,['path','to','images','subj32.png']),
>               (16,16,['path','to','images','tiny','subj16.png']) ]
>           """