[Zope-CMF] Dublin Core Subject Qualifier Implementation
sean.upton@uniontrib.com
sean.upton@uniontrib.com
Tue, 19 Feb 2002 12:46:11 -0800
Hey everybody,
I am looking at implementing Dublin Core Qualifiers for Subject metadata as
a means of expressing subjects within multiple controlled and standardized
vocabularies (namely, IPTC subjects for news and sports stuff, and NAICS or
SIC codes for Business information), in addition to supporting plain-text
subject vocabularies as well. Is there any established pattern or syntax
for dealing with subject codes this way in the CMF? I haven't found
anything, so I have been thinking about a solution... my thoughts are below.
The Dublin Core Qualifiers spec has several recommended element encoding
schemes for LC and medical subjects, but nothing excludes other
industry-standard subject vocabularies, such as IPTC (news/media, worldwide)
or NAICS (used by North American governments, business/economic news, and
yellow pages), or market names (stock tickers).
http://www.dublincore.org/documents/dcmes-qualifiers/#subject
http://www.iptc.org/
http://www.census.gov/epcd/www/naics.html
My first hunch is that the best way to convey a namespace/qualifier for a
subject code system is with a colon in the text, separating the vocabulary
"NAME" (in dcmes-qualifers terms). My second hunch is that I need to create
a subject lookup tool that performs lookups for "human-readable"
counterparts for codes, so that codes with qualifiers can get a description
that makes sense to a content user. I also think such a framework might be
useful for content creators if the user interface for metadata entry enabled
efficient lookup with these codes (the biggest UI issue is that number of
these codes may be in the order of thousands, something like a popup
search/browse dialog might be appropriate).
Example lookup/translation input/output:
NAICS:511110 --> "Newspaper Publishers"
IPTC:01016000 --> "Television"
NASDAQ:MSFT --> "Microsoft Corporation"
Media Companies --> "Media Companies" (verbatim translation of
unqualified text)
This tool should support internationalization (or is it localization?) of
description lookup, because these vocabularies are often defined by
multi-national organizations (thus multi-lingual lookup tables might exist,
for example IPTC supports most Eurpoean languages, Turkish, and Arabic);
this isn't to say that one need implement every language a vocabulary
supports to satisfy this, but that the interface for this tool should
support a language encoding parameter for this purpose, so that a
multi-lingual site can support multiple languages with one vocabulary
(SignOnSanDiego publishes content in English and some Spanish).
In use of this tool, there would still be interfacing issues to make this
work with the metadata tool and content types, both in terms of suporting a
user interface for massive amounts of subject codes, as well as determining
when to display the code and when to display the lookup description...
I'd be interested to see what people think about this. I wrote some
interface documentation, which is pasted below that might help in explaining
my idea. Thoughts?
Thanks,
Sean
#####################################
#####################################
import Interface
class portal_subjectlookup(Interface.Base):
"""
Interface for registry of subject code qualifier
vocabularies. Among other things that a tool
implementing this interface should do is provides the
ability to query with a code, language, and
vocabulary, and get descriptions.
"""
def getDescriptionFromCode(code, vocabulary=None, language='en-US'):
"""
Lookup code in registry specified by vocabulary for language
specified in language.
Pre-condition: code is a string object and is not None
Post-condition: a string is returned with a human-readable
text description (string) for a code in
the language specified, if available.
If a registry implementation in the tool is not
available in the language specified, a default
language should be used.
If no viable option can be found in lookup,
method should return None.
Notes: sorry about the ethnocentrism in the language default.
"""
def findCodeByKeyword(query, vocabulary=None, language='en-US'):
"""
Used primarily by content producers, or agents on their behalf.
This method is used to find a correct code, for a piece of
content
when the code is unknown, but the subject matter is. This
allows a query, which can be either a single string keyword, or
a
sequence of keyword strings. The query is an "or" query, so
that if
query == ['foo','bar'] topic codes with descriptions matching
both
should be returned.
Pre-condition: query is a string or a sequence of strings and
is not None. If query is a sequence, a query
will be performed for all terms as specified
above. If vocabulary is specified, only search
that vocabulary, otherwise a 'search all' is
assumed.
Assumptions: it is assumed that the query that is passed to
this
method should match with a wildcard on the end
of each keyword, so that a query of
['bio','tech','medi'] would find biotechnology,
technology, medicine, technical, medical, etc...
Post-condition: a sequence of matches is returned, where a match
is a tuple of vocabulary, code, and description
in the language of choice.
If a registry implementation in the tool is not
available in the language specified, a default
language should be used.
If no viable option can be found in lookup,
method should return None.
"""
def listAllCodes(vocabulary=None, language='en-US'):
"""
This method lists all entries in lookup tables for subject
vocabulary codes, either globally, or within a particular
vocabulary. Output is similar to findCodeByKeyword()...
Assumptions: If vocabulary is not None, then search
globally across all vocabularies present
in this tool.
Post-Condition: a sequence of entries is returned, where an
entry is a tuple of vocabulary, code, and
description in the language of choice.
If a registry implementation in the tool is not
available in the language specified, a default
language should be used.
If no viable option can be found in lookup,
method should return None.
"""
def getIconPathForSubject(code, vocabulary=None):
"""
Attempts to find an icon path registered for a code/vocabulary
combo. Since vocab is optional, this could potentially need
to look through the registry for entries in all vocabularies.
Returns a list of "wrapped-icons" where a wrapped-icon is a
tuple containing the icon width, icon height, and icon path
as a list; example:
[ (32,32,['path','to','images','subj32.png']),
(16,16,['path','to','images','tiny','subj16.png']) ]
"""
#####################################
#####################################
=========================
Sean Upton
Site Technology Supervisor
Development & Integration
SignOnSanDiego.com
The San Diego Union-Tribune
619.718.5241
sean.upton@uniontrib.com
=========================