Blog

Aligning OpenSearch and SRU

Tony Hammond

Tony Hammond – 2009 June 05

In Search

[Update - 2009.06.07: As pointed out by Todd Carpenter of NISO (see comments below) the phrase “SRU by contrast is an initiative to update Z39.50 for the Web” is inaccurate. I should have said “By contrast SRU is an initiative recognized by ZING (Z39.50 International Next Generation) to bring Z39.50 functionality into the mainstream Web“.]

[Update - 2009.06.08: Bizarrely I find in mentioning query languages below that I omitted to mention SQL. I don’t know what that means. Probably just that there’s no Web-based API. And that again it’s tied to a particular technology - RDBMS.]

queryType.png

(Click image to enlarge.)

There are two well-known public search APIs for generic Web-based search: OpenSearch and SRU. (Note that the key term here is “generic”, so neither Solr/Lucene nor XQuery really qualify for that slot. Also, I am concentrating here on “classic” query languages rather than on semantic query languages such as SPARQL.)

OpenSearch was created by Amazon’s A9.com and is a cheap and cheerful means to interface to a search service by declaring a template URL and returning a structured XML format. It therefore allows for structured result sets while placing no constraints on the query string. As outlined in my earlier post Search Web Service, there is support for search operation control parameters (pagination, encoding, etc.), but no inroads are made into the query string itself which is regarded as opaque.

SRU by contrast is an initiative to update Z39.50 for the Web and is firmly focussed on structured queries and responses. Specifically a query can be expressed in the high-level query language CQL which is independent of any underlying implementation. Result records are returned using any declared W3C XML Schema format and are transported within a defined XML wrapper format for SRU. (Note that the SRU 2.0 draft provides support for arbitrary result formats based on media type.)

One can summarize the respective OpenSearch and SRU functionalities as in this table:

<th width="33%" align="center">
  OpenSearch
</th>

<th width="33%" align="center">
  SRU
</th>
<td align="center">
  no
</td>

<td align="center">
  yes
</td>
<td align="center">
  yes
</td>

<td align="center">
  yes
</td>
<td align="center">
  yes
</td>

<td align="center">
  yes
</td>
<td align="center">
  no
</td>

<td align="center">
  yes
</td>
Structure
query
results
control
diagnostics

What I wanted to discuss here was the OpenSearch and SRU interfaces to a Search Web Service such as outlined in my previous post. The diagram at top of this post shows query forms for OpenSearch and SRU and associated result types. The Search Web Service is taken to be exposing an SRU interface. It might be simplest to walk through each of the cases.

(Continues below.)

Search Web Service

Tony Hammond

Tony Hammond – 2009 May 30

In Search

search-web-service.png

(Click image to enlarge graphic.)

While the OASIS Search Web Services TC is currently working towards reconciling SRU and OpenSearch, I thought it would be useful to share here a simple graphic outlining how a search web service for structured search might be architected.

Basically there are two views of this search web service (described in separate XML description files and discoverable through autodiscovery links added to HTML pages):

Structured Search Using PRISM Elements

Tony Hammond

Tony Hammond – 2009 May 30

In Search

We just registered in the SRU (Search and Retrieve by URL) search registry the following components:

Context Sets

This means that an SRU (Search and Retrieve by URL) search engine that supported one of the PRISM context sets registered above could accept CQL (Contextual Query Language) queries such as the following:

OAI-ORE: Workshop Slides

Tony Hammond

Tony Hammond – 2009 May 26

In Interoperability

This is a very slick presentation by Herbert Van de Sompel on OAI-ORE which he’s due to give today for a workshop at the INFORUM 2009 15th Conference on Prrofessional Information Resources in Prague. It’s on the long side at 167 slides but even if you just flip though or sample it selectively you’ll be bound to come away with something.

PRISM Aggregator Message

Tony Hammond

Tony Hammond – 2009 May 08

In Interoperability

The new OAI-PMH interface to Nature.com sports one particular novelty which may well be of interest here: it makes use of the PRISM Aggregator Message. (For an announcement of this service see the post on our web publishing blog Nascent.)

As a protocol for the harvesting of metadata records within a digital repository, OAI-PMH records may be expressed in a variety of different metadata formats. For reasons of interoperability a base metadata format (‘Dublin Core’) is mandated for all OAI-PMH implementations. The expectation is that this base format would be augmented by community-specific vocabularies.

Our natural inclination was to mirror the article descriptions which we already circulate in our RSS feeds and within our HTML pages (as META tags) and PDF files (as XMP packets). In these cases we have used open data models (e.g. RDF) with simple properties cherry-picked from the DC and PRISM namespaces. But OAI-PMH has a special ‘gotcha’ in this regard: any metadata format must allow for W3C XML Schema validation. That is, the properties need to be constrained by an XSD data model. Enter PRISM Aggregator Message (PAM).

(Continues)

Crossref’s OpenURL query interface

Chuck Koscher

Chuck Koscher – 2009 May 06

In OpenURLAPIs

Over the past two weeks we’ve focused on our OpenURL query interface with the goal being to improve its reliability. I’d like to mention some things we’ve done.

  1. We now require an OpenURL account to use this interface (see the registration page) . This account is still free, there are no fixed usage limits, and the terms of use have been greatly simplified.

  2. Resources have been re-arranged dedicating more horse-power to the OpenURL function.

OCLC defines requirements for a “Cooperative Identities Hub”

Geoffrey Bilder

Geoffrey Bilder – 2009 May 01

In ORCID

OCLC has published a report (PDF) identifying some requirements for what they call a “Cooperative Identities Hub”. A quick glance through it seems to show that the use cases focus on what we are calling the “Knowledge Discovery” use cases. As I mentioned in my interview with Martin Fenner, there is also a category of “authentication” use cases that I think needs to be addressed by a contributor identifier system. Still, this is a good report that highlights many of the complexities that an identifier system needs to address.

What do people want from an author identifier?

Geoffrey Bilder

Geoffrey Bilder – 2009 April 27

In ORCID

Martin Fenner continues his interest in the subject of author identifiers. He recently posted an online poll asking people some specific questions about how they would like to see an author identifier implemented.*

The results of the poll are in and, though the sample was very small, the results are interesting. The responses are both gratifying -there seems to be a general belief that Crossref has a roll to play here- and perplexing -most think the identifier needs to identify other “contributors” to the scholarly communications process- yet there seems to be a preference for the moniker “digital author identifier”. This latter preference is certainly a surprise to us as we had been focusing our efforts on identifying analog authors. The only “digital authors” I know of are this one at at MIT and possibly this one at Aberystwyth University. 😉

Introductory Signals

So while doing some background reading today I realized that legal citations already widely support a form of “citation typing” in the form of “Introductory Signals“. The 10 introductory signals break down as follows…

In support of an argument:

   1) [no signal]. (NB that, apparently, this is increasingly deprecated.)

   2) accord;

   3) see;

   4) see also;

   5) cf.;

For Comparisons:

   6) compare … with …;

For contradiction:

   7) but see;

Citation Typing Ontology

I was happy to read David Shotton’s recent Learned Publishing article, Semantic Publishing: The Coming Revolution in scientific journal publishing, and see that he and his team have drafted a Citation Typing Ontology.*

Anybody who has seen me speak at conferences knows that I often like to proselytize about the concept of the “typed link”, a notion that hypertext pioneer, Randy Trigg, discussed extensively in his 1983 Ph.D. thesis.. Basically, Trigg points out something that should be fairly obvious- a citation (i.e. “a link”) is not always a “vote” in favor of the thing being cited.
In fact, there are all sorts of reasons that an author might want to cite something. They might be elaborating on the item cited, they might be critiquing the item cited, they might even be trying to refute the item cited (For an exhaustive and entertaining survey of the use and abuse of citations in the humanities, Anthony Grafton‘s, The Footnote: A Curious History, is a rich source of examples)
Unfortunately, the naive assumption that a citation is tantamount to a vote of confidence has become inshrined in everything from the way in which we measure scholarly reputation, to the way in which we fund universities and the way in which search engines rank their results. The distorting affect of this assumption is profound. If nothing else, it leads to a perverse situation in which people will often discuss books, articles, and blog postings that they disagree with without actually citing the relevant content, just so that they can avoid inadvertently conferring “wuffie” on the item being discussed. This can’t be right.
Having said that, there has been a half-hearted attempt to introduce a gross level of link typology with the introduction of the “nofollow” link attribute- an initiative started by Google in order to try to address the increasing problem of “Spamdexing”. But this is a pretty ham-fisted form of link typing- particularly in the way it is implemented by the Wikipedia where Crossref DOI links to formally published scholarly literature have a “nofollow” attribute attached to them but, inexplicably, items with a PMID are not so hobbled (view the HTML source of this page, for example). Essentially, this means that, the Wikipedia is a black-hole of reputation. That is, it absorbs reputation (through links too the Wikipedia), but it doesn’t let reputation back out again. Hell, I feel dirty for even linking to it here ;-).
Anyway, scholarly publishers should certainly read Shotton’s article because it is full of good, and practical ideas about what can can be done with today’s technology in order to help us move beyond the “digital incunabula” that the industry is currently churning out. The sample semantic article that Shotton’s team created is inspirational and I particularly encourage people to look at the source file for the ontology-enhanced bibliography which reveals just how much more useful metadata can be associated with the humble citation.
And now I wonder whether CiteULike, Connotea, 2Collab or Zotero will consider adding support for the CItation Typing Ontology into their respective services?
* Disclosure:
a) I am on the editorial board of Learned Publishing
b) Crossref has consulted with David Shotton on the subject of semantically enhancing journal articles