[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Report on CODATA 2002 conference

To: Multiple recipients of list <epc-l@iucr.org>
Subject: Report on CODATA 2002 conference
From: Brian McMahon <bm@iucr.org>
Date: Mon, 7 Oct 2002 16:12:01 +0100 (BST)
To members of the Electronic Publishing and Database Committees
(copy to IUCr staff for information)

Herewith my digest of the highlights of last week's CODATA meeting
in Montreal. The Crystallography News Online editor is welcome to add
this to the Conference Reports section if he feels it of sufficient
interest.

Brian
_________________________________________________________________________
Brian McMahon                                       tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm@iucr.org
5 Abbey Square, Chester CH1 2HU, England                   bm@iucr.ac.uk


CODATA 2002 - Montreal, 29 September - 3 October 2002
-----------------------------------------------------

The biennial CODATA meeting just finished in Montreal set out its agenda
under six headings intended to emphasise cross-disciplinary concerns
relevant to all the CODATA participating organisations:
 * Preserving and Archiving of Scientific and Technical Data;
 * Legal Issues in Using and Sharing Scientific and Technical Data;
 * Interoperability and Data Integration;
 * Information Economics for Scientific and Technical Data;
 * Emerging Tools and Techniques for Data Handling;
 * Ethics in the Creation and Use of Scientific and Technical Data.

In an interesting keynote lecture on preservation and archiving, Kevin
Ashley of the University of London Computing Centre struck some
encouraging notes. The technical problems of archiving digital data are
well understood and pose no severe challenges; the challenges are
economic, social and managerial. The sheer volume of data available is
daunting, but the growth in bulk storage capacity continues to accelerate,
and the difficulty with preserving large volumes will become more one of
locating the desired data within the collection. Of course, one should not
aim to store absolutely everything. Selection is important, and
professional archivists are skilled in selection. It rapidly becomes
apparent that collectors of raw data often are not skilled in the critical
assessment of what should be stored to maximise the usefulness of archived
data to future exploitation. Hence there is merit in the current trend
of funding bodies in Europe and Australasia to support the digital
archiving and libraries communities because of their experience in
archiving and preservation; but these are communities not necessarily
attuned to the specific requirements of scientific data management, and
collaboration is needed to ensure that these requirements are met when
scientific data are transferred to non-specialised repositories.

It is clear that the more detail accompanying the data collections
(i.e. the richer the metadata at source), the more value will emerge from
the archive. Some anecdotes pointed out the need to retain 'bad' data (in
Ashley's talk the example was given of statistical demographic data that
was known to be flawed, but which had influenced political arguments - the
availability of the bad data was valuable to historical analysts). In the
Canadian Virtual Observatory, astronomical data from US sources was
recalibrated on the fly when requested. This dynamic processing using best
current techniques improved the usefulness of the data served at this
point in time. On the other hand, NASA demands that its archives retain
the original raw data (after all, the latest recalibration might contain
systematic errors). So, while the point was not made explicitly, it is
apparent that archives may not be static repositories, but may be called
upon to reflect changes overlaid upon the data they contain. This
emphasises the importance of audit trails to accompany the data as a
further level of metadata. (There is a parallel with our electronic
journals, where errata should be combined with the content of a paper to
assist current researchers, while yet the initial form must also be
accessible as a document of historical record.)

The Open Archival Information System (OAIS) reference model appears to
have been well accepted. It is not clear how widespread it is, but where
it has been adopted as a working reference it has proven effective,
whether in the actual generation of code from its formal UML
representation (as has been done at the Jet Propulsion Labs), or as
a more traditional blueprint for software engineering using XML, SOAP and
other web services tools as practised by Centre National d'Etudes
Spatiales (CNES). The CNES experience showed that its use promoted easy
interoperability between different databases, and it seems to me that its
level of abstraction makes it by far one of the most effective tools to
date in working towards proper cross-disciplinary interoperability.

Among other contributions related to archiving, the point was repeatedly
made that original data must be retained, whatever subsequent processing
it might undergo. The US Geological Survey distinguished between the
'migration' of data to different physical media, which was easy, from its
'transcription' into different formats which might be required for subsequent
reprocessing or even storage. One difficulty with transcription strategies
applied to very large volumes of data was that the lifetime of the target
format was often short compared with the time taken for the transcription
operation, so that one was forever chasing one's tail. The astronomy
community well understood the value of archived data: over 600 papers a
year are published from old observational data retrieved from data stores,
and data from the Hubble Space Telescope is being extracted for research
at a rate four time greater than it is being added to the archives.

The Principal Director of the Erpanet project (a European funded venture
somewhat focused on cultural digital objects) reiterated the archivist's
principle of selectivity. It is better to collect little but document well
than to aggregate huge amounts of poorly documented material. Acquisition
strategies of librarians are developed hand in hand with disposal
strategies (though this is perhaps driven by storage space concerns which
are less pressing in the current digital environment). One point he made
that I thought worth pondering is that archiving resources should be
allocated less on a cost/benefit analysis than through risk analysis: what
do you stand to lose?

The interoperability thread was introduced by a keynote talk full of ideas
(about 120 slides worth!) by Robert Robbins of the Hutchinson Cancer
Research Centre, Seattle. His theme was that interoperability between
databases in the life/molecular biology sciences alone was hampered,
partly by the scale, but also by obstacles to technical, semantic and
social connectivity. In practice it was found that people were more
willing to tackle the semantic and to some degree social obstacles as the
technical connectivity was improved. The problems that he saw at the
technical level had to do with the fact that current relational database
management systems were optimised for business databases. But business and
science differ: business is concerned with a closed universe and deductive
logic; science deals in an open universe of observations with inductive
logic. Nevertheless, relational database systems were attractive inasmuch
as they had a sound theoretical basis: their behaviour and properties were
tractable to set-theoretic analysis. Object-oriented databases with local
methods were attractive in terms of efficiency of manipulation, but tended
to be designed ad hoc to match the problem in hand. The difficulties of
integrating ad hoc solutions are more severe. In practice biological
databases will form at best a 'loosely coupled federation' within a formal
taxonomy of databases. Earlier attempts to rigorously analyse such systems
foundered on the impossibility of synchronising loosely coupled
structures, but Robbins believes that a formal theory of 'read-only'
loosely-coupled federated databases is possible, and is essential to
provide a sound basis for the design and implementation of the desired
integration of very large-scale biological databases. One thing he
identified as an essential was some sort of resource registry acting as an
analogue of domain name service to direct structure queries to the
appropriate server within a WWW technical model.

The other talks in the interoperability thread seemed to illustrate that
interoperability needs to - or at least tends to - start at the technical
level. The oceanographic OpENDAP protocol defining syntactic metadata was
successful in bringing together data sets in a number of different formats
behind a common front-end. It is an open-source network data access
protocol that sits, rather like a format translation layer, on top of
TCP/IP in a network transfer process. Mechanical procedures for
translating between formats are employed, and the amount of semantic
metadata required by the search and retrieval applications is rather
low. But the point was made that the format transfer layer reduces the
amount of metadata that is needed to facilitate meaningful data transport.
It was also pointed out that one can get a lot of functionality out of
'smart' clients, but the more intelligent the client, the lower (in
general) its capability for interoperability. OpENDAP appears to be
sufficiently low-level that it has been used by oceanography, earth
sciences and solar-terrestrial communities to good effect.

The OpenGIS Consortium demonstrated some impressive overlaying of
map-based information from different geography sources using a number of
web-compatible services. Among the tools used in XML-based data transfer
applications, UDDI (Universal Description, Discovery and Integration) was
mentioned by a number of speakers, and may go some way towards fulfilling
the role of a 'semantic DNS' mentioned in the keynote presentation.

Impressive though some of the working examples were, they are still
largely restricted to one discipline or to related disciplines with rather
similar data descriptions. Cross-discipline interoperability still seems a
long way off.

A speaker from the Open Archives Initiative (OAI) discussed the protocol
for metadata harvesting (PMH) that is designed to collect metadata across
disciplines. It is based on Dublin Core metadata, but may provide a way to
aggregate disparate metadata from different sources. This sounded
interesting, but unfortunately the speaker disappeared before I could chat
to him; but there might be more on this at the forthcoming CERN meeting.

The keynote talk in the 'Emerging Tools' thread was on Text Mining by
Stan Matwin of the University of Ottawa. This is the process of analysing
natural-language text to uncover new knowledge (that is, to extract
structured information from an unstructured source). The distinction was
made between 'uncovering' and 'discovering' new knowledge - an example of
the latter would be the recognition that references to birds in Grimms'
fairy tales were always metaphors for death. (To my mind this sounded far
more interesting, but doesn't appear to be anywhere near realisation yet!)
Text mining projects of today combine linguistic analysis with machine
learning. Linguistic analysis includes word stemming, tagging, and
rule-based parsing of the grammatical structures of a natural-language
text. The objective is to work towards a semantic analysis, which for
scientific texts is imaginable because the formal language of scientific
discourse uses relatively direct mappings between syntax and semantics
(unlike, say, metaphor-rich literary text). The machine-learning component
involves the preliminary feeding into the system of portions of text
which are variously tagged by experts as relevant or not relevant to a
particular type of query. This is seen as an effective way to generate the
thesauri relevant to a topic area. It was claimed that early projects
concerned with the automatic categorisation of documents in genomics, and
the detection of email spam, were showing promise.

Among the contributions to this thread, Henry Kehiaian presented the
standard file format SELF for physicochemical data as a technique for
publishing, retrieving and exchanging such data. A man from Oracle
discussed some of the innovations within Oracle databases for storing
spatial data, extending the database query language to include operations
on spatial data types and introducing optimised spatial indexing. I
gather that they are working with SDSC on applications in protein structure
representation and that they plan to provide biospatial types that are
compliant with mmCIF. This sounds to me like very good news, because I am
sure the integration of mmCIF objects in a commercial product of this
importance will be very welcome. If I understood correctly, he is
collaborating or proposing to collaborate - I presume with Phil Bourne's
group? - on PDB-mmCIF conversion tools. Unfortunately, he also vanished
before I could talk to him.

Because of the structure of the conference I could not attend to the other
topic threads in detail. But their keynotes were all of interest.

Masamitsu Negishi of the Japanese National Institute of Informatics (NII)
described Japan's current heavy investment in information technology:
the e-Japan strategy is a government-driven initiative to become the
world's most advanced IT nation by 2005. As its contribution, NII hosts
2892 databases (only 5% of them scientific), but provides a national
portal to these. It is building a citation index of Japanese publications
(the Japanese-language equivalent of ISI). Japanese electronic libraries
are exploring consortial subscription models, and Japanese interest in the
SPARC initiative for low-cost academic publication is high.

Pamela Samuelson of University of California at Berkeley discussed some
topical legal concerns and emphasised the need for the scientific
community to uphold the value of the public domain in safeguarding access
to information and ideas, fertilising new ideas, and upholding the general
principles of scientific openness. Current legislation on intellectual
property rights provides strong safeguards for the owner of data, but at
the potential cost of eroding the doctrine of 'fair use' for educational and
research purposes.

Such restrictions on access were also referred to by M. G. K. Menon (Dr
Vikram Sarabhai Distinguished Professor of Space and President, LEAD,
India), in an eloquent address on the ethical problems that would
certainly arise in the globalisation of information science and technology.
There is already an existing economic divide between the rich and poor
nations, but the digital divide exists too and is growing. The poor cannot
afford computers; telecommunications infrastructure in the developing
world is poor; illiterates cannot use keyboards; the Internet is
English-language dominated; and the developing nations have difficulty
enough in meeting their energy needs. The problem of access to data is
particularly one that concerns the poor nations. However, CODATA is active
in involving members of the developing nations in its activities, and
shows by its promotion of ethics-related sessions at this meeting and
elsewhere that it is an active participant in the quest for a proper
balance between ethical and economic values.

Among the entertainments provided for delegates were a pair of public
lectures and a session to predict the future. The public lectures were:
a bilingual presentation by Guy Baillargeon on biodiversity and the Global
Biodiversity Information Facility front-end to a collection of
interoperable taxonomy and specimen databases; and a presentation of
high-definition television satellite images of geographic, geologic and
meteorological phenomena by Fritz Hasler. The prophets in the CODATA 2015
session were Paul Ginsparg (who envisaged very effective text mining
through optimisation of the simple algorithms that now power Google; and
who suggested that future trends would favour the publisher whose income
per article were nearer the $1-5 of the arXiv preprint server than the
$10,000-20,000 of certain commercial publishers); Werner Martienssen (who
foresaw developments in the understanding of the natural laws of physics
that encompassed fractal concepts and elaboration of knowledge from
models that included evolutionary competition); and David Thomas (who saw
the progression in understanding of molecular biology through gene
function and cell function into the complete mapping of the cell, with
deep understanding of the organism and populations still to come).

Finally, I enjoyed a number of presentations illustrating the Virtual
Human Project, although the one with the most intellectual content had to
be abandoned because the speaker could not make his Mac talk to the data
projector!
==============================================================================
Reply to: [list | sender only]

Prev by Date: ICSTI: Bundled Subscriptions - the debate

Next by Date: The importance of the public domain, software patents et alia

Prev by thread: Re: The importance of the public domain, software patents et alia

Next by thread: Re: Report on CODATA 2002 conference

Index(es):

Date

Thread
Discussion List Archives

Report on CODATA 2002 conference