Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Report on CODATA 2004 conference

To members of the Electronic Publishing and Database Committees
(copy to IUCr staff for information)

Herewith my digest of the highlights of the recent CODATA meeting in
Berlin. The Crystallography News Online editor is welcome to add
this to the Conference Reports section if he feels it of sufficient
interest.

Brian
_________________________________________________________________________
Brian McMahon                                       tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm@iucr.org
5 Abbey Square, Chester CH1 2HU, England                   bm@iucr.ac.uk
_________________________________________________________________________


        The Information Society: New Horizons for Science
        -------------------------------------------------
             CODATA 2004 - Berlin, 7-10 November 2004

The title of the 2004 biennial CODATA conference reflected the
growing emphasis within CODATA of data science and scientific data
management as crucial components of the "information society". An
important session during the conference programme was devoted to
presentations and discussions on CODATA and ICSU activities
leading to Phase II of the World Summit on the Information Society
in Tunis 2005 (see below). However, the relationship between
society and scientific data was a recurring theme running through
many of the conference presentations, sometimes explicitly,
sometimes unspoken.

Keynote lectures
----------------
The Berlin Declaration of October 2003 is an initiative to
encourage open access to knowledge in the sciences and humanities,
with the goal of disseminating knowledge widely through
society. It has been signed by several representatives of national
and international academic institutions, and is strongly promoted
by the Max Planck Society (MPS). Jurgen Renn of the MPS described
his vision of a web of culture and science, arising from strenuous
efforts to expose scholarly knowledge on the web. Without a concerted
effort, many artefacts and components of cultural heritage - art,
literature, languages, oral traditions - will lose visibility as
they become the preserve of specialist scholars. The alternative is
to use the power and universality of the web to provide access for all. 
He argued that there is presently a crisis of dissemination,
linked to spiralling costs of journals and books. The current
standard solutions for web-based distribution are flawed: the
"big player" model tries to secure exclusive rights for commercial
exploitation, but fails to create an adequate access and retrieval
infrastructure and promotes the Digital Divide; the "scout" model
of transfer of content through pilot ventures lacks self-sustaining
dynamics. What is needed is a self-sustaining infrastructure built
through an "agora" solution - a support programme arising from the
contribution of all citizens towards the common good. In this
vision, the web of the future will be constituted by informed
peer-to-peer interactions. The engine for this will be dynamic
ontologies engendering a self-organising mechanism, so that
semantic linking runs deeper than the current linking between
predefined metadata collections. The Berlin Declaration is seen as
a starting point; it encourages scientists to publish the results
of their research through open-access vehicles, and calls upon the
holders of cultural heritage collections to make them available
via the web. Projects such as European Cultural Heritage Online
are informed by this vision and demonstrate its potential. A major
short-term goal of the signatories to the Berlin Declaration is to
raise the awareness of learned societies and to lay out a roadmap
that systematically addresses the issues of legal obstruction,
economics and the Digital Divide.

For all the current interest in scientific data and society, it
remains true that science itself is built on data, and Johann
Gasteiger of the University of Erlangen-Nurnberg described recent
developments in chemical informatics that drew new knowledge from
the mass of accumulating data. His theme was that chemistry is
more interested in properties than in compounds; although a
million new compounds are described each year in over 800,000
publications, the sheer volume can make it difficult to solve new
problems. On the other hand, there is often not enough data. While
there are in excess of 41 million recorded chemical compounds,
only a quarter of a million well-defined crystal structures are
known. Sometimes it is opportune for academia, which pioneers new
methods, to work closely with industry, which has the capacity to
provide large numbers of new compounds, and hence data points,
required for a specific study. It was noteworthy that industry was
poorly represented among CODATA members.  The new discipline of
chemoinformatics allowed the application of informatics and
mathematical methods to solve chemical problems. Neural networks
are one of a growing number of powerful new tools. Case studies
were presented of industrial/academic collaborations in which
self-organising two-dimensional neural networks had been applied
to problems of solubility prediction, infrared spectra
characterisation and drug discovery.

Gladys Cotter, of the US Geological Survey, discussed biodiversity
studies, another area where informatics is challenged by the
volume of data, by new techniques for collecting and processing
data, and by the demands of organisation and knowledge
management. Recent developments in the biological sciences have
produced close cooperation between participants at many levels of
scale, from UN-sponsored programmes through regional and national
organisations, local institutions and individual field workers,
all trying to communicate through increasingly interoperable
channels. This is an area of science where the hierarchy of levels
does seem to work quite well together, despite (but perhaps also
helped by) the proliferation of new data discovery techniques such
as portable digital assistants (PDAs) equipped with Global
Positioning System (GPS) locators, field computers, unmanned
aerial survey vehicles and lidar. Effective metadata schemes
within the Global Biodiversity Information Forum (GBIF) project
allowed the exposure of 35 million taxonomic records within a year
using the DiGIR software framework. New data models are moving
away from purely descriptive taxonomy towards a more predictive
function.

Yoshiyuki Sakaki, Director of the RIKEN Genomic Sciences Center,
described the recent work on "finishing" the euchromatic sequence
of the human genome, first published in draft in 2001. The
motivation was to produce data of the highest quality to form the
foundation for future medical research. The result is the
elucidation of some 20-25,000 protein-encoding genes. As well as
providing insight into genetic function within man, the complete
genome provides the raw data for the new study of comparative
genomics, where comparison of highly conserved sequences across
many species provides clues to evolution. Current bioinformatics
techniques applied to the genome have the potential to map
phylogenetic relationships.

Data and Society
----------------
Addressing the general theme of the conference, the plenary
session "Data and Society" provided two wide-ranging reviews.
Rene Deplanque, from FIZ Chemie Berlin, surveyed "The Use of
Scientific and Technological Data in Today's Society" in the
context of the development of the web and related information
sciences. Search machines like Google are working towards a
paradigm of intuitive searching without the need for formal query
languages, but are limited in the depth of information they crawl,
and the promiscuous nature of the returned results. Structured
information will certainly help, but there is still a vast
challenge in integrating databases from very different domains of
science. Ontology management software is needed, and is gradually
evolving; perhaps a suitable machinery for software development in
this area will be based on languages such as Prolog that have
significant logical inference capability. Exciting new
developments continue to emerge: examples are e-learning systems,
Grid technology, distributed virtual reality and ever more
powerful supercomputing. Despite all this, the processing
efficiency of the human brain is still far beyond anything
we can currently envisage.

If that was an upbeat and technically optimistic review, Roberta
Balstad Miller, Director of the Earth Institute at Columbia
University, rang warning bells over the potential to abuse
demographic and other human-population data gathered for
scientific research purposes. Science had contributed much to
warfare and human oppression in the 20th century through atomic
and chemical weapons and bioterrorism; without responsible
management, social data held the potential for large-scale harm in
the 21st century. Existing controls, such as the 72-year embargo
on census data in the US, are well intentioned but inadequate as
the combination of multiple databases can allow data mining and
discovery of detailed personal information about individuals. She
argued for a widespread educational programme to raise awareness
of these concerns, and the establishment of protocols allowing
independent academic advisory committees to work alongside
Government bodies in the collection and management of large social
data sets. CODATA had an important role to play in leading the
necessary educational programme and in defining appropriate checks
and balances. Technology-driven monitoring solutions would be
helpful, but the problem needed to be brought into the global
spotlight and might need to be addressed through international
treaties.

Mark-up Languages
-----------------
Brian Matthews of CCLRC Rutherford Labs, which hosts the UK office
of W3C, surveyed the tools developed for and promoted by W3C as
essential components of "The Semantic Web and Science Communities".
These were promoted as the standards for implementing the vision
of the Web of Culture and Science presaged by Jurgen Renn's
keynote address. Current thinking on the semantic web uses a
layered  model: Unicode and URIs provide the base layer, on which
are overlaid: XML with namespaces and schemas as a transport
layer; the resource description framework (RDF) for metadata;
ontology vocabularies managed by languages such as OWL to express
formal relationships; and above that, layers of logic, proof and
trust that were still to be addressed. Notable among emerging
projects to develop languages suitable for supporting thesauri is
SKOS.

Haruki Nakamura of the Institute for Protein Research at Osaka
University presented PDBML as an example of an XML language built
on a formal ontology (the mmCIF dictionary) and now used as the
standard exchange mechanism between the components of the
Worldwide Protein Data Bank (wwPDB). Standard XML tools can be
used to manage the data in this format, such as the use of XPath
searches in quite complex queries. The PDBjViewer is an
alternative to rasmol for protein structure visualisation, and can
be distributed as a Java applet, demonstrating the platform
independence essential for sustained progress. The presentation
also described a biomolecular simulation markup language (BMSML) 
that was being developed under a grid architecture to allow
biosimulations at multiple size scales simultaneously.

Peter Murray-Rust described his ongoing work with CML and
presented his collaborative work with Acta as an example of the
ease of interoperability between structured data representations
such as CIF. He also presented his standard appeal for explicit
licensing declarations in machine-readable format to promote data
reuse, and his advocacy of the need for community cooperation.

Data Archiving
--------------
CODATA has had an active interest for some time in long-term
preservation and access, and there were a number of sessions and
presentations on this topic. Increasingly, archival solutions are
designed under the influence of the Open Archive Information
Systems (OAIS) Reference Model; but, although this provides an
essential conceptual framework for the management of large
systems, its richness and complexity can be overwhelming for small
organisations. In a very nice presentation of her doctoral research
project, Jacqeline Spence of the University of Wales at Aberystwyth
demonstrated a questionnaire-based approach to scoring small
organisations' performance within the OAIS framework. The
objective is not so much to rank by merit, but to demonstrate the
areas where work is needed (and possibly to highlight areas where
work is not needed, according to the requirements of the
organisation). The scorecard is useful especially for allowing
organisations to work together collaboratively to ensure that the
archiving function is delegated and managed at an appropriate
level. I am not sure that the actual scoring methodology is
optimal (numeric scores assigned to risk and perceived
requirements are added, where multiplication might seem a better
weighting); but the idea suggests how small(ish) organisations can
present their actual archiving abilities and status in a reasonably
understandable and standard way. This could be very helpful, for
instance, in my long-standing desire to record the
crystallographic databases' status as archives.

The US National Committee for CODATA has been working with an
Archiving Task Group in collaboration with ICSTI to create a
portal site for resources connected with the archiving of
scientific data. The prototype site
(http://stills.nap.edu/shelves/codata/index.html) demonstrates the
potential uses of the portal, although its development is hampered
by the content management system used in this prototype. It is
hoped that the fully developed portal will be hosted in a
developing country as a capacity-building exercise. Note that this
fits in well with my suggestion some time back to provide
information about domain-specific data resources through CODATA
(perhaps weith archiving activities measured through a scorecard of
the type mentioned above).

The new Digital Curation Centre was introduced by David Giaretta,
its Associate Director (Development). The DCC (http://www.dcc.ac.uk) 
was established following a recommendation in the JISC Continuing
Access and Digital Preservation Strategy (October 2002) to establish
a UK centre to help solve challenges which could not be solved by
any single institution or discipline, including generic services,
development activity, and research. It does not seek to be a
repository of primary research data, but might nevertheless be a
useful establishment for providing us with advisory services,
ideas, tools and access to standards. The DCC development site is
at http://dev.dcc.rl.ac.uk and includes some demonstration
projects (see e.g. the astronomy FITS example, which has some
parallels with our CIF development).

A German equivalent is nestor, a distributed partnership of German
libraries and museums, www.langzeitarchieverung.de.

Among other points of interest to emerge from the presentations in
these sessions I noted the following.

China recognises long-term preservation and access as an objective
specifically listed in the WSIS draft plan of action. Chinese
receiving stations for the NASA MODIS (imaging spectroradiometer)
satellite programme can distribute received data online within an
hour, but during the same time frame the data are entered into a
long-term storage system.

The OAIS idea of a "designated user community" is important in
designing archive systems, but developers must be aware that there
may well be unanticipated demands for use by a broader user
community. Some principles of good practice follow - define a user
community with allowance for outreach (but within reason); engage
non-technical authors to write the documentation for data centres
(obviously in collaboration with the technicians); design
architectures that rely on transparency, interoperability,
extensibility, and storage or transaction economy; ensure that
uncertainties in data are properly documented.

These principles are being applied in metadata and ontology
development for a German project concerned with the *very*
long-term preservation of digital information (specifically, that
relating to nuclear waste disposal sites where the design goal is
to make information available for at least 100,000 years). An
important component of this is seen to be crafting ontologies that
are aware of IT infrastructure (the principles of storage,
database formats, communications channels and security), so that
these can also be migrated to new platforms over time. A useful
backup mechanism is the HD-Rosetta approach of etching text or
other analogue information microscopically on a hardened nickel
substrate (e.g http://www.norsam.com/hdrosetta.htm).

NASA itself is building more complex archiving applications on top
of the OAIS model, and increasingly integrating these into live
projects. The motivation behind well-characterised software
systems is to create complex systems that self-adjust with the
loss of one or more components in a network of satellites and
receiving stations. The NASA view is that archiving and e-science
together are essential for 21st-century science and technology.

Open Scientific Communications/Publication and Citation of
Scientific Data
----------------------------------------------------------
Norman Paskin of the International DOI Foundation discussed the
use of digital object identifiers (DOIs) for scientific data
sets. DOIs are used in publishing to identify literature articles
and, through searching of associated bibliographic metadata, to
provide a linking service for publishers through the CrossRef
registration agency. Similar functionality is possible for
scientific data sets. DOIs are intended as persistent identifiers,
and allow for more reliable long-term access than ad hoc and
frequently transient URLs. Two case studies were presented of
projects employing interesting DOI applications with science
data. One is the "Names for Life" project, which proposes DOIs as
persistent identifiers of taxonomic definitions. Because taxonomic
definitions change over time, the unambiguous identification of a
species can be difficult. Assignment of a DOI to a specific
definition, and the provision of forward linking to synonyms or
other related resources, will provide an audit trail of taxonomic
changes, and allow both the unambiguous identification of a cited
species and an understanding of the contemporary definition in its
historical context. Note the distinction between an identifier for
a specific data record (a taxonomic description) and an identifier
for a concept (the taxon itself). DOIs are most likely to be used
for the former purpose, since concept identifiers tend to be
domain-specific (e.g. genus/species scientific names, INChIs,
phase identifiers, chemical element symbols...). Nonetheless, the
use of DOIs as concept identifiers is not entirely ruled out,
especially if there is no existing systematic identification
scheme in place.

Paskin's second example was the assignment of DOIs to climate data
from the World Data Center for Climate (WDCC) in Hamburg. The
German National Library for Science and Technology (TIB, Hannover)
is acting as the registration agency in this case, and the WDCC
application is a pilot within a longer-term project to define
metadata suitable for different disciplines. TIB has an objective
of becoming the central registration agency for scientific primary
data by 2006. Michael Lautenberger of the Hamburg WDCC gave more
details of the pilot project, and made it clear that one of their
objectives was to promote academic credit associated with the
"publication" of primary data sets identified by DOIs, together
with integration of data sets into library catalogues and their
appearance in the Science Citation Index.

I chatted to Paskin about these developments, and mentioned that I
thought they were filling an important need, one that CrossRef had
declared itself unwilling to take on board when we spoke with them
some years ago. Subsequently, however, I discovered that CrossRef
have been discussing with the PDB the assignment of DOIs for
protein structures, and so the field appears to be opening
up. There are a number of considerations that will come into play:
will CrossRef or TIB create the better set of metadata for
characterising scientific data? is there a case for distinguishing
between "primary" data and "supplementary" data associated with
publications? what will be the financial model for scientific data
publication?

In a presentation on "Open Access to Data and the Berlin
Declaration", Jens Klump of the GeoForschungsZentrum Potsdam
also proposed that data centres could act as the equivalents of
data publishers within an open-access environment. He proposed
that the Berlin Declaration, and its effective endorsement by
Governments in the OECD Final Communique of January 2004
(http://www.oecd.org/document/15/0,2340,en_21571361_21590465_25998799_1_1_1_1,00.html)
should apply also to data. The key components of such a model
would be: irrevocable free access, worldwide; licences to copy,
use or distribute; licences for derivative works; and availability
through at least one long-term archival gateway. At this point, a
major difficulty was in formulating principles of "fair use" for
applications of openly-accessible scientific data.

Heinrich Behrens presented a paper considering the growth in the
number of publications in scientific literature and data since the
seventeenth century. Growth curves rise very rapidly over this
period, but without any models, the best way to fit such curves is
through statistical analysis of best-fit functions. Often growth
curves are fitted by exponentials, sometimes by a succession of
exponentials when the curve exhibits changes in growth rate over
time. Behrens demonstrated that statistical residuals could be
much smaller if multiple quadratics were fit through the same
empirical data points. While the differences in fitting past
curves were small, future growth predictions will of course
differ markedly depending on whether exponential or polynomials
are extrapolated. It would be interesting to predict growth in
CCDC or PDB by extrapolating quadratic fits into the future.

A paper that wasn't in fact presented nevertheless had an
interesting abstract demonstrating the close synergy between data
and publications in astronomy.
(http://www.codata.org/04conf/abstracts/OpenSciComm/Genova-Informationnetworking.htm)

Data Quality
------------
Ronald G. Munro of the Ceramics Division of NIST gave a talk on
"Data Evaluation as a Scientific Discipline", which presented a
mathematical model for assessing quality, but also made a number
of interesting general points. One was that the objective of data
evaluation should be considered as ascertaining the credibilty of
data. Another was the benefit of classifying quality indicators
into functional groups - at NIST a useful scheme (roughly in
ascending order) was: Unacceptable / Research / Commercial /
Validated / Unevaluated / Typical / Qualified / Certified.

Volkmar Vill of the University of Hamburg demonstrated some
applications of SciDex, an object-oriented database allowing 2D
and 3D data sets as data types. The system was developed for
implementation of LiqCryst, a  liquid crystals database, and hence
contains some rather general chemical validation methods (such as
substructure comparison) that fit it for other purposes. It has
been used to create a search engine for the index of Springer's
Landolt-Bornstein Online, as well as a number of other scientific
databases: 29Si-NMR, Phytobase, Hazardous Substances...

World Summit on the Information Society
---------------------------------------
The World Summit on the Information Society (WSIS) takes place, in
two stages, in Geneva in December 2003 and Tunis in November 2005,
organised by the International Telecommunication Union under the
patronage of the UN Secretary-General. It aims to bring together
Heads of State, Executive Heads of United Nations Agencies,
industry leaders, non-governmental organizations, media
representatives and civil society in a single high-level event, to
discuss the broad range of questions concerning the Information
Society and move towards a common vision and understanding of this
societal transformation.

ICSU and CODATA worked closely together to raise the visibility of
science as a contributor to the information society at the first
leg of the Summit. Now ICSU wishes to delegate to CODATA more
involvement in the run up to the Tunis event. The WSIS Session
during the CODATA conference is part of that involvement.

The first phase of the summit produced an Agenda for Action that
includes a number of charges related to science. The most relevant
single item is

22. E-science
a) Promote affordable and reliable high-speed Internet connection
   for all universities and research institutions to support their
   critical role in information and knowledge production, education
   and training, and to support the establishment of partnerships,
   cooperation and networking between these institutions.
b) Promote electronic publishing, differential pricing and open
   access initiatives to make scientific information affordable and
   accessible in all countries on an equitable basis.
c) Promote the use of peer-to-peer technology to share scientific
   knowledge and pre-prints and reprints written by scientific
   authors who have waived their right to payment.
d) Promote the long-term systematic and efficient collection,
   dissemination and preservation of essential scientific digital
   data, for example, population and meteorological data in all
   countries.
e) Promote principles and metadata standards to facilitate
   cooperation and effective use of collected scientific
   information and data as appropriate to conduct scientific
   research.

The CODATA session aimed specifically to highlight the initiatives
currently under way in the scientific community relating to the
Agenda Action items, and to identify particular outstanding
problems. A round-table discussion was structured around five
questions that had previously been distributed to attendees. Below
I give terse summaries of some of the points raised.

1. What are the major challenges regarding scientific data
   management and access?
..........................................................
* 20,000 petabytes of data are being produced annually. The
  problem is not just of access, but of usability of such amounts.
* Access and connectivity are essential, but as a first step. We
  also need new techniques for knowledge discovery, which depend
  on an ability to integrate knowledge at different scales.
* New forms of dissemination are potentially useful in helping
  policy makers and the general public to understand scientific
  issues.
* Funding is a common problem - how to persuade governments to
  finance data management as well as the basic science?
* There is a lack of resources (and interest) in digitising
  heritage data (e.g. astronomical photographic plates).
* There remains a mismatch in the collection of environmental data
  between what is being gathered and what is actually of most use,
  particularly in the developing world.
* The International Mathematical Union is working on the goal of
  digitising all mathematical publications to produce a complete
  digital library of mathematics.
* Geodiversity needs to be emphasised.
* WSIS should emphasise the need for common data standards.
* Personnel in the developing world need to become more involved.
  There are issues of language and training; and specifically a
  lack of awareness of the need for archives.
* INASP emphasised the need for improved access as a first step,
  and can provide many examples of how the benefits of increased
  bandwidth to developing institutions is very quickly realised.
* The Third World Academy of Science acknowledges the need for
  archiving, but their priority is rapid access to the latest
  information.

2. What issues and accomplishments should be highlighted at Tunis?
..................................................................
* Need to discriminate among different types (i.e. quality) of data.
* Want to see more new horizons for science arising from WSIS, and
  a proper respect for, and understanding of, the role of science
  within the broader Information Society.
* The IAU wants to see *better* science coming out of WSIS, and a
  culture change. Data should be taken seriously; the science is
  not finished until the associated data have been publicly posted.
* NASA looks forward to the emergence of a common language of
  science, with more collaborations in scientific endeavour.

3. Activities relating to e-Science
...................................
* The International Polar Year of 2007/8 (marking the 50th
  anniversary of the International Geophysical Year) 
  demonstrates the role of science in promoting international
  cooperation.
* The World Data Centres offer another good example.
* The forthcoming "Electronic Geophysical Year" will contribute
  towards the new horizon of taking data and information seriously.
* A project is under way to create a 1:1,000,000 digital map of
  the entire world, with eight layers of sustainable
  development. The best input so far has come from the developing
  world.
* The OAI-PMH transport mechanism for metadata in the provision of
  open access is a noteworthy achievement.

4/5. What outcomes and actions are expected?
............................................
* Renewed efforts towards the provision og electricity and power
  globally - no data if no power!
* Much science is based on relationships, and initiatives
  promoting interpersonal contacts should be encouraged.
* ICT developments may lead to an entirely different structure for
  science in the future - CODATA should paint the picture of what
  science will be like in 15 years time.
* The exercise of producing an inventory of specific activities is
  very important, but should not end with the Tunis summit.
* Scientists need to engage more with policy makers on issues of
  relevance. Internet governance is one such area.
* The summit is an opportunity to emphasise the non-monetary value
  of sharing knowledge. This is understood intrinsically within
  scientific culture, but may need to be spelled out to the world at
  large.
* Intellectual property rights must be managed sensitively in
  cooperation with WIPO.
* Open access to data and Equitable access to publications remain
  specific goals that should emerge from the WSIS summit.


Summary
=======
CODATA 2004 billed itself as the first major interdisciplinary
conference addressing new horizons for science in the information
society. The organisers believed that it had merited that
description. There were 260 participants from 28 countries, and
activities of most of the scientific unions were represented. The
participation by representatives from ICSU, UNESCO, IIASA and the
African Academy of Languages was taken as evidence of the growth
of interest in CODATA.

==============================================================================
_______________________________________________
Epc mailing list
Epc@iucr.org
http://scripts.iucr.org/mailman/listinfo/epc

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.