Crystallographic data

Workshop on Research Data Management

Programme book pdf

Friday May 26 2017

Hyatt Regency Hotel, New Orleans, USA

This is the third and final Workshop organised by the Diffraction Data Deposition Working Group (DDDWG), appointed by the IUCr Executive Committee to define the need for and practicalities of routine deposition of primary experimental data in X-ray diffraction and related experiments. It will take the form of a full-day workshop at the 2017 American Crystallographic Association Meeting with lectures from crystallographic practitioners, data management specialists and standards maintainers.

Objective: This workshop has two plenary sessions:

  • What every experimentalist needs to know about recording essential metadata of primary (raw) diffraction data
    This will include: sample preparation and characterization; correct recording of instrument axes, correction factors, calibration - instrument manufacturers; attention to diffuse scattering or other interesting "metadata".
  • Research Data Management policy mandates and requirements on Principal Investigators (PIs)
    This will include metadata standardization; data repositories; primary data linking to publications. 

There will also be an optional technical session:

  • High-data-rate/high-performance-computing issues of research data management for MX
    For synchrotron- and XFEL-based macromolecular crystallography (MX), high source brightness and the new generation of pixel array detectors raise big-data, high-performance-computing and high-performance-networking issues in research data management. There will be an optional early evening sub-session of the Research Data Management workshop to discuss the high-data-rate/high-performance-computing issues of research data management for MX that will include discussion of appropriate hardware choices and programming techniques that are useful in this context.  All registrants for the Research Data Management workshop are welcome to attend this workshop sub-session.

There is a public forum for discussion of the issues covered in this workshop at http://forums.iucr.org (and viewable in the 'Forum' tab below).

Acknowledgements: We express our gratitude to the organizations and companies listed in the Sponsors tab below, without whose generous contributions these events could not take place.

Programme

Friday 26 May

08:30 Open

Session I: What every experimentalist needs to know about recording essential metadata of primary (raw) diffraction data

08:30-08:40 John R. Helliwell and Brian McMahon Introduction to the DDDWG 2017 Workshop on Research Data Management Abstract | Presentation (608 kB)

[J. R. Helliwell]

John R. Helliwell1 and Brian McMahon2
1 School of Chemistry, University of Manchester, M13 9PL, UK. Email: john.helliwell@manchester.ac.uk
2 IUCr, 5 Abbey Square, Chester CH1 2HU, UK. Email: bm@iucr.org

The IUCr Executive Committee established a Diffraction Data Deposition Working Group (DDDWG), to define the need for and practicalities of routine deposition of primary experimental data in X-ray diffraction and related experiments. Since the Working Group's first Workshop in Bergen, Norway (August 2012), important strides have been taken to make routine deposition of raw data a reality. The major facilitator for this has been the establishment of digital data storage repositories registered to issue persistent unique Digital Object Identifiers (DOIs) for a raw dataset. Such repositories include universities (e.g. University of Manchester), the EU's Zenodo initiative, and several centralised neutron, synchrotron and X-ray laser facilities. As stressed by John Westbrook of the PDB (http://www.iucr.org/resources/data/dddwg/bergen-workshop) metadata descriptors for raw data are vital for its effective re-use. The PDB has extensive experience of specifying metadata descriptors for structure factors, coordinates and B factors, as well as for cryoEM and bioNMR data depositions. Kroon-Batenburg and Helliwell [1] provided an example of appropriate metadata, critically including a picture of their diffractometer, for their local raw diffraction data archive. This archive has seen successful examples of raw data re-use such as by Wladek Minor and collaborators [2]. A second DDDWG workshop on 'Metadata for Raw Data' (Rovinj, Croatia, August 2015) brought together another wide range of global experts (http://www.iucr.org/resources/data/dddwg/rovinj-workshop), including the Chair of the IUCr Committee for the Maintenance of the CIF Standard (James Hester), who has vast experience of metadata descriptors for processed and derived data. An outcome of the second Workshop was 'checkCIF for raw diffraction data', a notional service akin to the existing IUCr checkCIF for processed structure factors and derived atomic coordinates data (http://checkcif.iucr.org). This third Workshop at ACA 2017, New Orleans, broadly titled 'Research Data Management', includes the charge to Workshop participants to focus on metadata (including their experiences with processed structure factors and derived atomic coordinates data), and help to define as closely as possible the optimum metadata for raw diffraction data to guide the raw data archives listed above. Re-use of raw data leveraged upon metadata descriptions has already been shown to be viable [1,2]. This should now be built on more energetically by the single-crystal diffraction community (including chemical crystallography), as well as by the various scattering, diffraction, imaging and spectroscopy techniques represented in the various IUCr Commissions. Excellent headway has been made in defining SAXS and EXAFS metadata, for example. For an overview of raw diffraction data preservation and re-use including an update on practicalities and metadata requirements see the very recent publication by Kroon-Batenburg et al. 2017 [3].

  1. Kroon-Batenburg, L. M. J. & Helliwell, J. R. (2014). Acta Cryst. D70, 2502-2509.
  2. Shabalin, I., Dauter, Z., Jaskolski, M., Minor, W. & Wlodawer, A. (2015). Acta Cryst. D71, 1965-1979.
  3. Kroon-Batenburg, L. M. J., Helliwell, J. R., McMahon, B. & Terwilliger, T. C. (2017). IUCrJ 4, 87-99.
(hide | hide all)
08:40-09:00 Marvin L. Hackert, Luc Van Meervelt, John R. Helliwell and Brian McMahon The Science International Accord on Open Data in a Big Data World and the IUCr's response Abstract | Presentation (2.41 MB)

[M. L. Hackert]

Marvin L. Hackert1, Luc Van Meervelt2, John R. Helliwell3 and Brian McMahon4
1 Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
2 Chemistry Department, Universiteit Leuven, Celestijnenlaan 200F, BE-3001, Leuven, Belgium
3 School of Chemistry, University of Manchester, M13 9PL, UK. Email: john.helliwell@manchester.ac.uk
4 IUCr, 5 Abbey Square, Chester CH1 2HU, UK. Email: bm@iucr.org

Science is best served when access barriers to data (and publications) are low. Open Data in a Big Data World [1] is a response by the IUCr to an international Accord [2] by ICSU, IAP, TWAS and ISSC in an emerging scientific culture of big data on the values of open data that are discoverable, accessible, intelligible, assessable and usable. Technological advances in scientific instrumentation and computer technology have dramatically increased the quantities of data involved in scientific inquiry. It expresses the dependence of scientific assertions on supporting data and asserts that 'openness and transparency are the bedrock of modern science.' The IUCr supports this assertion, but argues that such data should also be subject to scrutiny through peer review and automated validation where possible to look for systematic bias or error. An overlooked challenge in handling ever-growing volumes of data is the need to apply the same level of critical evaluation as has historically been applied to smaller data sets. Any software implementations used to scrutinize such data should employ open algorithms where results could be cross-checked by independent implementations.

A major barrier to access is cost. Evaluating, storing and curating quality data is an expensive component of the scientific process, and care must be taken to understand how to obtain the maximum benefit from public funding of science.

  1. http://www.iucr.org/__data/assets/pdf_file/0011/125687/OpenData_crystallography_web.pdf
  2. http://www.icsu.org/science-international/accord
(hide | hide all)
09:00-09:30 Herbert J. Bernstein What every experimentalist needs to know about recording essential metadata of primary (i.e. raw) diffraction data Abstract | Presentation (335 kB)

[H. J. Bernstein]

Herbert J. Bernstein
Rochester Institute of Technology, Rochester, NY, USA

As the rate of production of diffraction images rises to several hundred datasets per day per beamline, it is becoming increasingly important to record essential metadata in an efficiently retrievable form. It is impractical to expect to refer to laboratory notebooks and do manual metadata entry in such an environment. Indeed, as data rates increase further it will become impractical to handle the same images multiple times in order to transform metadata from one convention to another. The last time our community faced a similar speed-constrained transition was with the Dectris Pilatus pixel-array detectors which strained computers and networks of that time by producing ten images per second, leading to the adoption of the imgCIF/CBF and miniCBF metadata conventions. Now, with data arriving one to three orders of magnitude faster and the introduction of NeXus/HDF5 images, and adoption of new experimental techniques including serial synchrotron crystallography, adoption of consistent, well-documented crystallographic-image metadata handling is essential to conserve processing resources and maximize beamline structure production. To this end, the necessary concordances of imgCIF/CBF - miniCBF - NeXus NXmx metadata specifications [1] [2] [3] are being maintained on a common web site. In this talk we review compromises between a common minimal set of metadata to allow for processing of simple rotation data and richer sets of metadata needed for more demanding experiments. We also consider the implications of these choices for future reprocessing of archived datasets.

  1. H. J. Bernstein, J. M. Sloan, G. Winter, T. S. Richter, NIAC, COMCIFS, 'Coping with BIG DATA image formats: integration of CBF, NeXus and HDF5', Computational Crystallography Newsletter, 2014, 5, 12-18.
  2. A. S. Brewster, J. Hattne, J. M. Parkhurst, D. G. Waterman, H. J. Bernstein, G. Winter, N. K. Sauter, 'XFEL Detectors and ImageCIF', Computational Crystallography Newsletter, 2014, 5, 19-25.
  3. M. Mueller, 'EIGER HDF5 data and NeXus format', in Workshop on Metadata for raw data from X-ray diffraction and other structural techniques, 22-23 Aug 2015, Rovinj, Croatia.
Work supported in part by Dectris. (hide | hide all)
09:30-10:00 Loes Kroon-Batenburg Correct recording of metadata: towards archiving and re-use of raw diffraction images in crystallography Abstract | Presentation (630 kB) |
Supplementary information (2.1 MB)

[L. Kroon-Batenburg]

L. M. J. Kroon-Batenburg
Crystal and Structural Chemistry, Utrecht University, Utrecht, The Netherlands. Email: l.m.j.kroon-batenburg@uu.nl

In recent years scientists and policy makers made major steps toward Open Science. The incentive is to allow validation and falsification of the research based on the data and to allow its re-use, as the aquisition of the data is mostly funded by the tax payer. New methods and technologies can be developed with the availiability of large data bases covering diverse types of experiments. In this framework the IUCr established a Diffraction Data Deposition Working Group (DDDWG) with the aim of developing standards for the representation of raw diffraction data in crystallography. Two key issues play a role: the importance of persistent identifiers and the full recording of metadata. Whilst discussions are vividly going on about what data to archive, only those related to published papers or also of incomplete or unsuccessful research that could be particularly interesting for the development of new science, the field should prepare itself for depositing fully self-contained data. A recent review [1] summarizes the ongoing developments. Ideally, metadata should comprise the following: identification of the image format, number of pixels, pixel sizes, byte-storage architecture, baseline offset and handling of overflows, information on the corrections that are applied (dark current, distortion correction, non-uniformity correction), detector gain, goniometer axes orientations and rotation directions, and information on the experiment such as exposure time, number of repeats, oscillation axis and range, wavelength used, beam polarization, detector position (or beam position) and offsets. Details and the importance of such information will be discussed. The necessity to use a structured language ( DDL) that defines data names (tags) in data formats like CIF or Nexus defines data names (tags) in data formats like CIF or Nexus [2] to ensure unambiguous interpretation, will be demonstrated. Awareness of detector manufacturers and experimentalists of recording sufficient metadata is essential, and guide lines for these are under way.

  1. Kroon-Batenburg, L.M.J., Helliwell, J.R., McMahon, B & Terwilliger, T.C. (2017). IUCrJ 4, 1-13.
  2. Bernstein, H.J., DDDWG Workshop (2015). http://www.iucr.org/resources/data/dddwg/rovinjworkshop
(hide | hide all)
     
10:00-10:30 Coffee
10:30-11:00 D. Marian Szebenyi, Devin Bougie, Aaron Finke, Richard Gillilan, Jesse Hopkins, David Schuller and Werner Sun Research data management at CHESS Abstract | Presentation (1.70 MB)

[M. Szebenyi]

D. Marian Szebenyi1, Devin Bougie, Aaron Finke, Richard Gillilan, Jesse Hopkins, David Schuller and Werner Sun
MacCHESS and CHESS, Cornell University, Ithaca, NY 14853, USA. 1E-mail: dms35@cornell.edu

Historically, the Cornell High Energy Synchrotron Source, CHESS, with its relatively small number of beamlines, has relied on users to manage their own data. The facility has provided adequate RAID storage at each station for a 6-8 week run, with some longer term backup. The advent of increasing numbers of experiments involving massive amounts of data has strained this system. Accordingly, we have recently implemented a large, centralized, more organized, system ('CHESS DAQ'), with separate storage for raw data, metadata, and general user data. Nightly incremental backups and full archiving at the end of each run protect against data loss. This system is used for most experiments at CHESS, with individual variations to suit the needs of users and staff. Our primary goal has been, and remains, to facilitate research by our users, by providing them the means to collect, process, and store the most useful data possible, while avoiding excessive bureaucracy.

BioSAXS raw data from SAXS and WAXS detectors (the two detectors record images simultaneously), as well as metadata, are written directly to CHESS DAQ. Processing is carried out locally on copies of the raw data, and processed data are backed up locally as well as on the DAQ and to user-supplied media.

Raw crystallographic data from the dedicated MX station, i.e. diffraction images, are stored locally, with users responsible for processing data and transferring raw and processed data to their home labs. Raw data are kept on-line for a few weeks and off-line for several years. Limited metadata are stored in image headers. Implementation of a new database, to facilitate organization of raw data and metadata, is under development in parallel with adoption of a new user interface for data collection, based on JBluIce.

(hide | hide all)
11:00-11:30 Andrew Allen, Fan Zhang, Jan Ilavsky and Pete Jemian Metadata for small-angle scattering measurements Abstract

[A. J. Allen]

Andrew J. Allen1, Fan Zhang1, Jan Ilavsky2 and Pete R. Jemian2
1Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD 20899, USA
2X-ray Science Division, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, USA

Measurements based on small-angle scattering (SAS) of X-rays or neutrons (SAXS or SANS) differ critically in several ways from those based on X-ray or neutron Bragg diffraction (XRD or ND) or on X-ray or neutron spectroscopic methods. XRD or ND measurements yield diffraction peaks at discrete scattering angles or scattering vectors, Q, from which a pattern may be identified, and from there the underlying crystal structure. Similarly, spectroscopic measurements frequently yield information directly relatable to bond energies or energies of transition within the sample material. In contrast, SAXS or SANS measurements yield data that usually comprise a smooth curve of SAS intensity as a function of scattering angle or Q. This requires interpretation in terms of the likely scattering features (inhomogeneities) that underlie the sample microstructure before a quantifiable data analysis can be carried out in any meaningful way. Thus, in archiving SAXS or SANS data, very significant emphasis is required on the metadata to accompany the measured data – both metadata providing detailed qualitative information on sample microstructures, and metadata providing detailed instrumental parameters and other information on the measurements, themselves.

Metadata requirements for SAS are inextricably linked to aspects that may be more-or-less closely related to the measurements, themselves. Examples might include the measurement configuration (SAXS versus SANS, transmission versus grazing-incidence geometry, 1D Bonse-Hart versus 2D pinhole camera, angular-dispersive SAXS or SANS versus time-of-flight SANS, etc.), the nature of the sample (e.g., precipitates in metallic alloys, pores in ceramics, polymer structures, nanoparticles in suspension, protein complexes, expected polydispersity in feature size and shape, etc.), absolute calibration and correction issues (e.g. for scattering geometry and Q-values, scattering intensity, effective sample volume), the effective spatial and Q-resolution, background subtraction issues, and even the requirements for common data formats and publication standards. This paper will discuss these issues and current ongoing international efforts within the SAS community to address them.

(hide | hide all)
    
11:30-12:00 General discussion
12:00-13:00 Lunch

Session II: Research data management policy mandates and requirements on Principal Investigators (PIs)

13:00-13:30 Marshall Ma Open Science and research data policy mandates and requirements on Principal Investigators (PIs) Abstract | Presentation (1.97 MB)

[Marshall Ma]

Marshall Ma

 Department of Computer Science, University of Idaho, 875 Perimeter Drive MS 1010, Moscow, ID 83844-1010, USA

This invited presentation will explore the policy landscape relating to research data. It aims to cast light on the latest developments in funder, institutional and journal policies and to clarify a number of issues relating to Open Science, Open Data, FAIR Data, Research Data Management etc. A simplified but useful and well-tested typology describes three categories of publicly-funded research data: (1) data resulting from large data creation/collection exercises that are often cumulative (e.g. EO/remote sensing, statistical data, meteorological data, refined crystallographic data); (2) full datasets created by funded research projects; (3) data that directly underpins research publications as the evidence (often a subset of (2). The presentation will analyse the data policies that exist in relation to these ‘types’ of data and the requirements they impose upon Principal Investigators and other parties.

The presentation will examine the benefits and challenges of Open Science and FAIR data in relation to the following issues and developments:

  • the major transformations and opportunities described in the Science International Accord on Open Data in a Big Data World;
  • the implications for peer-review, for the scrutiny and validation of data, and for the way in which scientific contribution is assessed and recognised;
  • the need and  opportunities for international development and coordination of standards and vocabularies within and across established disciplines;
  • the funding, governance and economic challenges for data resources being addressed by the CODATA-OECD Global Science Forum Project on Business Models for Sustainable Research Data Repositories.
(hide | hide all)
     
13:30-14:00 Stephen Burley Research data management: structure factors and atomic coordinates Abstract | Presentation (3.85 MB)

[S. K. Burley]

Stephen K. Burley
Director, RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, 174 Frelinghuysen Road, Piscataway, NJ 08854, USA

The Protein Data Bank (PDB; pdb.org) was established in 1971 as the first open-access digital data resource in biology. Today, the PDB archive serves as the single global repository for more than 125,000 experimentally determined atomic-level structures of biological macromolecules (protein, DNA, RNA) and their complexes. The worldwide PDB (wwPDB; wwpdb.org) partnership, the international collaboration that manages the PDB archive, supports Deposition, Biocuration, Validation, and Distribution of PDB data. The mission of the wwPDB organization is to ensure that the PDB archive will continue in perpetuity as a high-quality, open-access digital data resource with no limitations on usage.

Through its global collaboration, the wwPDB has developed OneDep, a unified platform for Deposition, Biocuration, and Validation of 3D biological macromolecules experimentally determined by X-ray crystallography, NMR spectroscopy, and 3D Electron Microscopy. Data are submitted to the PDB archive via this OneDep system. OneDep is designed to help the wwPDB and the global structural biology research community meet the challenges of rapidly changing technologies and keep pace with evolving data archiving needs over the coming decades. The PDB archive and the OneDep system are underpinned by an extensible data architecture based on the PDBx/mmCIF dictionary (mmcif.wwpdb.org). Community involvement in the development of this data dictionary is coordinated by the wwPDB PDBx/mmCIF Working Group (Chaired by Paul Adams, LBL/UC Berkeley).

At present, ~90% of PDB holdings were derived from diffraction methods. The earliest PDB entry in the PDB archive for which structure factors are available was deposited in 1976. Deposition of structure factors became mandatory in 2008, and ~90% of all crystallographic entries now include these data.

Management of structure factors and atomic coordinates within the PDB archive will be discussed, with emphasis on current efforts to extend the range and the complexity of the diffraction data and metadata items that can be deposited.

Acknowledgements: The RCSB Protein Data Bank (RCSB PDB; rcsb.org) is a founding member of the Worldwide Protein Data Bank organization (wwPDB; wwpdb.org. Additional members of the wwPDB include the Protein Data Bank in Europe (PDBe; pdbe.org), Protein Data Bank Japan (PDBj; pdbj.org), and BioMagResBank (BMRB; bmrb.org). Core RCSB PDB operations are funded by a grant to SKB (NSF DBI-1338415) from the National Science Foundation, the National Institutes of Health, and the US Department of Energy.

(hide | hide all)
    
14:00-14:30 Wladek Minor The Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) Abstract | Presentation (7.24 MB)

[W. Minor]

 

W. Minor
Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA 22901, USA

The Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) has been developed as part of the BD2K (Big Data to Knowledge) NIH project to archive raw data from diffraction experiments and, more importantly, to extract metadata from diffraction images alone, or from a combination of information obtained from a PDB deposit and diffraction images. As of February 2017, the IRRMC resource contained indexed data from 3235 macromolecular diffraction experiments (6189 data sets), accounting for around 3% of all structures in the Protein Data Bank (PDB). The IRRMC utilizes a distributed storage system implemented with a federated architecture of many independent storage servers, which provides both scalability and sustainability. The resource, which is accessible via the web portal at https://www.proteindiffraction.org, can be searched using various criteria. All data are available for unrestricted access and download. The resource serves as a proof of concept and demonstrates the feasibility of archiving raw diffraction data and associated metadata from X-ray crystallographic studies of biological macromolecules. The goal is to expand this resource to include data sets that have failed to yield X-ray structures in order to facilitate collaborative efforts that will improve protein structure-determination methods and to ensure the availability of ‘orphan’ data left behind for various reasons by individual investigators and/or extinct structural genomics projects. Every dataset in the IRRMC resource is assigned a DOI (Digital Object Identifier), which should provide a reliable mechanism of data location, even if the URL or the maintainer of the data changes.

(hide | hide all)
     
14:30-15:00 Simon Coles Research data management: administration, raw diffraction data, structure factors and coordinates at the UK's National Crystallographic Service (NCS) Abstract | Presentation (5.32 MB)

[S. J. Coles]

S. J. Coles
 UK National Crystallography Service, Chemistry, Faculty of Natural and Environmental Sciences, University of Southampton, Southampton, SO17 1BJ, UK

The need to manage, curate and disseminate data has become paramount in the modern era of academic research. The data explosion that has occurred at the same time has prompted an increased requirement for transparency about its generation and a greater responsibility and accountability for facilities to provide accurate and long term mechanisms for archival and curation.

The NCS has led the way for chemical crystallography for around 15 years in developing approaches to addressing this problem [1, 2]. The eCrystals project (http://ecrystals.chem.soton.ac.uk/) developed an institutional repository approach to curating and disseminating coordinates, structure factors and a range of other information relating to the 'derived' data from a crystallographic experiment. However, raw diffraction data, although being rigorously archived and in the last 15 years highly curated, is only available on request directly to the NCS. eCrystals has been designed to act as a discipline specific data repository, which has resulted in a pragmatic metadata scheme for the description of its contents and this promotes discovery and reuse of the material it makes available.

There is also a necessary administrative function in running a facility that provides a service and this is intrinsically related to the data itself. Over 30 years of operation the NCS has accumulated a range of databases, spreadsheets and forms to meet ever-changing requirements for administration, tracking and reporting. For the last 15 years a range of NCS projects has been researching and addressing this problem. However, becoming an EPSRC mid-range facility in 2010 prompted a review of requirements and with additional demands for enhanced user interaction and reporting, alongside the Web becoming a more prevalent and mature technology that people readily engage with, 'Portal' was conceived. Portal aimed to bring together all the elements described above into a single unified and coherent system.

We have learnt a lot from this work and Portal has largely achieved its goals, however there are significant aspects of the data repository yet to be incorporated and also the need to maintain a modern codebase. We have therefore embarked on an 18-month project address these matters. 'Portal - The Next Generation' will be a combination of a laboratory information system and a data repository with specific functions and plug-ins tailored for the operation of a crystallographic facility and its resulting data. The design objectives of this system, development progress and the potential for its availability to the community will be discussed.

  1. S. J. Coles et al. (2005), J. Appl. Cryst. 38, 819-826
  2. M. B. Hursthouse & S. J. Coles (2014), Crystallogr. Rev. 20:2, 117-154.
 (hide | hide all)
    
15:00-15:30 Peter Meyer, Stephanie Socias, Jason Key, Mercè Crosas and Piotr Sliz SBGrid Databank Abstract | Presentation (1.41 MB)

[P. Meyer]

Peter A. Meyer1, Stephanie Socias1, Jason Key1, Mercè Crosas2 and Piotr Sliz1,3
1BCMP, Harvard Medical School and SBGrid Consortium, USA 2IQSS, Harvard University and Dataverse Project, USA 3Boston Children's Hospital, Dept of Pediatrics, USA

Access to experimental X-ray diffraction image data is fundamental for validation and reproduction of macromolecular models, and indispensable for development of improved data processing algorithms. We have established a diffraction data publication and dissemination system, the SBGrid Databank, to preserve diffraction datasets supporting published crystal structures. Published datasets are openly available through direct download and through Data Access Alliance (DAA) sites. Data deposition is open to all structural biologists, and datasets for unpublished structures can be held for later publication. Existing databases (such as the PDB and PubMed) are used to reduce the the amount of additional information depositors need to provide. Reprocessing of published datasets is used to provide a baseline for ensuring that the datasets will be useful to other researchers. A set of REST APIs supports reprocessing pipelines, and allows users to access information about published datasets programmatically.

(hide | hide all)
    
15:30-16:00 General discussion
16:00-16:15 Tea

Session III: High-data-rate/high-performance-computing issues of research data management in macromolecular crystallography

16:15-16:45 Jean Jakoncic, Herbert J. Bernstein, Alexei Soares, Wuxian Shi, Martin Fuchs, Robert Petkus, Robert Sweet and Sean McSweeney Dealing with the avalanche of data generated in high-data-rate macromolecular crystallography Abstract

Jean Jakoncic1, Herbert J. Bernstein2, Alexei Soares1, Wuxian Shi3, Martin Fuchs1, Robert Petkus1, Robert M. Sweet1, Sean McSweeney1
1Brookhaven National Laboratory, Upton, NY, USA
2Rochester Institute of Technology, Rochester, NY, USA
3Case Western Reserve University, Cleveland, Ohio

Newly commissioned state of the art MX beamlines fitted with current advanced hybrid pixel detectors are now in operation. At the NSLS-II, AMX and FMX, two high-brightness microfocusing beamlines (> 1011 and > 5x1012 ph/s/um2 respectively) are fitted with Dectris Eiger detectors and are equipped with advanced automation that will ultimately allow screening of up to 1000 crystals per day. We have seen throughput greater than 1 GB/s per beamline during demanding experiments and are expecting this to increase in the upcoming months. With this level of throughput, near real time data analysis feedback is a necessity. This requires infrastructure with a high bandwidth network, fast-I/O large storage and significant computational capacity. Optimized data processing software and pipelines are being developed to help in coping with the throughput. We will present the state of current problems that the community is facing and some of the solutions that are currently deployed at various facilities.

(hide | hide all)
    
16:45-17:15 Henry Gabb Intel Scalable System Framework Abstract

[H. Gabb]

Henry Gabb
Intel Corporation, USA

The world depends on high-performance computing (HPC) to solve ever larger scientific, industrial, and societal problems, but we face growing technical and architectural challenges as HPC systems get larger. In traditional HPC, computation, memory/storage, and network performance are becoming more unbalanced so an integrated, holistic approach is needed for future systems. Also, different workloads (e.g. modeling and simulation, scientific visualization, big data analytics, machine learning) stress different parts of the system (compute, memory, I/O). This can lead to divergent, specialized system infrastructures that are dedicated to a particular type of workload. Specialized systems are more expensive to design, build, and manage because they do not benefit from economies of scale. They often require proprietary solutions that can limit software reusability. The solution to this problem requires innovative technologies that are tightly integrated. Intel  Scalable System Framework (SSF) provides breakthrough compute, memory/storage, and network performance; a common infrastructure that supports a variety of workloads; standards-based programmability; and broad vendor availability. This is made possible by Intel’s broad portfolio of innovative compute, memory and storage, network fabric, and software technologies, which allows unprecedented co-design and system integration. Tighter component integration improves compute density, I/O bandwidth, and network latency while lowering power consumption and overall cost. Intel SSF creates a stable system target for software vendors to help reduce development and maintenance costs. HPC users benefit from a common infrastructure. Reference designs based on Intel SSF help lower entry barriers for equipment manufacturers while still allowing them to innovate. The technical details of each of these high-level Intel SSF features will be discussed.

(hide | hide all)
    
17:15-17:45 Henry Gabb Intel software and programming tools ecosystem for HPC Abstract

[H. Gabb]

Henry Gabb
Intel Corporation, USA

High-performance computing (HPC) users are no strangers to code optimization and performance tuning. However, future HPC systems are likely to be even more heterogeneous than they are now. The mix of CPU, GPU, FPGA, and ASIC architectures could be quite diverse. Software will have to be modernized to take advantage of this heterogeneity, and Intel has an extensive ecosystem of programming tools to help. The Intel Fortran and C/C++ compilers have extensive auto-vectorization capability to deliver maximum performance on Intel processors. Support for the most popular productivity language is provided through the Intel Distribution for Python. Intel also provides a wide range of performance libraries, e.g. the Intel Math Kernel Library (FFT and numerical linear algebra), the Intel Integrated Performance Primitives (compression/decompression, image, vision, and signal processing), and the Intel Data Analytics Acceleration Library (big data analytics and machine learning). Many Python modules, machine learning frameworks, and third-party libraries already take advantage of the Intel performance libraries. Finally, Intel offers programming tools to support parallel debugging and tuning at the vector, thread, and process level. The key features of each tool will be discussed.

(hide | hide all)
    
17:45-18:15 General discussion
18:15 Close
18:30 ACA 2017 Opening Ceremony
 

About our sponsors

We acknowledge the generous financial support of the following partners.

The International Union of Crystallography (IUCr) is a scientific union whose objectives are to promote international cooperation in crystallography and to contribute to the advancement of crystallography in all its aspects.
The IUCr fulfills part of these objectives by publishing high-quality crystallographic research through nine primary scientific journals: Acta Crystallographica Section A: Foundations and Advances; Acta Crystallographica Section B: Structural Science, Crystal Engineering and Materials; Acta Crystallographica Section C: Structural Chemistry; Acta Crystallographica Section D: Structural Biology; Acta Crystallographica Section E: Crystallographic Communications; Acta Crystallographica Section F: Structural Biology Communications; Journal of Applied Crystallography; Journal of Synchrotron Radiation; and, launched for the International Year of Crystallography in 2014, IUCrJ, a gold open-access title publishing articles in all of the sciences and technologies supported by the IUCr.
CODATA, the Committee on Data for Science and Technology, is an interdisciplinary Scientific Committee of the International Council for Science (ICSU), established in 1966. Its mission is to strengthen international science for the benefit of society by promoting improved scientific and technical data management and use. CODATA works to improve the quality, reliability, management and accessibility of data of importance to all fields of science and technology. CODATA provides scientists and engineers with access to international data activities for increased awareness, direct cooperation and new knowledge. It is concerned with all types of data resulting from experimental measurements, observations and calculations in every field of science and technology, including the physical sciences, biology, geology, astronomy, engineering, environmental science, ecology and others. Particular emphasis is given to data management problems common to different disciplines and to data used outside the field in which they were generated.
Bruker has been driven by the idea to always provide the best technological solution for each analytical task for more than 50 years now. Today, worldwide more than 6,000 employees are working on this permanent challenge at over 90 locations on all continents. Bruker systems cover a broad spectrum of applications in all fields of research and development and are used in all industrial production processes for the purpose of ensuring quality and process reliability. Bruker continues to build upon its extensive range of products and solutions, its broad base of installed systems and a strong reputation among its customers. Being one of the world's leading analytical instrumentation companies, Bruker is strongly committed to further fully meet its customers’ needs as well as to continue to develop state-of-the-art technologies and innovative solutions for today's analytical questions.
Wiley's Scientific, Technical, Medical, and Scholarly (STMS) business serves the world's research and scholarly communities, and is the largest publisher for professional and scholarly societies. Wiley's programs encompass journals, books, major reference works, databases, and laboratory manuals, offered in print and electronically. Through Wiley Online Library, online access is provided to a broad range of STMS content: over 4 million articles from 1,500 journals, 9,000+ books, and many reference works and databases. Access to abstracts and searching is free, full content is accessible through licensing agreements, and large portions of the content are provided free or at nominal cost to nations in the developing world through partnerships with organizations such as HINARI, AGORA, and OARE.