Skip to Content
COVID-19 Resources
Cancer Imaging Program (CIP)
Contact CIP
Show menu
Search this site
Last Updated: 10/28/16

Overview of Cancer Imaging Databases

Cancer Imaging Informatics

Michael W. Vannier, Edward V. Staab, and Laurence C. Clarke
National Cancer Institute
Biomedical Imaging Program
Bethesda, MD

Neuroimaging Databases, from Science Vol 292, June 1, 2001

Appendix: Outline of Issues Related to Neuroimaging Databases

  • Data contents
    • Imaging data
    • Metadata
    • Data import and export
    • Data quality
  • Data access
  • Data ownership, credit, and confidentiality
  • Database structure
  • Interactions with the community

The Problem:

  • Most biological knowledge is stored in databases
  • Creation, expansion, and integration of these databases has become central to the advancement of biology and medicine
  • Many databases are isolated "silos"
  • Medical imaging is unique in that there are few publicly accessible databases, links to mainstream biological knowledge collections are absent, and there are few (software) tools available that allow you to use them

Why are imaging databases important?

  • Images contain the phenotype
  • In other fields (e.g., astronomy, geoscience, neuroscience, …), the integration of image (and other) databases has had a revolutionary effect
    • Coalescence of the scientific community
    • Open the field to rapid technological advancement
    • Possible to address questions that could not otherwise be answered (e.g., trans-species, multiscale, ad hoc group collaboration)

As an example, Italy in the Middle Ages, a time of City States, from Building a Nation from a Land of City States, Lincoln Stein, Cold Spring Harbor Laboratory

Effect on Trade & Technology

  • Italian city states had
    • Different legal & political systems
    • Different dialects & cultures
    • Different weights & measures
    • Different taxation systems
    • Different currencies
  • Italy generated brilliant scientists, but lagged in technology & industrialization

Italy, 1796, united parts under 1 flag
Italy, ca 1820, still fractured under several flags

Bioinformatics, ca. 2002, compares to Italy in 1820, with NCBI, e!, UCSC, SGD, Worm Base, Fly Base

Making Easy Things Hard
Sample Question: Give me all human sequences submitted to GenBank/EMBL last week.

Lots of ways to do it

  • Download weekly update of GenBank/EMBL from FTP site
  • Use official network-based interfaces to data:
    • NCBI toolkit
    • EBI CORBA & XEMBL servers
  • Use friendly web interfaces at NCBI, EBI

Creating a Bioinformatics Nation
Nature Vol 417, 9 May 2002, pp 119-120;

Special Supplement — Nature Genetics
September 2002: A user’s guide to the human genome

3 slides from Collation talk
Harold Varmus:
"…all modern biologists using genomic methods have become dependent on computer science to store, organize, search, manipulate and retrieve the new information.
Thus biology has been revolutionized by genomic information and by the methods that permit useful access to it."

  • Foreword — Power to the people — A D Baxevanis & F S Collins
  • Perspective — Genomic empowerment: The importance of public databases
  • H Varmus
  • User’s Guide
  • Question 1
    How does one find a gene of interest and determine that gene’s structure? Once the gene has been located on the map, how does one easily examine other genes in that same region? pp 9 - 17
  • Question 2
    How can sequence-tagged sites within a DNA sequence be identified? pp 18 - 20
  • . . .
  • Question 12
    How does a user find characterized mouse mutants corresponding to human genes? pp 66 - 69
  • Question 13
    A user has identified an interesting phenotype in a mouse model and has been able to narrow down the critical region for the responsible gene to approximately 0.5 cM. How does one find the mouse genes in this region? pp 70 - 73

(protein/DNA) Sequence Data and Molecular Biology Journals

  • Prior to publication, peer-reviewed molecular biology journals require that the authors deposit their data sets in a publicly-accessible archive and obtain an Accession Number.
  • The Accession Number is included with the publication (both printed and electronic form)
  • In many cases, the software tools used to analyze the sequence data are in the public domain

INSIGHT — Imaging Tools
from the Visible Human Project
Terry S. Yoo, Ph.D.
Office of High Performance Computing and Communications
National Library of Medicine

What is it?: Insight

  • A toolkit for registration and segmentation.
  • A common Application Programmers Interface (API).
  • A validation model for segmentation and registration.
  • Open-source resource for future research.

A New Research Program

  • Image Segmentation
  • multivalued (multimodal) data
  • Image Registration
  • rigid and deformable registration
  • Validation
  • Generation of mathematical models as test data
  • Acquisition of validation datasets from medical scanners

Why is it?

  • Segmentation and Registration recognized as major research areas for medical image research.
  • Common platforms may encourage communication and dissemination of research results.

Who’s sponsoring it?

  • National Library of Medicine (NLM)
  • National Institute for Dental and Craniofacial Research (NIDCD)
  • National Institute of Neurological Disorders and Stroke (NINDS)
  • National Institute of Mental Health (NIMH)
  • National Eye Institute (NEI)
  • National Science Foundation (NSF)
  • National Institute for Deafness and Other Communication Disorders (NIDCD)
  • National Cancer Institute (NCI)

Contractors and Subcontractors
GE CRD - Bill Lorensen
MathSoft - Vikram Chalana
U Penn - Demitris Metaxas
Harvard BWH - Ron Kikinis
U Penn - Jim Gee
Columbia U - Celina Imielinska
Kitware - Will Schroeder
UNC-CH - Stephen Aylward
U Tennessee - Ross Whitaker
U Pittsburgh - George Stetten
U Utah - Ross Whitaker

An Open Source Initiative

  • Encourages high-level technical communication.
  • Provides conventions (vs. standards) for inter-operable software development.
  • Establishes a baseline for improvement.
  • Opens the field to "beginners."
  • Creates common ground for product growth.
    • example: the creation of HTML enabled Web-based internet development
    • originally part of a broader Government sponsored initiative (incl. gopher, WAIS, etc.)

NIH Draft Statement on Sharing Research Data
Issued March 2002, by the NIH Office of Extramural Research

NIH Draft Statement on Sharing Research Data
Proposed Effective Date: January 1, 2003

  • NIH will expect the timely release and sharing of final research data for use by other researchers.
  • NIH will require applicants to include a plan for data sharing or to state why data sharing is not possible.

What do we mean by data?

  • We mean final research data necessary to validate research findings.
  • Research data do not include:
    • laboratory notebooks
    • partial data sets
    • preliminary analyses
    • drafts of scientific papers
    • plans for future research
    • communications with colleagues
    • physical objects, such as gels or laboratory specimens

Statement will apply to:

  • Intramural scientists
  • Extramural scientists seeking
    • Grants
    • Cooperative agreements
    • Contracts

What Kind of Research Does This Apply To?

  • Data generated with support from the NIH, including:
    • Basic research
    • Clinical studies
    • Surveys
    • Other types of research
  • Unless human research participants’ identities cannot be protected
  • Especially important to share:
    • Unique data sets that cannot be readily replicated
    • Large, expensive data sets

Will NIH Provide Support for Data Sharing?

  • Yes
  • In grant application - budget and budget justification
  • Administrative supplements

Caveats for Studies Including Human Research Participants

  • Investigators need to be cautious with
    • Studies with very small samples
    • Studies collecting very sensitive data
  • However, even these data can be shared if
    • Safeguards exist to ensure confidentiality and protect the identify of subjects

What is Meant by Timely?

  • No timeline specified
    • Will vary depending on nature of the data collected
  • Investigators who collected the data have a legitimate interest in benefiting from their investment of time and effort
  • Therefore, they could benefit from first and continuing use but not from prolonged exclusive use

Why Share?

  • Extends NIH policy on sharing research resources
  • Reinforces open scientific inquiry
  • Encourages diversity of analysis and opinion
  • Promotes new research
  • Supports testing of new or alternative hypotheses and methods of analysis
  • Facilitates the education of new researchers
  • Enables the exploration of topics not envisioned by the initial investigators
  • Permits the creation of new data sets from combined data

How to Share Data

  • Provide in publications
  • Share under the investigator’s own auspices
  • Place data sets in public archives
  • Put data on a web site
  • Place in restricted access data centers or data enclaves
  • Other ways?

What Will NIH Applicants Need to Do?

  • Include a data sharing plan in application
    • Statement of how data will be shared
    • If not, why not
  • Where in application
    • End of research plan
    • Budget, budget justification if asking for funds
    • Significance, if creating an important scientific resource

Why "tag" images with concepts?

  • Retrieving images by concept
    • Different words for different people
    • term used to catalog is not term used to search
  • Multi-resource knowledge acquisition (Medline, image repository, patient database)
  • Knowledge mining
    • "Are there unknown relationships between image findings and clinical history?."

Grid Computing

"The Grid"

  • Coined in 1990’s to denote a proposed distributed computing architecture.
  • "Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources"

From "The Anatomy of the Grid"

  • Resource Sharing
    • Computers,Storage,Sensors, Networks, Scientific Instruments
    • Sharing is highly controlled -- Providers & Consumer define
    • What is shared
    • Who is allowed to share
    • Conditions for sharing
  • Coordinated problem solving
    • Beyond client-server: distributed data analysis, visualization,computation, collaboration
  • Similar to the Power Grid, Faucets (Water supply), Nationwide Phone System.

Major Grid Projects

Name URL & Sponsors Focus
BlueGrid IBM Grid testbed linking IBM laboratories
DOE Defense Programs
Create operational Grid providing access to resources at three U.S. DOE weapons laboratories
DOE Science Grid
DOE Office of Science
Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities
Earth System Grid (ESG)
DOE Office of Science
Delivery and analysis of large climate model datasets for the climate research community
European Union (EU) DataGrid European Union Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics


iVDGL: International Virtual Data Grid Laboratory:

National Digital Mammography Archive
(NGI) Next Generation Internet Demonstration Project
- Three Applications -

  • Archive Storage and retrieval for clinical use
  • Teaching File repository for Radiology Departments
  • Computer Assisted Diagnosis (CAD) as a service

BIRN - Biomedical Informatics Research Network
National Center for Research Resources
National Institutes of Health
Biomedical Informatics Research Network--BIRN

  • Integrating data from different brain mapping research sites
    • UCSD, UCLA, Caltech, Duke, Mass General, Harvard
    • Mouse and human brain
  • BIRN Data/Knowledge Grid
    • High-speed networking
    • Access to distributed data
    • Semantic mediation
  • Intra-species and inter-species queries
    • Visualization and analysis tools

NPACI - National Partnership for Advanced Computational Infrastructure (NSF)

  • ~50 partner sites
  • shared compute resources
  • high-speed networks
  • Computational science efforts in "thrusts"
    • Neuroscience
    • Molecular Science
    • Earth Systems Science
    • Engineering

Enabling Technology Thrusts

  • Resources (TeraFlops, High Performance Networks, Data Caches)
  • Metacomputing (Grid Tools - Middleware)
  • Interaction Environments (Visualization - Science Portals)
  • Data-Intensive Computing (Databases - Data Migration - Knowledge Eng.)


Standards for Information Interchange -

The basis for multidisciplinary collaboration
Clinical Data Standards
Building consensus within the industry towards standards for exchanging electronic data:
’Speaking the same language’ to achieve more efficient and higher quality clinical trials
Building consensus within the industry towards standards for exchanging electronic data - Outline

  • Benefits of Standards and Potential Value
  • What is CDISC?
  • Principles of CDISC
  • Organization of CDISC
  • CDISC Progress
  • CDISC Models
  • Implementation of CDISC Models

A Case for Data Standards
Current State:
Costly and Time-consuming

CDISC Value: Cost of Clinical Data Interchange in Clinical Trials

  • ~7,000-8,000 clinical studies/year*
  • ~ 30 % outsourced and 5-10% EDC
  • Estimated cost of $35,000 for EDC transfers, $25,000 for CRO data transfers, and $10,000 for lab data transfers
  • Conservative Annual Cost to the Industry:
    $156 million

NOTE: The costs incurred with development partners or merged companies sharing data and the cost of preparing data for eSubmissions are not addressed in this set of calculations, nor are other costs such as training, planning or equipment.

Desired State
Labs, Pharma, Tech/Software, Other Vendors, Patients, CROs, Regulatory Agencies, BioTech all connected to center CDISC Data Standards

Where are CDISC standards being used?
One Consistent Data Standard
Applied Across Systems & Processes

Standards to Enable Seamless Flow of Data from Patient to Reviewers
Flow is from Safety, EDC, Lab, CRO in and out of operational data to submission data, to eSubmission for regulatory review.

Emerging Tools for Building Integrated Scientific Data Resources
Joe Futrelle
National Center for Supercomputing Applications


  • Vision: A Digital Library of Scientific Data
  • How to Integrate Scientific Data
  • New Technologies for Data and Metadata: XML and friends
  • Current Scenarios, Projects and Technology

Vision: a Digital Library of Scientific Data

  • Contents
    • scientific literature
    • data used in studies
    • software used to do the studies
  • Services
    • digital publishing
    • retrieval of data based on scientific criteria
    • remote analysis and visualization
    • access to computational resources
    • ability to link data from different studies and disciplines together to do new studies

This is a Really Hard Problem

  • Scientific data is exploding in
    • resolution
    • complexity
    • heterogeneity
    • volume
  • It’s not enough to just turn every science data collection into a website
    • large data sets cannot go "the last mile"
    • a digital library of science data will integrate many (1000’s) of collections
    • data management tools must work across collections

How to Integrate Scientific Data

  • Generate integrated use scenario
    • input from scientific community
    • represent ~100 groups of researchers with common scientific specialization
    • informal
    • more than one data collection across
    • discipline or sub-discipline (e.g. wavelength in radio astronomy, species in biology, process in chemical engineering)
    • scientific data type (e.g. satellite swath, genetic sequence, sensor trace)
    • access modality (e.g. browsing, search, visualization, simulation)

How to Integrate Scientific Data (cont.)

  • Develop data and metadata models to enable the scenarios
    • identify community-wide data semantics
    • formal, incremental process
    • ongoing review and documentation
    • target key semantics for scenarios
    • use extensible data modeling technologies (e.g. XML, RDF, HDF) to implement data models
  • Link scenarios to build network of data services
  • Other concerns
    • security
    • intellectual property
    • data preservation

New Technologies for Data and Metadata

  • What’s the difference between data and metadata?
    • Metadata is data that describes other data (e.g. a card catalog)
    • Within an item in a collection of information:
    • "Data" grows as the amount of information in the item grows
    • "Metadata" grows as the complexity of information in the item grows
    • All metadata is data but not all data is metadata
  • Why does it matter?
    • Data and metadata have different usage patterns and performance implications

New Technologies: XML

  • XML is really a set of closely-related technologies, including
    • XML: generalized markup
    • XLink and URI: interobject reference and linking
    • XML-Schema: document model definition
    • XSL: transformation and presentation
    • RDF: metadata and and inference
    • XQuery: retrieval from XML documents
    • SOAP: remote procedure calling
  • Key commonalities:
    • draft standards from WWW consortium
    • text-based
    • extensible/portable

New Technologies: XML

  • Suitable for metadata and "light data"
  • Structured
  • Hierarchical
    • Limited graph-like relationships (e.g. ID’s)
  • Portable across
    • languages
    • operating systems
  • Becoming ubiquitous
    • standard parser API’s (DOM, SAX)
    • parsers available in all major languages, platforms


  • 8 AM - Presentation by Jim Gray, Microsoft Research on Databases
  • 9 AM - 1 PM … Many image databases, architectures, and applications
  • 2 PM - 3:30 PM … Emerging standards, FDA e-submission process, and WEAR
  • 3:30 - ??? PM … Breakout sessions

The End