Skip to Content
Cancer Imaging Program (CIP)
Contact CIP
Show menu
Search this site
Last Updated: 10/28/16

LIDC — Data Collection Process

Slide 1

The Lung Image Database Consortium (LIDC) Data Collection Process

This presentation based on the RSNA 2004 InfoRAD theater presentation titled "The Lung Imaging Database Consortium (LIDC): Creating a Publicly Available Database to Stimulate Research in CAD Methods for Lung Cancer" (9110 DS-i)

November 29, 2004

Michael McNitt-Gray (UCLA), Anthony P. Reeves (Cornell), Roger Engelmann (U. Chicago), Peyton Bland (U. Michigan), Chris Piker (U. Iowa), John Freymann (NCI) and The Lung Image Database Consortium (LIDC)

Slide 2

Principle Goals

To establish standard formats and processes for managing thoracic CT scans and related technical and clinical data for use in the development and testing of computer-aided diagnostic algorithms.

Slide 3

To establish standard formats and processes for managing thoracic CT scans and related technical and clinical data for use in the development and testing of computer-aided diagnostic algorithms.
To develop an image database as a web-accessible international research resource for the development, training, and evaluation of computer-aided diagnostic (CAD) methods for lung cancer detection and diagnosis using helical CT.

Slide 4

The database will contain:

  1. A collection of CT scan images
  2. Technical factors about the CT scan
    • Non-patient information in DICOM header
  3. For Nodules > 3 mm diameter
    • Radiologist drawn boundaries
    • Description of characteristics
  4. For Nodules
    • Radiologist marks centroid, no characteristics
  5. Pathology results or diagnosis information whenever available
  6. All in a searchable relational database

Slide 5

The LIDC Data Collection Process

  • For nodule detection, recent research has demonstrated that the results from a single reader are not sufficient

Slide 6

  • At least two and perhaps four readers may be required.
  • Not practical to do joint reading sessions across five institutions
  • LIDC Will NOT do a forced consensus read. We won’t force agreement on location of a nodule nor its boundary.

Slide 7

Truth - Detection
LIDC - Initial Approach

  • Multiple Reads with Multiple Readers
    • First Read - 4 readers, each reads independently (Blinded)
    • Compile 4 blinded reads and distribute to readers
    • Second Read - Same 4 readers, this time unblinded to the results of the other readers from the first reading.
    • Still, no forced consensus on either location of nodules nor on their boundaries.

Slides 8-12

Lung CT scan images with readers’ marks indicating the nodules; readers were reading blinded to each other’s marks.

Slides 13-17

Lung CT scan images with readers’ marks indicating the nodules; readers could see the previous blinded reads.

Slide 18

Lung CT scan image showing unblinded reads for all 4 readers

Slide 19

Radiologist Review & Reconcile

  • 4 Radiologists Perform Blinded Read - R1B, R2B, R3B, R4B
  • Submit to Requesting Site; This site compiles markings and re-sends case
  • 4 Radiologists see all (anonymized) markings
  • 4 Radiologists Perform Unblinded Read (R1U, R2U, R3U, R4U)

Database (will contain Blinded AND Unblinded reads)
Nodules for each condition: (R1B, R2B, R3B, R4B, R1U, R2U, R3U, R4U)

  • Location
  • Outline (where appropriate)
  • Label (where appropriate)

Slides 20-33

4 readers, 3 marking methods: illustrated with marks on a partial CT image of the lung.

Slide 34

How to Represent This Variability? Create a Probabilistic Description of Nodule Boundary

  • For each voxel, sum the number of occurrences (across reader markings) that it was included as part of the nodule
  • Create a probabilistic map of nodule voxels
  • Higher probability voxels are shown as brighter; lower probability are darker
  • Can use apply a threshold and show only voxels > some prob. Value if desired.

Slide 35

Probabilistic Description of Boundary
Illustrated with an image of the nodule with gray scale indicating probability that the region was marked.

Slide 36

Apply Threshold if Desired
Illustrated with an image of the nodule with gray scale bounded by an edge indicating probability that the region was marked and had a certain probability of being inside the boundary.

Slide 37

Challenge: Define the Boundary of a Nodule

  • Do we need to have agreement between radiologists on boundaries?
  • LIDC’s answer is no.
  • LIDC Approach will be to:
    • Construct a probabilistic description of boundaries to capture reader variability
    • Use a threshold value (50% centile or 1% centile) to give fixed contours.

Slide 38

Pathology Information

  • In those cases in which pathology is available, we will extract from reports:
    • Whether histology or cytology was performed
    • If histology, try to establish the cell type according to WHO classifications
    • If cytology, establish whether it was benign or malignant

Slide 39

  • If no pathology, other diagnostic information may be substituted when available (such as 2 years Dx F/U with no change in radiographic appearance).
  • If neither is available, then case will be used for detection purposes only.

Slide 40

Database Implementation
How to capture and collect all of this data?
5 Phases of Data Collection

  • Initial Review
    • review case for inclusion in database;
    • anonymize case;
    • Index case, e.g. Full Chest/Limited Chest, Image Quality.
  • Blinded Read
    • identifying and drawing nodules independently
  • Unblinded Read
    • confirming using an overread, labeling nodules (characteristics)
  • Subject info
  • demographics, smoking history, pathology.
  • Export Data to NCI-hosted database (public)

Slide 41 How to capture and collect all of this data? We have developed an internal standard for representing a region of interest (ROI) that is 3-D based on xml. This is portable across software drawing tools. We are also using xml to capture radiologist interpretation of nodule characteristics (shape, subtlety, etc.) by using a limited set of descriptors

Slide 42

How to capture and collect all of this data?

We have designed and tested a communication protocol to send image data and xml messages

  • Read Request messages (with a code/mechanism to distinguish blinded from unblinded read request)
  • Read Response messages (with a code/mechanism to distinguish blinded from unblinded read response)

How to capture and collect all of this data?

Designed and implemented database for each host site for all case data.
Designed and are implementing the central NCI hosted database.

Slide 44

Communication Model

  • Each Site Plays Dual Roles
  • As a Requesting Site
    • Identify Case and collect data
    • Phase 1- Initial Review
    • Manage it through blinded and unblinded read process
    • Create database entry for case
    • Phase 4 - Demographics, Pathology
    • Phase 5 - Export to NCI
    • NOTE: Site does not READ/MARK its own cases
  • As a Servicing Site
    • Perform blinded (Phase 2) and unblinded (Phase 3) reads

Slide 45 A schematic drawing of the various parts of the process:


IRB Approval
LIDC Activities
Patient/Nodule criteria
CT scan criteria
Labeling vocabulary
Image quality criteria
Participants are imaged as part of a study/clinical program

Apply Inclusion Criteria:

  • IF Meets Scan Parameter Criteria [NOTE: All NLST & ELCAP eligible]
  • AND IF Meets Patient Inclusion Criteria
  • THEN Include in Db, Label Nodule Characteristics and Score Image Quality

Data Collected
Image data illustrated with a lung CT slice
Non-Image Data
Demographic Data
Image Quality score
Scan Classification
Patient Classification

Taking the process apart:


  • Definition of Nodules to be included in Db
  • Agreement on Marking /Contouring process


  • Radiologist Review Process (described next)

Data Collected

Slide 46

Access to LIDC Database

  • Cases Exported to NCI
  • NCI hosts Database
  • Publicly Available
  • Query Based on Data Elements Collected
    • Imaging Data such as Slice Thickness, etc.
    • Pathology or F/U Data
    • Other Fields
  • Obtain
    • Image Data including DICOM headers
    • Serial Imaging when available
    • Radiologists’ Identification, Contours and Characterization of Nodules
    • Diagnosis Data (Path, Radiographic F/U, etc) whenever available
    • Case Demographics whenever available

Slide 47

Database Implementation
TASKS COMPLETED (see reports on website):

  • Specification of Inclusion Criteria:
    • CT scanning technical parameters
    • Patient inclusion criteria
  • Process Model for Data collection
    • Determination of Spatial "truth" Using Blinded and Unblinded reads
  • Development of Boundary Drawing Tools
  • Development and implementation of xml standard for ROIs

Slide 48


  • Defined Common Data Elements for LIDC
  • Database design - tables and relationships between tables
  • Communication protocol
  • Establishing Public Database and Access Mechanism at NCI

Slide 49

Other Products Publications/Presentations

  • LIDC Overview manuscript
    • Radiology 2004 Sep;232(3):739-748.
  • Assessment Methodologies manuscript
    • Academic Radiology April 2004
      • (Acad Radiol 2004; 11:462-475)
  • Special Session SPIE Medical Imaging
    • Sunday evening session

Slide 50


  • LIDC mission - to create public database
  • Current understanding of problem dictated multiple readers
  • Multi-Institutions dictated distributed, asynchronous reads

Slide 51

  • LIDC developed:
    • Process Model for Blinded and Unblinded Reads w/Multiple Readers
    • Infrastructure to Communicate Radiologist Expert Information (Markings, Contours, Labelings)
      • Data Elements -image, meta data (DICOM), radiologist markings, contours and labels, pathology, demographics
      • Data Representation Scheme (xml)
      • Communication (messaging) protocol
      • Database Design
      • Mechanism to handle reader disagreement/variability