Extraction¶
Extraction scans DICOM files and imports metadata into the NILS database. This is the first step in processing a cohort.
What Extraction Does¶
- Subject Discovery - Scans the raw directory for folder structure
- Series Grouping - Organizes files by Series/Study/Subject
- DICOM Parsing - Extracts metadata from each DICOM file
- Database Insertion - Writes to metadata database tables
- Stack Detection - Groups instances into homogeneous SeriesStacks
- Progress Tracking - Supports resumable extraction
Extraction Phases¶
Phase 1: Subject Discovery¶
- Scans
source_path(raw root directory) - Identifies subject folders based on naming convention
- Builds list of
SubjectFolderobjects
Phase 2: Series Planning¶
- Enumerates all DICOM files per subject
- Groups files by SeriesInstanceUID
- Groups series by StudyInstanceUID
- Groups studies by Subject
Phase 3: Parallel Extraction¶
- Multiple workers parse DICOM files in parallel (ProcessPoolExecutor)
- Extracts full DICOM metadata from each file
- Writes results to batch buffers
Phase 4: Batch Database Insertion¶
- Bulk inserts to database with batching
- Atomic transactions per batch to prevent data corruption
- Prevents out-of-memory errors on large datasets
What Gets Extracted¶
Study-Level Metadata¶
| Field | DICOM Tag |
|---|---|
study_instance_uid |
(0020,000D) |
study_date |
(0008,0020) |
study_time |
(0008,0030) |
study_description |
(0008,1030) |
modality |
(0008,0060) |
manufacturer |
(0008,0070) |
manufacturer_model_name |
(0008,1090) |
station_name |
(0008,1010) |
institution_name |
(0008,0080) |
Series-Level Metadata¶
| Field | DICOM Tag |
|---|---|
series_instance_uid |
(0020,000E) |
series_description |
(0008,103E) |
protocol_name |
(0018,1030) |
sequence_name |
(0018,0024) |
body_part_examined |
(0018,0015) |
scanning_sequence |
(0018,0020) |
sequence_variant |
(0018,0021) |
scan_options |
(0018,0022) |
MR-Specific Metadata¶
| Field | DICOM Tag |
|---|---|
repetition_time |
(0018,0080) |
echo_time |
(0018,0081) |
inversion_time |
(0018,0082) |
flip_angle |
(0018,1314) |
echo_train_length |
(0018,0091) |
mr_acquisition_type |
(0018,0023) |
diffusion_b_value |
(0018,9087) |
parallel_acquisition_technique |
(0018,9078) |
CT-Specific Metadata¶
| Field | DICOM Tag |
|---|---|
kvp |
(0018,0060) |
exposure_time |
(0018,1150) |
x_ray_tube_current |
(0018,1151) |
convolution_kernel |
(0018,1210) |
spiral_pitch_factor |
(0018,9311) |
ctdi_vol |
(0018,9345) |
PET-Specific Metadata¶
| Field | DICOM Tag |
|---|---|
radiopharmaceutical |
(0018,0031) |
radionuclide_total_dose |
(0018,1074) |
radionuclide_half_life |
(0018,1075) |
reconstruction_method |
(0054,1103) |
attenuation_correction_method |
(0054,1101) |
suv_type |
(0054,1006) |
Stack Detection¶
After parsing instances, NILS groups them into SeriesStacks.
What Defines a Stack?¶
Instances are grouped by matching:
MR Stacks:
- Echo Time (TE)
- Inversion Time (TI)
- Flip Angle
- Echo Number
- Image Type (MAGNITUDE, PHASE, etc.)
- Image Orientation
CT Stacks:
- kVp
- Tube Current
- Exposure
PET Stacks:
- Bed Index
- Frame Type
Stack Key¶
Each stack gets a deterministic stack_key - a string combining all defining parameters:
This enables idempotent classification across reruns.
Resume Capability¶
Extraction supports resumable processing:
- Tracks previously extracted subjects/series
- On restart, skips already-processed data
- Enables recovery from interruptions
The resume index stores:
- Subject folder paths already processed
- Series UIDs already in database
Running Extraction¶
From Web Interface¶
- Navigate to the cohort
- Click Extract
- Monitor progress in the Jobs tab
From CLI¶
Options¶
| Option | Description | Default |
|---|---|---|
--workers |
Parallel parsing workers | 4 |
--batch-size |
DB insert batch size | 1000 |
--resume |
Resume from last checkpoint | true |
After Extraction¶
Once extraction completes, the database contains:
- Subject records for each patient
- Study records for each imaging session
- Series records for each acquisition
- Instance records for each image
- SeriesStack records grouping instances
The cohort is now ready for Sorting.
Troubleshooting¶
"No DICOM files found"¶
- Check that
source_pathpoints to actual DICOM files - NILS scans recursively - files can be in subdirectories
- Verify files have .dcm extension or no extension
"Duplicate series"¶
- This is normal for re-extractions
- Enable resume mode to skip already-imported data
- Or clear existing data before re-extraction
Memory Issues¶
- Reduce
--workerscount - Reduce
--batch-size - Ensure adequate system RAM (8GB+ recommended)