Skip to content

Sorting

Sorting runs the classification pipeline on all extracted series. It consists of four steps that must run in sequence.

The Four Steps

flowchart LR
    A["Step 1: Checkup"] --> B["Step 2: Stack Fingerprint"]
    B --> C["Step 3: Classification"]
    C --> D["Step 4: Output Generation"]

Step 1: Checkup

Purpose: Validate data and prepare series for processing.

What It Does

  1. Cohort Subject Resolution
  2. Gets all subjects in the cohort
  3. Validates subject membership

  4. Study Discovery

  5. Finds all studies for these subjects
  6. Validates study-subject relationships

  7. Study Date Validation & Repair

  8. Checks for missing study_date
  9. Attempts recovery from acquisition_date or content_date
  10. Flags studies with unrecoverable dates

  11. Series Collection

  12. Gets all series from valid studies
  13. Filters by modality if configured

  14. Existing Classification Filter

  15. Checks if series already classified
  16. Skip or reprocess based on configuration

Output

Step1Handover containing:

  • List of SeriesForProcessing (series_id, study_id, subject_id)
  • Validation results
  • Excluded series with reasons

Step 2: Stack Fingerprint

Purpose: Build classification-ready feature vectors for each stack.

What It Does

  1. Load Handover
  2. Receives series IDs from Step 1

  3. Query Stack Data

  4. Fetches all SeriesStack records
  5. Joins with Series, Study, and modality-specific details

  6. Build Fingerprints (Polars)

  7. Vectorized transformations using Polars
  8. Normalizes values across modalities
  9. Aggregates text fields into searchable blobs
  10. Computes geometry features (FOV, aspect ratio)

  11. Database Upsert

  12. Bulk COPY into stack_fingerprint table
  13. UPSERT for existing fingerprints

  14. Batched Commits

  15. Commits in batches to prevent OOM
  16. Enables progress tracking

Performance

  • Processes ~450K stacks in 45-60 seconds
  • Previous ORM-based approach caused OOM on large datasets
  • Polars vectorization provides 10-50x speedup

Output

Step2Handover containing:

  • List of fingerprint_id values
  • Processing statistics

Step 3: Classification

Purpose: Run the 10-stage classification pipeline on each fingerprint.

What It Does

For each StackFingerprint:

  1. Stage 0: Exclusion Check
  2. Filters screenshots, secondary reformats
  3. Checks ImageType flags

  4. Stage 1: Provenance Detection

  5. Determines processing pipeline
  6. Routes to appropriate branch

  7. Stage 2: Technique Detection

  8. Identifies pulse sequence family

  9. Stage 3: Branch Logic

  10. Executes provenance-specific logic:

    • SWI Branch → SWI/QSM classification
    • SyMRI Branch → Synthetic MRI classification
    • EPIMix Branch → Multi-contrast EPI
    • RawRecon Branch → Standard detection
  11. Stage 4: Modifier Detection

  12. Detects FLAIR, FatSat, MT, etc.

  13. Stage 5: Acceleration Detection

  14. Detects GRAPPA, SMS, etc.

  15. Stage 6: Contrast Agent Detection

  16. Pre/post contrast determination

  17. Stage 7: Body Part Detection

  18. Spinal cord flagging

  19. Stage 8: Intent Synthesis

  20. Maps to BIDS directory_type (anat, dwi, func, fmap)

  21. Stage 9: Review Flag Aggregation

    • Combines all review triggers
    • Sets manual_review_required

Output

SeriesClassificationCache records containing:

  • All six classification axes
  • Flags (post_contrast, localizer, spinal_cord)
  • BIDS intent (directory_type)
  • Review requirements

Step 4: Output Generation

Purpose: Export classified data to target structure.

What It Does

  1. Filter by Classification
  2. Include/exclude by provenance
  3. Include/exclude by intent

  4. Organize Output

  5. BIDS structure or flat layout
  6. Provenance-specific routing

  7. Copy/Convert Files

  8. DICOM copy or NIfTI conversion
  9. Parallel processing

Output Modes

Mode Description
dcm Copy DICOM files
nii Convert to NIfTI
nii.gz Convert to compressed NIfTI

Running Sorting

From Web Interface

  1. Navigate to the cohort
  2. Click Sort
  3. Steps run automatically in sequence
  4. Monitor in Jobs tab

Step-by-Step Execution

You can also run steps individually:

  1. Run Step 1 (Checkup)
  2. Review validation results
  3. Run Step 2 (Fingerprint)
  4. Run Step 3 (Classification)
  5. Run Step 4 (Output) when ready

Configuration Options

Option Description Default
reprocess Reclassify already-classified series false
include_modalities Filter to specific modalities all
parallel_workers Classification workers 4

Date Recovery

Step 1 attempts to repair missing study_date:

  1. Check acquisition_date from Instance
  2. Check content_date from Instance
  3. Mark as excluded if unrecoverable

This handles DICOM files with missing or corrupted dates.


Stack Key

Each SeriesStack has a deterministic stack_key:

MR|TE=2.46|TI=900|FA=9|ECHO=1|TYPE=M|ORIENT=AX

This enables:

  • Duplicate detection across reruns
  • Idempotent classification
  • Stack grouping within series

Fingerprint Features

StackFingerprint contains normalized features:

General Features

  • modality - MR, CT, PET
  • manufacturer - Normalized (GE, SIEMENS, PHILIPS, etc.)
  • text_search_blob - Concatenated descriptions

Geometry Features

  • stack_orientation - Axial, Coronal, Sagittal
  • fov_x, fov_y - Field of view in mm
  • aspect_ratio - FOV ratio

MR Features

  • mr_te, mr_tr, mr_ti - Timing parameters (ms)
  • mr_flip_angle - Flip angle (degrees)
  • mr_acquisition_type - 2D or 3D
  • mr_diffusion_b_value - Diffusion b-value

CT Features

  • ct_kvp - Tube voltage
  • ct_tube_current - Tube current
  • ct_convolution_kernel - Reconstruction kernel

PET Features

  • pet_tracer - Radiopharmaceutical
  • pet_reconstruction_method - Recon algorithm
  • pet_suv_type - SUV calculation type

Troubleshooting

"No series to process"

  • Check extraction completed successfully
  • Verify series exist in database
  • Check modality filters

Classification Issues

  • Review manual_review_required flags
  • Check manual_review_reasons_csv for details
  • Use QC interface to review flagged series

Performance

  • Step 2 is typically the bottleneck
  • Ensure adequate RAM (8GB+ for large datasets)
  • Reduce batch size if memory issues occur