Cancer as the proving ground for information-theoretic thinking machines
We integrate multi-omics data and treat missingness as an information-budget problem. When the model reconstructs masked values, we measure evidence sufficiency in bits/nats and use verifiers to keep downstream discovery stable and auditable.
Can a model learn cross-omics structure by reconstructing what we hide?
Most ML learns from labels. Here, the dataset is the teacher: we mask values and ask the model to reconstruct them.
If a model can predict masked expression from CNV, clinical variables, and knowledge context, it has captured real cross-modal constraints—not just a brittle mapping.
Our goal isn’t blind imputation. It’s an auditable pipeline: propose a value, verify it against available evidence, and certify what cleared the budget.
Traditional Approach
Train on labeled data. The model learns mappings from input to output but not the underlying structure.
Our Approach
Mask data, force reconstruction. The model learns biological functions to fill gaps—revealing how omics layers connect.
We don't train the model on biology papers. We train it to reconstruct masked data. The biology emerges from the structure.
Four data layers, one learned function
Each layer provides a different view into cancer biology. The model learns how they connect.
Mask, Reconstruct, Discover
A simple protocol that turns biological structure into a measurable test.
Mask
Hide 10% of values across gene expression, CNV, and clinical features. These become reconstruction targets.
Integrate
Feed remaining multi-omics data to the LLM with knowledge graph context for grounding.
Reconstruct
Model predicts masked values using cross-modal relationships it has learned.
Validate
Compare predictions to held-out ground truth. Measure where reconstruction is accurate, calibrated, and evidence-supported.
Patterns the model found without being told
Reconstruction is the test. The interesting part is what a model must learn to pass it: cross-omics couplings, cohort structure, and constraints that we didn’t hard-code.
CNV-Expression Coupling
The model strongly weights copy number when predicting expression. It learned gene dosage effects—more copies means more transcription.
"More copies → More expression"Stage-Specific Constraints
Validation confidence varies by tumor stage. Early-stage tumors are predictable; late-stage shows higher uncertainty—biological heterogeneity.
"Early = tight bounds, Late = chaos"Pathway Co-Regulation
Genes in the same biological pathway predict each other's expression. The model learned regulatory networks from reconstruction.
"Same pathway → mutual prediction"Patient Neighborhoods
Cohort similarity improves reconstruction accuracy. Age and stage define "biological neighborhoods" where patients share expression patterns.
"Similar patients → similar biology"From discovery to deployment: Evidence-grounded imputation
These discoveries become a practical tool. When you have missing cancer data, the model proposes imputations and our verification layer scores them with an evidence budget—producing a certificate or an abstention, not a guess.
Claim Validation
Each imputed value is treated as a claim requiring evidence.
Cross-Modal Consistency
Validates expression using CNV and clinical signals.
Similar-Patient Priors
Leverages cohort similarity for validation.
Uncertainty Quantification
Per-value uncertainty estimates and calibrated intervals.
KG Grounding
Grounds predictions in biomedical knowledge to reduce ungrounded completions.
Ensemble Filtering
Compares multiple imputation methods.
Why this approach works
Structure as Teacher
Masking turns the dataset into supervision: reconstruction forces the model to internalize cross-omics constraints.
Evidence Budgets
We operationalise confidence in bits/nats: accept an imputation only when the evidence clears the budget; otherwise abstain.
Verifier Coupling
Predictions are checked against multi-modal signals (CNV, clinical, cohort priors, KG) and disagreement across methods.
Certificates
Outputs ship with an audit trail: what evidence was used, how much was needed, and why the system acted.
Ready to build auditable multi-omics pipelines?
Stop imputing blind. Start shipping evidence-rated imputations with certificates.