Cancer Research

Cancer as the proving ground for information-theoretic thinking machines

We integrate multi-omics data and treat missingness as an information-budget problem. When the model reconstructs masked values, we measure evidence sufficiency in bits/nats and use verifiers to keep downstream discovery stable and auditable.

MULTI-OMICS INPUT EXPRESSION ? ? CNV ? ? CLINICAL Age: ? Stage: III ? = Masked values LLM f(x) Non-parametric function approximator LEARNED PATTERNS EXPRESSION CNV CLINICAL Age: 62 Stage: III Reconstructed with confidence
The Question

Can a model learn cross-omics structure by reconstructing what we hide?

Most ML learns from labels. Here, the dataset is the teacher: we mask values and ask the model to reconstruct them.

If a model can predict masked expression from CNV, clinical variables, and knowledge context, it has captured real cross-modal constraints—not just a brittle mapping.

Our goal isn’t blind imputation. It’s an auditable pipeline: propose a value, verify it against available evidence, and certify what cleared the budget.

Traditional Approach

Train on labeled data. The model learns mappings from input to output but not the underlying structure.

Our Approach

Mask data, force reconstruction. The model learns biological functions to fill gaps—revealing how omics layers connect.

"

We don't train the model on biology papers. We train it to reconstruct masked data. The biology emerges from the structure.

"
Multi-Omics Integration

Four data layers, one learned function

Each layer provides a different view into cancer biology. The model learns how they connect.

LLM f(x) Integration EXPRESSION CNV CLINICAL KNOWLEDGE Expression Amp Del Clinical KG
The Experiment

Mask, Reconstruct, Discover

A simple protocol that turns biological structure into a measurable test.

1

Mask

Hide 10% of values across gene expression, CNV, and clinical features. These become reconstruction targets.

2

Integrate

Feed remaining multi-omics data to the LLM with knowledge graph context for grounding.

3

Reconstruct

Model predicts masked values using cross-modal relationships it has learned.

4

Validate

Compare predictions to held-out ground truth. Measure where reconstruction is accurate, calibrated, and evidence-supported.

PATIENT P1 BRCA1: 8.2 TP53: MASKED EGFR: 4.1 CNV: +2 chr17 Stage: III LLM Reconstruction f(Expr, CNV, Clinical, KG) → TP53 prediction RECONSTRUCTED BRCA1: 8.2 TP53: 6.7 EGFR: 4.1 CNV: +2 chr17 Stage: III LEARNED CNV amp + Stage III → elevated TP53
Discoveries

Patterns the model found without being told

Reconstruction is the test. The interesting part is what a model must learn to pass it: cross-omics couplings, cohort structure, and constraints that we didn’t hard-code.

CNV +2 EXPR HIGH

CNV-Expression Coupling

The model strongly weights copy number when predicting expression. It learned gene dosage effects—more copies means more transcription.

"More copies → More expression"
STAGE I Tight bounds vs STAGE IV Wide variance

Stage-Specific Constraints

Validation confidence varies by tumor stage. Early-stage tumors are predictable; late-stage shows higher uncertainty—biological heterogeneity.

"Early = tight bounds, Late = chaos"
DNA REPAIR PATHWAY BRCA1 RAD51 PALB2

Pathway Co-Regulation

Genes in the same biological pathway predict each other's expression. The model learned regulatory networks from reconstruction.

"Same pathway → mutual prediction"
SIMILAR P1 DISTANT Age: 55-60 Stage: II

Patient Neighborhoods

Cohort similarity improves reconstruction accuracy. Age and stage define "biological neighborhoods" where patients share expression patterns.

"Similar patients → similar biology"
The Application

From discovery to deployment: Evidence-grounded imputation

These discoveries become a practical tool. When you have missing cancer data, the model proposes imputations and our verification layer scores them with an evidence budget—producing a certificate or an abstention, not a guess.

Claim Validation

Each imputed value is treated as a claim requiring evidence.

Cross-Modal Consistency

Validates expression using CNV and clinical signals.

Similar-Patient Priors

Leverages cohort similarity for validation.

Uncertainty Quantification

Per-value uncertainty estimates and calibrated intervals.

KG Grounding

Grounds predictions in biomedical knowledge to reduce ungrounded completions.

Ensemble Filtering

Compares multiple imputation methods.

Philosophy

Why this approach works

? 6.7 ? CNV amplification → Expression Stage ← I to IV The Learned Function Surface f(CNV, Stage, ...) → Expression Learned surface Masked (unknown) Reconstructed KEY INSIGHT The LLM learns this surface implicitly from reconstruction loss.

Structure as Teacher

Masking turns the dataset into supervision: reconstruction forces the model to internalize cross-omics constraints.

Evidence Budgets

We operationalise confidence in bits/nats: accept an imputation only when the evidence clears the budget; otherwise abstain.

Verifier Coupling

Predictions are checked against multi-modal signals (CNV, clinical, cohort priors, KG) and disagreement across methods.

Certificates

Outputs ship with an audit trail: what evidence was used, how much was needed, and why the system acted.

Ready to build auditable multi-omics pipelines?

Stop imputing blind. Start shipping evidence-rated imputations with certificates.