Cancer Research

Cancer as the proving ground for information-theoretic thinking machines

We integrate multi-omics data and treat missingness as an information-budget problem. When the model reconstructs masked values, we measure evidence sufficiency in bits/nats and use verifiers to keep downstream discovery stable and auditable.

The Question

Can a model learn cross-omics structure by reconstructing what we hide?

Most ML learns from labels. Here, the dataset is the teacher: we mask values and ask the model to reconstruct them.

If a model can predict masked expression from CNV, clinical variables, and knowledge context, it has captured real cross-modal constraints—not just a brittle mapping.

Our goal isn’t blind imputation. It’s an auditable pipeline: propose a value, verify it against available evidence, and certify what cleared the budget.

Traditional Approach

Train on labeled data. The model learns mappings from input to output but not the underlying structure.

Our Approach

Mask data, force reconstruction. The model learns biological functions to fill gaps—revealing how omics layers connect.

We don't train the model on biology papers. We train it to reconstruct masked data. The biology emerges from the structure.

Multi-Omics Integration

Four data layers, one learned function

Each layer provides a different view into cancer biology. The model learns how they connect.

The Experiment

Mask, Reconstruct, Discover

A simple protocol that turns biological structure into a measurable test.

Mask

Hide 10% of values across gene expression, CNV, and clinical features. These become reconstruction targets.

Integrate

Feed remaining multi-omics data to the LLM with knowledge graph context for grounding.

Reconstruct

Model predicts masked values using cross-modal relationships it has learned.

Validate

Compare predictions to held-out ground truth. Measure where reconstruction is accurate, calibrated, and evidence-supported.

Discoveries

Patterns the model found without being told

Reconstruction is the test. The interesting part is what a model must learn to pass it: cross-omics couplings, cohort structure, and constraints that we didn’t hard-code.

CNV-Expression Coupling

The model strongly weights copy number when predicting expression. It learned gene dosage effects—more copies means more transcription.

"More copies → More expression"

Stage-Specific Constraints

Validation confidence varies by tumor stage. Early-stage tumors are predictable; late-stage shows higher uncertainty—biological heterogeneity.

"Early = tight bounds, Late = chaos"

Pathway Co-Regulation

Genes in the same biological pathway predict each other's expression. The model learned regulatory networks from reconstruction.

"Same pathway → mutual prediction"

Patient Neighborhoods

Cohort similarity improves reconstruction accuracy. Age and stage define "biological neighborhoods" where patients share expression patterns.

"Similar patients → similar biology"

The Application

From discovery to deployment: Evidence-grounded imputation

These discoveries become a practical tool. When you have missing cancer data, the model proposes imputations and our verification layer scores them with an evidence budget—producing a certificate or an abstention, not a guess.

Claim Validation

Each imputed value is treated as a claim requiring evidence.

Cross-Modal Consistency

Validates expression using CNV and clinical signals.

Similar-Patient Priors

Leverages cohort similarity for validation.

Uncertainty Quantification

Per-value uncertainty estimates and calibrated intervals.

KG Grounding

Grounds predictions in biomedical knowledge to reduce ungrounded completions.

Ensemble Filtering

Compares multiple imputation methods.

Philosophy

Why this approach works

Structure as Teacher

Masking turns the dataset into supervision: reconstruction forces the model to internalize cross-omics constraints.

Evidence Budgets

We operationalise confidence in bits/nats: accept an imputation only when the evidence clears the budget; otherwise abstain.

Verifier Coupling

Predictions are checked against multi-modal signals (CNV, clinical, cohort priors, KG) and disagreement across methods.

Certificates

Outputs ship with an audit trail: what evidence was used, how much was needed, and why the system acted.

Ready to build auditable multi-omics pipelines?

Stop imputing blind. Start shipping evidence-rated imputations with certificates.

View on GitHub Get in touch