A study on PulmoFoundation, an AI model built for lung pathology, has been posted on arXiv with broader validation data than the usual pattern in CPath papers. The paper matters because of how it was evaluated: biopsies, frozen sections, resection specimens, IHC markers, molecular markers, patient survival, then prospective validation and a reader study involving pathologists.
In the laboratory, these numbers are read from a specific angle. The practical question is whether the model can become a triage and support layer at defined points in the workflow, without replacing the physician’s judgment or creating too much confidence in its outputs. The paper gives enough data for a professional discussion of that question.
What did the study test?
PulmoFoundation was based on Virchow2, then underwent additional lung-focused training using more than 88 million tiles taken from about 40,000 digital H&E slides from 12 institutional and public sources. The model was then tested on more than 26,000 WSI across 32 clinical tasks, with 32 internal cohorts and 21 external cohorts from 8 independent institutions.
This scale does not remove questions about bias or protocol differences, but it moves the discussion away from a narrow technical demonstration and closer to the daily workflow. The paper does not stop at classifying one image or one task. It measures model performance across biopsy, frozen section, and resection reporting, then connects it with selected IHC decisions, molecular markers, and prognostic endpoints.
Biopsy: the first decision gate
For biopsies, the study evaluated four main tasks. The model reached a mean AUC of 0.936 internally and 0.916 externally. For benign versus malignant classification, the AUC was 0.970 internally and 0.916 externally. This point matters because a biopsy is often a limited sample, and the first decision sets the pace for histologic subtyping and ancillary testing.
For distinguishing primary lung cancer from metastasis, some comparisons reached an AUC of 1.000 within a defined cohort, but that result has to be read within the limits of the metastatic tumor types, their number, and the data source. It should not be turned into a general promise. Its practical value is that WSI representations may carry useful signals for triaging cases that need deeper review, not that they replace the clinical context or an IHC panel when one is needed.
Frozen section: where time matters
Frozen section tests the model under a different kind of pressure. The decision is rapid, and the surgical consequence is immediate. Across four frozen-section tasks, PulmoFoundation reached a mean AUC of 0.908 internally and 0.985 externally. At an operating point requiring specificity of at least 99%, the model missed fewer malignant cases than the reference models in the test centers described in the paper.
Here, the word “accurate” is not enough. The pathologist cares about the error pattern: does it miss malignancy? Does it push the surgeon toward overtreatment? How does it handle freezing artifact? The study provides some numbers around these questions, but local testing is still needed before any clinical adoption, because freezing quality, specimen type, and surgical-team behavior differ between laboratories.
Resection specimens: classification, grading, and beyond
In resection specimens, the model was tested on 12 tasks related to classification, grading, and pathologic assessment. The tasks included benign versus malignant disease, primary tumor versus metastasis, origin of metastasis, and distinction between adenocarcinoma and squamous cell carcinoma. They also covered reporting elements that affect adjuvant therapy and prognosis.
What stands out in this section is that the model was not presented as a tool that gives only one answer. The more realistic use is as a second-reading layer that points to areas or probabilities that deserve attention, especially in large cases or when the histologic impression conflicts with later test results. Any real deployment should stay inside a clear quality-control system, with errors recorded and reviewed regularly.
IHC and molecular markers: early value, not a replacement for testing
The paper tested the model’s ability to infer markers from H&E, including TTF-1, Napsin-A, CK7, P40, and P63, along with Ki-67, some molecular markers, and survival outcomes. For example, TTF-1 reached an internal AUC of 0.923, Napsin-A 0.936, and CK7 0.899, with higher numbers in an external cohort reported in the paper.
These results do not mean stains can be cancelled. A better professional reading is that they may help organize the work: which cases look clear enough to reduce a low-yield order, and which cases need rapid confirmation? In prospective validation, the study suggested that the system could defer 44.5% of IHC orders within prespecified safety thresholds, with a pooled PPV of 0.966 for the marker panel. That is an important number, but it requires a local definition of “defer”: does it mean not ordering the stain at all, waiting for pathologist review, or showing an internal recommendation in the work interface?
Prospective validation and the pathologist reader study
The strongest part of the paper is the prospective validation on 1,357 consecutive patients across 11 tasks in routine practice. The model reached a mean AUC of 0.923. The triage thresholds also suggested that the system could reduce the second-review workload in 68.8% of biopsies and 83.0% of frozen sections, with PPV values of 1.000 and 0.991, respectively.
The study also ran a randomized crossover experiment with eight pathologists across 4,928 case-reader pairs. Accuracy with assistance increased from 83.8% to 91.7%, median diagnostic time fell by 19.6%, diagnostic confidence increased by 8.7%, and inter-reader agreement improved from κ=0.56 to κ=0.76. These are strong figures, but the number of pathologists is limited. The display interface and the way the model result is shown also matter, because they may change reader behavior as much as the model itself does.
The risk of bias toward the machine
The paper did not ignore automation bias. Among 4,928 AI-assisted observations, accuracy fell after the model result was shown in 0.5% of observations, and strict harm caused by adopting the model’s error occurred in 0.1%. These rates are small, but they are not zero. They are a reminder that any support system needs an interface designed to prevent the output from being accepted as final truth.
In practice, the best place for this kind of tool may be triage, prioritization, and detection of cases that deserve a second review, not independent diagnosis. A laboratory considering this type of solution needs internal testing on its own archive, then silent running, then comparison with pathologist diagnoses and downstream test results before the tool enters the clinical workflow.
What does this mean for the laboratory?
PulmoFoundation is an example of a more mature direction in digital pathology model evaluation: multiple tasks, external cohorts, prospective validation, and a reader study. That does not make the model ready for every laboratory. It makes the paper useful for defining the evaluation standards we should ask of any vendor or research team: performance by specimen type, prespecified operating thresholds, error analysis, measurement of time impact, and monitoring for bias toward the automated result.
Paper source: arXiv:2605.25878