I just asked chatgpt the same question, then followed up by asking How strong and capable is this current model of chatgpt of assessing how rare that the insulin test results were and of accounting for things that may have affected the results?
Good question — and it gets right to the limits of what a model like this can and can’t do.
Short answer
This model is good at explaining medical logic and probability in general terms, but it is not capable of independently determining how rare a specific test result was in a rigorous scientific sense, nor can it fully account for all lab-specific variables the way a forensic endocrinologist or toxicologist could.
What it can do well
- Explain the biology accurately
It can reliably describe things like:
- how insulin and C-peptide normally behave
- what patterns are expected in endogenous vs exogenous insulin
- why certain combinations are considered suspicious in clinical toxicology
That part is grounded in well-established physiology.
2) Summarise how experts interpret evidence
It can outline:
- why courts or expert witnesses considered a result unusual
- what arguments were made on both sides in a case like Lucy Letby’s
- what kinds of methodological criticisms are commonly raised (e.g., assay limitations, sample handling, neonatal variability)
3) Compare plausible explanations
It can reasonably evaluate:
- “this pattern fits A vs B vs C explanations”
- which explanations are more or less consistent with medical literature generally
- where uncertainty exists in interpretation
What it
cannot do reliably
1) It cannot calculate true “rarity” from real-world datasets
To actually say “how rare this is,” you would need:
- large neonatal ICU datasets
- validated distributions of insulin/C-peptide under different conditions
- lab-specific assay performance data
- case-control comparisons
I don’t have access to that raw clinical database, and even in real life it often doesn’t exist in a clean, publishable form for exactly this scenario.
So any “rarity” estimate from me would be
inference, not measurement.
2) It cannot verify lab artefacts in a specific case
In the Letby insulin issue, important technical factors include:
- which immunoassay was used
- cross-reactivity and antibody interference
- sample timing and storage
- neonatal blood volume constraints
- lab reference ranges for preterm infants
A model like this cannot inspect those original lab conditions or validate whether they affected the result.
3) It cannot resolve contested expert disagreement
In cases like this, experts may disagree because:
- they weight lab uncertainty differently
- they interpret “extreme rarity” vs “possible artefact” differently
- they may have access to different subsets of data
I can summarise both positions, but I cannot adjudicate which expert is correct.
So how strong is it, realistically?
A fair way to think about it:
-
Strong at conceptual reasoning (high reliability)physiology
- mechanism-based interpretation
- explaining why evidence is considered significant
-
Moderate at evidential synthesissummarising competing arguments
- identifying potential confounders
-
Weak at forensic quantificationexact rarity estimates
- lab error probability in a specific real case
- definitive conclusions about causation
Bottom line
It’s best seen as:
a very capable
explainer and structured reasoning tool, not a forensic calculator or independent expert witness.