Interesting observations about AI and large language models (LLM)s in social care.
The researchers used more than 600 real case notes. They varied only the sex of the person at the heart of them. They used a model that’s already in use by more than half of England’s councils to generate an assessment for an adult social care support package.
When the person was "Mr Smith," the AI summarised this as, "84-year-old man who lives alone and has a complex medical history, no care package and poor mobility." With a switch in sex and titles/pronouns, "Mrs Smith," was, "84-year-old living alone. Despite her limitations, she is independent and able to maintain her personal care."
I’ve seen a fair amount of public discussion that says the AI is correcting for bias. That the same facts and needs are different because ‘men at any age are less competent at self care’. So, holding everything else equal, men are more likely to need a care package than women. So, presumably, those commenters would think other LLMs (see below are unfairly biased).
This is not an intellectual exercise. These systems are influencing or, in some cases, deciding who is eligible for social care support. How quickly, and how intensively (2 or 4 visits a day?).
Just describing someone as "coping" rather than "struggling" shifts the outcome. Men’s reports and experiences were framed in terms of difficulty, women in terms of self-reliance.
I’m highlighting the researchers’ findings with one popular LLM. Other LLMs displayed less bias or no discernible bias but Google’s Gemma is a popular LLM.
Gemma’s male summaries were generally more negative in sentiment, and certain themes, such as physical health and mental health, were more frequently highlighted for men. The language used by Gemma for men was often more direct, while more euphemistic language was used for women. In the Gemma summaries, women’s health issues appeared less severe than men’s and details of women’s needs were sometimes omitted. Workers reading such summaries might assess women’s care needs differently from those of otherwise identical men, based on gender rather than need. As care services are awarded based on need, this could impact allocation decisions.
Without formal evaluations, we don’t know what training people receive. Nor what the governance is nor the safeguards.
Copied from lead author’s discussion.
Rickman, S. Evaluating gender bias in large language models in long-term care. BMC Med Inform Decis Mak 25, 274 (2025). https://doi.org/10.1186/s12911-025-03118-0
As AI is rolled out in areas that affect people's lives - like social care, housing and criminal justice - how do we know we're using the right models?
In this study, we evaluated gender bias in Google and Meta's LLMs for summarising pseudonymised, gender-swapped care records. While Meta's model showed no measurable bias, Google's was more likely to mention physical and mental health needs if they were men's - even when women had the same conditions.
These differences matter because care goes to those who appear in greatest need. If women's needs are downplayed, will they receive the same level of care and support?
The research also suggests that AI bias isn't inevitable - it varies between models. Social care, health, and other public services face real challenges with documentation, and AI can help address them. But those benefits come with risks. Evaluation is essential to ensure we choose the best tools for the job.
https://www.theguardian.com/technology/2025/aug/11/ai-tools-used-by-english-councils-downplay-womens-health-issues-study-finds?
Open access paper: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-025-03118-0