Please or to access all these features

Add post

Watch this thread

Save thread

Start a new thread

Flip thread

Hide thread

My feed

Active Unanswered threads

Getting started FAQ's

Unanswered threads Acronyms Talk guidelines

Hide shortcut buttons

Talk

Site stuff

Join our Innovation Panel to try new features early and help make Mumsnet better.

See all MNHQ comments on this thread

Flip

1 2 3 4 5 6

MNHQ

Why we're taking legal action against Open AI and other scrapers

134 replies

JustineMumsnet · 19/07/2024 09:46

Hi all - you may have noticed this piece (https://www.thetimes.com/uk/technology-uk/article/mumsnet-openai-sues-copyright-infringement-cz5hzvf8s) in the Times today and I wanted to explain why we're doing this.

Earlier this year, we became aware that OpenAI was scraping Mumsnet - presumably to train their large language model (LLM). Such scraping without permission is an explicit breach of our terms of use, which clearly state that no part of the site may be distributed, scraped or copied for any purpose without our express approval. So we approached Open AI and suggested they might like to licence our content. In truth there are some very good reasons why the LLMs should ingest our conversational data to train their models. The six billion plus words on Mumsnet is a unique record of twenty-four years of female conversation about everything from global politics to fashion to relationships with in-laws. By contrast the majority of the content on the web was written by and for men. AI models have misogyny baked in and we’d love to help counter the gender bias likely to be present in many of them and raise women’s voices. Their response was that they were more interested in datasets that are not easily accessible online.

Much of the content on the open web is likewise being lifted. Mustafa Suleyman, CEO of Microsoft AI pronounced only two weeks ago that machine-learning companies are perfectly within their rights to scrape content published online because the moment it’s published it becomes ‘freeware’.

You might ask why the theft of online content for model-training poses a problem - hasn’t Google been crawling all over websites and ingesting their data for search purposes since the dawn of the internet? True, but there is a clear value exchange in allowing Google to access that data, namely the resulting search traffic that comes from being indexed by Google. The LLMs are building models like ChatGPT to provide the answers to any and all prospective questions that will mean we’ll no longer need to go elsewhere for solutions. And they’re building those models with scraped content from the websites they are poised to replace.

At Mumsnet we’re in a stronger position than most because much of our traffic comes to us direct and though it’s a piece of cake for an LLM to spit out a Mumsnet-style answer to a parenting question I doubt they’ll ever be as funny about parking wars or as honest about relationships and they’ll certainly never provide the emotional support that sees around a thousand women a year helped to leave abusive partners by other Mumsnet users. But if these trillion-dollar giants are simply allowed to pillage content from online publishers - and get away with it - they will destroy many of them.

Not surprisingly, a number of large, global publishers are currently suing OpenAI and Microsoft for copyright infringement and here at Mumsnet though we’re neither large (in revenue terms) nor global, we’ve decided we have no choice but to initiate a legal complaint too.

That’s not to say that A.I. is all bad of course. It plainly has the potential to advance human progress and improve our lives in multiple ways. But if the LLMs are allowed to simply steal content from publishers and communities like Mumsnet they risk destroying them. Everything that’s unique and brilliant about sites like ours will be lost, and a handful of Silicon Valley giants will be left with even more control over the world’s content and commerce.

We know that taking on a multinational giant like OpenAI, with its $3bn of revenues, is not an easy task in the face of the huge resources they’ll throw at us but this is too important an issue to simply roll over. Not just for Mumsnet but for every website you’ve ever landed on for news, advice or simply to ask if you’re being unreasonable.

Mumsnet launches first British legal action against OpenAI

Parenting website accuses the California tech giant of scraping six billion words from it to help build the chatbot ChatGPT

https://www.thetimes.com/uk/technology-uk/article/mumsnet-openai-sues-copyright-infringement-cz5hzvf8s

OP posts:

Thread gallery

ArabellaScott · 20/07/2024 11:05

dieselKiller · 20/07/2024 11:03

Could you give a brief summary of the Aston thing? I’m interested, but short of time. I think I read someone suggest that the US govt funded de-anonymisation research using mumsnet data. Is that correct?

Show quote history

Aston scraped the whole site and used parts of it as a 'sandbox' to experiment, test analytics models on, specifically whether they could identify users by posting style across different sites.

They downloaded the whole site onto an air gapped server, despite this being explicitly forbidden in MNs T&Cs, to create their corpus.

A PHd student then proposed a study on 'transphobia' on MN. This alerted MN users to the existence of the corpus.

Corpus has since been mostly deleted, still a couple of lesser scrapes to go, lawyers in discussion.

Briefly.

ArabellaScott · 20/07/2024 11:06

Sorry, and yes, the US govt have funded some of Aston's research.

Amazinggrace842 · 20/07/2024 12:41

Sethera · 20/07/2024 09:44

humans can be malicious, mistaken, or overconfident, but we do have a concept of reality and the laws of physics and social obligation and logic and … which LLMs do not.

Co-pilot is overconfident in my experience. It won't admit it doesn't know something or can't do something, it just keeps returning rubbish and when you point out what it's doing, it apologises meaninglessly and does the same again.

Copilot just like a typical man then! 😂

allowedtochoosewhosepartytogoto · 20/07/2024 15:11

There are a number of issues with scraping.

If an abused woman - in desperation - posts on here, receives help, then realises that her posts means she could be identified in real life, she will often ask for her posts to be retrospectively deleted. Which MN do.

If a scrape has happened before that, that identifying highly sensitive personal data is not deleted in the held scraped data. So there's an issue there. It's not academic, lots of women are murdered and injured every year by violent partners who also often isolate them successfully from real world friends and family.

MN rightly are proud of the number of abused women who are helped to leave their partners every year by MN. So these issues do need to be considered.

Another issue is that Aston (and presumably others) have linguistic models which seek to identify which posts across usernames and threads are by the same person and identify that person IRL. I can't confidently assert that this sort of technology looking at all my posts on MN across many years and usernames wouldn't be able to figure out who I am in real life.

Aston claim they haven't done this to MN data scrapes, and also that they'd only do it - because they brag they can do it - in cases of crime. But their staff also think women knowing what biological reality is is a crime. So we get into a very sticky area where they are judge and jury. It's all a bit 1984.

It is possible for MN to want to remain a viable, profitable business and also to want to help abused women seek help as an ethical position. Charities have to raise money to do good. The two positions are not mutually exclusive but - being a women centred site - I do think both consent is important and also ensuring those using the data are not doing so to harm women. This latter clearly isn't the case with Aston who started the PhD using the scraped data from the premise that posters on MN were bigoted. They lumped Mums on MN in with other sites they've scraped specifically to target illegal terrorist activity. They are not neutral.

I've read all the T&Cs and I accept that MN can make money from my data, mostly by selling me stuff, but I also trust that MN will not seek to identify me - it's not in their interests to do so. I manage my household's entire budget and I do buy things recommended on MN. I spend thousands and thousands every year on feeding, clothing and activities for my children. I often search MN for others' opinions before buying things.

With the Aston scrapers, there seems to be some zealotry and McCarthyist thinking going on. So I am very uncomfortable with Aston hold my MN data. There's a big, big difference with data being used in limited contexts and the entirety of all the posts being put through some big computer system to identify who you are in real life. I don't trust Aston not to do this, so am glad MN have involved lawyers.

allowedtochoosewhosepartytogoto · 20/07/2024 15:15

As far as Aston go, what they did is illegal. It's against T&Cs. The issue is how well they will be enforced to delete the illegal data and how they will be punished. Because if they're not punished, they'll do it again, as will many others.

And eventually this will not be a site that women can turn to for help, and it will not be a profitable business.

ArabellaScott · 20/07/2024 16:07

https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2023/08/joint-statement-on-data-scraping-and-data-protection/

ICO on scraping.

'Scraping from social media creates privacy risks and potential harms, such as the information people post online being used for reasons they don’t expect, exploited in cyberattacks or used for identity fraud.

Social media companies and the operators of websites that host publicly accessible personal data have obligations under data protection and privacy laws to protect personal information on their platforms from unlawful data scraping.

 Mass data scraping incidents that harvest personal information can constitute
reportable data breaches in many jurisdictions'

Joint statement on data scraping and data protection

The Information Commissioner’s Office and eleven other data protection and privacy authorities from around the world have today published a joint statement calling for the protection of people’s personal data from unlawful data scraping taking place on...

https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2023/08/joint-statement-on-data-scraping-and-data-protection

HaveToSaySomethingHere · 20/07/2024 18:26

Fair play to you mumsnet!

Boiledbeetle · 20/07/2024 21:07

AstonUniDataScraperWankers · 20/07/2024 05:50

And a whole range of merch!
grimut-gerbils.teemill.com/

Show quote history

It's wonderful stuff! A very eclectic range!

Why we're taking legal action against Open AI and other scrapers

Deuteragonist · 20/04/2025 19:25

@JustineMumsnet
Sorry I'm late to this topic, but I'm new here and only just found it. I just wanted to reply because I've been involved in something similar via my own day-job. From which I've learned a bit about "AI ethics" (or the lack of).

The Aston example is relevant, as it's fairly consistent with how many of the AI providers operate. That is, the original AI product is developed in some university or research lab. Which then scraps content from loads of websites (like MumsNet) "for research purposes", which makes it sound like non-commercial use. Excusable so far?

But then they license to their product to a major player, like Microsoft, OpenAI etc). And they use the scrapped content in their commercial AI offerings. This is not only fairly blatant breach of copyright, it's also a breach of Data Governance rules. Within the Data Governance community, I've seen AI described as the "Fruit of a poisoned tree".

Many major professional standards organisations (e.g. BCIM, IEEE, Chartered accountants, Lawyers, DAMA) now have a Code of Ethics that means abuses of AI should be called-out.

Some website copyright boilerplate now adds "Not for use in AI or Large Language Models". But enforcing that is difficult if the AI scrapers ignore it. Most websites have a file called robots.txt - this can be intended to block bots and scrapers. But OpenAI has been ignoring that too.

It needs brave people - like MumsNet - to challenge them - thank you! 😀

Flip

1 2 3 4 5 6

Swipe left for the next trending thread