Please or to access all these features

Add post

Watch this thread

Save thread

Start a new thread

Flip thread

Hide thread

My feed

Active Unanswered threads

Getting started FAQ's

Unanswered threads Acronyms Talk guidelines

Hide shortcut buttons

Talk

Site stuff

Join our Innovation Panel to try new features early and help make Mumsnet better.

See all MNHQ comments on this thread

Flip

1 2 3 4 5 6

MNHQ

Why we're taking legal action against Open AI and other scrapers

134 replies

JustineMumsnet · 19/07/2024 09:46

Hi all - you may have noticed this piece (https://www.thetimes.com/uk/technology-uk/article/mumsnet-openai-sues-copyright-infringement-cz5hzvf8s) in the Times today and I wanted to explain why we're doing this.

Earlier this year, we became aware that OpenAI was scraping Mumsnet - presumably to train their large language model (LLM). Such scraping without permission is an explicit breach of our terms of use, which clearly state that no part of the site may be distributed, scraped or copied for any purpose without our express approval. So we approached Open AI and suggested they might like to licence our content. In truth there are some very good reasons why the LLMs should ingest our conversational data to train their models. The six billion plus words on Mumsnet is a unique record of twenty-four years of female conversation about everything from global politics to fashion to relationships with in-laws. By contrast the majority of the content on the web was written by and for men. AI models have misogyny baked in and we’d love to help counter the gender bias likely to be present in many of them and raise women’s voices. Their response was that they were more interested in datasets that are not easily accessible online.

Much of the content on the open web is likewise being lifted. Mustafa Suleyman, CEO of Microsoft AI pronounced only two weeks ago that machine-learning companies are perfectly within their rights to scrape content published online because the moment it’s published it becomes ‘freeware’.

You might ask why the theft of online content for model-training poses a problem - hasn’t Google been crawling all over websites and ingesting their data for search purposes since the dawn of the internet? True, but there is a clear value exchange in allowing Google to access that data, namely the resulting search traffic that comes from being indexed by Google. The LLMs are building models like ChatGPT to provide the answers to any and all prospective questions that will mean we’ll no longer need to go elsewhere for solutions. And they’re building those models with scraped content from the websites they are poised to replace.

At Mumsnet we’re in a stronger position than most because much of our traffic comes to us direct and though it’s a piece of cake for an LLM to spit out a Mumsnet-style answer to a parenting question I doubt they’ll ever be as funny about parking wars or as honest about relationships and they’ll certainly never provide the emotional support that sees around a thousand women a year helped to leave abusive partners by other Mumsnet users. But if these trillion-dollar giants are simply allowed to pillage content from online publishers - and get away with it - they will destroy many of them.

Not surprisingly, a number of large, global publishers are currently suing OpenAI and Microsoft for copyright infringement and here at Mumsnet though we’re neither large (in revenue terms) nor global, we’ve decided we have no choice but to initiate a legal complaint too.

That’s not to say that A.I. is all bad of course. It plainly has the potential to advance human progress and improve our lives in multiple ways. But if the LLMs are allowed to simply steal content from publishers and communities like Mumsnet they risk destroying them. Everything that’s unique and brilliant about sites like ours will be lost, and a handful of Silicon Valley giants will be left with even more control over the world’s content and commerce.

We know that taking on a multinational giant like OpenAI, with its $3bn of revenues, is not an easy task in the face of the huge resources they’ll throw at us but this is too important an issue to simply roll over. Not just for Mumsnet but for every website you’ve ever landed on for news, advice or simply to ask if you’re being unreasonable.

Mumsnet launches first British legal action against OpenAI

Parenting website accuses the California tech giant of scraping six billion words from it to help build the chatbot ChatGPT

https://www.thetimes.com/uk/technology-uk/article/mumsnet-openai-sues-copyright-infringement-cz5hzvf8s

OP posts:

Thread gallery

AstonToTheNaughtyStep · 19/07/2024 23:29

Ereshkigalangcleg · 19/07/2024 22:48

People who are debating the merits of LLMs should read the Aston thread, and see just how our posts could potentially be misused.

Aw, Aston University, the inspiration for quite a few name changes.

AstonUniDataScraperWankers · 20/07/2024 05:50

AstonToTheNaughtyStep · 19/07/2024 23:29

Aw, Aston University, the inspiration for quite a few name changes.

Show quote history

And a whole range of merch!
grimut-gerbils.teemill.com/

ThisOldThang · 20/07/2024 05:58

I think there is also the risk of AI being able to match up usernames with real life people based upon writing style and content.

People share some very private details on this site and need to be able to do that anonymously.

Thesquarerootofnotgivingafuck · 20/07/2024 06:23

ThisOldThang · 20/07/2024 05:58

I think there is also the risk of AI being able to match up usernames with real life people based upon writing style and content.

People share some very private details on this site and need to be able to do that anonymously.

This was one of the aims of the Aston researchers who scraped the entirety of Mumsnet without permission, to use as a sandbox for their ai to "play in".
Think about that for a moment, we know that they specifically used the infertility board (women sharing their struggles to get pregnant, miscarriage etc), as a toy! They also used the LGBTQ Children board for the same purpose.

Thesquarerootofnotgivingafuck · 20/07/2024 06:27

And I don't know whether people who frequent those boards have been told this yet, perhaps @JustineMumsnet could clarify?

Timebox · 20/07/2024 06:38

I don't want my Mumsnet ramblings scraped by anyone thank you, even if it is to 'make women's voices be heard' ... really??!!

If you're going to license my words, just be upfront about cashing in to make money rather than pretending you're addressing some gender imbalance for the goodness of it.

I suppose I should have read the small print - or maybe you should be more up front and explain this more visibly - but this might be my last contribution. It annoys me to think my words can be sold off---- to ANYONE.

moderate · 20/07/2024 08:13

dieselKiller · 19/07/2024 21:55

Why give defective tools a veneer of quality so that more people are fooled by them?

The attitude that I want to shape is that LLMs are dangerously defective by design.

Apparently ChatGPT already contains mumsnet content? It hasn’t fixed its flaws.

Show quote history

People are going to using these systems whether you like it or not. In my opinion you’re overstating the case against them. Not knowing why you believe something is really rather a human quality.

dieselKiller · 20/07/2024 09:10

moderate · 20/07/2024 08:13

People are going to using these systems whether you like it or not. In my opinion you’re overstating the case against them. Not knowing why you believe something is really rather a human quality.

Show quote history

I don’t have to support harmful products and I won’t, regardless of how & whether other people use them. I recognise that I don’t control other people’s actions and that life is imperfect, but I would like people to understand what LLMs are and hope that would discourage their use.

I think we disagree both on the value of LLMs and on the likely effects of adding additional mumsnet data to the training sets.

Could you explain what you mean by “Not knowing why you believe something is really rather a human quality”.

moderate · 20/07/2024 09:19

dieselKiller · 20/07/2024 09:10

I think we disagree both on the value of LLMs and on the likely effects of adding additional mumsnet data to the training sets.

Could you explain what you mean by “Not knowing why you believe something is really rather a human quality”.

Show quote history

One of your two main complaints against LLMs was that they have no concept of correctness. They can sound authoritative on a subject but there might not be any justified true belief behind it.

Rather like a human.

People know the environmental arguments against aviation and private transport too but it really doesn’t stop them. You’ve listed the negatives but for most people the positives will outweigh them — especially as the technology improves (introspection is in the pipeline).

moderate · 20/07/2024 09:35

dieselKiller · 20/07/2024 09:10

I think we disagree both on the value of LLMs and on the likely effects of adding additional mumsnet data to the training sets.

Could you explain what you mean by “Not knowing why you believe something is really rather a human quality”.

Show quote history

In order to discuss further whether or not adding Mumsnet data to the training set would make any tangible difference, the terminology “Sorites paradox” temporarily escaped me. I entered a loose description into ChatGPT and it immediately returned me the name I had not been able to call to mind. I could maybe have got this answer through a traditional search engine, but I knew an LLM would definitely return it.

Now consider the effect of including vs excluding thousands of posts drawing a distinction between “gender critical” and “transphobic”.

dieselKiller · 20/07/2024 09:36

moderate · 20/07/2024 09:19

One of your two main complaints against LLMs was that they have no concept of correctness. They can sound authoritative on a subject but there might not be any justified true belief behind it.

Rather like a human.

Show quote history

I see what you’re saying, but the way LLMs fail is not at all like a human, because humans can be malicious, mistaken, or overconfident, but we do have a concept of reality and the laws of physics and social obligation and logic and … which LLMs do not.

Besides, if I introduced you to my friend and explained that she is frequently malicious, mistaken, and overconfident and suggested you take important advice from her, you just wouldn’t, would you?

“Here’s my friend. Not only can’t she do simple arithmetic, she doesn’t even have the concept of addition. She doesn’t know what a percentage is. Go ahead and ask her questions about your taxes.”

moderate · 20/07/2024 09:40

dieselKiller · 20/07/2024 09:36

Besides, if I introduced you to my friend and explained that she is frequently malicious, mistaken, and overconfident and suggested you take important advice from her, you just wouldn’t, would you?

Show quote history

Everyone should always be considering the fitness-for-purpose of any source they use. My calculator is considerably better than my accountant at multiplying large numbers together, but I won’t be asking it for tax advice.

Sethera · 20/07/2024 09:44

humans can be malicious, mistaken, or overconfident, but we do have a concept of reality and the laws of physics and social obligation and logic and … which LLMs do not.

Co-pilot is overconfident in my experience. It won't admit it doesn't know something or can't do something, it just keeps returning rubbish and when you point out what it's doing, it apologises meaninglessly and does the same again.

dieselKiller · 20/07/2024 09:44

moderate · 20/07/2024 09:40

Show quote history

Certainly. And fortunately your calculator doesn’t claim to offer tax advice, does not respond when asked tax advice, and people are not confused about whether it offers tax advice.

moderate · 20/07/2024 09:57

dieselKiller · 20/07/2024 09:44

Certainly. And fortunately your calculator doesn’t claim to offer tax advice, does not respond when asked tax advice, and people are not confused about whether it offers tax advice.

Show quote history

But for this to be my metric would lull me into a false sense of security because when I meet someone who does claim to know about tax, or read something that someone has written about tax, I still have to consider the veracity of the source.

dieselKiller · 20/07/2024 10:20

moderate · 20/07/2024 09:35

Now consider the effect of including vs excluding thousands of posts drawing a distinction between “gender critical” and “transphobic”.

Show quote history

Finding words associated with other words is something that LLMs could be said to be designed for. It’s an electricity-inefficient way of doing the search, but it worked for you.

Drawing fine distinctions between the meanings of words is a less reliable use for LLMs. In order to get a response, the user has to provide some input. That input primes the response in unpredictable ways. (Consider that men and women may ask the same question in different ways for example).

You may be able to win an ideological battle through LLM training data, but victory is far from certain. You’re as likely to provide cover for your ideological opponents. Victory is only guaranteed if you have overwhelming numbers in both question space & response space (or you control the entire system). The way that LLMs have been designed to “both sides” their responses is typically a loss for non-extremists and non-majority positions.

Since you think LLM use is inevitable and you presumably write about topics that you feel are underrepresented in the datasets, and you think that adding that data will translate to a positive effect, of course it makes sense that you would opt your own writing in to the dataset.

I think the better option is to starve the LLMs of money, data, and users, and I’d be happy to have that option for the content I write.

dieselKiller · 20/07/2024 10:27

moderate · 20/07/2024 09:57

Show quote history

Yes, the “certainly” applied to always having to consider your tools and sources. I’m not suggesting that people stop thinking. Quite the opposite, I’m encouraging people to understand what LLMs do, how they work, and when it is appropriate to use them. (The answer being: “almost never”).

moderate · 20/07/2024 10:29

dieselKiller · 20/07/2024 10:20

Finding words associated with other words is something that LLMs could be said to be designed for. It’s an electricity-inefficient way of doing the search, but it worked for you.

I think the better option is to starve the LLMs of money, data, and users, and I’d be happy to have that option for the content I write.

Show quote history

In short, you believe your contribution to be too small to make a difference to the behaviour of an LLM, but large enough to form the tipping point in a boycott.

I remain unconvinced.

dieselKiller · 20/07/2024 10:38

moderate · 20/07/2024 10:29

In short, you believe your contribution to be too small to make a difference to the behaviour of an LLM, but large enough to form the tipping point in a boycott.

I remain unconvinced.

Show quote history

What are you unconvinced by?

That I should be able to decide whether the mumsnet content that I write is included in LLM training sets?

I haven’t said that my content would be too small to make a difference. I have said that the difference that any particular content makes is unpredictable.

I’m also not organising a boycott.

ArabellaScott · 20/07/2024 10:38

moderate · 19/07/2024 21:45

Legally, we’ve given up all our rights to Mumsnet HQ. They can do whatever they like with our posts, subject to statutory rights.

I just don’t understand why people want to withhold what they’ve publicly posted. It’s excluding yourself from helping to shape attitudes.

Show quote history

No, we haven't. Users retain copyright, last time I checked the t&Cs.

ArabellaScott · 20/07/2024 10:40

Yep, we retain copyright, MN license it:

'By submitting User Content to us, simultaneously with such posting you automatically grant to us a worldwide, fully-paid, royalty-free, perpetual, irrevocable, non-exclusive, fully sublicensable, and transferable right and license to use, record, sell, lease, reproduce, distribute, create derivative works based upon (including, without limitation, translations), publicly display, publicly perform, transmit, publish and otherwise exploit the User Content (in whole or in part) as Mumsnet, in its sole discretion, deems appropriate. We may exercise this grant in any format, media or technology now known or later developed for the full term of any copyright that may exist in such User Content.
Subject to the rights and license you grant to us under these Terms of Use, you retain all your right, title and interest in your User Content submissions. This means that copyright in your User Content will remain with you and that you can continue to use the material in any way, including allowing others to use it.'

ArabellaScott · 20/07/2024 10:43

There is this, though:

'you waive any and all claims you may now or later have in any jurisdiction to so-called "moral rights" or rights of "droit moral" with respect to the User Content.'

Rights and licensing can get very, very complicated.
https://www.lawbite.co.uk/resources/blog/moral-rights

What are moral rights and how do they work?

Whilst copyright is concerned with the economics of owning creative work, moral rights are focused on the ethical right to be acknowledged as a creator.

https://www.lawbite.co.uk/resources/blog/moral-rights

moderate · 20/07/2024 10:46

ArabellaScott · 20/07/2024 10:38

No, we haven't. Users retain copyright, last time I checked the t&Cs.

Show quote history

True, but that just means you can still use it yourself and permit others to do so too. Mumsnet can still do whatever they like with it.

> By submitting User Content to us, simultaneously with such posting you automatically grant to us a worldwide, fully-paid, royalty-free, perpetual, irrevocable, non-exclusive, fully sublicensable, and transferable right and license to use, record, sell, lease, reproduce, distribute, create derivative works based upon (including, without limitation, translations), publicly display, publicly perform, transmit, publish and otherwise exploit the User Content (in whole or in part) as Mumsnet, in its sole discretion, deems appropriate. We may exercise this grant in any format, media or technology now known or later developed for the full term of any copyright that may exist in such User Content.

> Subject to the rights and license you grant to us under these Terms of Use, you retain all your right, title and interest in your User Content submissions. This means that copyright in your User Content will remain with you and that you can continue to use the material in any way, including allowing others to use it.

moderate · 20/07/2024 10:59

dieselKiller · 20/07/2024 10:38

What are you unconvinced by?

That I should be able to decide whether the mumsnet content that I write is included in LLM training sets?

I haven’t said that my content would be too small to make a difference. I have said that the difference that any particular content makes is unpredictable.

I’m also not organising a boycott.

Show quote history

Unconvinced by your “better option”.

dieselKiller · 20/07/2024 11:03

Ereshkigalangcleg · 19/07/2024 22:48

People who are debating the merits of LLMs should read the Aston thread, and see just how our posts could potentially be misused.

Could you give a brief summary of the Aston thing? I’m interested, but short of time. I think I read someone suggest that the US govt funded de-anonymisation research using mumsnet data. Is that correct?

Flip

1 2 3 4 5 6

Swipe left for the next trending thread