Meet the Other Phone. Child-safe in minutes.

Meet the Other Phone.
Child-safe in minutes.

Buy now

Please or to access all these features

Site stuff

Join our Innovation Panel to try new features early and help make Mumsnet better.

See all MNHQ comments on this thread

Mumsnet Corpus

1000 replies

TokyoBouncyBall · 19/04/2024 11:36

Not a TAAT, but a bit of googling as a result of a now deleted thread has led me to this:

https://fold.aston.ac.uk/handle/123456789/18

I note it says that the License is uncertain. Can you confirm that you have given permission for posts to be used in this way, or is there something that Aston might like to look into?

I note it says Users who wish to access this dataset must make a detailed application to FoLD and the researcher, as well as potentially gain additional agreement from an external organisation before they can be approved for access.

Given one of the uses it is being put to, I think it is a bit dubious to say the least.

OP posts:
Thread gallery
82
MaiseeBee · 02/05/2024 21:41

Great to finally have an update from MN but it looks this could be the tip of the iceberg. Manchester Uni seems to have 10 years worth of MN data and they say it's available "on reasonable request".

https://link.springer.com/article/10.1007/s13278-023-01155-z

Newcastle Uni have 21 years of MN posts.

https://www.jmir.org/2023/1/e47849/

It took me five minutes to find those using Scopus. @JustineMumsnet Can you have a look please?

AmaryllisNightAndDay · 02/05/2024 22:07

Well well, who knew? Is large-scale linguistic data analysis just a copyright Wild West and is "they're all at it" a defence in law and in ethics? Then again, maybe Newcastle and Manchester got permission. The linked papers state that ethical approval was granted by the respective universities..... but not the basis for approval.

Time to invest in an even bigger bucket of popcorn.

ADoggyDogWorld · 02/05/2024 22:09

MaiseeBee · 02/05/2024 21:41

Great to finally have an update from MN but it looks this could be the tip of the iceberg. Manchester Uni seems to have 10 years worth of MN data and they say it's available "on reasonable request".

https://link.springer.com/article/10.1007/s13278-023-01155-z

Newcastle Uni have 21 years of MN posts.

https://www.jmir.org/2023/1/e47849/

It took me five minutes to find those using Scopus. @JustineMumsnet Can you have a look please?

OMG.

Codlingmoths · 02/05/2024 22:10

MaiseeBee · 02/05/2024 21:41

Great to finally have an update from MN but it looks this could be the tip of the iceberg. Manchester Uni seems to have 10 years worth of MN data and they say it's available "on reasonable request".

https://link.springer.com/article/10.1007/s13278-023-01155-z

Newcastle Uni have 21 years of MN posts.

https://www.jmir.org/2023/1/e47849/

It took me five minutes to find those using Scopus. @JustineMumsnet Can you have a look please?

I’m just bumping this.

VitoCorleoneOfMNMafia · 02/05/2024 22:28

ArsetonUniversity · 02/05/2024 20:19

Thanks for the update.

I'm shocked that, contrary to what I had believed and to my defence of the student, the student actually did do more scraping specifically for their research. How the fuck did that get approved?? Madness.

Same. I've suddenly lost a lot of my prior sympathy for her. And also, given that this was a second scrape, the ethics board fucked up twice.

Tell me that the ethics board holds women and our autonomy in utter contempt without telling me that the ethics board holds women and our autonomy in utter contempt.

DewinDwl · 02/05/2024 22:28

WTF

MarkMenziesFakeMugger · 02/05/2024 22:36

OMG Newcastle uni and Manchester uni as well - 30 odd years of posts between them? How can this be?

VitoCorleoneOfMNMafia · 02/05/2024 22:41

MarkMenziesFakeMugger · 02/05/2024 22:36

OMG Newcastle uni and Manchester uni as well - 30 odd years of posts between them? How can this be?

Manchester used a kind of software tool called a scraper. Which tells me that they didn't ask permission to use the posts, because if they had, Mumsnet could have supplied a database dump.

This is stated in section 3.1 of their paper.

VitoCorleoneOfMNMafia · 02/05/2024 22:47

MarkMenziesFakeMugger · 02/05/2024 22:36

OMG Newcastle uni and Manchester uni as well - 30 odd years of posts between them? How can this be?

Newcastle also used a scraper. It's stated in Data Collection and Cleaning section of their paper.

Again, had they asked for and been granted permission to take this data, Mumsnet could have supplied a database dump.

Talulahalula · 02/05/2024 22:50

Somewhere upthread, I said that I had done a quick library search of MN to see how many articles and suchlike I can find and I found 34.

To be absolutely clear, although I have not read all of these, what I did look at (including the one on vaccination linked above), these are not wholesale scraping of the site. In some cases, researchers have written whole papers based on one thread, in others, it is from specific boards or looking at specific issues, and some sections have been scraped. There are some examples linked early on in this thread or the first FWR thread and also some discussion of the ethics of this, and where it has been discussed in the papers (or not, as sometimes it is just not). I didn’t find anything (yet) which is a wholesale scraping of the site which was not linked to Aston University.

The papers I looked at set out with a defined research question and used posts to help answer it. Or in other cases, the thread inspired the paper (so for example, there was one about space and geography based on a thread about childhood memories of houses or something like that).

The conclusion I drew was that anything I post could end up being analysed in a research paper and it would be up to the moral compass of the author how ethically this was done. I have not posted on anything except this thread and the FWR thread since I realised this (and it depressed me as when I was having a hard time a few years back, I posted a lot and I have posted quite personal things, more fool me). I mean to be clear, if you put Mumsnet and Corpus into Twitter, you will find out that someone has created a corpus of posts on PND and written about that. I don’t know the ethics they went through or whether MN gave permission, and I have not read the papers to see if posters were asked permission, but it’s a bit jarring to know that people posted in dark times looking for support and then some academic is tweeting about their great paper on the subject and experience presenting it.

All that said, I still think the Aston data scrape was something else, because it was wholesale, because they created their ‘sandbox to play in’ and it has been stored and analysed in the same space as criminal online material and such like, there has been no nuanced engagement with content and social context of content or anything like that, it’s just words detached from people. I am not explaining that very well. The whole thing has depressed me.

TokyoBouncyBall · 02/05/2024 23:08

MaiseeBee · 02/05/2024 21:41

Great to finally have an update from MN but it looks this could be the tip of the iceberg. Manchester Uni seems to have 10 years worth of MN data and they say it's available "on reasonable request".

https://link.springer.com/article/10.1007/s13278-023-01155-z

Newcastle Uni have 21 years of MN posts.

https://www.jmir.org/2023/1/e47849/

It took me five minutes to find those using Scopus. @JustineMumsnet Can you have a look please?

Boring PSA but the best way to get MN’s attention is to report your own post. That’s what I did when I started it.

Have reported yours as I think it is potentially very important.

Bastard fuckers everywhere.

Also, there is a screen shot of Eden Palmer’s rather unfortunate LinkedIn post somewhere on one of the threads too.

OP posts:
TokyoBouncyBall · 02/05/2024 23:14

@MaiseeBee You are not wrong either. A cursory scan reveals nothing about ethics to boot. The lawyers are going to be busy for months to come.

OP posts:
DrSoupDragonsFriend · 03/05/2024 02:19

@Talulahalula , I found the post you'd mentioned about PND (Reading University) and only skimmed it but the researchers there quote MN posts verbatim - it's easy to search back to the original posters and the discussions on MN using phrases from the quoted texts. I've only skimmed the chapter but it looks like 53 extracts of women's writing about PND are discussed possibly* from three different corpuses. MN's was the biggest used with 4,778,285 words. *The others are clinician texts, and printed media. No mention of ethics, or consent.
['Using a comparative corpus-assisted approach to study health and illness discourses across domains: the case of postnatal depression (PND) in lay, medical and media texts', Karen Kinloch and Sylvia Jaworska, in Applying Linguistics in Illness and Healthcare Contexts. Zsófia Demjén (Anthology Editor). Bloomsbury Academic, 2020.]
https://centaur.reading.ac.uk/77236/3/Jawcorska%20Kinloch%20PND%20chapter_revised_FINAL.pdf

The authors reference another work of a similar ilk.

Then, more fool me, I typed 'Mumsnet scrape' into Google Scholar.
The first of a long list was the vaccine one, the second was this from a BMJ scientific meeting - this is another one that feels like it's overstepped a line in terms of the personal nature of the data taken.

‘How do I get a grip on this weight gain?’ An analysis of weight-related behaviours reported on Mumsnet by perinatal women with overweight or obesity
<...cut...>
This study aimed to explore the experiences and perceptions of weight and weight-related behaviours in pregnant and postpartum women with overweight or obesity through analysing posts on a widely used online forum (Mumsnet).
Methods Using a qualitative design, we adapted an existing method of analysing textual data on Mumsnet. Data generated between July 2021 and March 2022 were extracted using the web scraping tool Parsehub. We applied a priori inclusion and exclusion criteria to screen and identify posts for inclusion in the analysis. Specifically, posts suggestive of overweight/obesity (e.g., BMI, weight/height mentioned) and posts related to diet and physical activity were eligible for inclusion. We analysed data using thematic and content analysis, following established methods.
Results: Of 3,124 replies extracted from the Mumsnet Talk forum, 113 met the inclusion criteria and were included in the analysis. We identified six themes from the data analysis: ‘concerns surrounding overweight pregnancy’, ‘impacts of pregnancy on eating and physical activity behaviour’, ‘experiences and attitudes concerning weight change and management during the perinatal period’, ‘self-esteem during and after pregnancy’, ‘postpartum diet and personal struggles with weight’ and ‘healthcare professionals’ impact on women’. Women discussed their weight management goals throughout the perinatal period, specifically weight loss and maintenance. Concerns around overweight and obesity during the perinatal period resulted in self-directed research online, as women reported these concerns being left unaddressed by healthcare professionals. etc. etc.
https://doi.org/10.1136/jech-2023-SSMabstracts.18

I can't access the full text atm.

Like you I'm narrowing down where I'm posting and I'm going to restrict it further. It's not just all these random bits of research and scraped data taken without permission, it's where else it's being used and how easily all of this stuff with be cross-referenced in the future with the user-identification software that Aston (and no doubt many others) are working on or have already developed. I will keep posting only where I actively want my words to be read from now on.

@JustineMumsnet In my opinion, you need to be talking to all users about this. Data breaches without consent, especially about vulnerable and personal things, potentially affect us all as individuals as well as you as a business. Thanks.

AstonVillains · 03/05/2024 04:50

If the scraping is more widespread than originally thought, I think it might be necessary to make the talk section only visible to logged in members of the site. It won't stop people taking screenshots and posting on twitter, but could stop the scraping I think, although it might require some clever coding on the servers.

ArabellaScott · 03/05/2024 06:44

Oh my word. This is gross. It's the volume, the scraping.

Remember LangClag would talk about how users were the content creators...

WookeyHole · 03/05/2024 06:56

I am no expert in this area and I am as apalled as you all; it was a pretty bleak personal circumstance which first brought me to mn (under a different username, of course) and the support I received was invaluable.

However, I don't think it's right to call this a data breach, and whilst MNHQ absolutely need to put a stop to this, the precautionary tale for all of us is that anything on the internet is accessible to anyone. It doesn't make it ok to misuse, but we can't be too trusting.

BritishBeatleMania · 03/05/2024 07:10

I have been thinking the same @WookeyHole. A data breach, reportable to the ICO, would usually mean a failure of systems and processes meaning that our data has been made publicly available.

In this instance MN haven’t failed to protect the content, a third party has used data scraping to retrieve and store the information. This is contrary to current Ts and Cs. And even before that, they have breached the IP terms by taking data without consent.

What I see here is a potentially huge liability for the university who have:

  1. taken the data set without consent
  2. stored the data set beyond any reasonable use
  3. done the above without MN consent
  4. not allowed users the opportunity or right to control their data / seek deletion
  5. have sought to monetise / market from this unlawfully gained data

IANAL but I reckon the legal team at Aston are very busy at the moment.

I also agree that this needs to be a site wide announcement. Anyone potentially affected needs to be notified that their content may have been used in this way.

ArabellaScott · 03/05/2024 07:18

Yes, there needs to be a site wide announcement and as many MN users and past users need to be informed.as.is possible. The ICO needs to be informed.

Women have shared the most intimate details of their life on here; and while most know its a public forum, that doesn't mean anyone anticipated the possibility of software potentially being used to identify people.

Aston may deny that that's been done so far, but this is one of the activities the Institute are involved in, and it seems these Unis feel entitled to scrape, store, use and sell data as they please. Which doesn't inspire confidence.

Any mass data held by universities - or commercial companies- should be deleted if it was scraped without permission.

Whinge · 03/05/2024 07:23

I also agree that this needs to be a site wide announcement. Anyone potentially affected needs to be notified that their content may have been used in this way.

I've said this a few times on the thread, but I continue to be surprised MNHQ haven't made a site wide announcement yet. We get pop ups for webchats, pinned / sticky threads for new features, so we know it's possible.

I appreciate there are a few posts on the issue, and MNHQ are keeping users updated on different threads. But I suspect a large majority of MN users have no idea about the situation, and they deserve to know.

MarkMenziesFakeMugger · 03/05/2024 07:29

So far, so intrusive…

I’d like to know if my
content has been scraped analysed or used in any discussion / paper.

What I’m not understanding is how all of these linguists submitted these articles - without any of them asking permission from mn. And how no one in charge, no supervisor, no journal editor spotted that (mostly) women’s private thoughts had just been pilfered and used. Some people inevitably knew this had been going on.

Justine said we don’t have any reason to suspect that identification via linguistic analysis was the purpose of some of the work. But wasn’t there hints of this? Just because Aston deny this - it doesn’t mean it’s true. There is the lucrative $$ American contract to consider. Aston for one have not been ethical or honest so why start now? Are they not just scrambling to avoid litigation? And trouble from the ICO?

Universities … not quite the ambassadors I once naively believed them to be.

I’m done. I’ll log out now but I’ll be keeping a close eye on this thread. Huge thanks to those beady eyed researchers who first noticed the data mining.

Presumably there will also be an announcement soon too. So I’ll look out for that as well. Bye ladies.

Talulahalula · 03/05/2024 07:31

DrSoupDragonsFriend thank you. I think I will go back and start looking systematically at what has been done with MN data, and what level of permissions/consent/ethics are mentioned. What I did was quite impressionistic and looking specifically for the Aston datascrape.

There is something I am thinking through but cannot quite articulate yet which is about the historical importance of women’s communities as support and activism to improve women’s lives. Mumsnet is the online version of that for the current day. So in a sense, not posting or being circumspect harms women themselves. But we then also need to weigh up the fact whether we want our discussions to be used by unethical researchers (or the press) rather than archived for the use of historians later (in the case of campaign groups or collectives or co-operatives). Archiving would be a conscious action saying, we want people to know what we did and said. (So here I am thinking of for example, the Women’s Co-operative Guild which has archived papers, rather than a housing community where the women had coffee weekly, but MN provides both the activism and the social).

Which leads me to my next thought, which is how is MN archived? I seem to remember that it is one of the sites which is formally archived but I might be making that up, and I don’t remember if that involves the talk forums.
checked and I was not making that up, the British Library did a ‘web harvest’ in 2013 and maybe has done one regularly since

https://www.standard.co.uk/panewsfeeds/british-library-begins-web-harvest-8560972.html

https://www.nls.uk/guides/publishers/web-harvesting/

the last link is to the National Library of Scotland (could not get the British Library page to load, that site was hacked in November I think and not sure if it is all up and running again). The NLS page does have advice on if you don’t want their web crawler to visit your site (only read it quickly, need to get on with the day).

British Library begins web harvest

The British Library will begin to preserve the digital age for future generations when new regulations come into force on Saturday.

https://www.standard.co.uk/panewsfeeds/british-library-begins-web-harvest-8560972.html

AstonToTheNaughtyStep · 03/05/2024 08:20

I'm shocked by how widespread data scraping of Mumsnet seems to be and by the sense of entitlement of the researchers who have done it.

I do not consent to my posts on Mumsnet being analysed and used in research, and I suspect many here feel the same way. I've participated willingly as a subject in many research projects and even donated part of my tumour to a tissue bank for researchers to use. The difference is that the researchers in the those projects looked for volunteers, explained their research and most importantly sought explicit consent for inclusion in the research.

We all know that posts on Mumsnet are public and viewable by anyone. When we post here, it is with the expectation of anonymity. We make up user names, and can change them freely and frequently to protect that anonymity. We feel able to post about deeply personal issues safe in the knowledge we are anonymous and in the millions of new words posted on various internet forums daily ours will briefly be noticed by readers before being swallowed up by the sheer quantity of other views and opinions, ending up largely forgotten or neglected.

It was not reasonable to expect that one day our posts, our data, would be scraped by people who could potentially deanonymize our posts. Or that our content would be poured over, examined and analysed by researchers without warning and without our consent.

I hope Mumsnet will be providing a quick and easy way for members to have all of our posts across all of our user names deleted from Mumsnet and from the various illegitimately held datasets.

Encyclopediaofnonsense · 03/05/2024 08:23

I think the longer MNHQ leave a lot of these things unanswered now the more reputationaly damaging it will be in the long run. Lots of posters will want to know why they weren't informed early on, lots of posters will want their data deleting and lots more will want assurances they won't be identified.

I think this stoic silence and non-answers from MNHQ is going to do them more harm in the long run.

IDoNotConsentToAstonResearch · 03/05/2024 08:30

‘I do not consent to my posts on Mumsnet being analysed and used in research, and I suspect many here feel the same way. I've participated willingly as a subject in many research projects and even donated part of my tumour to a tissue bank for researchers to use. The difference is that the researchers in the those projects looked for volunteers, explained their research and most importantly sought explicit consent for inclusion in the research.’

If you don’t consent to it at all you probably shouldn’t be posting on here because Mumsnet have been perfectly open about the fact that they do sometimes allow it and I don’t think they always make individual permission a condition?

AmaryllisNightAndDay · 03/05/2024 08:37

Hm, The link to the Springer (Manchester) article isn't working this morning. Points to the journal but not the article.

Meanwhile the Newcastle article has references to other articles doing similar things: "Data collected from Mumsnet have previously been used to describe the views of parents (particularly mothers) and answer a wide range of research questions [17-20]."

which led me (among others) to this from UCL (2020) https://mental.jmir.org/2020/9/e18271
"The use of mumsnet by parents of young people with mental health needs: qualitative investigation. JMIR Ment Health."

MumsNet might possibly have agreed to this one but there's no indication that MNHQ were asked or permission was granted, and it looks as if the reverse is true:

"Parsehub was used to extract data from Mumsnet threads. Parsehub is a freely available web-based scraping tool designed to extract internet data. Any original posts or comments including information that could potentially identify the user, such as age, name, or location, were omitted manually by the researchers before the data were analyzed. Following this, the raw data were transferred into word documents for analysis."

So this team UCL have been scraping conversations between distressed mothers about their children with mental health needs. The paper includes brief anonymised quotes from individual posters.

And to start answering @MarkMenziesFakeMugger why do they think this is OK?

The ethics section says they had approval from the UCL ethics team, plus this:
"In line with the recommendations of the Association of Internet Researchers Ethics Working Committee [23], all data were extracted without the inclusion of usernames, and direct quotes were altered slightly (without changing meaning) to maintain the privacy of those posting on the forum during the initial data extraction stage of the analysis."

So here is a link to that paper about ethics: https://aoir.org/reports/ethics2.pdf

Markham A, Buchanan E. Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0). Association of Internet Researchers. 2012.

Members of the AoIR Ethics Working Committee who contributed to the report are listed at the top, dunno what the mix is of Information Retrieval and ethicists. It's rather elderly (2012) My quick skim and a keyword search says that this document does mentions copyright as a potential problem, and terms and conditions. It mentions scraping right at the start but nowhere else. It says this is ethically complex and it doesn't mention the bleeding obvious - try reading the Ts&Cs and asking the site owners. Maybe they have a better idea of what's acceptable on their own site than you do!

Please create an account

To comment on this thread you need to create a Mumsnet account.

This thread is not accepting new messages.