Meet the Other Phone. Flexible and made to last.

Meet the Other Phone.
Flexible and made to last.

Buy now

Please or to access all these features

Site stuff

Join our Innovation Panel to try new features early and help make Mumsnet better.

See all MNHQ comments on this thread

Mumsnet Corpus

1000 replies

TokyoBouncyBall · 19/04/2024 11:36

Not a TAAT, but a bit of googling as a result of a now deleted thread has led me to this:

https://fold.aston.ac.uk/handle/123456789/18

I note it says that the License is uncertain. Can you confirm that you have given permission for posts to be used in this way, or is there something that Aston might like to look into?

I note it says Users who wish to access this dataset must make a detailed application to FoLD and the researcher, as well as potentially gain additional agreement from an external organisation before they can be approved for access.

Given one of the uses it is being put to, I think it is a bit dubious to say the least.

OP posts:
Thread gallery
82
AirGappedServerScrapings · 26/04/2024 20:17

I'm not saying I think it's a good idea, I'm just being pragmatic, in event of MN not coming out all guns blazing on this one. Aston have had the stolen data set for ages, they've been playing with it, they think they have legitimate rights over it - a bit like adverse possession of land. They might have gotten away with it if it wasn't for the Land Registry Eden Palmer promoting the cancelled talk.

everythingthelighttouches · 26/04/2024 20:32

Did anyone get a new set of mumsnet terms and conditions pop up on their screens today too ? I had to consent (or not) to the use of my data. But it was quite fiddly right go through and turn off all the consents.

VitoCorleoneOfMNMafia · 26/04/2024 20:52

DeanElderberry · 26/04/2024 19:40

I don't see how Mumsnet could accept any settlement that involved breaking their contract with their users.

I think that only paid users have a contract with MNHQ. A contract requires payment in return for something.

VitoCorleoneOfMNMafia · 26/04/2024 20:53

everythingthelighttouches · 26/04/2024 20:32

Did anyone get a new set of mumsnet terms and conditions pop up on their screens today too ? I had to consent (or not) to the use of my data. But it was quite fiddly right go through and turn off all the consents.

Edited

No. I'm using the mobile site.

Was it the Cookie dialog that you saw? https://www.mumsnet.com/talk/site_stuff/5062473-constant-cookie-banner

Talulahalula · 26/04/2024 20:55

From Tim Grant, The Idea of Progress in Forensic Authorship Analysis (2022) p.22

Kredens et al. (2019b) refer to this task in the forensic context as ‘linguistically enabled offender identification’ and have described a series of studies and experiments across two corpora of forum messages. In the first experiment, with a corpus of 418 million words from 114,000 authors, they first used a single message-board post as a basis for a search, and in their results, which rank likely authors from the rest of the corpus, they achieved 47.9 per cent recall in the first thirty positions in the ranking (average hit rank = 7.2, median = 4). This means that in nearly half the experiments, other posts by the anonymous author appeared in the top thirty search hits. When they took fifty randomly selected posts from the author, rather than a single post, their results improved to achieve 89.7 per cent recall in the first thirty predictions (average hit rank = 2.7, median = 1). In an even bigger experiment with a bulletin board corpus of 2 billion words from 588,000 authors, results with a single input post were reduced to 19.2 per cent recall in the first thirty predictions (average hit rank = 5.46, median = 3); and with fifty random posts from a single author as input to 45.1 per cent (average hit rank = 5.46, median = 3). These results demonstrate both the plausibility and the power of author search approaches, and the issue of increasing the size of the search pool up to a full internet-scale technology.

https://www.cambridge.org/core/books/idea-of-progress-in-forensic-authorship-analysis/6A4F7668B4831CCD7DBF74DECA3EBA06

[The reference to Kredens 2019b is to the paper ‘Towards linguistic explanation of idiolectal variation: understanding the black box’ IAFL 2019’ which is the subject of the FoI request re use of MN corpus. I checked the conference abstract and it says ‘the paper reports on what to the best of our knowledge is the biggest corpus (ca 3. billion words produced by over one million authors) ever used in author classification research’. So Grant is giving slightly different figures for some reason - maybe he has the wrong reference but there nothing in Grant’s piece above to suggest that the MN corpus is not dark web bad stuff. But the really interesting part is Grant on ethics, which I will post in a minute]

The Idea of Progress in Forensic Authorship Analysis

Cambridge Core - Applied Linguistics - The Idea of Progress in Forensic Authorship Analysis

https://www.cambridge.org/core/books/idea-of-progress-in-forensic-authorship-analysis/6A4F7668B4831CCD7DBF74DECA3EBA06

DeanElderberry · 26/04/2024 21:00

okay, not a contract so, but whatever you call it when we agree to terms and conditions that include being used to sell advertising (as with everything free, we are the product here), and being limited by house rules, while Mumsnet agrees not to let our data be scraped.

Talulahalula · 26/04/2024 21:02

From Tim Grant, The Idea of Progress in Forensic Authorship AnalysiS (2022)

p. 32
A further area which requires considerable development is in thinking through the ethics of forensic authorship analysis. The starting point has to be that nearly all authorship analyses constitute an intrusion into the privacy of an individual and that this requires justification. This is true whether that individual is a novelist writing under a pseudonym, such as Eleanor Ferrante or Robert Galbraith, or whether the analysis is of a username of an individual on a dark web site set up to exchange indecent images of children. Authorship analysis is neither morally neutral nor inherently constrained in its application. It can be used as a necessary and proportionate method to protect individuals and society from a variety of harms, or it might be used to identify and potentially do harm to whistle-blowers or political activists of one persuasion or another.

One interesting development in security ethics which might apply to authorship analyses is the ‘ethical liability model’ developed by Nathan (2017). Nathan, in considering the ethics of undercover policing, suggests a departure from the simplistic dichotomy between rules-based and utilitarian models of ethics to one that addresses the location of the liability for certain types of wrong. Nathan is dissatisfied with utilitarian justifications of deception and manipulation in undercover policing as he says it leaves an ‘ethical residue’ with those undercover police officers. In a legal and well-justified undercover operation, police officers might be asked to deceive and manipulate, and Nathan contends that these are still wrongs even if they carried out for the greater good. He argues that in these situations, the liability for the wrong, the ethical residue for wrongdoing, can be placed with the offenders. He argues that this is similar to a situation in which an attacker is killed by a defender in an act of selfdefence. The defender is well justified, but harm has still been done. Nathan locates the responsibility for this harm with the attacker.

Similarly, then, intrusion against an anonymous abuser is still intrusion, but the responsibility for that intrusion is theirs, created by the actions and harms of their cause. The responsibility for the intrusion done to a pseudonymous novelist, against whom there is no issue of liability, thus lies squarely with the authorship analyst. Clearly there needs to be considerably more thinking with regard to the issues of authorship analysis as an intrusion and where and how it might be justified, and such discussions are well beyond the scope of this Element, but Nathan’s work provides a useful starting point.

https://www.cambridge.org/core/books/idea-of-progress-in-forensic-authorship-analysis/6A4F7668B4831CCD7DBF74DECA3EBA06

The Idea of Progress in Forensic Authorship Analysis

Cambridge Core - Applied Linguistics - The Idea of Progress in Forensic Authorship Analysis

https://www.cambridge.org/core/books/idea-of-progress-in-forensic-authorship-analysis/6A4F7668B4831CCD7DBF74DECA3EBA06

Talulahalula · 26/04/2024 21:11

So, taking these extracts together, Prof Grant

a) does not acknowledge that the Kredens paper uses what he has called elsewhere ‘an open parenting forum’ - ie people posting with no malicious intent

and

b) does acknowledge that in respect of an intrusion of privacy of an individual - ‘The responsibility for the intrusion done to a pseudonymous novelist, against whom there is no issue of liability, thus lies squarely with the authorship analyst’

Now, to me, that is clear that Grant recognises that doing forensic analysis on someone using a pseudonym where there is no liability or offence is an intrusion of privacy which lies with the analyst.
i would go as far as to say this is precisely why we do not find the Mumsnet Corpus referred to again by name 2019 and not at all by name in print, and it is not directly referred to by Grant here

Talulahalula · 26/04/2024 21:12

the text I have linked to is Open Access and easily available to read online.

IcakethereforeIam · 26/04/2024 21:15

Nathan seems to be claiming that bad stuff can be done to bad people because they are bad, if they weren't bad we wouldn't be doing this bad stuff to them. It's interesting that he mentions undercover policemen. Those women women who were deceived into relationships, and presumably the children who were conceived in a few cases, deserved it because they were bad!!?

I'm.....

I hope I've misunderstood.

IcakethereforeIam · 26/04/2024 21:18

Grant seems to think that Nathan's blame shifting could be tweaked so that it could be applied to people who have committed no wrong doing.

Boiledbeetle · 26/04/2024 21:22

@Talulahalula weasels the lot of them! He knows it's as unethical as you can get!

RethinkingLife · 26/04/2024 21:28

Nathan contends that these are still wrongs even if they carried out for the greater good. He argues that in these situations, the liability for the wrong, the ethical residue for wrongdoing, can be placed with the offenders.

Both Grant and Nathan might benefit from considering the issues raised in this Triggernometry interview with a former undercover officer who reflects upon the people whom he deceived, manipulated, and plausibly betrayed.

Neil Woods:

Or this Good Morning interview with a woman deceived into a relationship with an undercover officer. Interesting discussion that it's a known tactic to target women as 'women are easier to form relationships with' - it seems there was no named target.

[The tribunal heard] he was told to develop personal relationships in order to gather pre-emptive intelligence' on activists at the centre, with the help of 'an extensive support system for the purpose of this long-term infiltration'.

https://www.dailymail.co.uk/femail/article-13100047/eco-activist-lover-secret-undercover-cop-mark-kennedy.html

Grant and Nathan might not, but I consider the women involved in these as more than collateral damage, a means to an end ,and the true locus of responsibility for the "ethical residue".

Undercover Cop: "Drug Policing Makes Things Worse"

💥Join us on our Journey to 1 Million Subscribers💥 SPONSORED BY: *Ridge Wallet* Use Code “TRIGGER ” for 10% off your order at https://www.ridge.com/TRIGGERN...

https://www.youtube.com/watch?v=ABqpbr2FXog

Talulahalula · 26/04/2024 21:31

Boiledbeetle · 26/04/2024 21:22

@Talulahalula weasels the lot of them! He knows it's as unethical as you can get!

Yes.

Talulahalula · 26/04/2024 21:46

Talulahalula · 26/04/2024 21:31

Yes.

Grant’s last sentence about the ethics needing more consideration is him looking for someone else to square his ethical circle because as of 2022, he couldn’t. That is how I would read it.

GrannyAchingsShepherdsHut · 26/04/2024 21:55

It's beginning to additionally piss me off that not only was the data stolen, but the research it's been used for is quite frankly shoddy.

It seems to me, that if the mumsnet corpus was indeed used for that author identification research, then the research is inherently flawed.

Someone clever than me, please correct me if I'm wrong, but as I understand it they selected a single post, then attempted to identify posts by the same author. They then selected multiple posts by the same username and ran the identification process again.

They, I presume, measured their success by looking at the username of the posts that their process identified as a match. Where it was the same as the initial post(s) they marked that as a success. No match, no success. I cannot see how else they can have checked their results.

It seems to me that the very nature of mumsnet that makes it anonymous - that one user can and will have multiple usernames, makes their measuring of success completely nonsensical. There's no way for them to know if those 'incorrect' matches from other usernames are in fact the same author.

They've carried out a completely pointless exercise.

VitoCorleoneOfMNMafia · 26/04/2024 22:11

GrannyAchingsShepherdsHut · 26/04/2024 21:55

It's beginning to additionally piss me off that not only was the data stolen, but the research it's been used for is quite frankly shoddy.

It seems to me, that if the mumsnet corpus was indeed used for that author identification research, then the research is inherently flawed.

Someone clever than me, please correct me if I'm wrong, but as I understand it they selected a single post, then attempted to identify posts by the same author. They then selected multiple posts by the same username and ran the identification process again.

They, I presume, measured their success by looking at the username of the posts that their process identified as a match. Where it was the same as the initial post(s) they marked that as a success. No match, no success. I cannot see how else they can have checked their results.

It seems to me that the very nature of mumsnet that makes it anonymous - that one user can and will have multiple usernames, makes their measuring of success completely nonsensical. There's no way for them to know if those 'incorrect' matches from other usernames are in fact the same author.

They've carried out a completely pointless exercise.

We discussed this on the FWR threads and came to the same conclusion.

Boiledbeetle · 26/04/2024 22:40

GrannyAchingsShepherdsHut · 26/04/2024 21:55

It's beginning to additionally piss me off that not only was the data stolen, but the research it's been used for is quite frankly shoddy.

It seems to me, that if the mumsnet corpus was indeed used for that author identification research, then the research is inherently flawed.

Someone clever than me, please correct me if I'm wrong, but as I understand it they selected a single post, then attempted to identify posts by the same author. They then selected multiple posts by the same username and ran the identification process again.

They, I presume, measured their success by looking at the username of the posts that their process identified as a match. Where it was the same as the initial post(s) they marked that as a success. No match, no success. I cannot see how else they can have checked their results.

It seems to me that the very nature of mumsnet that makes it anonymous - that one user can and will have multiple usernames, makes their measuring of success completely nonsensical. There's no way for them to know if those 'incorrect' matches from other usernames are in fact the same author.

They've carried out a completely pointless exercise.

Totally. Their failure to appreciate that one poster could post on just one thread under say 20 different usernames if they wanted never mind how many name changes some posters makes over a day on a board, some might use a different username for each thread, and I presume some posters have different usernames or multiple usernames for different boards.

For all they know from that one single post their AI actual success rate could be 100% but to quote a poster I've not seen for awhile (hi Felix) "we just don't know!" And I'm betting neither do they! The difference being We can see the massive flaw, it's honestly looking they they didn't even know it was a possibility.

Astontacious · 26/04/2024 22:41

I have had at least 20 usernames, several different named ponies, children of different sexes and combinations and even lived in different countries. Or have I? He’ll be ending up with false negatives. If I do do that.

Boiledbeetle · 26/04/2024 22:55

Astontacious · 26/04/2024 22:41

I have had at least 20 usernames, several different named ponies, children of different sexes and combinations and even lived in different countries. Or have I? He’ll be ending up with false negatives. If I do do that.

And that's just today!

😏

Myrmecophagatridactyla · 26/04/2024 23:06

I've only been around MumNet since last autumn. I already have over twenty names in use or in reserve. I keep adding to my reserve list when I think of good ones! I use several in parallel then I'll drop them randomly. I've permanently retired a couple of names now. Multiple names makes sense for anonymity even more so now that Aston's databases have come to light.

HabeasCorpus · 26/04/2024 23:52

Yes, similar, I’ve got loads of names. I keep some for specific threads but others I just use as and when I fancy. I often use two, three or more from my stash in a day if I’m posting on threads where I’m not a regular.

SqueakyDinosaur · 26/04/2024 23:59

I don't actually think that Aston can possibly think they have a legitimate right to hold this data. I am really looking forward to their arguments in this case, because I don't actually think there are any. So I am washing and ironing my pointing&laughing trousers.

Lassiata · 27/04/2024 00:02

I don't know who the feck Tim Grant thinks "Eleanor" Ferrante is but it doesn't fill me with confidence in his work.

Fallingirl · 27/04/2024 01:05

When you apply for ethics permission for your research to go ahead, you have to include a fairly detailed description, not only to demonstrate (hopefully) that there isn’t anything dodgy about it, but also to demonstrate that the proposed research is good, that the method hangs together, that the method will actually give you what you are trying to find out etc.

The research the dodgy sand box scrapers have done is utter shite, given they didn’t account for several user names per poster, so quite apart from breaching MN’s T&C, stealing women’s posts without consent etc, methodologically the research they stole it for wasn’t even of a quality that merited ethical consent to be given from the university.

In other news, Audrey Ludwigs tweet now has 152.2K views. This cat is well and truly out of the bag.

Please create an account

To comment on this thread you need to create a Mumsnet account.

This thread is not accepting new messages.