Meet the Other Phone. Protection built in.

Meet the Other Phone.
Protection built in.

Buy now

Please or to access all these features

Site stuff

Join our Innovation Panel to try new features early and help make Mumsnet better.

See all MNHQ comments on this thread

Mumsnet Corpus

1000 replies

TokyoBouncyBall · 19/04/2024 11:36

Not a TAAT, but a bit of googling as a result of a now deleted thread has led me to this:

https://fold.aston.ac.uk/handle/123456789/18

I note it says that the License is uncertain. Can you confirm that you have given permission for posts to be used in this way, or is there something that Aston might like to look into?

I note it says Users who wish to access this dataset must make a detailed application to FoLD and the researcher, as well as potentially gain additional agreement from an external organisation before they can be approved for access.

Given one of the uses it is being put to, I think it is a bit dubious to say the least.

OP posts:
Thread gallery
82
everythingthelighttouches · 24/04/2024 13:01

NotTHATCorpusLinguist · 24/04/2024 11:33

I wanted to offer some thoughts on the doxxing threat angle taken up in this and the other thread, and give some potentially helpful info on other corpus linguistic technicalities... NOT to excuse or explain the PhD in question (because wow) but because this is my field and some of us have research integrity not reflected by this PhD and people at Aston. I'm livid that this kind of research is what gets corpus linguistics more known and people then think we're all unethical twits. Ehem. So:

The whole point of corpus linguistics is to look at a dataset as a whole. There are no individuals.

Forensic linguists might be looking at authorship identification but it is a whole different field to corpus linguistics.

The fact that the Mumsnet corpus held by Aston has/could be used by both forensic and corpus linguists makes this confusing.

EPs PhD looks to be squarely in the corpus linguistics wheelhouse, likely without any access to any usernames at all. The data you use in corpus software is just the post content. Usernames are miles away.

It is likely that ethical approval has been given for EPs PhD already precisely because the dataset already exists at Aston. In a kind of "I'm using an approved internal dataset, nothing to see here". But any researcher worth their salt should CHECK that the rules haven't changed. Should check the rules of the website in question. So that's a failure on the supervisor and the student. Ethics is a big thing. Not getting informed consent from participants is a massive thing. And the rules on online datasets are changing rapidly and it is vital to stay informed.

The big issue here is the ethics of the Mumsnet dataset existing in the first place as it contravenes the T&Cs of the site.

To be clear, I am not affiliated with Aston or the PhD in question. Just someone else in the field watching with interest in how this all ends...

Thank you. Incredibly helpful.

I wonder what your thoughts are about if research that was forensic linguistics was applied to these scraped datasets?

KellieJaysLapdog · 24/04/2024 13:05

I just found a video of a second, more recent round table talk from 2 years ago. Nothing in the description about the speakers, can’t find anything online yet. Might be totally irrelevant to our discussion but will leave it here to keep track of it and try and find
more later.

C8H10N4O2 · 24/04/2024 13:07

NotTHATCorpusLinguist · 24/04/2024 12:04

I wouldn't say it's the main issue. It's definitely a problem but more in the area of research quality. It's definitely a bit suspect! But is entirely seperate to the problem of lack of consent.

Yes my area of concern is the applications of models built on those assumptions. This is where we regularly see issues - the assumptions fed into the fundamental build of the models results in not just replicating human biases but magnifying them over time (and "computer sez" taking precedence over human input).

This is a department producing models used for author analysis and identification of crimes and propensity to crime (according to their website). Its this kind of subjective sloppiness in model build which results in the problems of racism and ageism (to name two) in tools used by some law enforcement and other background checking agencies.

KellieJaysLapdog · 24/04/2024 13:08

AgathaAllAlong · 24/04/2024 13:01

@KellieJaysLapdog oh god it's over seven hours long! Do you happen to know which talk (or perhaps roughly when in the video) mentions the database?

About 3.20 for the Automatic Doxing Machine Men.

NotTHATCorpusLinguist · 24/04/2024 13:09

everythingthelighttouches · 24/04/2024 13:01

Thank you. Incredibly helpful.

I wonder what your thoughts are about if research that was forensic linguistics was applied to these scraped datasets?

Broadly, it depends. Forensic linguistics is about more then just authorship identification. I don't think that kind of forensic linguistics can be done on scraped social media data because of the issue with people changing usernames.

But, ultimately, I will die on the hill that any research done without sufficient consent gained (in this case, Mumsnet for the corpus) is highly unethical.

Boiledbeetle · 24/04/2024 13:10

This reply has been deleted

This has been deleted by MNHQ for breaking our Talk Guidelines.

No, I'm shocked that a university would go against mumsnets T and Cs and scrape/ steal decades of data to do with who knows what whilst letting a student use that information to do a phd that starts from the assumption we are transphobic on here.

I've used plenty of posts on here. Two books worth, but I sought permission from the posters and from mumsnet to use that data and have credited the original authors of the posts as having the copyright and thanked them in the books for letting me use their copyrighted posts with their permission.

What I didn't do was just wholesale copy people's posts and think fuck them I don't care if they don't want that information disseminated to other people in a different format.

Now i could have used posts without gaining individual copyright permission, I'd even agreed an acceptable amount with mnhq in fact but when it came to it those posters who said no i don't want you to use my posts, you know what... I didn't. Some posters had deactivated their accounts, and some posters didn't respond to my direct messages, you know what... I didn't include their posts!

Over two books only three comments had to be referenced back to mumsnet as the copyright holder rather than the individual.

Mumsnet has copyright over the entirety of the info on this site, but we still have individual copyright of our own posts.

AstonsDataThief · 24/04/2024 13:13

EPs PhD looks to be squarely in the corpus linguistics wheelhouse, likely without any access to any usernames at all. The data you use in corpus software is just the post content. Usernames are miles away.

With a few exceptions, usernames are not what makes posts identifying. What does is location, job, illness, situation, other details IN POSTS.

AstonsDataThief · 24/04/2024 13:15

The big issue here is the ethics of the Mumsnet dataset existing in the first place as it contravenes the T&Cs of the site.

No the big issue here is under the European Convention on Human Rights (and Human Rights Act 1998) I have a right to privacy including of my correspondence and to not have that interfered with by public authorities.

IDoNotConsentToAstonResearch · 24/04/2024 13:18

This reply has been deleted

This has been deleted by MNHQ for breaking our Talk Guidelines.

You’re really missing the point there.
I am shocked that a university is acting that unethically. Joe Random out there, sure , he could be doing all kinds of data scraping and analysing from his bedroom, but universities are institutions which benefit from a level of credibility and public esteem and one reason is that they are generally seen to uphold academic and ethical standards.
I am disappointed but probably not shocked by the poor quality of research being done and disseminated while still progress. There’s a reason some institutions rank lower than others.

EasternStandard · 24/04/2024 13:20

This reply has been deleted

This has been deleted by MNHQ for breaking our Talk Guidelines.

I’m more surprised that someone can read the posts from mnhq and the rest and reach this conclusion

AlisonDonut · 24/04/2024 13:21

It might be worth trying to get hold of the funding application to the US to see exactly what they have said they have and what is being funded.

This is so dark! We need a journalist on it.

Boiledbeetle · 24/04/2024 13:27

Boiledbeetle · 24/04/2024 13:10

No, I'm shocked that a university would go against mumsnets T and Cs and scrape/ steal decades of data to do with who knows what whilst letting a student use that information to do a phd that starts from the assumption we are transphobic on here.

I've used plenty of posts on here. Two books worth, but I sought permission from the posters and from mumsnet to use that data and have credited the original authors of the posts as having the copyright and thanked them in the books for letting me use their copyrighted posts with their permission.

What I didn't do was just wholesale copy people's posts and think fuck them I don't care if they don't want that information disseminated to other people in a different format.

Now i could have used posts without gaining individual copyright permission, I'd even agreed an acceptable amount with mnhq in fact but when it came to it those posters who said no i don't want you to use my posts, you know what... I didn't. Some posters had deactivated their accounts, and some posters didn't respond to my direct messages, you know what... I didn't include their posts!

Over two books only three comments had to be referenced back to mumsnet as the copyright holder rather than the individual.

Mumsnet has copyright over the entirety of the info on this site, but we still have individual copyright of our own posts.

For clarity

mumsnet wouldn't and didnt give me permission to use the posts in the quantity I wanted to use them. That permission was quite rightly denied. But as I had the individual posters permissions I wasn't doing anything wrong. Mumsnet did agree that the amount of information that needed to be referenced back to them rather than individual posters was acceptable under fair usage rules.

Ereshkigalangcleg · 24/04/2024 13:39

I just found a video of a second, more recent round table talk from 2 years ago. Nothing in the description about the speakers, can’t find anything online yet. Might be totally irrelevant to our discussion but will leave it here to keep track of it and try and find
more later.

I watched it straight after the 2019 one and before I posted that video, they don't really say anything interesting to us directly so I didn't cite it. Kredens chimes in towards the end, the session focused a lot of literary authorship and he seems a bit miffed that it's not strictly relevant to forensic linguistics. His bit is around the 4 hour mark I think. He doesn't directly reference MN but he talks in passing about a database with 600k posters which I think is MN.

Ereshkigalangcleg · 24/04/2024 13:39

Apologies meant to quote @KellieJaysLapdog

Ereshkigalangcleg · 24/04/2024 13:40

But do watch it, I skimmed most of it.

MyLadyDisdainlsYetLiving · 24/04/2024 13:44

i am not a data expert, but I know enough that if you store pieces of information about a person, even if they are apparently anonymous, it is possible to identify that person. Imagine in mumsnet terms, a theoretical poster who has shared over the years in different posts that she is middle aged, she lives in the south west of England, she has a fairly unusual name, she has two teenage kids with an ex husband who was in the military and a third child with a new husband, they are renting but can’t find an affordable house with enough space for her step kids too, she works in a healthcare related profession, he’s a builder, they have a new puppy etc etc. It wouldn’t take a great deal of work to identify the person in real Iife. Especially if you knew them.

The safety mechanism we thought we had to keep us anonymous is regular name changes. The whole point of this Aston database is to develop software that will identify people based on how they write and use language, regardless of name change. The alleged “legitimate” point of such software is to track criminals and terrorists. Not to prove that Sue Jones with the problematic neighbour parking on her local Facebook page is also poster xyz987 on mumsnet who is trying to leave her abusive husband.

If you want to assure people you are developing software for a legitimate purpose, then creating a dataset by scraping a website without asking permission and contravening the T&c of that website does not assure me that you have pure intention and are behaving ethically. Quite the opposite. And especially then if you allow anyone else to access the database for their own research.

And all of that is well before the topic of the PhD that brought it to our attention. That is now a relatively minor side point in this whole affair.

LarissaFeodorovna · 24/04/2024 13:51

NotTHATCorpusLinguist · 24/04/2024 13:09

Broadly, it depends. Forensic linguistics is about more then just authorship identification. I don't think that kind of forensic linguistics can be done on scraped social media data because of the issue with people changing usernames.

But, ultimately, I will die on the hill that any research done without sufficient consent gained (in this case, Mumsnet for the corpus) is highly unethical.

I work in a related field and I can absolutely see the usefulness of having a huge corpus of normal informal written language as a reference dataset for helping determine the significance of features that you come across in forensic samples. That’s inherently non-controversial, as you’re not using the database to dox, the lack of authorship or name changing is irrelevant - it just gives you a fantastic resource to mine as an indication of normal usage among a very large group of contemporary users of British English.

But if you don’t have permission to hold or use the data, then it doesn’t matter how well-intentioned or important your research is.

hamstersarse · 24/04/2024 13:51

Let's face it, data sets have massive value - Meta is free to all specifically because they want all our data.

It would have cost Aston millions and millions of pounds to get such a data set.

It's outright stealing and shady as fuck

Astontacious · 24/04/2024 13:56

MyLadyDisdainlsYetLiving · 24/04/2024 13:44

i am not a data expert, but I know enough that if you store pieces of information about a person, even if they are apparently anonymous, it is possible to identify that person. Imagine in mumsnet terms, a theoretical poster who has shared over the years in different posts that she is middle aged, she lives in the south west of England, she has a fairly unusual name, she has two teenage kids with an ex husband who was in the military and a third child with a new husband, they are renting but can’t find an affordable house with enough space for her step kids too, she works in a healthcare related profession, he’s a builder, they have a new puppy etc etc. It wouldn’t take a great deal of work to identify the person in real Iife. Especially if you knew them.

The safety mechanism we thought we had to keep us anonymous is regular name changes. The whole point of this Aston database is to develop software that will identify people based on how they write and use language, regardless of name change. The alleged “legitimate” point of such software is to track criminals and terrorists. Not to prove that Sue Jones with the problematic neighbour parking on her local Facebook page is also poster xyz987 on mumsnet who is trying to leave her abusive husband.

If you want to assure people you are developing software for a legitimate purpose, then creating a dataset by scraping a website without asking permission and contravening the T&c of that website does not assure me that you have pure intention and are behaving ethically. Quite the opposite. And especially then if you allow anyone else to access the database for their own research.

And all of that is well before the topic of the PhD that brought it to our attention. That is now a relatively minor side point in this whole affair.

It’s not a relatively minor side issue if it looks like the data is being held in a site usually used for criminal analysis, that the researcher has stated that mumsnetters post transphobic content in her PhD title, that the supervisor has changed her banner to trans rights are human rights and is used an a expert police witness, and that Scotland in particular talk about transphobic hate crime.

Edited: to say that how I understand it

Boiledbeetle · 24/04/2024 14:05

Can you imagine in future court cases anyone from Aston trying to come across as an ethical and trustworthy expert witness?

The other side will be "and did you gain your knowledge in this area using the data your university stole from mumsnet against their terms and conditions and without obtaining any permissions from the site or the users to do so?"

That will do down well with juries.

MyLadyDisdainlsYetLiving · 24/04/2024 14:07

Astontacious · 24/04/2024 13:56

It’s not a relatively minor side issue if it looks like the data is being held in a site usually used for criminal analysis, that the researcher has stated that mumsnetters post transphobic content in her PhD title, that the supervisor has changed her banner to trans rights are human rights and is used an a expert police witness, and that Scotland in particular talk about transphobic hate crime.

Edited: to say that how I understand it

Edited

I don’t disagree with that interpretation of events. And if it was not for the nature of the dataset, the discussion would be solely on the FWR board regarding the researcher and their supervisor and their piss-poor attempt at a decent research proposal.

For me the concerns around the creation and storage of the entire body of MNs posts, and posters, are so massive that the individual actions of the researcher and supervisor are relatively minor and eclipse the implied insult on posters being transphobic.

KellieJaysLapdog · 24/04/2024 14:08

Simply having Mumsnet in the same Forensics Library as Harold Shipman, The Unabomber and Text Generated by Paedos is surely damaging to Mumsnet’s corporate brand?
The FoLD is not a big repository, haven’t counted but it’s only about 20-30 datasets.

wibdib · 24/04/2024 14:09

I’ve been talking to DH about this thread as his work involves using assorted data and datasets, and he found it very interesting.

It’s interesting that the first thing he said was that MN should ask Aston to immediately shut down the illegally scraped and stolen dataset and prevent any access to it while the matter is under discussion, which is something I haven’t seen mentioned here. This is a reasonable ask given the circumstances - their reaction to being asked this will be insightful in itself

Also - words (as linguistic folk they know only too damn well) are important - so keep reinforcing your view of the dataset - when they call it scraped or in a sandbox and so on, it’s deliberately misinterpreting and minimising their bad actions in allowing the database to be created and run for many years. It makes it much harder to justify that they should be allowed to keep using it if they have to refer to it as the ‘stolen mumsnet dataset’ every time or even if they just have to listen to others (ie MN!) do so during the discussions.

DH also said that lots of people just don’t get that they have to specify the use of the dataset and that it can’t be used for anything else if the permission wasn’t given up front - it’s something he has to battle with regularly - especially with managers from overseas and/or those who figure they can quiz an existing dataset to get an answer quickly and who don’t want to jump through the admin hoops they need to, to get permission each time. They seem him as being deliberately obstructive rather than realising that he is saving their asses from big fines and legal bills if they get caught doing the wrong stuff.

The onus is very much on Aston to explain why they think they shouldn’t stop using the illegal dataset immediately - the fact that they don’t want to and are making money on it is besides the point.

Really looking forward to hearing what the VC says next!

ArabellaScott · 24/04/2024 14:13

MN should ask Aston to immediately shut down the illegally scraped and stolen dataset and prevent any access to it while the matter is under discussion,

This.

Boiledbeetle · 24/04/2024 14:13

wibdib · 24/04/2024 14:09

I’ve been talking to DH about this thread as his work involves using assorted data and datasets, and he found it very interesting.

It’s interesting that the first thing he said was that MN should ask Aston to immediately shut down the illegally scraped and stolen dataset and prevent any access to it while the matter is under discussion, which is something I haven’t seen mentioned here. This is a reasonable ask given the circumstances - their reaction to being asked this will be insightful in itself

Also - words (as linguistic folk they know only too damn well) are important - so keep reinforcing your view of the dataset - when they call it scraped or in a sandbox and so on, it’s deliberately misinterpreting and minimising their bad actions in allowing the database to be created and run for many years. It makes it much harder to justify that they should be allowed to keep using it if they have to refer to it as the ‘stolen mumsnet dataset’ every time or even if they just have to listen to others (ie MN!) do so during the discussions.

DH also said that lots of people just don’t get that they have to specify the use of the dataset and that it can’t be used for anything else if the permission wasn’t given up front - it’s something he has to battle with regularly - especially with managers from overseas and/or those who figure they can quiz an existing dataset to get an answer quickly and who don’t want to jump through the admin hoops they need to, to get permission each time. They seem him as being deliberately obstructive rather than realising that he is saving their asses from big fines and legal bills if they get caught doing the wrong stuff.

The onus is very much on Aston to explain why they think they shouldn’t stop using the illegal dataset immediately - the fact that they don’t want to and are making money on it is besides the point.

Really looking forward to hearing what the VC says next!

If nothing else if they keep it they will forever be forced to look at/include/search/reference/explain (depending on the task at hand) the 2024 blip in data re usernames publicly condemning them and references to Aston university and stolen mumsnet datasets

Please create an account

To comment on this thread you need to create a Mumsnet account.

This thread is not accepting new messages.
Swipe left for the next trending thread