Meet the Other Phone. Child-safe in minutes.

Meet the Other Phone.
Child-safe in minutes.

Buy now

Please or to access all these features

Site stuff

Join our Innovation Panel to try new features early and help make Mumsnet better.

See all MNHQ comments on this thread

Corpus 2

766 replies

TokyoBouncyBall · 11/05/2024 11:48

A summary would be good and I might do one later but Aston, data scraping, astonishing lack of contrition…

OP posts:
Thread gallery
64
TheAutopsyOfMNCorpus · 10/07/2024 16:48

Having to come up with a new idea for a thesis, I hope.

Ereshkigalangcleg · 10/07/2024 16:53

Thanks @JustineMumsnet

AstonUniversityScrapedMyCorpus · 10/07/2024 17:00

Sounds good Justine - don’t let those Aston-A-Holes get away with besmirching us!

Onetransphobicmother · 10/07/2024 17:02

Thanks @JustineMumsnet sorry if we don't always play nice. I'll try and do better in future.

Astontacious · 10/07/2024 17:03

Thank you

Onetransphobicmother · 10/07/2024 17:04

AstonUniversityScrapedMyCorpus · 10/07/2024 17:00

Sounds good Justine - don’t let those Aston-A-Holes get away with besmirching us!

Aston-Holes? That's what they call the potholes on the M6

Dumbledoreslemonsherbets · 10/07/2024 17:29

This is a great update thanks Justine. Look forward to hearing the next steps once you can share.

I personally hope the ICO is closely monitoring this.

TokyoBouncyBall · 10/07/2024 17:36

Excellent news. And agree, will be interesting to see if the ICO has a thought or three.

OP posts:
ifIwerenotanandroid · 10/07/2024 17:44

Many thanks, Justine. Good luck with it!

DrSpartacular · 10/07/2024 17:45

Thanks for the update @JustineMumsnet

DrBlackbird · 10/07/2024 17:48

BIWI · 10/07/2024 16:46

Thanks for the update @JustineMumsnet

Where does that leave things with the PhD student though, and her presentation/thesis?

As quite a bit of her/their data was scraped pre 2018 I imagine she’ll/they’ll still be able to use that data (assume that was pre change of T&Cs) even though the basis of her/their ethics submission misrepresented the literature cited to support her/their methodology.

@JustineMumsnet would you be able to ask for a copy of Eden’s recent presentation please? We’d quite like to know what comments / corpora constitutes transphobia.

popebishop · 10/07/2024 18:26

would you be able to ask for a copy of Eden’s recent presentation please? We’d quite like to know what comments / corpora constitutes transphobia.

I'll second this.
Thanks Justine!

BeckyAMumsnet · 10/07/2024 18:35

@DrBlackbird @popebishop we're not sure but we'll certainly look into this.

Boiledbeetle · 10/07/2024 19:41

I'm guessing those in charge have been hoping if they ignore @JustineMumsnet the whole thing would just go away! Nothing to see here!

Corpus 2
AlisonDonut · 10/07/2024 19:47

Thank you for the update.

DrSoupDragonsFriend · 10/07/2024 20:03

Justine, thanks from me too.

AstonVillains · 10/07/2024 21:44

More thanks to Justine here, and I'd also be interested to see if Aston are forthcoming with Eden's presentation.

cancelledduetoillnessapparently · 10/07/2024 21:45

Thanks from me too.

Talulahalula · 11/07/2024 08:03

That is very, very welcome news, thank you. i really do appreciate the efforts MN are taking in this to safeguard the content of the site and ensure posters can have confidence in how their words are used (ie by MN in line with the T&Cs). Using the relationships board and LGBT children is particularly egregious in my opinion but a continuation of a disregard which also thought it fine to use the adoption board and infertility threads as examples.

Talulahalula · 11/07/2024 08:21

DrBlackbird · 10/07/2024 17:48

As quite a bit of her/their data was scraped pre 2018 I imagine she’ll/they’ll still be able to use that data (assume that was pre change of T&Cs) even though the basis of her/their ethics submission misrepresented the literature cited to support her/their methodology.

@JustineMumsnet would you be able to ask for a copy of Eden’s recent presentation please? We’d quite like to know what comments / corpora constitutes transphobia.

No, I don’t think so. I think the 2019 dataset which Justine refers to was the full scrape which was hosted on FOLD and because the sandbox to play in (so that was the project by Kredens and another of his colleagues); the 2024 dataset is the one scraped for the PhD. This will have content going back to 2008 or whenever it was but that is irrelevant to the fact that scraping it was against MN’s terms and conditions.

The other flaw with the ethics review was precisely this, that the data was collected in breach of terms and conditions, and there was no way therefore that posters could reasonably know this is how their words would be used. I think also the papers cited were out of date, if I recall correctly, and there is much more recent writing on the ethics of social media and forum research, including from one of Aston’s alumni.

Basically, the PhD student would be well advised to broaden their research questions and methodology at this point, and if I was their supervisor, I would be encouraging them to think about the issues in a broader way. If the student is interested in how the language around trans issues has changed since 2004 (which would be a good starting point with the GRA), then there is plenty of publically available media to focus on. Heck, one could even use Hansard to compare political discourse with media coverage. There are also articles and whole books by people advocating one way or the other around the issues. Might be more work that scraping some data, but I think this would be a more open approach to the issues. I am not a linguist though, so my advice comes with that caveat.

I also think Aston should compensate with a year’s additional fees and maintenance to start fresh on this, because having one’s research the centre of a legal issue is the result of flawed supervisory and ethics procedures as much as anything else.

EggcornAcorn · 11/07/2024 09:33

Thank you Justine.

Ormally · 17/07/2024 14:13

Related vein, perhaps. The Times Higher Ed Academic Freedom survey: web intro says it is open to academics and university administrators.

There's a link for it, published 10th July. The preamble reads as follows.

"Should academics be allowed to make any lawful statement without censure by their institutions? Has your freedom of speech as an academic ever been restricted for reasons other than legal ones? Is academic freedom of speech more restricted than it was 10 years ago? Help us find the answers to this and many more question in our survey"

DrBlackbird · 17/07/2024 14:25

Talulahalula · 11/07/2024 08:21

No, I don’t think so. I think the 2019 dataset which Justine refers to was the full scrape which was hosted on FOLD and because the sandbox to play in (so that was the project by Kredens and another of his colleagues); the 2024 dataset is the one scraped for the PhD. This will have content going back to 2008 or whenever it was but that is irrelevant to the fact that scraping it was against MN’s terms and conditions.

The other flaw with the ethics review was precisely this, that the data was collected in breach of terms and conditions, and there was no way therefore that posters could reasonably know this is how their words would be used. I think also the papers cited were out of date, if I recall correctly, and there is much more recent writing on the ethics of social media and forum research, including from one of Aston’s alumni.

Basically, the PhD student would be well advised to broaden their research questions and methodology at this point, and if I was their supervisor, I would be encouraging them to think about the issues in a broader way. If the student is interested in how the language around trans issues has changed since 2004 (which would be a good starting point with the GRA), then there is plenty of publically available media to focus on. Heck, one could even use Hansard to compare political discourse with media coverage. There are also articles and whole books by people advocating one way or the other around the issues. Might be more work that scraping some data, but I think this would be a more open approach to the issues. I am not a linguist though, so my advice comes with that caveat.

I also think Aston should compensate with a year’s additional fees and maintenance to start fresh on this, because having one’s research the centre of a legal issue is the result of flawed supervisory and ethics procedures as much as anything else.

It would be reassuring to know that the PhD’s data was only taken in 2024 thus subject to MNs T&Cs if I trusted either this student or their supervisor or Aston.

These are all sensible suggestions to help the student progress their doctorate and learn from the experience but the immediate evidence of doubling down by the researcher, and more importantly her/their supervisor, indicates this change of methodology or approach is unlikely to happen. Also as indicated by going ahead with their conference paper talk with barely concealed maligning of MN/FWR as being transphobic.

UtopiaPlanitia · 17/07/2024 14:27

Another news story about scraped data has broken:

https://9to5mac.com/2024/07/16/apple-used-youtube-videos/

An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.

Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce…

The downloads were reportedly performed by a non-profit called EleutherAI, which says it helps developers train AI models. While the aim appears to have been to provide training materials to small developers and academics, the dataset has also been used by several tech giants, including Apple…

…while Apple and the other companies named likely used a publicly-available dataset in good faith, it’s a good illustration of the legal minefield created by scraping the web to train AI systems. There have been multiple examples of AI systems plagiarizing entire paragraphs of text when asked about niche topics, and the dangers of using material without permission are only increased when companies use datasets compiled by third parties.’

lcakethereforeIam · 19/07/2024 00:13

Woh! Go @JustineMumsnet 🥳 🎉

www.thetimes.com/uk/technology-uk/article/mumsnet-openai-sues-copyright-infringement-cz5hzvf8s

https://archive.ph/Cf3Ru erm, I'll just put this here