Another news story about scraped data has broken:
https://9to5mac.com/2024/07/16/apple-used-youtube-videos/
‘An investigation by Proof News found some of the wealthiest AI companies in the world have used material from thousands of YouTube videos to train AI. Companies did so despite YouTube’s rules against harvesting materials from the platform without permission.
Our investigation found that subtitles from 173,536 YouTube videos, siphoned from more than 48,000 channels, were used by Silicon Valley heavyweights, including Anthropic, Nvidia, Apple, and Salesforce…
The downloads were reportedly performed by a non-profit called EleutherAI, which says it helps developers train AI models. While the aim appears to have been to provide training materials to small developers and academics, the dataset has also been used by several tech giants, including Apple…
…while Apple and the other companies named likely used a publicly-available dataset in good faith, it’s a good illustration of the legal minefield created by scraping the web to train AI systems. There have been multiple examples of AI systems plagiarizing entire paragraphs of text when asked about niche topics, and the dangers of using material without permission are only increased when companies use datasets compiled by third parties.’