this post was submitted on 26 Sep 2023
142 points (90.3% liked)

Technology

34912 readers
1367 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 52 points 1 year ago (14 children)

That's like says smartphones are fundamentally a surveillance technology. There's truth to it, but it's not inherent to the technology. It's a deliberate act by people using the tech that we allow for whatever reason.

[–] [email protected] 5 points 1 year ago* (last edited 1 year ago) (13 children)

Right, you can still do traditional advertising without the targeted metrics provided by smartphones, but....

AI LLMs literally require a corpus of language to learn from. Thus the "Large Language" part of "LLM." The amount of data these models need to function is so staggeringly huge there is no way they can compile all that data without scraping the entire internet and pirating a bunch of copyrighted books.

It's fundamentally a surveillance technology, because the technology fundamentally cannot function without that large dataset of language to begin with. It needs massive amounts of data that have to be surveilled to be achieved, because unless you're Reddit or Facebook, your own site probably doesn't contain enough data to fill out the needs of the LLM. Thus you need to scrape the internet for more data in hopes of filling it out.

Books3 is used widely as part of "The Pile" and is clearly all of the content of private torrent tracker Bibliotik. People theorize Books2 is all of the books from Library Genesis. To be able to make their models work, they have to scrape the internet and pirate thousands of books to make it functional at all.

This is also fundamentally why AI starts to fail so quickly, because these tools have been used to flood the internet with AI generated pages, which in turn become training data for AI, which means the training data is tainted with AI generated garbage, which will further degrade the LLM. On the plus side, I guess, is that if they keep using this kind of business model, they will unintentionally make their AI pretty useless within a few years by flooding the internet with useless, incorrect data.

[–] [email protected] -1 points 1 year ago

And thus the Tech Industry Hype Cycle will begin anew. Maybe next time it'll be The Fediverse. Maybe it'll be Holograms. Maybe it'll be Blockchain But This Time It's Not A Scam, Pinky Promise.

load more comments (12 replies)
load more comments (12 replies)