this post was submitted on 16 May 2024

516 points (97.1% liked)

Technology

68724 readers

3313 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

516

OpenAI strikes Reddit deal to train its AI on your posts (www.theverge.com)

submitted 11 months ago by [email protected] to c/[email protected]

125 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] [email protected] 152 points 11 months ago (2 children)

So they filled reddit with bot generated content, and now they're selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?

[–] [email protected] 90 points 11 months ago (2 children)

This is actually a thing. It's called "Model Collapse". You can read about it here.

[–] [email protected] 21 points 11 months ago (5 children)

"Model collapse" can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.

[–] [email protected] 15 points 11 months ago (4 children)

A model trained on jokes about bacon, narwhals, and rage comics.

load more comments (4 replies)

load more comments (1 replies)

[–] [email protected] 18 points 11 months ago

I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.

Anybody who's looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn't want that crap contaminating my models.

[–] [email protected] 104 points 11 months ago (2 children)

They always were.

Only now they've agreed to pay Reddit for it. This is what their third party lockdown was really all about.

They're helping themselves to your Lemmy comments for free, as that's just how it's designed. If you post anything publicly anywhere, it's getting slurped up by a bot somewhere.

[–] [email protected] 15 points 11 months ago (5 children)

I'm not a lawyer. But isn't the reason they had to go to reddit to get permission is because users hand over over ownership to reddit the moment you post. And since there's no such clause on Lemmy, they'd have to ask the actual authors of the comments for permission instead?

Mind you, I understand there's no technical limitation that prevents bots from harvesting the data, I'm talking about the legality. After all, public does not equate public domain.

[–] [email protected] 14 points 11 months ago (1 children)

users hand over over ownership to reddit the moment you post

Not ownership. Just permission to copy and distribute freely. Which basically is necessary to run a service like this, where user-submitted content is displayed.

And since there's no such clause on Lemmy, they'd have to ask the actual authors of the comments for permission instead?

It's more of a fuzzy area, but simply by posting on a federated service you're agreeing to let that service copy and display your comments, and sync with other servers/instances to copy and display your comments to their users. It's baked into the protocol, that your content will be copied automatically all over the internet.

Does that imply a license to let software be run on that text? Does it matter what the software does with it, like display the content in a third party Mobile app? What about when it engages in text to speech or braille conversion for accessibility? Or index the page for a search engine? Does AI training make any difference at that point?

The fact is, these services have APIs, and the APIs allow for the efficient copying and ingest of the user-created information, with metadata about it, at scale. From a technical perspective obviously scraping is easy. But from a copyright perspective submitting your content into that technical reality is implicit permission to copy, maybe even for things like AI training.

load more comments (1 replies)

load more comments (4 replies)

[–] [email protected] 9 points 11 months ago* (last edited 11 months ago) (2 children)

What if I say the word gasp fuck?

load more comments (2 replies)

[–] [email protected] 87 points 11 months ago (3 children)

BRB - changing my entire 15 year reddit comment history to "Fuck Spez". LOL.

[–] [email protected] 18 points 11 months ago (4 children)

Know any bots or ways to perma delete all Reddit comments?

[–] [email protected] 64 points 11 months ago (5 children)

Reddit has backups, permanently isn’t an option.

[–] [email protected] 17 points 11 months ago (2 children)

yep they fuckin got us

but it's not like our posts are safe here either. This is the world we live in now.

[–] [email protected] 7 points 11 months ago (1 children)

But here, the API is open and I can run my own copy and train my own LLM same as anyone else. It's not one asshole who decides to whom and for how much he'll sell the content we all gave him for free, so he can justify his $193 million paycheck.

load more comments (1 replies)

[–] [email protected] 12 points 11 months ago (2 children)

They're not multiple though, edit it and then delete it and it's gone. They disabled all the tools to do it though so it's manually or nothing now.

[–] [email protected] 14 points 11 months ago

Damn. You outsmarted them well paid data jockeys. And assuming your edits change the actual comment and don’t simply hide the original.

I could be an idiot too though. Reddit might have been running this whole shit show on the original version of the database system and be upselling to buyers.

[–] [email protected] 11 points 11 months ago

They just reload a previous cached comment, doesn’t matter how many times you edit or delete, it’s all logged and backed up.

load more comments (3 replies)

[–] [email protected] 11 points 11 months ago* (last edited 11 months ago)

I used redact.dev to mass edit all my comments, worked pretty well. Problem is that if you mass delete, they'll restore them pretty quick, but so far they haven't reverted my edits.

[–] [email protected] 6 points 11 months ago (1 children)

https://github.com/j0be/PowerDeleteSuite

load more comments (1 replies)

[–] [email protected] 8 points 11 months ago

Realistically, when you're operating at Reddit's scale, you're probably keeping a history of each comment for analytics purposes.

load more comments (1 replies)

[–] [email protected] 46 points 11 months ago (6 children)

Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.

load more comments (6 replies)

[–] [email protected] 38 points 11 months ago (2 children)

LLMs have been training on Reddit posts since at least 2012. Nothing really new here.

[–] [email protected] 6 points 11 months ago (4 children)

Now they get to train on all the "deleted" comments/posts as well.

load more comments (4 replies)

load more comments (1 replies)

[–] [email protected] 32 points 11 months ago* (last edited 11 months ago) (4 children)

What makes you think that they are not scraping Lemmy too? The only reason they might not be is probably how niche Lemmy and the fediverse are, but I am sure there have been people already doing it.

[–] [email protected] 28 points 11 months ago

Fediverse is designed to do exactly that. It's free flow of information which is a good thing. Don't let corporations hijack this beautiful concept. We all want information to be free.

[–] [email protected] 16 points 11 months ago

I’m not mad about the scraping. The linkedin scraping case pretty much cemented that there was nothing that could be done to stop it. I’m just mad that I can no longer use the app of my choice. No such problem with Lemmy.

load more comments (2 replies)

[–] [email protected] 30 points 11 months ago* (last edited 11 months ago) (3 children)

They now are paying Reddit? I thought they could just scrape for free.

Also, you can not delete anything on the internet. Once something is public there will always be a copy somewhere.

[–] [email protected] 25 points 11 months ago (3 children)

Scraping through a website at the scale they are talking about isn't really viable. You need access to the API so that you can have very targeted requests.

This is why reddit changed their API pricing and screwed over everyone using third party apps. They can make more money selling access to LLM trainers than they could from having millions of people using apps that rely on the API.

load more comments (3 replies)

[–] [email protected] 10 points 11 months ago* (last edited 11 months ago) (1 children)

There's actually legal precedent against scrapping a website through unofficial channels, even if the information is public. But basically, if you scrape a website and hinder their ability to operate, it falls under "virtual trespassing".

I'm assuming it would be even worse now that everyone is using the cloud and that scrapping their site would cause a noticeable increase in resource cost (and thus, directly cost them more money because of cloud usage fees).

It's why APIs are such a big deal. They provide you with an official, controlled, entry point to a platform's data.

[–] [email protected] 10 points 11 months ago* (last edited 11 months ago) (1 children)

It's the opposite! There's legal precedence that scraping public data is 100% legal in the US.

There are few countries where scraping is illegal though like Japan and China. European countries often also have things called "database protection" laws that forbid replicating public databases through scraping or any other means but that has to be a big chunk of overal database. Also there are personally identifiable info (PII) protection laws that protect storing of people data without their consent (like GDPR).

Source: I work with anti bot tech and we have to explain this to almost every customer who wants to "sue the web scrapers" that lol if Linkedin couldn't do it, you're not sueing anyone.

load more comments (1 replies)

[–] [email protected] 20 points 11 months ago* (last edited 11 months ago) (1 children)

Reddit banned me through IP address or something. Whatever new account i create will be banned within 24hrs even if i don't upvote a single post or comment. I tried with 10 new account all banned and all new email address. So gave up and randomly changed all my good comments. Shifted permanently to lemmy. Missing some of the most niche community. But not so much to return to reddit.

Edit: I didn't even commit any rule violation. Took a too long to change from modded reddit app. I only logged in once. That doesn't amount to blocking me from every using reddit.

load more comments (1 replies)

[–] [email protected] 19 points 11 months ago (1 children)

Meh, good luck with that.

All my Reddit comments have just said “Comment redacted in protest against Reddit's deranged attacks against third party apps, the community, and common sense. See you'll in Lemmy or Kbin once this embarrassment of a site is done enshittifying itself out of existence. Monetize this, u/spez, you greedy little pigboy. 🖕” since I edited them before moving here. 🤷‍♂️

[–] [email protected] 18 points 11 months ago (2 children)

You better double check. I just found out that only my comments with few upvotes are still that way, the others have been restored.

A script replacing them with random words might do the trick.

load more comments (2 replies)

[–] [email protected] 17 points 11 months ago (4 children)

This form of propaganda is my pet peeve. It's not "your posts" as soon as you put something to public you don't get to eat your cake. It's out there, you shared it. Don't share it if you don't want humanity to ingest and use it.

[–] [email protected] 21 points 11 months ago (3 children)

You're technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.

load more comments (3 replies)

[–] [email protected] 17 points 11 months ago (2 children)

No wonder AI is crazy AF.

load more comments (2 replies)

[–] [email protected] 15 points 11 months ago

Isn’t this news like every month?

[–] [email protected] 12 points 11 months ago (3 children)

Finally found a use for MS Edge, loaded up Nuke Reddit History and removed all comments and posts: https://microsoftedge.microsoft.com/addons/detail/nuke-reddit-history/bklbcgohenjegdibgmppligaapohkgip

[–] [email protected] 32 points 11 months ago (3 children)

Hate to break it to you, but the time to do that was over a year ago, and even then it wasn’t ever really a sure thing - we don’t really know what their backup policies are around that stuff.

This is what the former power user community that made an exodus from Reddit roughly a year ago has been trying to communicate, but a ton of people here seem to enjoy keeping their toes in the water over there, with rather predictable consequences (literally, the post we’re commenting on).

All that said: I am very much looking forward to the absolutely titanic lawsuit around GDPR I’m sure is in the works over this.

load more comments (3 replies)

[–] [email protected] 13 points 11 months ago

Worth doing, but I suspect they’re sending OpenAI snapshots of the database from before you did that.

load more comments (1 replies)

[–] [email protected] 7 points 11 months ago

Does this mean I can stop prefacing my AI requests with “According to Reddit…”?

[–] [email protected] 6 points 11 months ago

I didn't delete my comments before nuking my account, but I'm pretty sure the grand majority were shitposts containing ample amounts of smut, gore and other ridiculous over the top shit. So I consider this a win.

load more comments