Technology

59312 readers

5006 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

134

Largest Dataset Powering AI Images Removed After Discovery of ‘Suspected’ Child Sexual Abuse Material (www.404media.co)

submitted 10 months ago by [email protected] to c/[email protected]

21 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 43 points 10 months ago* (last edited 10 months ago) (4 children)

Sounds like nothing particularly unusual or alarming. Researchers found a few thousand images that could be illegal that were referenced by it, told LAION about it, and LAION pulled the database down temporarily while checking and removing them. A few thousand images out of five billion is not significant.

There's also the persistent misunderstanding of what the LAION database is, which is even perpetuated by the paper itself (making me suspicious of the researchers' motivations since they surely know better). The paper says: “We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” When the LAION-5B dataset doesn't actually have any pictures at all in. It's purely a list of URLs pointing at images that are on the Internet, each with text describing them. Possessing the dataset doesn't make you in possession of any of those images.

Edit: Yeah, down at the bottom of the article I see the researcher state that in his opinion LAION-5B shouldn't even exist and use inaccurate emotionally-charged language about how AI training data is "stolen." So there's the motivation I was suspicious of.

[–] [email protected] 17 points 10 months ago

While I get what you are saying, it's pretty clear that what he was saying was that if you actually populate the dataset by downloading the images contained in the links (which anyone who is actually using the dataset to train a model would need to do), then you have inadvertantly downloaded illegal images.

It is mentioned repeatedly in the article that the dataset itself is simply a list of urls to the images.

[–] [email protected] 9 points 10 months ago (1 children)

Makes one wonder if there is some lobby org behind this. The benefits to major corporate interests are obvious, and it feels a little campaigny.

[–] [email protected] 3 points 10 months ago* (last edited 9 months ago) (1 children)

deleted

[–] [email protected] 2 points 10 months ago (1 children)

What?

[–] [email protected] 4 points 10 months ago (1 children)

He's (correctly) taking the piss

[–] [email protected] 2 points 10 months ago (1 children)

I don't get it. What's the joke?

[–] [email protected] 1 points 10 months ago* (last edited 9 months ago)

deleted

[–] [email protected] 3 points 10 months ago

This new "journalism" site is not doing itself any favors with bullshit headlines like this. And this is not the first wildly inaccurate article I've seen from 404 Media.

[–] [email protected] 2 points 10 months ago* (last edited 9 months ago) (2 children)

deleted

[–] [email protected] 6 points 10 months ago

LAION is a database of URLs, gathered from publicly-available data on the Web. Who is "taking" anything?

[–] [email protected] 4 points 10 months ago (1 children)

“Taking” is doing a lot of work there, and fundamentally the issue at heart.

[–] [email protected] 1 points 10 months ago* (last edited 9 months ago) (1 children)

deleted

[–] [email protected] 5 points 10 months ago (2 children)

"Copyright violation" is probably the wording you're looking for. Copyright violation is not taking or theft or stealing or any of those other words - it's copyright violation.

Whether training an AI on a copyrighted work without permission of the copyright holder is a violation of copyright is something that is debatable. But it most definitely is not stealing or theft. Theft is covered by completely different laws.

[–] [email protected] 1 points 10 months ago* (last edited 9 months ago)

deleted

[–] [email protected] -2 points 10 months ago

Unless you feel like being a pedant, copyright infringement is also known as content theft.

https://www.deviantart.com/team/journal/Calling-All-Creator-Platforms-to-Fight-Art-Theft-901238948