this post was submitted on 20 Dec 2023

112 points (90.6% liked)

Technology

59374 readers

3794 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

112

AI image training dataset found to include child sexual abuse imagery (www.theverge.com)

submitted 11 months ago by [email protected] to c/[email protected]

15 comments fedilink hide all child comments

all 16 comments

sorted by: hot top controversial new old

[–] [email protected] 25 points 11 months ago* (last edited 11 months ago)

It occurs to me that a lot of people don’t know the background here. (ETA: I wrote this in response to a different article, so some refs don't make sense.)

LAION is a German Verein (a club). It’s mainly a German physics/comp sci teacher who does this in his spare time. (German teachers have the equivalent of a Master’s degree.)

He took data collected by an American non-profit called Common Crawl. “Crawl” means that they have a computer program that automatically follows all links on a page, and then all links on those pages, and so on. In this way, Common Crawl basically downloads the internet (or rather the publicly reachable parts of it).

Search engines, like Google or Microsoft’s Bing, crawl the internet to create the databases that power their search. But these and other for-profit businesses aren’t sharing the data. Common Crawl exists so that independent researchers also have some data to study the internet and its history.

Obviously, these data sets include illegal content. It’s not feasible to detect all of it. Even if you could manually look at all of it, that would be illegal in a lot of jurisdictions. Besides, which standards of illegal content should one apply? If a Chinese researcher downloads some data and learns things about Tiananmen Square in 1989, what should the US do about that?

Well, that data is somehow not the issue here, for some reason. Interesting, no?

The German physics teacher wrote a program that extracted links to images, as well as their accompanying text descriptions, from Common Crawl. These links and descriptions were put into a list - a spreadsheet, basically. The list also contains metadata like the image size. On top of that, he used AI to guess if they are “NSFW” (IE porn), and if people would think they are beautiful. This list, with 5 billion entries, is LAION-5b.

Sifting through Petabytes of data to do all that is not something you can do on your home computer. The funding that Stability AI provided is a few thousand USD for supercomputer time in “the cloud”.

German researchers at the LMU - a government funded university in Munich - had developed a new image AI, which is especially efficient and can be run on normal gaming PCs. (The main people now work on a start-up in New York.) The AI was trained on that open source data set and named Stable Diffusion in honor of Stability AI, which had provided the several 100k USD needed to pay for the supercomputer time.

These supposed issues are only an issue for free and open source AI. The for-profit AI companies keep their data sets secret. They are fairly safe from accusations.

Maybe one should use PhotoDNA to search for illegal content? The for-profit company PhotoDNA, which so kindly provided its services for free to this study, is owned by Microsoft, which is also behind OpenAI.

Or maybe one should only use data that has been manually checked by humans? That would be outsourced to a low wage country for pennies, but no need: Luckily, billion-dollar corporations exist that offer just such data sets.

This article solely attacks non-profit endeavors. The only for-profit companies mentioned (PhotoDNA, Getty), stand to gain from these attacks.

[–] [email protected] 18 points 11 months ago* (last edited 11 months ago)

1,679 out of 5,000,000,000.

So around 0.000034% of the LAION-5B data set.

[–] [email protected] 8 points 11 months ago (1 children)

Sky found to be blue

[–] [email protected] -3 points 11 months ago (2 children)

Man, it's so disappointing that this joke, or some variation of it, is the top comment on so many news threads. What do you think you're contributing?

[–] [email protected] 8 points 11 months ago* (last edited 11 months ago)

the top comment on so many news thread

the problem is not the comment or the joke but the news articles with low quality or obvious stories.

most "tech" journalism is, generally, garbage.

in this case, the fact that 1000 CSAM photos were found in a 5 billion photos database scrapped from the internet is not surprising in anyway. These photos, unfortunately, are spammed all over the internet.

[–] [email protected] 1 points 11 months ago

Man, it's so disappointing that this sanctimony, or some variation of it, is the second comment on so many inane news threads. What do you think you're contributing?

[–] [email protected] 5 points 10 months ago

People that write articles about subjects should at least have a bare minimum education in said subject matter. The fact less than a percent of the images in a multi billion set were problematic is actually impressive.

Anyone who has ever moderated any kind of risque forum or board would have an understanding of just how ridiculously common this shit is and ON THE CLEAR NET of all things.

If top level domains can't deal with this shit how is a NON PROFIT supposed to? Especially when a majority of the images come from the companies that make money off their gather and sale (getty/google/et al).

This is such a garbage article written by a garbage person with little to no education in any of the matters they're discussing.

[–] [email protected] 3 points 11 months ago

This is the best summary I could come up with:

The researchers began combing through the LAION dataset in September 2023 to investigate how much, if any, child sexual abuse material (CSAM) was present.

These were sent to CSAM detection platforms like PhotoDNA and verified by the Canadian Centre for Child Protection.

Stanford’s researchers said the presence of CSAM does not necessarily influence the output of models trained on the dataset.

“The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims,” the report said.

The researchers acknowledged it would be difficult to fully remove the problematic content, especially from the AI models trained on it.

US attorneys general have called on Congress to set up a committee to investigate the impact of AI on child exploitation and prohibit the creation of AI-generated CSAM.

The original article contains 339 words, the summary contains 134 words. Saved 60%. I'm a bot and I'm open source!

[–] [email protected] 0 points 11 months ago (2 children)

How could this even happen by accident?

[–] [email protected] 11 points 11 months ago (1 children)

Because it has five billion images?

The potentially at issue images comprise less than one percent of one percent of one percent of the total.

[–] [email protected] 3 points 10 months ago (1 children)

Don't they need to label the data?

[–] [email protected] 4 points 10 months ago

No, it's not manually labeled. It connects the text to the image based on things like alt text or the comment next to it in a social media post, and then ran them through a different AI (CLIP) which rated how well the text description matched the image and they filter out the ones with a low score.

The point of the OP research is that they should add another step to check CSAM databases and not rely on social media curation to have avoided illegal material (which they should, even though it's a very very small portion of the overall dataset).

But at no time was a human reviewing CSAM, labeling it, and including it in the data.

[–] [email protected] 7 points 11 months ago* (last edited 11 months ago) (1 children)

removing these images from the open web has been a headache of webmasters and admins for years in sites which host user uploaded images.

if the millions of images in the training data were automatically scraped from the internet, I don't find it surprising that there was CSAM there.

[–] [email protected] 0 points 10 months ago (1 children)

Don't they need to label the data?

[–] [email protected] 1 points 10 months ago

Not manually