this post was submitted on 24 Jul 2024
436 points (97.2% liked)

Technology

59374 readers
3767 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 12 points 3 months ago* (last edited 3 months ago) (14 children)

provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet

It doesn't need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a "truth score" for each.

We don't teach children to read by just handing them random tweets. We give them books that are made specifically for children. Our filtering mechanism for good / bad content is very robust for humans. Why can't AI just read every piece of "classic literature", famous speeches, popular books, good TV and movie scripts, textbooks, etc?

[–] [email protected] 6 points 3 months ago (10 children)

It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.

That isn't enough because the model isn't able to reason.

I'll give you an example. Suppose that you feed the model with both sentences:

  1. Cats have fur.
  2. Birds have feathers.

Both sentences are true. And based on vocabulary of both, the model can output the following sentences:

  1. Cats have feathers.
  2. Birds have fur.

Both are false but the model doesn't "know" it. All that it knows is that "have" is allowed to go after both "cats" and "birds", and that both "feathers" and "fur" are allowed to go after "have".

[–] [email protected] 3 points 3 months ago (8 children)

It's not just a predictive text program. That's been around for decades. That's a common misconception.

As I understand it, it uses statistics from the whole text to create new text. It would be very rare to output "cats have feathers" because that phrase doesn't ever appear in the training data. Both words "have feathers" never follow "cats".

[–] [email protected] 1 points 3 months ago* (last edited 3 months ago)

This isn't really accurate either. At the moment of generation, an LLM only has context for the input string and the network of text tokens it's been assigned. It pulls from a "pool" of these tokens based on what it's already output and the input context, nothing more.

Most LLMs have what are called "Top P", "Top K" etc, these are the number of tokens that it ends up selecting from based on the previous token, alongside the input tokens. It then randomly chooses one based on temperature settings.

It's why if you turn these models' temperature settings really high they output pure nonsense both conceptually and grammatically, because the tenuous thread linking the previous token's context to the next token has been widened enough that it completely loses any semblance of cohesiveness.

load more comments (7 replies)
load more comments (8 replies)
load more comments (11 replies)