this post was submitted on 12 Apr 2024
1001 points (98.5% liked)

Technology

60080 readers
3368 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 8 months ago* (last edited 8 months ago)

No. There's only model collapse (the term for this in academia) if literally all the content is synthetic.

In fact, a mix of synthetic and human generated performs better than either/or.

Which makes sense, as the collapse is a result of distribution edges eroding, so keeping human content prevents that, but then the synthetic content is increasingly biased towards excellence in more modern models, so the overall data set has an improved median oven the human only set. Best of both worlds.

Using Gab as an example, you can see from other comments that in spite of these instructions the model answers are more nuanced and correct than Gab posts. So if you only had Gab posts you'd have answers from morons, and the synthetic data is better. But if you only had synthetic data, it wouldn't know what morons look like to avoid those answers and develop nuance around them.