this post was submitted on 26 May 2024
454 points (95.6% liked)

Technology

60071 readers
3536 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

These are 17 of the worst, most cringeworthy Google AI overview answers:

  1. Eating Boogers Boosts the Immune System?
  2. Use Your Name and Birthday for a Memorable Password
  3. Training Data is Fair Use
  4. Wrong Motherboard
  5. Which USB is Fastest?
  6. Home Remedies for Appendicitis
  7. Can I Use Gasoline in a Recipe?
  8. Glue Your Cheese to the Pizza
  9. How Many Rocks to Eat
  10. Health Benefits of Tobacco or Chewing Tobacco
  11. Benefits of Nuclear War, Human Sacrifice and Infanticide
  12. Pros and Cons of Smacking a Child
  13. Which Religion is More Violent?
  14. How Old is Gen D?
  15. Which Presidents Graduated from UW?
  16. How Many Muslim Presidents Has the U.S. Had?
  17. How to Type 500 WPM
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 7 months ago

People get very confused about this. Pre-training "ChatGPT" (or any transformer model) with "internet shitposting text" doesn't cause them to reply with garbage comments, bad alignment does. Google seems to have implemented no frameworks to prevent hallucinations whatsoever and the RLHF/DPO applied seems to be lacking. But this is not "problem with training on the entire web". You can pre-train a model exclusively on a 4-chan database that with the right finetuning you would see a perfectly healthy and harmless model. Actually, it's not bad to have "shitposting" or "toxic" text in the pre-training because that gives the model an ability to identify it and understand it

If so, the "problem with training on the entire web" is that we would be drinking from a poisoned well, AI-generated text has a very different statistical distribution from the one users have, which would degrade the quality of subsequent models. Proof of this can be seen with the RedPajama dataset, which improves the scores on trained models simply because it has less duplicated information and is a more dense dataset: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama