this post was submitted on 22 Dec 2024
998 points (97.5% liked)

Technology

60052 readers
2851 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 95 points 11 hours ago* (last edited 8 hours ago) (31 children)

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

[–] [email protected] 3 points 10 hours ago (9 children)

But wouldn't that mean making it open source, then it not functioning properly without the data while open, would prove that it is using a huge amount of unlicensed data?

Probably not "burden of proof in a court of law" prove though.

[–] [email protected] 8 points 10 hours ago (1 children)

Making it open source doesn't change how it works. It doesn't need the data after it's been trained. Most of these AIs are just figuring out patterns to look for in the new data it comes across.

[–] [email protected] 3 points 10 hours ago (1 children)

So you're saying the data wouldn't exist anywhere in the source code, but it would still be able to answer questions based on the data it has previously seen?

[–] [email protected] 16 points 10 hours ago (1 children)

That is how LLM works, they don't store the data as data, but as weight values.

[–] [email protected] 1 points 8 hours ago (1 children)

So then why, if it were all open sourced, including the weights, would the AI be worthless? Surely having an identical but open source version, that would strip profitability from the original paid product.

[–] [email protected] 4 points 6 hours ago (1 children)

It wouldn't be. It would still work. It just wouldn't be exclusively available to the group that created it-any competitive advantage is lost.

But all of this ignores the real issue - you're not really punishing the use of unauthorized data. Those who owned that data are still harmed by this.

[–] [email protected] 1 points 5 hours ago

It does discourages the use of unauthorised data. If stealing doesn't give you competitive advantage, it's not really worth the risk and cost of stealing it in the first place.

load more comments (7 replies)
load more comments (28 replies)