this post was submitted on 22 Dec 2024
1535 points (97.5% liked)

Technology

60071 readers
3356 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

you are viewing a single comment's thread
view the rest of the comments
[โ€“] [email protected] 2 points 22 hours ago (1 children)

My view is that of a scholar - one who does devote a large part of their life to freely creating and disseminating knowledge. I do indeed hold a strong bias here, one I'm happy to admit.

Much of the time, when I've run across copyright, it is rarely (if ever come to think of it) in the name of the author (a common requirement of journals being the giving up of ownership of one's work). It normally falls to a company; one usually driven by shareholder value with little (if no) concern for the author's rights. This tends to be the rule rather than the exception, and I'd argue that copyright in it's current incarnation merely provides a legal avenue to steal the work of another, or hold to ransom their works from future generations. This contradicts the first point, and also the second (paywalled papers); indeed the lack of availability of academic works (created for free, or with public funding) is, I believe, a key driver of inequality in this world.

One can withold or even selectively share knowledge, and history will never know what that has cost us.

In terms of AI training, I wouldn't say it is copyright infringement even in spirit, and I say this as one whose works are vomited out verbatim by LLMs when questioned about the field. The comparison with speaking is an interesting one, for we generally do try and attribute ideas if we hold the speaker in esteem, or feel their name will enhance our point. An AI, however, is not speaking of their own volition, but is instead acting in the interest of the company hosting them (and so would fall under the professional label rather than the personal). This might contradict your final point, if one assumes AI progresses as a subscription product (which looks likely).

I think your framework has merit, mostly because it is built on ideals (and we need more such thinking in the world); however, it does not quite match the observed data. Though, it does suggest the rules a better incarnation of copyright could adhere to.

More so, I think no-one has an issue with training publicly available models - it's the ones under copyright themselves people are leery of.

[โ€“] [email protected] 2 points 20 hours ago

I wholeheartedly agree about proprietary models. My perspective is as someone who saw the initial momentum of AI and only run models on my hardware. What you are seeing with your work is not possible from a base model in practice. There are too many holes that need to align in the swiss cheese to make that possible, especially with softmax settings for general use. Even with deterministic softmax settings this doesn't happen. I've even tried overtraining with a fine tune, and it won't reproduce verbatim. What you are seeing is only possible with an agenetic RAG architecture. RAG is augmented retrieval with a database. The common open source libraries are LangChain and ChromaDB for the agent and database. The agent is just a group of models running at the same time with a central model capable of functions calling in the model loader code.

I can coax stuff out of a base model that is not supposed to be there, but it is so extreme and unreliable that it is not at all something useful. If I give a model something like 10k tokens (words/fragments) of lead-in then I can start a sentence of the reply and the model might get a sentence or two correct before it goes off on some tangent. Those kinds of paths through the tensor layers are like walking on a knife edge. There is absolutely no way to get that kind of result at random or without extreme efforts. The first few words of a model's reply are very important too, and with open source models I can control every aspect. Indeed, I run models from a text editor interface where I see and control every aspect of generation.

I tried to create a RAG for learning Operating Systems Principles and Practice, Computer Systems A Programmer's Perspective, and Linux Kernel Development as the next step in learning CS on my own. I learned a lot of the limits of present AI systems. They have a lot of promise, but progress mostly involves peripheral model loader code more than it does with the base model IMO.

I don't know the answer to the stagnation and corruption of academia in so many areas. I figure there must be a group somewhere that has figured out civilization is going to collapse soon so why bother.