this post was submitted on 15 Jun 2024

78 points (77.5% liked)

Technology

59174 readers

2122 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

AI Loophole #1; Your GitHub README.md (lemmy.world)

submitted 4 months ago* (last edited 4 months ago) by [email protected] to c/[email protected]

73 comments fedilink hide all child comments

I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly "source available" security mainly focusing on BSD. I'm on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev.

Well, on that server I tried to deny AI with Suricata, robots.txt, "NO AI" Licenses, Human Intelligence (HI) License links in the software, "NO AI" comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server.

Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search for BSD Security for two pages which is all most users will click, is very recent comparitively to the "dead pool" of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful.

The traceback and AI result analysis shows the following:

GitHub cloning vs visitor activity in the Traffic tab DOES NOT MATCH any useful pattern for me the Engineer. Likelyhood of AI training rough estimate of my own repositories: 60% of clones are AI/Automata
GitHub README.md is not licensable material and is a public document able to be trained on no matter what the software license, copyright, statements, or any technical measures used to dissuade/defeat it. a. I'm trying to see if tracking down whether any README.md no matter what the context is trainable; is a solvable engineering project considering my life constraints.
Plagarisation of technical writing: Probable
Theft of programming "snippets" or perhaps "single lines of code" and overall logic design pattern for that solution: Probable
Supremely interesting choice of datasets used vs available, in summary use, but also checking for validation against other software and weighted upon reputation factors with "Coq" like proofing, GitHub "Stars", Employer History?
Even though I can see my own writing and formatting right out of my README.md the citation was to "Phoronix Forum" but that isn't true. That's like saying your post is "Tick Tock" said. I wrote that, a real flesh and blood human being took comparitvely massive amounts of time to do that. My birthname is there in the post 2 times [EDIT: post signature with my name no longer? Name not in "about" either hmm], in the repo, in the comments, all over the Internet.

[EDIT continued] Did it choose the Phoronix vector to that information because it was less attributable? It found my other repos in other ways. My Phoronix handle is the same name as GitHub username, where my handl is my name, easily inferable in any, as well as a biography link with my fullname in the about.[EDIT cont end]

You should test this out for yourself as I'm not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md and then check it against AI every day until you see it.

P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 21 points 4 months ago (2 children)

Lmao you got some criticism and now you’re saying everyone else is a bot or has an agenda. I am a software engineer and my organization does not gain any specific benefits for promoting AI in any way. They don’t sell AI products and never will. We do publish open source work however, and per its license anyone is free to use it for any purpose, AI training included. It’s actually great that our work is in training sets, because it means our users can ask tools like ChatGPT questions and it can usually generate accurate code, at least for the simple cases. Saves us time answering those questions ourselves.

I think that the anti-AI hysteria is stupid virtue signaling for luddites. LLMs are here, whether or not they train on your random project isn’t going to affect them in any meaningful way, there are more than enough fully open source works to train on. Better to have your work included so that the LLM can recommend it to people or answer questions about it.

[–] [email protected] 6 points 4 months ago (1 children)

The way that I see it, LLMs are a powerful tool to quickly and easily generate an output that should then be checked by a human. The problem is that it’s being shoehorned into every product it feasibly can be, often as an unchecked source of truth, by people who don’t understand it and just don’t want to miss out. If at any point you have to simply trust an LLM is “right”, it’s being used wrong.

[–] [email protected] 2 points 4 months ago (1 children)

Yeah this is super sensible. Out of curiosity, do you have any decent examples bad usage? I think chatbots, GitHub copilot type stuff to be fine. I find the rewording applications to be fine. I haven’t used it but Duolingo has an AI mode now and it is questionable sounding, but maybe it is elementary enough and fine tuned well enough for the content in the supported courses that errors are extremely rare or even detectable.

[–] [email protected] 3 points 4 months ago (1 children)

I would say chatbots are bad if their job is to provide accurate information, similarly is their use in search engines. Github on the other hand would be an example of a good use, as the code will be checked by whoever is using it. I also like all the image generation/processing uses, assuming that they aren’t taken as a source of truth.

[–] [email protected] 3 points 4 months ago (1 children)

Chatbots are fine as long as it’s clearly disclosed to the user that anything they generate could be wrong. They’re super useful just as an idea generating machine for example, or even as a starting point for technical questions when you don’t know what the right vocabulary is to describe a problem.

[–] [email protected] 1 points 4 months ago (1 children)

Yeah I was thinking more along the lines of customer support chatbots

[–] [email protected] 2 points 4 months ago

Oh yeah those are problematic, but I’m pretty sure a court has ruled in a customer’s favor when the AI fucked up, which is good at least.

[–] [email protected] 3 points 4 months ago* (last edited 4 months ago) (1 children)

you got some criticism and now you’re saying everyone else is a bot or has an agenda

Please look up ad hominem, and stop doing it. Yes, their responses are a distraction from the topic at hand, but so were the random posts calling OP paranoid. I'd have been on the defensive too.

[Our company] publish[s] open source work ... anyone is free to use it for any purpose, AI training included

Great, I hope this makes the models better. But you made that decision. OP clearly didn't. In fact, they attempted to use several methods to explicitly block it, and the model trainers did it anyway.

I think that the anti-AI hysteria is stupid virtue signaling for luddites

Many loudly outspoken figures against the use of stolen data for the training of generative models work in the tech industry, myself included (I've been in the industry for over two decades). We're far from Luddites.

LLMs are here

I've heard this used as a justification for using them, and reasonable people can discuss the merits of the technology in various contexts. However, this is not a justification for defending the blatant theft of content to train the models.

whether or not they train on your random project isn’t going to affect them in any meaningful way

And yet, they did it while ignoring explicit instructions to the contrary.

there are more than enough fully open source works to train on

I agree, and model trainers should use that content, instead of whatever they happen to grab off every site they happen to scrape.

Better to have your work included so that the LLM can recommend it to people or answer questions about it

I agree if you give permission for model trainers to do so. That's not what happened here.

[–] [email protected] -1 points 4 months ago (1 children)

Why do you think they need your permission to use information you posted publicly to train their models? Copyright isn’t unlimited, and model training is probably fair use.

[–] [email protected] 5 points 4 months ago (2 children)

"Your honor, we can use whatever data we want because model training is probably fair use, or whatever".

I don't know what's worse, the fact that you think creators don't have the right to dictate how their works are used, or that you apparently have no idea what fair use is.

This might help; https://copyright.gov/fair-use/

[–] [email protected] 2 points 4 months ago* (last edited 4 months ago) (1 children)

I mean, this is how courts work. Someone will sue because a work they hold copyright to was used in a training set without their authorization, the defendant will claim it was fair use, the judge will pick a side. To the best of my knowledge this hasn’t happened just yet, and since I’m not a judge, I use “probably”. Fair use is both vague and broad, and this is important to ensure copyright holders don’t have complete control over their work. It was recognized a long time ago that you can make works that utilize another copyrighted work, but don’t functionally replace the original work, and are therefore fair use. The whole point was to try and foster innovation, not to allow copyright holders to dictate how their works are used, and fair use is an essential part of that.

Training an LLM with a work doesn’t functionally replace that work. If there is a filter that prevents 1:1 reproduction, then it literally cannot. It also provides significant benefit to have these LLMs, they are a unique and valuable work themselves. That’s why it’s fair use.

[–] [email protected] 1 points 4 months ago (1 children)

Agreed on all points, except my personal interpretation of "fair use" specific to the case of generative models.

You call out "doesn't replace the original work". Is that not how you see an LLM Q/A bot replacing a user going to a git repo for established examples, or a website for an article (generating page views, subscriptions, ad revenue), or similar? Why would anyone go to the source materials if they're getting their answer from the bot?

This is practically the same as when Google started showing articles in AMP, and not bringing people to the original website, is it not?

[–] [email protected] 1 points 4 months ago (1 children)

How would an LLM answering questions about a git repo be legally different from a person answering those same questions (think stackoverflow)? Specific to this case, US law does not consider “APIs” to be copyrightable (Oracle v Google, Google reimplemented Java using the same APIs but their own implementation code, court ruled that Oracle couldn’t copyright the APIs).

Regarding “replace”, the primary use of the git repo is the code itself, not the Q&A about how to use it. The LLM doesn’t generate code that fully replaces that library or program, or if it does, it is distinct enough to be a different work.

[–] [email protected] 1 points 4 months ago

First, a chat bot is not an API. Second, they were talking about the the formatting and delivery method of the data, not the content.

Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs. Notwithstanding that; If I set a flag that's says "don't use my data" and they use it anyway, that's theft, even if it's only one file, even if the file is just a description of the code. That's my work, not yours. You don't get to use it however you want, unless I specifically note that it's public domain (or you use it and follow the license, like attributing me, or linking to the repo, etc).

As to the difference between a bot and a human (re: stack overflow)? The former is a representative of a company (automation or not, whether it's a bot or a page on their corporate site), the latter is a person relating experience and opinion. The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person's question for no reason other than a desire to be helpful (and if they're decent, attributing the source instead of claiming that they're generating wisdom on their own).

That last parenthetical used to be called plagiarism, by the way.

[–] [email protected] 0 points 4 months ago (2 children)

authors should have no say in how published works are used.

[–] [email protected] 1 points 4 months ago (1 children)

I already replied to the essence of this in my reply to your other post about how "illegal downloads aren't theft because its a copy", but I'll mention here that this is even more evidence that you aren't a creator, and I suggest that your opinions on this subject aren't relevant, and you should avoid subjecting other people to them.

[–] [email protected] -1 points 4 months ago (1 children)

your attacks on my identity don't undercut my claims at all.

[–] [email protected] 1 points 4 months ago (1 children)

"evidence suggests that you probably aren't a creator" "As a result, I suggests that your opinions aren't relevant"

Aside from the fact that these are not character attacks, I encourage you to refute my assumptions. Otherwise, my points will stand on their own.

[–] [email protected] -1 points 4 months ago

on the internet, no one knows you're a dog. whether I have or not, saying so doesn't prove it. what I said stands on its own merits and your inability to make an argument without attacking identity speaks to the strength of your argument, your understanding of the subject, and your ability (or willingness) to engage in good faith.

[–] [email protected] 1 points 4 months ago (1 children)

Authors shouldn't be paid for their labor?

[–] [email protected] -2 points 4 months ago (1 children)

I didn't say that. you're making a leap of logic

[–] [email protected] 1 points 4 months ago (1 children)

Yes, I am. Logically, if an author creates something and cannot control its distribution, it is available to everyone at no cost, therefore the author will never see a dime for their labor.

This discounts the donation model, because in practice, it rarely pays the bills. It also ignores patronage, because I doubt that you want the creation of art to be dependent on the generosity of the rich.

Thus, it makes sense for the author to maintain certain rights over the product of their labor. They provide the work under their terms, e.g. requiring payment for a copy, and that relatively low cost to the average Joe provides the money they need to buy food, pay rent, etc.

[–] [email protected] -2 points 4 months ago (1 children)

you recognize two well known cases where copyright is not necessary to get paid. I don't think there is even an argument at this point. have a nice day.

[–] [email protected] 1 points 4 months ago (1 children)

Yes, and I said they're not feasible, because they've been tried in the past and present and found to not work very well. If you disagree, I'm happy to hear your thoughts.

[–] [email protected] -2 points 4 months ago (1 children)

you claim they are not feasible, but we know people do get paid through them, so you're just lying.

[–] [email protected] 1 points 4 months ago (1 children)

Yes, they do get paid, but not a living wage.

For the donation model, most people doing that work that I've talked to have day jobs, and do the other work on the side. There's a reason the donation platform buttons say things like "buy me a coffee" and not "pay my rent for the month": it's because the donations don't cover rent.

For the patronage model, like I said, I don't think anyone wants work like this to be controlled by a handful of rich people.

I'm still interested in hearing your thoughts if you have more than "nuh uh" and "you're lying".

[–] [email protected] -2 points 4 months ago (1 children)

only one person would need to be able to live on either model to disprove your claim. since that has definitely happened, you're definitely lying.

[–] [email protected] 1 points 4 months ago (1 children)

Just because it works once doesn't mean it'll work all the time for everyone.

[–] [email protected] -1 points 4 months ago (1 children)

any time it has worked proves you are wrong. the top 50 patreons clear over $100k a year

[–] [email protected] 1 points 4 months ago (2 children)

Exactly, and since it certainly follows a long tail distribution, the rest of the 250,000 creators on patreon make a tiny fraction of that. For the vast majority of people, it doesn't provide a primary income.

I'm not sure you want to rely on Patreon in any case, since it also relies on the retention of rights for profit. In your scenario, when they upload to Patreon, anyone involved could tell them to get fucked and pay the author nothing.

[–] [email protected] 0 points 4 months ago

Exactly, and since it certainly follows a long tail distribution, the rest of the 250,000 creators on patreon make a tiny fraction of that. For the vast majority of people, it doesn’t provide a primary income.

this is true for the vast majority of storytellers and artists and musicians through all of history.

[–] [email protected] -1 points 4 months ago

I'm not sure you want to rely on Patreon in any case, since it also relies on the retention of rights for profit. In your scenario, when they upload to Patreon, anyone involved could tell them to get fucked and pay the author nothing.

anyone could do that now. people still get paid.