antonim

joined 1 year ago
[–] [email protected] 2 points 8 hours ago

(Sorry for the late response.) Well it depends a lot on the site. Since I focus on books and scholarly articles, the ideal way is to find the URL of the original PDF. The website might show you just individual pages as images, but it might hide the link to the PDF somewhere in the code. Alternatively, you might just obtain all the URLs of the individual page images, put them all into a download manager, and later bundle them all into a new PDF. (When you open the "inspect element" window, you just have to figure out which part of the code is meant to display the pages/images to you.) Sometimes the PDFs and page images can be found in your browser cache, as I mention in the OP. There's quite some variety among the different sites, but with even the most rudimentary knowledge of web design you should be able to figure out most of them.

If need help with ripping something in particular, DM me and I'll give it a try.

[–] [email protected] 15 points 1 week ago

I never said I follow the law, I'm just wondering what the law says ;)

[–] [email protected] 2 points 1 week ago (3 children)

Honestly much of your reply is confusing me and doesn't seem to be relevant to my questions. This is what I think is crucial:

Just because a file is cached on your device does not mean you are the legal owner of that content forever.

What does being "the legal owner forever" actually entail, either with regards to a physical book or its scan? And what does that mean regarding what I can legally do with the cached file on my computer?

 

Quite frequently I come across scanned books that are viewable for free online. For example, the publisher put them there (such as preview chapters), a library (old books from their collection that are in public domain), etc. Since I like hoarding data, and the online viewers that are used to present the book to me might not be very practical, I frequently try to download the books one way or another. This requires toying with the "inspect element" tool and various other methods of getting the images/PDF. Now, all that I access is what is, well, accessible; I don't hack into the servers or something. But - the stuff is meant to be hidden from the normal user. Does that act of hiding the material, no matter how primitive and easily circumvented, mean that I'm not allowed to access it at all?

I suppose ripping a public domain book is no big deal, but would books under copyright fare differently?

Mainly I'm asking out of curiosity, I don't expect the police to come visit me for ripping a 16th century dictionary.

Note: I live in EU, but I'd be curious to hear how this is treated elsewhere too.

Edit: I also remembered a funny trick I noticed on one site - it allows viewing PDFs on their website, but not downloading, unless you pay for the PDF. But when you load the page, even without paying, the PDF is already downloaded onto your computer and can be found in the browser cache. Is it legal to simply save the file that is already on your computer?

[–] [email protected] 25 points 1 week ago* (last edited 1 week ago) (2 children)

FYI, there are multiple methods to download "digitally loaned" books off IA, the guides exist on reddit. The public domain stuff is safe, but the stuff that is still under copyright yet unavailable by other means (Libgen/Anna's Archive, or even normal physical copies) should definitely be ripped and uploaded to LG.

The method I use, which results in best images, is to "loan" the book, zoom in to load the highest resolution, and then leaf through the book. Periodically extract the full images from your browser cache (with e.g. MZCacheView). This should probably be automatised, but I'm yet to find a method, other than making e.g. an Autohotkey script. When you have everything downloaded, the images can be easily modified (if the book doesn't have coloured illustrations IMO it is ideal to convert all images to black-and-white 2-bit PNG), and bundled up into a PDF with a PDF editor (I use X-Change Editor; I also like doing OCR, adding the bookmarks/outline, and adding special page numbering if needed - but that stuff can take a while and just makes the file easier to handle, it's not necessary). Then the book can be uploaded to proper pirate sites and hopefully live on freely forever. Also there are some other methods you can find online, on reddit, etc.

[–] [email protected] 33 points 1 week ago

Produce infinite copies of bread loaves, and then get arrested because the baker lobby doesn't like that.

[–] [email protected] 1 points 1 week ago

Came here to post this.

[–] [email protected] 1 points 2 weeks ago

Yeah I'm wondering as well. It seems to save webpages, whereas the issue is with scanned books which may be removed from IA...

[–] [email protected] 2 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

So child porn is okay then? You would already have it on your system

You'd have to look for it, knowing fully well that it is illegal to produce in the first place and distribute to others, access it online, and then deliberately retain it. It's not really the same as something that's legal to produce and distribute (it is certainly legal for me to view your site). You wouldn't "already" have it.

I doubt you are either.

Well I've read some copyright laws, had to solve some issues regarding usage of copyrighted works, etc. Nothing that makes me an expert, but I'm not talking wholly out of my ass either.

It does… on paper… A lot. https://time.com/6266147/internet-archive-copyright-infringement-books-lawsuit/ To the point it’s losing lawsuits over exactly that.

That's not Wayback Machine per se, that's Internet Archive's book scanning and "digital lending" system, which was most definitely doing legally questionable (and stupid) things even to an amateur eye. However, Wayback Machine making read-only copies of websites has for now never been disputed successfully.

[–] [email protected] 6 points 2 weeks ago

What do you mean by "saving a copy"? I still have the .doc file somewhere in my emails. If I told you I'm a serious published writer, and then you asked me where you can read my texts, and I sent you a .doc that hasn't been proofread, would you take me seriously?

[–] [email protected] 4 points 2 weeks ago (3 children)

You don’t have any rights to do anything else with it.

That's patently false. At a minimum, I can quote parts of your content, just as you can quote smaller portions of any published text anywhere, you don't have to ask the publisher or author for permission. It's also ridiculous and impossible to control, the content is on my private machine already, how can any law be relevant or exerted upon what I do there? I doubt you're writing this comment on the basis of your knowledge of copyright law.

Incorrect. Your browser made it do that. How that data is accessed and displayed is not controlled by me.

You're arguing semantics that really don't make any difference. The display is irrelevant, because the data by itself is stored on my computer before it is displayed. That data is what you've put up online to be accessed.

Owning the CD grants you a license to the content on that CD. That’s about as good as ownership gets there. They own the CD/license. As long as that CD exists/works. You don’t gain that same right by simply visiting a website.

I fail to see the difference between getting a CD with some data (buying it or being given for free, as e.g. a gift) and being sent some data online for free. More importantly - says who? Does copyright law say this about websites?

If an artist makes a painting… and posts a picture of it. They have no rights to the painting anymore? They deserve no ownership/pay for what they’ve done?

This simply doesn't follow from what I've written. They certainly retain the rights to the painting. Besides, "deserving pay" depends on completely different factors than the ones we're discussing, usually artists sell the actual object, the painting. A digital reproduction is, as far as most people care (I think), merely an informative reproduction, and not the real thing. Stuff that's posted online for free is... free. It wasn't intended to be made money with directly.

Your final paragraph is really confusing me, you seem to be saying that Wayback Machine is also committing theft, which I'm pretty sure is not true (I've followed the lawsuits against IA for a while and don't remember anyone invoking that term). And at this point I don't know what "theft" is even supposed to mean to you or to anyone else, and what was the point of the discussion anyway. Maybe I should reread the whole discussion carefully all over again, but I'm on my phone and it's all giving me a headache.

[–] [email protected] 17 points 2 weeks ago* (last edited 2 weeks ago) (3 children)

, it’s a salty article

Actually the author himself is somewhat harmed by this situation. I would be salty too. When I wish to write my CV, I can say: my text have been published at X and Y. Especially nice if it's an important and well known publication. Now a part of his CV is literally erased, he can't access his own texts anymore (not even on Internet Archive). That's... utterly ridiculous. It's a common practice to send the author a copy (or multiple) of the text he has published, he has every right to own a copy of them. Now the copy that was intended to be available to everyone is not available even to him. Something of the sort really has happened to me too when a website I published an article on a site underwent a redesign and now the text just isn't available anymore. Admittedly it's still on IA, but it's an awkward situation.

 
96
submitted 8 months ago* (last edited 8 months ago) by [email protected] to c/[email protected]
 

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2024-01-10/Traffic_report

Here's the top 50 list, with the number of views in brackets. The actual article also includes commentary and dates with peak amount of views.

  1. ChatGPT [52,565,681]
  2. Deaths in 2023 [48,603,284]
  3. 2023 Cricket World Cup [38,723,498]
  4. Oppenheimer (film) [31,265,503]
  5. J. Robert Oppenheimer [28,681,943]
  6. Cricket World Cup [26,390,217]
  7. Jawan (film) [23,112,884]
  8. Taylor Swift [22,179,656]
  9. The Last of Us (TV series) [21,000,722]
  10. Pathaan (film) [20,614,066]
  11. Premier League [19,968,486]
  12. Barbie (film) [19,930,916]
  13. Cristiano Ronaldo [19,287,757]
  14. The Idol (TV series) [19,186,512]
  15. United States [18,135,421]
  16. Matthew Perry [17,882,508]
  17. Lionel Messi [17,768,818]
  18. Animal (2023 film) [16,988,676]
  19. Elon Musk [16,026,256]
  20. India [15,200,006]
  21. Avatar: The Way of Water [15,062,733]
  22. Lisa Marie Presley [14,812,928]
  23. Guardians of the Galaxy Vol. 3 [14,155,874]
  24. Russian invasion of Ukraine [13,998,378]
  25. Leo (2023 Indian film) [13,994,461]
  26. List of highest-grossing Indian films [13,904,959]
  27. 2023 Israel–Hamas war [13,647,220]
  28. Israel [13,344,140]
  29. Andrew Tate [13,604,475]
  30. Elizabeth II [13,021,033]
  31. David Beckham [12,850,994]
  32. Fast X [12,763,269]
  33. Sinéad O'Connor [12,712,846]
  34. Spider-Man: Across the Spider-Verse [12,705,868]
  35. Elvis Presley [12,584,150]
  36. Killers of the Flower Moon (film) [12,525,826]
  37. Twitter [12,220,814]
  38. List of American films of 2023 [12,197,227]
  39. Travis Kelce [12,155,733]
  40. The Super Mario Bros. Movie [12,065,680]
  41. Pedro Pascal [12,022,551]
  42. Charles III [11,978,873]
  43. Donald Trump [11,925,480]
  44. Tina Turner [11,634,915]
  45. Indiana Jones and the Dial of Destiny [11,563,900]
  46. Joe Biden [11,152,150]
  47. John Wick: Chapter 4 [11,133,720]
  48. Gadar 2 [11,129,684]
  49. Everything Everywhere All at Once [11,115,623]
  50. Margot Robbie [11,041,143]
 

From https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2023-10-03/Recent_research

^By^ ^Tilman^ ^Bayer^

A preprint titled "Do You Trust ChatGPT? -- Perceived Credibility of Human and AI-Generated Content" presents what the authors (four researchers from Mainz, Germany) call surprising and troubling findings:

"We conduct an extensive online survey with overall 606 English speaking participants and ask for their perceived credibility of text excerpts in different UI [user interface] settings (ChatGPT UI, Raw Text UI, Wikipedia UI) while also manipulating the origin of the text: either human-generated or generated by [a large language model] ("LLM-generated"). Surprisingly, our results demonstrate that regardless of the UI presentation, participants tend to attribute similar levels of credibility to the content. Furthermore, our study reveals an unsettling finding: participants perceive LLM-generated content as clearer and more engaging while on the other hand they are not identifying any differences with regards to message’s competence and trustworthiness."

The human-generated texts were taken from the lead section of four English Wikipedia articles (Academy Awards, Canada, malware and US Senate). The LLM-generated versions were obtained from ChatGPT using the prompt Write a dictionary article on the topic "[TITLE]". The article should have about [WORDS] words.

The researchers report that

"[...] even if the participants know that the texts are from ChatGPT, they consider them to be as credible as human-generated and curated texts [from Wikipedia]. Furthermore, we found that the texts generated by ChatGPT are perceived as more clear and captivating by the participants than the human-generated texts. This perception was further supported by the finding that participants spent less time reading LLM-generated content while achieving comparable comprehension levels."

One caveat about these results (which is only indirectly acknowledged in the paper's "Limitations" section) is that the study focused on four quite popular (i.e. non-obscure) topics – Academy Awards, Canada, malware and US Senate. Also, it sought to present only the most important information about each of these, in the form of a dictionary entry (as per the ChatGPT prompt) or the lead section of a Wikipedia article. It is well known that the output of LLMs tends to be have fewer errors when it draws from information that is amply present in their training data (see e.g. our previous coverage of a paper that, for this reason, called for assessing the factual accuracy of LLM output on a benchmark that specifically includes lesser-known "tail topics"). Indeed, the authors of the present paper "manually checked the LLM-generated texts for factual errors and did not find any major mistakes," something that is well reported to not be the case for ChatGPT output in general. That said, it has similarly been claimed that Wikipedia, too, is less reliable on obscure topics. Also, the paper used the freely available version of ChatGPT (in its 23 March 2023 revision) which is based on the GPT 3.5 model, rather than the premium "ChatGPT Plus" version which, since March 2023, has been using the more powerful GPT-4 model (as does Microsoft's free Bing chatbot). GPT-4 has been found to have a significantly lower hallucination rate than GPT 3.5.

view more: next ›