this post was submitted on 30 Dec 2024
31 points (97.0% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

55212 readers
130 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others



Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):


💰 Please help cover server costs.

Ko-Fi Liberapay
Ko-fi Liberapay

founded 2 years ago
MODERATORS
 

Not sure if this is the right community but seems close enough.

Ideally i want a url that i can just put any paywalled news article into that will return the unpaywalled version.

Ie: https://somedomain/https://somenewssite/somenewsartle

I need it to work with https://pypi.org/project/newspaper4k/

Alternativly if someone knows of another python library that can extract article text and images automaticly just from a link that would also solve my problem.

top 9 comments
sorted by: hot top controversial new old
[–] [email protected] 1 points 3 days ago

12ft works, if you really need to. But in general, I just don’t read any publications that paywall their content. Mass media is all owned by one or two billionaires, if they need money they can get it from them.

[–] [email protected] 7 points 6 days ago (1 children)

Looks like newspaper4k uses headless Chrome. You could try loading the Bypass Paywalls Clean extension and browsing the pages directly.

I regularly use it (in Firefox) without even thinking about it. Only notice when I send someone an article they can't access.

[–] [email protected] 1 points 5 days ago (1 children)

It does not use headless chrome it just uses the python requests library. Did u get got by an ai hallucination?

Source: i went digging in the source code.

[–] [email protected] 2 points 4 days ago (1 children)

No, just this example code from their site:

browser = p.chromium.launch(headless=True)

My mistake was not knowing where newspaper4k fits in the stack. They're wrapping it with Playwright, which it seems you could do here.

[–] [email protected] 1 points 4 days ago

Ahh i see. Im using newspaper4k to fetch articles directly it seems the example u found is just using it simply as a parser after using playwright as a html fetcher. I might try that approach.

[–] [email protected] 3 points 6 days ago (1 children)
[–] [email protected] 1 points 6 days ago

Yeah ive tried that only some of em work in an easy way to implement but if the one im currently using goes down then i guess ill have to bodge somthing together.

[–] [email protected] 15 points 1 week ago

Generally, 12ft.io works pretty well for me.

[–] [email protected] 10 points 1 week ago* (last edited 1 week ago)

Most of the time archive.today gets the work done

It also offers a URL to get a snapshot from a given URL: http://archive.is/newest/http://lemmy.dbzer0.com/c/piracy