this post was submitted on 26 Jul 2024
661 points (97.4% liked)

Technology

58137 readers
4393 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 16 points 1 month ago (6 children)

I work for a different sort of company that hosts some publicly available user generated content. And honestly the crawlers can be a serious engineering cost for us, and supporting them is simply not part of our product offering.

I can see how reddit users might have different expectations. But I just wanted to offer a perspective. (I'm not saying it's the right or best path.)

[–] [email protected] 4 points 1 month ago* (last edited 1 month ago) (5 children)

Can you use something like the DDOS filter to prevent AI automated scrapings (too many requests per second)?

I'm not a tech person so probably don't even know what I'm talking about.

[–] [email protected] 5 points 1 month ago* (last edited 1 month ago)

I worked with a company that used product data from competitors (you can debate the morals of it, but everyone is doing it). Their crawlers were set up so that each new line of requests came from a new IP.. I don’t recall the name of the service, and it was not that many unique IP’s but it did allow their crawlers to live unhindered..

They didn’t do IP banning for the same reasoning, but they did notice one of their competitors did not alter their IP when scraping them. If they had malicious intend, they could have changed data around for that IP only. Eg. increasing the prices, or decreasing the prices so they had bad data..

I’d imagine companies like OpenAI has many times the IP, and they’d be able to do something similarly.. meaning if you try’n ban IP’s, you might hit real users as well.. which would be unfortunate.

load more comments (4 replies)
load more comments (4 replies)