this post was submitted on 17 Mar 2024

244 points (81.8% liked)

Privacy

31975 readers

248 users here now

A place to discuss privacy and freedom in the digital world.

Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.

In this community everyone is welcome to post links and discuss topics related to privacy.

Some Rules

Posting a link to a website containing tracking isn't great, if contents of the website are behind a paywall maybe copy them into the post
Don't promote proprietary software
Try to keep things on topic
If you have a question, please try searching for previous discussions, maybe it has already been answered
Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
Be nice :)

Related communities

Chat rooms

[Matrix/Element]Dead
Discord

much thanks to @gary_host_laptop for the logo design :)

founded 5 years ago

MODERATORS

[email protected]

244

We're building a search engine to compete with DuckDuckGo. No JS, no WASM, no spying. Just a statically generated results page. (lemmy.world)

submitted 8 months ago by [email protected] to c/[email protected]

85 comments fedilink hide all child comments

We're (a group of friends) building a search engine from scratch to compete with DuckDuckGo. It still needs a name and logo.

Here's some pictures (results not cherrypicked): https://imgur.com/a/eVeQKWB

Unique traits:

Written in pure Rust backend, HTML and CSS only on frontend - no JavaScript, PHP, SQL, etc..
Has a custom database, schema, engine, indexer, parser, and spider
Extensively themeable with CSS - theme submissions welcome
Only two crates used - TOML and Rocket (plus Rust's standard library)
Homegrown index - not based on Google, Bing, Yandex, Baidu, or anything else
Pages are statically generated - super fast load times
If an onion link is available, an "Onion" button appears to the left of the clearnet URL
Easy to audit - No: JavaScript, WASM, etc.. requests can be audited with F12 network tab
Works over Tor with strictest settings (official Tor hidden service address at the bottom of this post)
Allows for modifiers: hacker -news +youtube removes all results containing hacker news and only includes results that contain the word "youtube"
Optional tracker removal from results - on by default h No censorship - results are what they are (exception: underage material)
No ads in results - if we do ever have ads, they'll be purely text in the bottom right corner, away from results, no media
Everything runs in memory, no user queries saved.
Would make Richard Stallman smile :)

THIS IS A PRE-ALPHA PRODUCT, it will get much MUCH better over the coming months. The dataset in the temporary hidden service linked below does not do our algorithm justice, its there to prove our concept. Please don't judge the technology until beta.

Onion URL (hosted on my laptop since so many people asked for the link): ht6wt7cs7nbzn53tpcnliig6zrqyfuimoght2pkuyafz5lognv4uvmqd.onion

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 16 points 8 months ago (3 children)

pay for it

I wonder what a distributed search engine would look like. Basically, the index would be sharded across user computers, and queries would hit some representative sample of that index. This means:

hosting costs are very low - just need a way to proxy requests to the network
search times should improve as more people use the service
no risk of the service logging anything - individual nodes don't need to know who requested the data, just who to send the response to

My biggest concern is how to build the index, but if OP is willing to share that, I might start hacking on a distributed version.

[–] [email protected] 13 points 8 months ago (2 children)

Don't start new; contribute to what already exists: https://en.wikipedia.org/wiki/YaCy

[–] [email protected] 3 points 8 months ago

Awesome! That's pretty much exactly what I'm looking for, though I'm interested to see how easy it is limit certain peers to certain functions. Not everyone has resources to crawl and index pages, but a lot of people can store the index.

I'm interested in having client-side web storage, so you can participate in the network by just having the search page open (opt-in of course).

I'm honestly not actively working on it, but if OP provides the database and/or crawler, I'll do some research on feasibility.

[–] [email protected] 2 points 8 months ago (2 children)

This is really neat and I’m just hearing about it after over twenty years of development. I need to try it out, thank you. How do you stay in the know about this kind of stuff? I’m curious about all the cool stuff out there I wouldn’t even know I’m curious to find.

[–] [email protected] 6 points 8 months ago

How do you stay in the know about this kind of stuff?

By being terminally online, I guess?

More concretely, I've spent (probably too much) time on Slashdot, Reddit and now Lemmy over the years (subscribed to Free Software and privacy-related communities in particular). Also, looking through sites like https://awesome-selfhosted.net/ and https://www.privacytools.io/, wiki-walking through articles about Free Software projects on Wikipedia, browsing the Debian repositories, etc.

I'm sure there are plenty of things I haven't heard of either, though.

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago)

How do you stay in the know about this kind of stuff? I’m curious about all the cool stuff out there I wouldn’t even know I’m curious to find.

I was going to mention YaCy as well if nobody else was, so I can chip in to this somewhat. My method is to keep wondering and researching. In this case it was a matter of being interested in alternative search engines and different applications of peer to peer/decentralized technologies that led me to finding this.

So from this you might go: take something you're even passingly interested in, try to find more information about it, and follow whatever tangential trails it leads to. With rare exceptions, there are good chances someone out there on the internet will also have had some interest in whatever it is, asked about it, and written about it.

Also be willing to make throwaway accounts to get into the walled gardens for whatever info might be buried away there and, if you think others may be interested, share it outside of those spaces.

[–] [email protected] 6 points 8 months ago (1 children)

I wonder what a distributed search engine would look like.

Isn't that what Searx is/can be?

https://en.wikipedia.org/wiki/Searx#Instances

I admit it's not something I've looked closely at.

[–] [email protected] 7 points 8 months ago (2 children)

No, Searx is a metasearch engine that queries and aggregates results from multiple normal search engines (Google, Bing, etc.)

A distributed search engine would be more like YaCy, which does its own crawling and stores the index as a distributed hash table shared across all instances.

[–] [email protected] 1 points 8 months ago

Ah thanks - appreciate the clarification.

[–] [email protected] 1 points 8 months ago

Exactly. The main difference I would bring is a web client that hooks into the network, and perhaps an alternative client (e.g. I'm interested in Tauri, so I may rewrite part of the BE to Rust).

But I'm probably not going to start on this project on my own. DDG is good enough for now, so I'm putting my efforts elsewhere.

[–] [email protected] 3 points 8 months ago (1 children)

i feel that decentralized search is an extremely valuable thing to start thinking about. but the devil is in practically every one of the details.

[–] [email protected] 1 points 8 months ago

Yup. Even if you trust all your peers (which isn't reasonable), there's still a ton of practical issues that need to be resolved:

pagination with a different set of peers
moderation of CSAM and whatnot
outdated peers and stale data
how much data and where are results reduced

It's a really complex problem without getting p2p involved, and p2p just adds a ton of other problems.

So I'm probably going to stick with building my Reddit clone, which I think is simpler (search doesn't need to happen at the start).