My main account is [email protected]. However, as of roughly ~~24-hours ago~~ (it seems this has been going on since March 10th and gotten worse since) it seems like the server has stopped properly retrieving content from lemmy.world.
It's been running smoothly for well over 9 months, and (I think) working fine for content coming in from other instances. So I'm curious if anyone else experienced anything strange with lemmy.world federation recently?
Setup Description
The server flow in my case is as follows:
[Public Internet] <-> [Digital Ocean Droplet] <-> [ZeroTier] <-> [Physical Machine in my Basement (HW Info)]
The Digital Ocean droplet is a virtual host machine that forwards requests via nginx to the physical machine where a second nginx server (running the standard lemmy nginx config) then forwards the request to the lemmy server software itself.
Current Status
Lemmy Internal Error
I've found this is my lemmy logs:
2024-03-24T00:42:10.062274Z WARN lemmy_utils: error in spawn: Unknown: Request limit was reached during fetch
0: lemmy_apub::objects::community::from_json
at crates/apub/src/objects/community.rs:126
1: lemmy_apub::fetcher::user_or_community::from_json
at crates/apub/src/fetcher/user_or_community.rs:87
2: lemmy_server::root_span_builder::HTTP request
with http.method=POST http.scheme="http" http.host=social.packetloss.gg http.target=/inbox otel.kind="server" request_id=688ad030-f892-4925-9ce9-fc4f3070a967
at src/root_span_builder.rs:16
I'm thinking this could be the cause ... though I'm not sure how to raise the limit (it seems to be hard coded). I opened an issue with the Lemmy devs but I've since closed it while gathering more information/making sure this is truly an issue with the Lemmy server software.
Nginx 408 and 499s
I'm seeing the digital ocean nginx server reporting 499 on various "/inbox" route requests and I'm seeing the nginx running on the physical machine that talks directly to lemmy reporting 408 on various "/inbox" route requests.
There are some examples in this comment: https://lemmy.world/comment/8728858
If your failing on Lemmy.world like others suggest, but federation is working with others, it's likely you are getting large queries that are timing out with lemmy.world. You can check for timeouts in your proxy's logs.
Troubleshooting tips
This guide is for tuning postgres if you find it is an issue. This has good tips and tips on how to analyze issues on postgres. It's oriented for matrix tuning though. work_mem and temp files likely will need constant tuning to meet your demands if you have limited memory for postgres.
Lastly I wanted to note that Lemmy does application connection pooling, so if you have a Pooler service behind postgres, dont, let Lemmy connect directly to postgres
Hope these help. I went from a recent 1 million federation backlog in lemmy.world to near none. I believe I am at a ram limit and a couple expensive queries (deleting posts for example) are creating temp files and taking too long.
So, I think you're most on the right track of the responses...
It seems to just be exclusively incoming from lemmy.world. If you look here, my most recent comment is on lemmy.world:
https://social.packetloss.gg/comment/1415801 https://lemmy.world/comment/8710941
The instance just isn't getting any new posts, comments, or votes back from lemmy.world.
Everytime I shut down the lemmy server I see this:
That number never seems to move, there are always 1618 works. I'm not sure if that means anything or not regarding pending processing or what have you.
I am seeing in my publicly facing nginx logs:
There's then an internal nginx server that sees:
So, things did start timing out. I'm not sure what to do about that though.
This server is not resource starved:
It's just this lemmy.world data that's suddenly out of wack after months of normal operation (both on lemmy 18 and 19).
It feels like a bad payload that the server just can't move past for some reason and lemmy.world keeps sending.
I had logging on the lemmy container itself piped to /dev/null because it's just such a noisy log. I turned it back on... I'll see if I can find more information next time lemmy.world posts.
Another note you want to have a round trip from federation in under 3 seconds. Those 499 are likely 10 second timeouts from Lemmy.worlds side. Mostly from a long query being the culprit. Lemmy federation in 0.19 processes changes in order per instance so lemmy.world won't sync newer stuff and keep sending the same large request. I believe there is a retry limit though.
Yeah, I mean things should be fine in general; like I said this has been working for quite a long time now without issue.
The machine that's actually doing the work here is quite powerful and is used to run several game servers in addition to Lemmy ... Lemmy really isn't much more than footnote in resource usage:
How much RAM is postgres using? I'm running on i7 11th Gen and 16gb of RAM so your hardware is fine. This is the resource usage of my setup. First 6 are Lemmy itself broken up to scale and for redundancy. I have a node down right now normally using 3 http ones. Http ones receive federation from others and the lemmy-instance2s are the databases setup to be highly available too.
NAME CPU(cores) MEMORY(bytes)
lemmy-6b9fbd75f6-4q4n8 1m 5Mi
lemmy-fed-0 2m 18Mi
lemmy-fed-1 1m 18Mi
lemmy-fed-2 6m 17Mi
lemmy-http-57886f784f-5x9vb 4m 73Mi
lemmy-http-57886f784f-tpbtb 13m 121Mi
lemmy-instance2-czc6-0 47m 3362Mi
lemmy-instance2-jpw7-0 13m 1017Mi
Not sure what you're using to generate that list/formatting is a bit difficult.
I don't have a cluster since it's effectively single user + @[email protected] (in theory a few other people have access, but they're not active), single machine, it's just more or less the out of the box docker stuff on a bare metal machine in my basement + a digital ocean droplet.
The droplet is what I'm using to have a static IP to prevent dynamic DNS nonsense + it provides some level of protection against a naive DDoS attack on random fediverse servers (since I can in the worst case, get on my phone and severe the ZeroTier connection that's using to connect the droplet to my basement server).
I'm pretty confident whatever is going on is payload related at this point.
I've got quite a bit of experience with postgres; I don't see any indication it's the problem.