ruud

joined 1 year ago
MODERATOR OF
1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 

We've updated Lemmy.world to Lemmy 0.18.1.

For the release notes, see https://lemmy.world/post/1139237

1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 

For those who find it interesting, enjoy!

1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 

Another day, another update.

More troubleshooting was done today. What did we do:

  • Yesterday evening @phiresky@[email protected] did some SQL troubleshooting with some of the lemmy.world admins. After that, phiresky submitted some PRs to github.
  • @[email protected] created a docker image containing 3PR's: Disable retry queue, Get follower Inbox Fix, Admin Index Fix
  • We started using this image, and saw a big drop in CPU usage and disk load.
  • We saw thousands of errors per minute in the nginx log for old clients trying to access the websockets (which were removed in 0.18), so we added a return 404 in nginx conf for /api/v3/ws.
  • We updated lemmy-ui from RC7 to RC10 which fixed a lot, among which the issue with replying to DMs
  • We found that the many 502-errors were caused by an issue in Lemmy/markdown-it.actix or whatever, causing nginx to temporarily mark an upstream to be dead. As a workaround we can either 1.) Only use 1 container or 2.) set ~~proxy_next_upstream timeout;~~ max_fails=5 in nginx.

Currently we're running with 1 lemmy container, so the 502-errors are completely gone so far, and because of the fixes in the Lemmy code everything seems to be running smooth. If needed we could spin up a second lemmy container using the ~~proxy_next_upstream timeout;~~ max_fails=5 workaround but for now it seems to hold with 1.

Thanks to @[email protected] , @[email protected] , @[email protected], @[email protected] , @[email protected] , @[email protected] for their help!

And not to forget, thanks to @[email protected] and @[email protected] for their continuing hard work on Lemmy!

And thank you all for your patience, we'll keep working on it!

Oh, and as bonus, an image (thanks Phiresky!) of the change in bandwidth after implementing the new Lemmy docker image with the PRs.

Edit So as soon as the US folks wake up (hi!) we seem to need the second Lemmy container for performance. So that's now started, and I noticed the proxy_next_upstream timeout setting didn't work (or I didn't set it properly) so I used max_fails=5 for each upstream, that does actually work.

1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 

Status update July 4th

Just wanted to let you know where we are with Lemmy.world.

Issues

As you might have noticed, things still won't work as desired.. we see several issues:

Performance

  • Loading is mostly OK, but sometimes things take forever
  • We (and you) see many 502 errors, resulting in empty pages etc.
  • System load: The server is roughly at 60% cpu usage and around 25GB RAM usage. (That is, if we restart Lemmy every 30 minutes. Else memory will go to 100%)

Bugs

  • Replying to a DM doesn't seem to work. When hitting reply, you get a box with the original message which you can edit and save (which does nothing)
  • 2FA seems to be a problem for many people. It doesn't always work as expected.

Troubleshooting

We have many people helping us, with (site) moderation, sysadmin, troubleshooting, advise etc. There currently are 25 people in our Discord, including admins of other servers. In the Sysadmin channel we are with 8 people. We do troubleshooting sessions with these, and sometimes others. One of the Lemmy devs, @[email protected] is also helping with current issues.

So, all is not yet running smoothly as we hoped, but with all this help we'll surely get there! Also thank you all for the donations, this helps giving the possibility to use the hardware and tools needed to keep Lemmy.world running!

 

If you need support, best is to not DM me here or mention me in comments. I now have 300 notifications and probably no time to read them soon. Also I don’t do moderation so any moderation questions I have to forward to the moderation team.

where to get support

There’s the [email protected] community, and another option is to send mail to [email protected]. Mail is converted to tickets which can be picked up by admins and moderators.

Thanks! Enjoy your day!

1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 

Looks like it works.

Edit still see some performance issues. Needs more troubleshooting

Update: Registrations re-opened We encountered a bug where people could not log in, see https://github.com/LemmyNet/lemmy/issues/3422#issuecomment-1616112264 . As a workaround we opened registrations.

Thanks

First of all, I would like to thank the Lemmy.world team and the 2 admins of other servers @[email protected] and @[email protected] for their help! We did some thorough troubleshooting to get this working!

The upgrade

The upgrade itself isn't too hard. Create a backup, and then change the image names in the docker-compose.yml and restart.

But, like the first 2 tries, after a few minutes the site started getting slow until it stopped responding. Then the troubleshooting started.

The solutions

What I had noticed previously, is that the lemmy container could reach around 1500% CPU usage, above that the site got slow. Which is weird, because the server has 64 threads, so 6400% should be the max. So we tried what @[email protected] had suggested before: we created extra lemmy containers to spread the load. (And extra lemmy-ui containers). And used nginx to load balance between them.

Et voilà. That seems to work.

Also, as suggested by him, we start the lemmy containers with the scheduler disabled, and have 1 extra lemmy running with the scheduler enabled, unused for other stuff.

There will be room for improvement, and probably new bugs, but we're very happy lemmy.world is now at 0.18.1-rc. This fixes a lot of bugs.

 

We'll give the upgrade new try tomorrow. I've had some good input from admins of other instances, which are also gonna help troubleshoot during/after the upgrade.

Also there are newer RC versions with fixed issues.

Be aware that might we need to rollback again, posts posted between the upgrade and the rollback will be lost.

We see a huge rise in new user signups (duh.. it's July 1st) which also stresses the server. Let's hope the improvements in 0.18.1 will also help with that.

 

So I've been troubleshooting the federation issues with some other admins:

(Thanks for the help)

So what we see is that when there are many federation workers running at the same time, they get too slow, causing them to timeout and fail.

I had federation workers set to 200000. I've now lowered that to 8192, and set the activitypub logging to debugging to get queue stats. RUST_LOG="warn,lemmy_server=warn,lemmy_api=warn,lemmy_api_common=warn,lemmy_api_crud=warn,lemmy_apub=warn,lemmy_db_schema=warn,lemmy_db_views=warn,lemmy_db_views_actor=warn,lemmy_db_views_moderator=warn,lemmy_routes=warn,lemmy_utils=warn,lemmy_websocket=warn,activitypub_federation=debug"

Also, I saw that there were many workers retrying to servers that are unreachable. So, I've blocked some of these servers:

commallama.social,mayheminc.win,lemmy.name,lm.runnerd.net,frostbyrne.io,be-lemmy.org,lemmonade.marbledfennec.net,lemmy.sarcasticdeveloper.com,lemmy.kosapps.com,pawb.social,kbin.wageoffsite.com,lemmy.iswhereits.at,lemmy.easfrq.live,lemmy.friheter.com,lmy.rndmm.us,kbin.korgen.xyz

This gave good results, way less active workers, so less timeouts. (I see that above 3000 active workers, timeouts start).

(If you own one of these servers, let me know once it's back up, so I can un-block it)

Now it's after midnight so I'm going to bed. Surely more troubleshooting will follow tomorrow and in the weekend.

Please let me know if you see improvements, or have many issues still.

 

We've upgraded lemmy.world to 0.18.1-rc.1 and rolled back that upgrade because of issues.

(If you had posted anything in those 10 minutes between upgrade and rollback, that post is gone. Sorry!)

The main issue we saw is that users can't login anymore. Existing sessions still worked, but new logins failed (from macos, ios and android. From linux and windows it worked)

Also new account creation didn't work.

I'll create an issue for the devs and retry once it's fixed.

Edit Contacted the devs, they tell me to try again with lemmy-ui at version 0.18.0. Will try again, brace for some downtime!

Edit 2 So we upgraded again, and it seemed to work nicely! But then it slowed down so much it was unuseable. There were many locks in the database. People reported many JSON errors. Sorry, we won't be on 0.18.1 any time soon I'm afraid..

1
submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]
 

The 0.18 version of Lemmy was announced. This will solve many issues.

But we can't upgrade yet because the captcha was removed, and captcha relied on Websockets, which are removed in 0.18 so despite the devs agreeing on my request to add captcha back, this will not be until 0.18.1. Without captcha we will be overrun by bots.

Hopefully this 0.18.1 will be released soon, because another issue is that the newest version of the Jerboa app won't work with servers older than 0.18. So if you're on Lemmy.world, please (temporarily) use another app or the web version.

 

I added some known issues with websockets / spinning wheel to the known issues post

 

I wrote my fist post about Lemmy.world. When June is finished, I'll also include Lemmy in the financial update on the same blog.

[–] [email protected] 1 points 1 year ago

You're welcome. Enjoy!

[–] [email protected] 1 points 1 year ago

I think there's a misunderstanding. In the docker-compose.yml, you specify services, and these services can use the official container images. The only thing the docker-compose actually does is define your services so you don't have to specify them each time starting a container.

view more: ‹ prev next ›