1

lazysoci.al Outage 2024-10-11 (lazysoci.al)

submitted 9 months ago by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

Our hosting provider had an outage which lasted ~30 minutes.

I raised a support case as soon as I got the alert as I couldn't access the systems via their link, but the 4G backup confirmed servers were online.

1

lazysoci.al Outage 2024-08-28 (lazysoci.al)

submitted 10 months ago by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

lazysoci.al was offline whilst the cluster was being updated.

Outage was 38 minutes. 11:00 to 11:38 BST

This was expected to be <15 minutes. The extended outage was due to an issue bringing up a docker container which was a pre-requisite for the load balancer.

1

lazysoci.al Update 2024-07-13 (lazysoci.al)

submitted 1 year ago* (last edited 1 year ago) by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

I shall be migrating pict-rs to postgres, now.

The pict-rs service shall need to be offline which shall effect thumbnail generation and the ability to post images.

Should be completed within 4 hours of this post.

Edit: Migration has completed at 1515 UK time.

1

lazysoci.al Update 2024-06-27 (lazysoci.al)

submitted 1 year ago* (last edited 1 year ago) by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

Registration requirements

I've had to ban multiple vile accounts this morning. It seems this instance has found itself on the radar of trolls.

To that end, sign-ups now require users to fill in a questionnaire prior to joining. I always thought it was lame, and it won't really prevent a troll account from joining, but it would slow them down, and likely cause them to find another easier-to-join instance.

Upgrade

I shall be upgrading the instance to ~~0.19.4~~ 0.19.5 ~~this afternoon~~ on Friday morning, so expect a little downtime

I'll unpin this notice once the update is complete.

Edit: Unfortunately work got in the way of me performing the update today, so postponing this to first thing tomorrow morning.

Cached images

We use the standard group of services for Lemmy, including the pict-rs image/thumbnail cache. This image cache grew to 700G recently and continues to grow as Lemmy grows. Therefore some effort has been made to keep it under control.

Cached images from other instances are now only kept for 90-days. This doesn't remove the original image.

1

Outage: 2024-01-14 - Upgraded to 0.19.2 (lazysoci.al)

submitted 2 years ago* (last edited 2 years ago) by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

Upgraded to 0.19.2

Upgrade itself went OK. Also updated pict-rs to 0.5.1 which performed a metadata update.

I tested this upgrade last Thursday 2024-01-11 but rolled back when outgoing federation didn't work.

Found a fix from matrix.org (thanks to Ategon@programming.dev and Illecors@lemmy.cafe) by updating the "updated" column on the instance table.

UPDATE instance SET updated = now() WHERE updated > now() - interval '14 days';

Total downtime was 12 minutes.

1

Outage: 2023-12-17 (lazysoci.al)

submitted 2 years ago* (last edited 2 years ago) by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

Upgraded to 0.19.0

Looks good so far. Took a snapshot prior to upgrade, but initial monitoring shows no issues.

Total downtime was 42 minutes.

1

Outage: 2023-11-11 (lazysoci.al)

submitted 2 years ago by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

What was meant to be a quick blip ended up being over an hour. Migrating the reverse proxy that sits in front of the Lemmy server failed as dockerhub is having an outage.

1

Outage; 2023-09-10 (lazysoci.al)

submitted 2 years ago* (last edited 2 years ago) by lazyadmin@lazysoci.al to c/announcements@lazysoci.al

0 comments fedilink

Overview

lazysoci.al was offline for 3h 15m today following a database corruption. Server is now back online, federated data is flowing again.

Details

I moved the server to its own dedicated host this morning, for both the performance and security (dedicated vlan) impact. Should have been a simple case of moving the virtual disk with the Lemmy data to the new VM and spinning up the new docker image.

The docker logs didn't show any initial issues, however writing to the database gave errors of ERROR: relation "approvals" does not exist for every UPDATE query.

After some troubleshooting, I finally thought the database was corrupted, so I started a restore from last nights backup. This took approx. 2h 30m to restore.

Post-restore, the same issue. I then performed an update to the latest beta, and the issue is now resolved.

This has highlighted one problem. I use proxmox-backup-server and proxmox-virtual-environment. You can't easily restore a single disk from a VM into a ZFS volume. If using the web interface you have to restore the entire VM. So the backup took much longer to restore, as it needed to restore several other disks first.

Improvements

The new setup would restore quicker as there is only the OS and data, unique to Lemmy.
There shall now be backups performed every 2 hours instead of nightly.
A new script performs a snapshot every hour, and retains snapshots for 24 hours. Snapshots can be restored pretty much instantly.

I'll also do some testing with the CLI proxmox-backup-client so I can work out how to restore a single disk into a zvol.