forum Stability announcement: downtime yesterday
Started by @Eldest-God-andrew health_and_safety flash_onAdmin
tune

people_alt 102 followers

@Eldest-God-andrew health_and_safety flash_onAdmin

First off, let me apologize for the downtime we saw yesterday. For approximately 4 hours, about 40% of users were unable to reliably access the site.

There were multiple issues that happened at once, resulting in a perfect storm that made debugging quite difficult: once one issue was fixed, another popped up. Here's what happened:

  1. A configuration issue with our host removed a machine dedicated to sending forum emails to users. Instead of not sending any emails, our main machine (the one that powers the site) took over, which tied up resources that would otherwise be dedicated to running the site smoothly. The forums were temporarily disabled while this was debugged, and are now online again. I'm discussing the issue with our hosting provider that runs the machines to ensure that it does not happen again.

  2. We experienced a minor DDoS from a user that repeatedly attempted to upload the same maliciously-crafted 5GB file over and over again. This user has been banned and we'll be implementing better protection against this soon.

  3. A few users seem to have scripted a program to create tens of thousands of pages on their behalf, which also refreshed the entire list after each creation and stole significant amounts of server resources from legitimate users. These users have been given the option to export their notebook, but are banned from creating new pages to ensure that people who legitimately use the site are offered the resources they need to do so.

  4. There still seems to be some kind of configuration issue with how our host is managing the new Postgres database we introduced a few days ago. I am actively working with them to solve these issues, and fighting slowness on the site along the way.

Any of these issues individually wouldn't have caused such long downtime, but the three of them together caused issues across the machines, the database, and routing through our hosting provider.

Again, I'm very sorry for this downtime. As a silver lining, the database upgrade we did the other day made it easier to diagnose pieces of each of the issues and respond quicker.

The issues seem to be resolved now, and I will continue to monitor server uptime and performance. I will also be bringing aboard an expert in infrastructure to better protect against incidents like this in the future.

Thank you everyone, and happy worldbuilding.

andrew (Our Supreme Lord and Overseer)

@Eldest-God-andrew health_and_safety flash_onAdmin

Fingers crossed: I may have found (and fixed) a large oversight in the database configuration. We're already seeing speedups of 300%, and things are looking even faster than before this whole fiasco.

I hope this fixes the intermittent issues we've been seeing today and yesterday. I will continue to monitor the site closely. I apologize to everyone who's been running into issues; scaling is difficult, but we've handled it relatively well, even with the occasional bump in the road. Back to optimizations!

@Eldest-God-andrew health_and_safety flash_onAdmin

Any time. :)

Just want to say we're running blazingly fast with zero errors in the past 24 hours.

Some statistics for page load times over the past 24 hours:

  • 50% of page loads take less than 150ms
  • 95% of page loads take less than 520ms
  • 99% of page loads take less than 1.6 seconds
  • the longest time a page has taken to load has been 5.4 seconds

I have a few small improvements to add in from here (so it should only get faster!), but I'm pretty happy with these speeds.

Happy worldbuilding, everyone!

@Urby

I am suddenly curious: is there an size limit to single uploads in the gallery? The description of the DDOS attack makes me wonder how it was possible in the first place.

Anyway, I was one of the users who was having problems earlier this week, and I'm happy to report that there are no more issues! Thanks!