Server Issues over Last Few Days

ScottW

Founder
Staff member
We have been having on and off issues that are a combination of external issues and bugs being found in our fail-over system setup. While it isn't so much bugs in the setup itself, as it is in some of the software we use behind the scenes. Now that those are bugs are getting killed, we might just have a great setup moving forward.

Additional testing will need to be done to test the fail over again, hopefully overnight vs morning or noon during our high traffic periods.

Part of the issues came up when I found the configuration problem causing the fail-over or Europe based system to seem unusually slow. Some reported it was quick, but the performance issues seemed to be less noticeable from reports. Turned out that the Europe based server was accessing the database on the USA side for some calls, which defeated the whole purpose of a geo-distant self-sufficient system - hence the slowdown.

Fixing this issue, improved the performance significantly of the Europe server, even from sitting inside the USA. This was a big sigh of relief. However, unknown at the time, a bug was lurking in the configuration of the software previously connecting to the USA server and now pointing to itself. This bug made our real-life fail-overs fail and brought down both servers, instead of just the one.

With all these details nailed out, we can probably visit the viability of offering a Europe based server for those in Europe (will need some testers) and providing an effective fail-over system for those in the USA with minimal performance loss due to distance.

Scott
 
Nice work Scott. Sounds stressful to say the least. I did notice the slowdown/lack of operation this morning but it's working fine at the moment.
 
Failover testing earlier this evening went well. I am glad we went through it. It is good testing things when they are stable and you know what changed so if things don't go right, you can pinpoint. I discovered that modification of IPs manually in the DNS table screws up the monitoring system's ability to automatically update DNS. I was then able to re-test it out and it did work.

A few issues remain, which will include tweaking performance settings with the database server on the fail-over, as it started cranking up utilization quickly making the server slow and the downtime from the periodic checks of the monitor can be in upwards of 5 minutes of downtime before moving over. While, in the big scheme of things, 5 minutes is better than 10, 15 or even 8 hours, but I would like to shorten it.
 
I have been sick with 102 degree temp since last night. So, I have been out of it. Still looking to see what happened today.
 
Back
Top