We have been having on and off issues that are a combination of external issues and bugs being found in our fail-over system setup. While it isn't so much bugs in the setup itself, as it is in some of the software we use behind the scenes. Now that those are bugs are getting killed, we might just have a great setup moving forward.
Additional testing will need to be done to test the fail over again, hopefully overnight vs morning or noon during our high traffic periods.
Part of the issues came up when I found the configuration problem causing the fail-over or Europe based system to seem unusually slow. Some reported it was quick, but the performance issues seemed to be less noticeable from reports. Turned out that the Europe based server was accessing the database on the USA side for some calls, which defeated the whole purpose of a geo-distant self-sufficient system - hence the slowdown.
Fixing this issue, improved the performance significantly of the Europe server, even from sitting inside the USA. This was a big sigh of relief. However, unknown at the time, a bug was lurking in the configuration of the software previously connecting to the USA server and now pointing to itself. This bug made our real-life fail-overs fail and brought down both servers, instead of just the one.
With all these details nailed out, we can probably visit the viability of offering a Europe based server for those in Europe (will need some testers) and providing an effective fail-over system for those in the USA with minimal performance loss due to distance.
Scott
Additional testing will need to be done to test the fail over again, hopefully overnight vs morning or noon during our high traffic periods.
Part of the issues came up when I found the configuration problem causing the fail-over or Europe based system to seem unusually slow. Some reported it was quick, but the performance issues seemed to be less noticeable from reports. Turned out that the Europe based server was accessing the database on the USA side for some calls, which defeated the whole purpose of a geo-distant self-sufficient system - hence the slowdown.
Fixing this issue, improved the performance significantly of the Europe server, even from sitting inside the USA. This was a big sigh of relief. However, unknown at the time, a bug was lurking in the configuration of the software previously connecting to the USA server and now pointing to itself. This bug made our real-life fail-overs fail and brought down both servers, instead of just the one.
With all these details nailed out, we can probably visit the viability of offering a Europe based server for those in Europe (will need some testers) and providing an effective fail-over system for those in the USA with minimal performance loss due to distance.
Scott