Another outage, see below for a report from the people that look after the servers.
Unfortunately when this happens it corrupts the forum database if someone is posting at the time.
So I have to manually do a database repair to correct it.
Incident 1
Date: Tuesday 10th September
Time: Morning
VPS Downtime: VPSs located on the Windows 2012 Cluster on SAN01 were offline for a period of 3-4 hours. During this time our mail server was also offline causing difficulties in responding to and receiving tickets which coupled with a very heavy phone load meant some users were unable to communicate with out staff.
Cause: The central licensing server for the SAN provided incorrect activation expiry details which saw one of our SANs inexplicably fail to renew it's license automatically. This took some time to sort out as it involved overseas contact. This issue has been resolved permanently and will not reoccur. Future Preventative Action: Our immediate action involved ensuring with the SAN provider that this could not happen again and modifying our license/activation structure accordingly in conjunction with them.
Incident 2
Date: Wednesday 11th September
Time: 11:40pm-12:15am & 1:30am
VPS Downtime: VPSs located on the Windows 2012 Cluster (across all SANs) were offline for a short period of time ranging from 5 minutes to 35 minutes. All VPSs located on the Windows 2012 Cluster were restarted as a result.
Cause: The virtualization system began throwing errors on some nodes, this snowballed and caused issues for a number of VPSs that were in a hung state. We needed to restart the virtualization system to restore all services quickly. Future Preventative Action: Due to the initial problem occurring we installed the latest hotfix provided by Microsoft, this also caused a reoccurrence of the issue while the patches were being installed. This was scheduled to be installed later in September, and hadn't been done earlier due to the relative stability of the virtualization system since the last issues caused by this problem. As a result of a reoccurence we have installed the hotfix across all nodes which according to Microsoft should prevent this problem reoccurring.
General Information - Virtualization
We have identified for some time that the virtualization system under 2012 has not been as stable as our 2008 cluster, and we believe we adopted the technology on a broad scale too early. However, MS has given assurances that as of the latest hotfix there are no continuing known issues that should cause the same ongoing problem. However, we are very soon shifting to a model offered by our Australian competitors for our standard VPS products. This will involve individual servers rather than clustered failover (as the issues with the virtualization system as well as other problems have all been related to either SAN or the clustering system). These servers will also include 100% RAID 10 local SSD storage so in effect will also offer faster disk access. We expect to be offering this within 30 days and the launch will coincide with a new look website. We will still offering failover clustering as a separate product. Existing clients will not be moved unless they request a move after the new systems are online and our internal services integration completed. We will be assessing our 2012 cluster now that the new hotfix has been installed to determine it's stability before making any further decisions which would affect existing VPSs.
Regards peterp