Yesterday, 2022-11-28, we needed to do some fairly urgent maintenance work on the DEOSS Community Server. One of its hard disks was failing. The server resides in a data centre in York, and Bytemark (my upstream provider) engineers undertook the work. This is a fairly common operation for its engineers. Nevertheless, there was still the possibility of unforeseen technical issues.
The server uses two disks in a “RAID 1” arrangement, so that if one disk fails, the data is retained on the other disk. When the faulty disk is replaced, the RAID will then write a copy of the data back to the new disk.
Work was initially scheduled for Monday, 2022-11-28, 10:00 UTC. However we hit a small snag. I back-up the server daily to a local machine via RSync. So data loss was unlikely. However if the repair did fail completely, then I wanted to ensure a replacement server could be implemented quickly, without securely squirting 200GiB data across the internet.
So, in addition to my local off-site backups, Bytemark kindly provided me with a backup system that is local to the server, just in case things went wrong. Having a complete dataset in the same building as the server would enable us to build a new server much more quickly if things went badly wrong.
We planned to create this additional backup over the previous weekend. But unfortunately, the backup volume was limited to 100GiB. Bytemark fixed that limitation Monday morning. Which meant that the full backup did not complete until later that day.
Consequently, the the disk swap was not undertaken until 16:00 on Monday afternoon. Fortunately for users, the server remained up for all that time. The swap itself took about 30 minutes. However the reboot took over two hours as a program called ‘fsck’ checked the remaining RAID disk for errors, prior to rebuilding the RAID across onto the replacement second RAID disk. Consequently, the Server went live and became usable again at approximately 18:38. This meant we had a total of 2 hours and 38 minutes actual downtime. This was a little longer than I hoped. But considering the circumstances, it really wasn’t too bad.
However, the server still ran rather slowly for a few more hours as the RAID array rebuilt itself across onto the new disk. The RAID completed at around 23:00 Monday evening. Since then, I have performed fairly extensive tests on the server and all seems well. However, if you are DEOSS customer and still have any problems with your service, then please contact me though the usual channels.
(RAID=’redundant array of inexpensive disks’.)