Hello,
This is to share an ordeal in trying to get the RAID1 back without the risk of loosing all data in the surviving drive.
One of the drive in a RAID1 configured DNS323 failed. The failed drive is a Seagate 1TB with firmware SD15, which is known to die 'expectedly'. I had firmware 1.06 in the DNS323 originally, replaced the failed drive with a Western Digital Caviar Green WD10EARS 1TB with 32MB, the configuration wizard guided me to format the drive, however, it got stuck at 94% completion mark (found out later that this is widely reported in the forum). After leaving it in that stage for hours (>2hrs), the progress was sill 94%, so had to forced power recycling because the web configuration was no longer responsive. After the unit booted up, getting back into the web configuration, the wizard once again asked that the drive
S to be formatted to be part of RAID1 array. I clicked "
Skip" (
This step is crucial, by proceeding, the whole RAID1 will likely be damaged). Then, going to "Tools", "RAID1", I started manually "Re-build", which re-sync the drives. After many hours, the sync was reported completed.
However, upon re-booting the DNS323, the web configuration wizard was again reporting that the new drive needed to be formatted, again, click "Skip". Checked the "Status" page, and it reported RAID1 array was in-sync'ed. .
I googled for similar reports, and decided to flash in 1.08 firmware. However, the behavior is the same, though the firmware release note did mentioned that the "stuck at 94%" problem has been resolved. I think this was because the problem is already triggered in my case, had I flash in 1.08 before replacing the drive, it may have avoided this, I suppose.
So, this is a bug that the rest of this post is about, however, it require familiarity with Linux and all the steps necessary to get funplug in the DNS323. So, do not proceed if you are not comfortable with those.
First, let me zero in on the problem - Despite that the internal Linux RAID sub-system was very happy about the health status of the RAID1 array, however, the web configuration wizards depended on other DNS323 proprietary data file to track the state of the RAID1 array, and "status" page depended on query to the Linux /proc/mdstat to establish the health of the array. These different approaches in establishing the state of RAID array caused the ambiguous information.
The internal RAID state tracking data file appeared to be
hd_magic_num, kept in /dev/mtdblock0 an /dev/mtdblock1 (both are minix filesystem). The file is copied out to /mnt/HD_xx/.systemfile/ (i.e. HD_a2, HD_a4 and HD_b4 typically). The format appears to be two random tokens followed by the serial numbers of the right and left drive, in that order. The random tokens are probably meant for consistency verification of all copies scattered around, it probably allow for reusing the same two drives for building RAID1 from scratch if they do not match. For a degraded RAID case with a replacement drive, due to the "stuck at 94%" problem, these files were never updated with the correct information.
Therefore, funplug is required for getting into the box, and manually fixing these files. Assuming you either enter the box via telnet or ssh (dropbear), the following need to be done:
1. Mount /dev/mtdblock0: mount -t minix /dev/mtdblock0 /sys/mtd1
2. Goto to /sys/mtd1: cd /sys/mtd1
3. Make a backup copy of hd_magic_num: cp hd_magic_num hd_magic_num.old
4. Edit hd_magic_num: vi hd_magic_num
5. Change the first two numbers to any number of choice (32 bit integer in decimal).
6. Change the 3rd line to the serial number of the right drive.
7. Change the 4th line to the serial number of the left drive.
8. Exit vi.
9. Check that all information is correct carefully.
10. Un-mount /dev/mtdblock0: umount /sys/mtd1. (
this step is crucial).
11. Mount /dev/mtdblock1: mount -t minix /dev/mtdblock1 /sys/mtd2
12. Goto to /sys/mtd2: cd /sys/mtd2
13. Make a backup copy of hd_magic_num: cp hd_magic_num hd_magic_num.old
14. Edit hd_magic_num: vi hd_magic_num
15. Change the first two numbers to any number of choice (32 bit integer in decimal).
16. Change the 3rd line to the serial number of the right drive.
17. Change the 4th line to the serial number of the left drive.
18. Exit vi.
19. Check that all information is correct carefully.
20. Copy hd_magic_num to copies in hard drives:
- cp hd_magic_num /mnt/HD_a2/.systemfile
- cp hd_magic_num /mnt/HD_a4/.systemfile
- cp hd_magic_num /mnt/HD_b4/.systemfile
21. Un-mount /dev/mtdblock1: umount /sys/mtd2. (
this step is crucial).
22. Re-start the unit and verify that the web configuration wizard do not ask to format the drive again.
Alternative way to get the RAID1 array rebuildI had experimented with the following way and found it equally workable, but require familiarity with funplug and Linux:
1. Get into the box via telnet or ssh.
2. Manually partition the replacement drive to match the surviving drive using fdisk.
3. Manually rebuild the RAID1: /usr/sbin/mdadm --manage --add /dev/md0 /dev/sdx2
4. Check /proc/mdstat for re-sync status.
5. When re-sync is done, update hd_magic_num as described above.
This saga once again reaffirm my trust in DNS323 because it uses Linux and therefore, there are many ways where a user interface failure can be work-around.
One more thing, if you have a Linux box that has a spare SATA slot, it may be worthy to pull out the surviving drive and slot into the box to get the drive a health check. For example, have Fedora (recommend 9 and above), use "Disk Utility" to read the SMART status. What you should look at is the "relocated bad sector count" and "pending bad sector". Any of these reading other than 0 means that the drive
will fail pretty soon. If you have a 1TB Seagate Barracuda, you can also check the firmware release, and if so, download the bootable ISO image to update the firmware to SD1A this way.
Good luck and hope this post will never be of any use to you. :-)