D-Link Forums

The Graveyard - Products No Longer Supported => D-Link Storage => DNS-321 => Topic started by: irha on October 30, 2010, 10:06:25 PM

Title: SMART report question
Post by: irha on October 30, 2010, 10:06:25 PM: My DNS-321 is showing "Abnormal" status for one of the drives in a RAID-1 configuration and it is a bit confusing which drive is failing. Under hard sdrive info, I see something like this:

Right WDC: ... Normal
Left WDC: ... Abnormal

When I click on "Abnormal", I get a report that says:
Item Now Worst Thresh Updated 1 Raw_Read_Error_Rate 1 1 51 302541 3 Spin_Up_Time 110 107 21 7466 4 Start_Stop_Count 100 100 0 247 5 Reallocated_Sector_Ct 200 200 140 0 7 Seek_Error_Rate 200 200 0 0 9 Power_On_Hours 99 99 0 825 10 Spin_Retry_Count 100 100 0 0 11 Calibration_Retry_Count 100 253 0 0 12 Power_Cycle_Count 100 100 0 26 192 Power-Off_Retract_Count 200 200 0 19 193 Load_Cycle_Count 200 200 0 278 194 Temperature_Celsius 108 102 0 39 196 Reallocated_Event_Count 200 200 0 0 197 Current_Pending_Sector 192 192 0 1458 198 Offline_Uncorrectable 192 192 0 1402 199 UDMA_CRC_Error_Count 200 200 0 0 200 Multi_Zone_Error_Rate 185 184 0 3178

What exactly is abnormal here? For comparison, I clicked on the "Normal" button and this is what I see:
Item Now Worst Thresh Updated 1 Raw_Read_Error_Rate 200 200 51 0 3 Spin_Up_Time 168 164 21 6600 4 Start_Stop_Count 100 100 0 257 5 Reallocated_Sector_Ct 200 200 140 0 7 Seek_Error_Rate 200 200 0 0 9 Power_On_Hours 90 90 0 7589 10 Spin_Retry_Count 100 100 0 0 11 Calibration_Retry_Count 100 253 0 0 12 Power_Cycle_Count 100 100 0 26 192 Power-Off_Retract_Count 200 200 0 9 193 Load_Cycle_Count 200 200 0 257 194 Temperature_Celsius 115 106 0 35 196 Reallocated_Event_Count 200 200 0 0 197 Current_Pending_Sector 200 200 0 0 198 Offline_Uncorrectable 200 200 0 0 199 UDMA_CRC_Error_Count 200 200 0 0 200 Multi_Zone_Error_Rate 200 200 0 0
Looks like the one which is supposed to be normal has a very high Raw_Read_Error_Rate, where as the one that is supposed to be abnormal, has only a value of 1. So which one is actually failing? Also, would the "Left WDC" drive be the top or bottom drive? I guess I can pull one of the drives out and match the s/no, but I am not even sure if DNS-321 is giving accurate information.
Title: Re: SMART report question
Post by: jamieburchell on October 31, 2010, 03:19:33 AM: The drive you say is "abnormal" in your post has

1 Raw_Read_Error_Rate 1 1 51 302541
198 Offline_Uncorrectable 192 192 0 1402
200 Multi_Zone_Error_Rate 185 184 0 3178

What do you mean top and bottom drives? They are installed left to right?
Title: Re: SMART report question
Post by: irha on November 01, 2010, 12:53:28 PM: Quote from: jamieburchell on October 31, 2010, 03:19:33 AM
The drive you say is "abnormal" in your post has

1 Raw_Read_Error_Rate 1 1 51 302541
198 Offline_Uncorrectable 192 192 0 1402
200 Multi_Zone_Error_Rate 185 184 0 3178

I looked at a reference http://www.z-a-recovery.com/man-smart.htm, but I am still a bit confused on how to interpret these numbers. So in general, a lower number means it is bad? Is 200 considered a good reference number, and anything lower than that might indicate a failure?

Edit: I jumped to relevant information without reading at the beginning, and I see that the link already answers my above question. It basically says: "on the scale from 0 (bad) to some maximum (good) value. Maximum values are typically 100, 200 or 253. Rule of thumb is: high values are good, low values are bad.". I now understand what these numbers indicate.

Quote from: jamieburchell on October 31, 2010, 03:19:33 AM
What do you mean top and bottom drives? They are installed left to right?
You are right, I think I was confused because I had Corza for a few days before replacing it with the D-Link, and it had drives one over the other.

Thanks for your help.
Title: Re: SMART report question
Post by: jamieburchell on November 01, 2010, 01:04:23 PM: Actually it's my understanding that the last column of values are that of interest and that the higher the error values, the "worse" it is. I.e. 302541

In any case, the abnormal drive has defective sectors so I would consider replacing it.
Title: Re: SMART report question
Post by: irha on November 01, 2010, 01:31:10 PM: Quote from: jamieburchell on November 01, 2010, 01:04:23 PM
Actually it's my understanding that the last row of values are that of interest and that the higher the error values, the "worse" it is. I.e. 302541

In any case, the abnormal drive has defective sectors so I would consider replacing it.
Thanks for the clarification.

PS: I guess you mean last column of values.
Title: Re: SMART report question
Post by: irha on November 01, 2010, 03:15:13 PM: I already requested for an RMA advance replacement from WD, but wondering if I should something immediately right now. Is it better to emove the one that is reported abnormal in advance and use it with the one drive or should I leave it in there? Also, is it better to keep the NAS shutdown until I get the replacement?
Title: Re: SMART report question
Post by: jamieburchell on November 01, 2010, 03:22:09 PM: Yes I meant column :)
I would backup your data somewhere if possible. Not sure about your other questions. I'd probably leave it in, it hasn't totally failed yet and two up to date copies is better than one.
Title: Re: SMART report question
Post by: jamieburchell on November 01, 2010, 03:36:55 PM: You're right about higher normalised values being good and that they shouldn't be lower than the threshold. If those column headings are correct, the first row (error rate) means the drive is failing as the normalised value is 1 and is less than the threshold. There's also an indication of bad sectors.
Title: Re: SMART report question
Post by: irha on November 01, 2010, 05:29:38 PM: Thanks jamieburchell!

I am also wondering if one of the drives failing could cause the read speeds to go down drastically. Currently, the read speeds vary anywhere from just a few kb to sometimes up to 10mb/s and is very unpredictable. I actually never got good speeds with this NAS (mostly an average of 2.5 to 3mb/s) but didn't have time to diagnose until now. So far I verified that the connections from my PC and DNS321 to the gigabit switch are in deed at gigabit speed (my switch lights two leds if the speed is gigabit), so looks like the problem is not with the networking and could be due to the bad disk all along.
Title: Re: SMART report question
Post by: jamieburchell on November 02, 2010, 03:12:36 AM: Yes it could be, what with the read error rate being so high. If you did pull the bad disk out you might see the speed improve but I don't know that for sure.
Title: Re: SMART report question
Post by: irha on November 05, 2010, 12:40:02 PM: I got my replacement drive and put the drive in after shutting down the NAS. During the startup, the NAS took me through a couple of pages to initialize the drive and the process was stuck with the formatting progress bar at 0% and the NAS is unresponsive to web UI or ssh. I waited over night and it is the same, so I restarted the NAS using the power switch and but the same behavior repeats (ie., the same wizard at the startup and the same unresponsiveness later). What is going wrong here? I would appreciate any help. Thanks
Title: Re: SMART report question
Post by: jamieburchell on November 05, 2010, 03:11:24 PM: Try a factory reset first me thinks.
Are you also certain that you replaced the correct drive? Which drive did you go for? Have you tried checking the drive with the manufacturers diagnostics to ensure it's healthy?
Title: Re: SMART report question
Post by: irha on November 05, 2010, 04:31:50 PM: I replaced the left side one, and confirmed the s/no to be the same as the one I RMA'ed. I didn't test the drive, but I am sure the one inside is the one reported as "Normal". Could doing a factory reset make the NAS forget how to rebuild the array? If the only option after a reset is to build a fresh array together, then that would cause a loss of my data, so I am a little worried.
Title: Re: SMART report question
Post by: gunrunnerjohn on November 06, 2010, 07:13:44 AM: A factory reset should not lose the data, I've done that exact procedure with a RAID-1 array.
Title: Re: SMART report question
Post by: irha on November 06, 2010, 06:19:35 PM: OK, I went ahead and did a factory reset and logged in with no password. I got the same screen, but I skipped and went into the status and verified the status. I then went into Raid page and opted to rebuild the raid-1 array and it is now stuck at the same screen as before. The device is now unresponsive. One thing I failed to mention is that in this state, I don't see any drive activity (as per the leds) taking place. Here is a screenshot of this screen:
(http://img215.imageshack.us/img215/5013/dns321raidhang.jpg)
Title: Re: SMART report question
Post by: gunrunnerjohn on November 06, 2010, 06:22:35 PM: I think I'm out of ideas here. My take would be to remove the drive and connect it to a PC and recover the data before doing anything else. :)
Title: Re: SMART report question
Post by: irha on November 06, 2010, 08:06:50 PM: If you clone the entire drive by connecting both the drives to a pc, would DNS-321 treat them to be in sync? The other option is to ssh in and use command line tools to rebuild the array, at least that would probably give a better idea of where things are failing, but I am not sure if DNS-321 uses the standard linux raid configuration and tools and if they are accessible as command-line tools.
Title: Re: SMART report question
Post by: gunrunnerjohn on November 06, 2010, 08:59:29 PM: I would not screw around if the data on the drive is your only copy...
Title: Re: SMART report question
Post by: irha on November 06, 2010, 11:32:02 PM: I am following this thread (http://forums.dlink.com/index.php?topic=11903.0) now to do a manual resync. I am now monitoring the /proc/mdstat for progress and will continue with the rest of the steps and report back.

PS: I had to get the fdisk2 attached to this post (http://forum.dsmg600.info/viewtopic.php?id=5729) to get the manual partitioning to work, as the fdisk that is already there was crashing.
Title: Re: SMART report question
Post by: jamieburchell on November 07, 2010, 02:25:58 AM: A potentially easier and safer approach would be:

Backup data somewhere
Remove both drives, attach to PC and erase all partitions/zero fill
Check drives for errors/surface scan
Factory reset
Attach both drives in NAS
Format as RAID1
Copy data back

If any of those steps fail, you've got a problem somewhere.
Title: Re: SMART report question
Post by: irha on November 07, 2010, 09:32:06 AM: I agree, it is a risky operation without having a backup copy. In my case, since I just started consolidating misc. external drives to the NAS solution, I won't have a problem starting over (though I will loose some time).

I have now successfully finished the manual process. Here is what I did:
1. I downloaded fdisk2 from the thread I previously pointed out.
2. Created partitions in the same order and size as on the existing drive.
3. Manually added partition to the array waited for that to finish.
-- /usr/sbin/mdadm --manage --add /dev/md0 /dev/sdx2
4. Manually created ext3 filesystems on the other two partitions
-- /ffp/sbin/mke2fs -j /dev/sdx1
-- /ffp/sbin/mke2fs -j /dev/sdx4
5. Mounted the sdx4 partition
-- mount /dev/sdx4 /mnt/HD_x4
6. Forced an update of the hd_magic_num file for the web UI
-- /usr/sbin/hd_verify -w

Now the UI seems to happy about the state of the array. My only concern is about the steps in 4 and 5 that nobody seem to have mentioned as having to do. I am not sure if I am missing any more steps, e.g., should I be adding an fstab entry somewhere?