HD parece que esta a punto de morir? o es otro tipo de log.?

Imagen de The One


Saludos compañeros, tengo unos mensajes en los logs de “/var/log/messages” los cuales me salen de esta forma:

Mar 15 18:13:27 corp-190-12-4-250-cue kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Mar 15 18:13:27 corp-190-12-4-250-cue kernel: ata1.00: cmd c8/00:08:45:7b:32/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 in
Mar 15 18:13:27 corp-190-12-4-250-cue kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 15 18:13:27 corp-190-12-4-250-cue kernel: ata1: soft resetting port
Mar 15 18:13:29 corp-190-12-4-250-cue kernel: ata1.00: configured for UDMA/100
Mar 15 18:13:29 corp-190-12-4-250-cue kernel: ata1: EH complete
Mar 15 18:13:29 corp-190-12-4-250-cue kernel: SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
Mar 15 18:13:29 corp-190-12-4-250-cue kernel: sda: Write Protect is off
Mar 15 18:13:29 corp-190-12-4-250-cue kernel: SCSI device sda: drive cache: write back

El cual no se si este en lo correcto pero me parece que el error que me marca es de un(os) sector(es) dañado( s) del HD, si es así que opción me pueden dar para poder reparar esto, a lo mejor ejecutar un “fsck -fyc” desde el disco 1 de CentOS me pueda ayudar, pero que me recomiendan Uds., o me tocaría cambiar este HD.??

El servidor se cuelga en el momento menos esperado, ya configure el smartd para que me envíe mensajes a mi correo para ver que mismo es, pero espero que me puedan ayudar con alguna solución.

Gracias por sus comentarios.

espero que no sea un error de disco duro

Imagen de juandarcy2000

pero revisando en google mucha gente habla de eso, y habla de desactivar la funcion ACPI en el bios ademas hablan de la tarjeta madre intel, otros hablan del kernel, revisa el disco duro con algun analizador te recomiendo hddregenerator a mi me a resultado muy bueno para descartar algun fallo del disco duro a nivel logico.
prueba ponerle en el grub en parametros del kernel esta opcion.

busca en google hay bastante informacion sobre ese mensaje y si encuentras alguna solucion dinos como lo resolvistes, ademas seria buena idea nos digas que version de linux tienes, version del kernel, tipo de hardware (tarjeta madre,procesador,disco duro)

Saludos amigo, gracias por

Imagen de The One

Saludos amigo, gracias por tu respuesta.

Te comento que lo de desactivar la funcion ACPI en el bios ya lo intente, pero no mismo funca asi, este problema me estaba generando ya algunas semanas atras, ya que siempre me salia el famoso kernel panic y el servidor se colgaba y no dejaba hacer nada, por eso hice una pequeña configuración que cuando me salga ese KP se reinicie por si solo el servidor, las características del servidor son:

AMD Athlon 64
Placa Turion
Memoria 1G
Disco SATA 80G
y esta instalado el CentOS 5.1 con el kernel "2.6.18-53.1.13.el5" no tengo el más reciente porque si lo intento instalar me da kernel panic y el más reciente creo que es el "kernel-2.6.18-53.1.14.el5".

Y al momento que ejecuto el comando "smartctl -a -d ata /dev/sda" me sale esto:

smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device Model: ST3808110AS
Serial Number: 5LR6BTV9
Firmware Version: 3.AAH
User Capacity: 80,026,361,856 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Mar 16 17:08:07 2008 ECT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 27) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
1 Raw_Read_Error_Rate 0x000f 112 088 006 Pre-fail Always - 46413852
3 Spin_Up_Time 0x0003 098 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 517
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 66661092
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1901
10 Spin_Retry_Count 0x0013 099 099 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 769
187 Unknown_Attribute 0x0032 083 083 000 Old_age Always - 17
189 Unknown_Attribute 0x003a 100 100 000 Old_age Always - 0
190 Unknown_Attribute 0x0022 068 061 045 Old_age Always - 572522528
194 Temperature_Celsius 0x0022 032 040 000 Old_age Always - 32 (Lifetime Min/Max 0/19)
195 Hardware_ECC_Recovered 0x001a 060 046 000 Old_age Always - 125619585
197 Current_Pending_Sector 0x0012 001 001 000 Old_age Always - 4294967295
198 Offline_Uncorrectable 0x0010 001 001 000 Old_age Offline - 4294967295
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 21 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 21 occurred at disk power-on lifetime: 1426 hours (59 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
40 51 00 a7 f3 24 e1 Error: UNC at LBA = 0x0124f3a7 = 19198887

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 18 9a f3 24 e1 00 02:11:27.719 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:27.718 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:25.817 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:25.816 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:23.915 READ DMA

Error 20 occurred at disk power-on lifetime: 1426 hours (59 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
40 51 00 a7 f3 24 e1 Error: UNC at LBA = 0x0124f3a7 = 19198887

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 18 9a f3 24 e1 00 02:11:27.719 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:27.718 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:25.817 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:25.816 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:23.915 READ DMA

Error 19 occurred at disk power-on lifetime: 1426 hours (59 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
40 51 00 a7 f3 24 e1 Error: UNC at LBA = 0x0124f3a7 = 19198887

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 18 9a f3 24 e1 00 02:11:27.719 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:27.718 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:25.817 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:25.816 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:23.915 READ DMA

Error 18 occurred at disk power-on lifetime: 1426 hours (59 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
40 51 00 a7 f3 24 e1 Error: UNC at LBA = 0x0124f3a7 = 19198887

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 18 9a f3 24 e1 00 02:11:20.997 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:20.997 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:25.817 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:25.816 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:23.915 READ DMA

Error 17 occurred at disk power-on lifetime: 1426 hours (59 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
-- -- -- -- -- -- --
40 51 00 a7 f3 24 e1 Error: UNC at LBA = 0x0124f3a7 = 19198887

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 18 9a f3 24 e1 00 02:11:20.997 READ DMA
ec 00 00 a7 f3 24 a0 00 02:11:20.997 IDENTIFY DEVICE
c8 00 18 9a f3 24 e1 00 02:11:20.997 READ DMA
c8 00 18 7a f3 24 e1 00 02:11:20.994 READ DMA
c8 00 08 6a f3 24 e1 00 02:11:23.915 READ DMA

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1885 -

SMART Selective self-test log data structure revision number 1
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

No se leer esto, pero me parece que me dice que le queda de vida a mi HD 59 días + 10 horas.

Que otra solución me pueden dar?.

