Cas pràctic

En un servidor configurat com a RAID 10 amb 4 discs durs, trobem un misstage a la pantalla que diu:

BTRFS error (device sda1): bdev /dev/sdf1 errs: wr 51956436, rd 28739183, flush 124601, corrupt 0, gen 0

Això ens diu que hi ha algun problema en el sistema d'arxius. Si fem algunes comprovacions:

 sudo btrfs scrub start /media/btrfs/
 scrub started on /media/btrfs/, fsid bb600f14-9fbb-4f27-af33-95c6ac1975fe (pid=16546)
 sudo btrfs scrub status /media/btrfs/
   scrub status for bb600f14-9fbb-4f27-af33-95c6ac1975fe
	scrub started at Mon Nov 18 13:54:43 2019, running for 00:15:05
	total bytes scrubbed: 84.04GiB with 5340488 errors
	error details: read=5340485 super=3
	corrected errors: 0, uncorrectable errors: 5340485, unverified errors: 0

Efectivament, es troben molts errors que no pot solucionar.

Si fem un smartctl al disc que sembla fallar:

sudo smartctl /dev/sdf -a
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-70-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,

Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1ER164
Serial Number:    W4Z19BCP
LU WWN Device Id: 5 000c50 07d44b288
Firmware Version: CC25
User Capacity:    2.000.398.934.016 bytes [2,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Nov 18 15:53:14 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   89) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 219) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       162249504
  3 Spin_Up_Time            0x0003   100   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       175
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       89979761
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16693
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       85
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   089   000    Old_age   Always       -       1 2 12
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   064   048   045    Old_age   Always       -       36 (Min/Max 30/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       44
193 Load_Cycle_Count        0x0032   067   067   000    Old_age   Always       -       67287
194 Temperature_Celsius     0x0022   036   052   000    Old_age   Always       -       36 (0 9 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   069   000    Old_age   Always       -       262
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       16485h+48m+15.949s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       19632474908
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1261379460714

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Diu que no està molt sà, però que funciona. Per tant, cal intentar reparar el sistema d'arxius:

btrfs rescue zero-log /dev/sdf1  # Si és un error de transaccions tallades per un tall elèctric, per exemple.
btrfs rescue super-recover       # Si no està el superblock, el recupera de una de les còpies que té el disc.
btrfs rescue chunk-recover       # En cas de fallar les metadates i tindre fitxers desconeguts. És lent i perillós.
btrfs check --repair /dev/sdXY   #  Per reparar el sistema d'arxius.