在最近一次的核心系统迁移中.NetAPP存储发生了意想不到的情况,在前端负载不是很高的情况下 存储CPU使用了超过了55%,并且读竟然达到了1GB/s
在无法获取1GB数据产生源的情况下,项目被迫回滚,导致50多人白忙活了一夜。最后在netapp的check 中发现竟然是一次存储的自检行为导致”NetAPP DISK SCRUB” 默认在周日凌晨1点启动持续6个小时,竟然跟我们项目冲突了,下面做一个总结:
当时的情况 A B 两个机头负载同时飙升到60% read均达到了1GB+/s 并且A机头的负载>B 机头 这是由于这套系统使用了B机头作为主机头,NETAPP在自检的过程中采取了dynamic的方式自动降低了有数据交换的B机头的扫描负载
It’s a well-known fact in the storage world that firmware bugs (and sometimes hardware and data path problems) can cause silent data corruption; the data that ends up on disk is not the data that was sent down the pipe. To protect against this, when Data ONTAP writes data to disk, it creates a checksum for each 4kB block that is stored as part of the block’s metadata. When data is later read from disk, the checksum is recalculated and compared to the stored checksum. If they are different, the requested data is recreated from parity. In addition, the data from parity is rewritten to the original 4kB block, then read back to verify its accuracy.
To ensure the accuracy of archive data that may remain on disk for long periods without being read, NetApp offers the configurable RAID scrub feature. A scrub can be configured to run when the system is idle and reads every 4kB block on disk, triggering the checksum mechanism to identify and correct hidden corruption or media errors that may occur over time. This proactive diagnostic software promotes self-healing and general drive maintenance.
To NetApp, rule number 1 is to protect our customer data at all costs. Protection against firmware-induced silent data corruption is an example of NetApp’s continuing focus on developing innovative storage resiliency features to ensure the highest level of data integrity.
How you schedule automatic RAID-level scrubs
By default, Data ONTAP performs a weekly RAID-level scrub starting on Sunday at 1:00 a.m. for a duration of six hours. You can change the start time and duration of the weekly scrub, add more automatic scrubs, or disable the automatic scrub.
To schedule an automatic RAID-level scrub, you use the raid.scrub.schedule option.
To change the duration of automatic RAID-level scrubbing without changing the start time, you use the raid.scrub.duration option,specifying the number of minutes you want automatic RAID-level scrubs to run. If you set this option to -1, all automatic RAID-level scrubs run to completion.
Note: If you specify a duration using the raid.scrub.schedule option, that value overrides the value you specify with this option.
To enable or disable automatic RAID-level scrubbing, you use the raid.scrub.enable option.
Scheduling example
The following command schedules two weekly RAID scrubs. The first scrub is for 240 minutes (four hours) every Tuesday starting at 2 a.m. The second scrub is for eight hours every Saturday starting at 10 p.m.
options raid.scrub.schedule 240m@tue@2,8h@sat@22
Verification example
The following command displays your current RAID-level automatic scrub schedule. If you are using the default schedule, nothing is displayed.
options raid.scrub.schedule
Reverting to the default schedule example
The following command reverts your automatic RAID-level scrub schedule to the default (Sunday at 1:00 am, for six hours):
options raid.scrub.schedule ” ”