SQL Server Corruption Post Mortem

· 2 min read
Table of Contents

“No matter what method you use to recover from corruption, you should always determine why it happened to avoid future problems.” – Paul Randal

Most Likely Causes

The most likely causes are outlined below, in order of likelihood:

  1. Run I/O subsystem and server memory diagnostics (almost always the cause)
  2. Windows OS
  3. File system filter drivers – e.g. antivirus, defraggers, encryption
  4. Network cards, switches, cables
  5. Memory corruption
    • Bad memory chips
    • Scribblers
  6. SAN controllers
  7. RAID controllers
  8. Disks

Examine the Logs

Check the SQL Server error log and Windows event logs for clues:

  • Error 823: A hard I/O error (when Windows can’t return the data to SQL)
  • Error 824: A soft I/O error (when SQL detects there is a problem with the data it was given by Windows)
  • Error 825: A read-retry error (these show as information alerts however they are critical because they are impending-doom warning signs)
  • More information: Microsoft KB 2015757

Also:

  • Check that the firmware is up-to-date
  • Investigate NTFS filter drivers

What Does NOT Cause Corruption

It is important to remember that corruptions are not caused by the following:

  • Anything an application can do
  • Anything you can do in SQL Server with supported, documented commands
  • Interrupting a database shrink, index rebuild, or long-running batch
  • Shutting down SQL Server

Early Detection Strategies

If corruption happened, it is likely that certain steps are not being done inside your organization to protect your data. Here is a list of what can be done to help detect early signs:

  1. Implement error alerts for Severity 19 and above errors – see Configuring Alerts for High Severity Problems
  2. Implement Page Verify – Checksum option – see Setting Your Page Verify Database Option to Checksum
  3. Implement backups with Checksum option
    • All databases INCLUDING system databases
    • Allows you to use RESTORE VERIFYONLY with CHECKSUM to validate your backups
  4. Implement an integrity check script – either Ola Hallengren’s Maintenance Solution (widely used) or at minimum a DBCC CHECKDB WITH NO_INFOMSGS job to check integrity

Comments