« Diskeeper Hugs a Tree | Main | Diskeeper and Vista »

October 07, 2006

Identifying Common Reliability/Stability Problems Caused by File Fragmentation

Earlier this year I wrote a white paper on fragmentation's effect on system and application reliability, presenting information about the root cause of the issue. Below is the first section of that paper. You can read the entire document in our Knowledge Center.

An Overview of the Problem

Having all program and data files stored in contiguous form on the hard drive is a key factor in keeping a system stable and performing at peak efficiency. Though unavoidable, the moment a file is broken into pieces and scattered across a drive, it opens the door to a host of stability/reliability issues. Having just a few key files fragmented can lead to crashes, conflicts and errors.

The principle of fragmentation's impact on system or application reliability is the timing out of a requestor or service provider in collecting/reassembling fragmented data. This principle holds true for both IP datagram fragmentation and file/disk fragmentation.

Many system and application breakage points can be defined as exerted stress on buffers to the point of overflow/overrun. DoS attacks are well documented examples of exploiting IP datagrams, but far less information abounds for reliability considerations in the case of file objects. A good overview of the affect of stress when requesting file objects comes from a Microsoft Knowledge Base article which states "The Server service cannot process the requested network I/O items to the hard disk quickly enough to prevent the Server service from running out of resources."

Disk fragmentation is often the straw that broke the camel's back when noting issues of stability or reliability. Stressed I/O activity, compounded by fragmentation can expose faulty device drivers or file filters that may otherwise operate effectively (in non-fragmented environments). The reliability of third party applications is highly dependent on the degree to which those applications can accommodate bottlenecks, such as in disk subsystems.

The point at which application or system stability is compromised is difficult, if not impossible, to calculate. It is a combination of hardware and software and operations at the moment of instability. A poorly written driver or file filter can be exposed in some environments but not in others, and the amount of fragmentation required to reach critical mass on a specific file or files, will vary greatly upon all the other variables involved.

This issue can be exampled by better understanding asynchronous I/O. Asynchronous I/O exists to compensate for variables that may prevent or eliminate the possibility of synchronous I/O (e.g. I/O is much slower than data processing). An alternative to handling I/O asynchronously, which generally offers lower performance, is to "block" other I/O.

Here is an example: a Win32 application creates either an I/O completion port, executes an overlapping completion routine, or calls WaitForSingleObject / WaitForMultipleObjects APIs at the time of thread creation. In any case where the wait state is exceeded (e.g. queued I/O is paged to disk), a failure can occur. As suggested, low available memory (non-paged pool) can exacerbate failures as it re-introduces the physical disk into the equation. In lieu of failures, extended queuing/waiting and proper exception handling can mitigate issues, at the expense of lower performance (operations take longer) for the application, or increased system resource requirements.

"The problem we were having was the server would get so busy that it would stop processing I/O requests and network traffic would just hang. Working with Microsoft and Compaq we concluded it was due to fragmentation. When we installed Diskeeper it resolved the problem overnight." -Mike N, System Administrator, John Deere

Failure to routinely address or understand fragmentation and its role in helping to cause these problems, results in increased IT staff workloads attempting to troubleshoot and identify the source of problems. This frequently leads to such common and often unnecessary actions as reinstalling software, re-imaging of hard drives, expensive replacement of hardware, an unnecessary work-around, as well as overwork at the Help Desk. Forcing IT to work reactively on problems, increases IT costs and adversely affects user productivity due to unacceptable levels of downtime.

Reliability and Stability Issues Traceable to Disk Fragmentation

The most common problems caused by file fragmentation are:
a) Crashes and system hangs/freezes
b) Slow boot up and computers that will not boot up
c) Slow back up times and aborted backup
d) File corruption and data loss
e) Errors in programs
f) RAM and cache problems
g) Hard drive failures

Posted by Michael at October 7, 2006 01:15 AM

Comments

Post a comment




Remember Me?