Tuesday, November 23, 2004

Why Macs are so vulnerable to bad RAM

MacInTouch Home Page

Because they've missed a step in computer evolution:
I manage about 150 Macs in a creative agency. Over the last year and a half I have noticed a sharp increase in the discovery of bad RAM.

My fifty or so G5s (all dual 1.8 or 2.0) have been subject to about 5 instances of bad RAM. That's a ridiculously high percentage (granted we have 4 DIMMs in each, but please...). I don't understand why this would suddenly become such a bigger problem. We have more mirrored-door machines (and more DIMMs) and don't have anywhere near this level of trouble. I pull RAM from machines at the first sign of multiple kernel panics now. I never used to think that way, but if a user is getting panics, the odds are these days that it's the hardware, not my system.

What's more (and most importantly) is that none of the available utilities diagnose the bad DIMMs. I have to send them to a break/fix shop with a hardware-based RAM tester to see if the RAM is OK. I recently ordered 4 GBbytes from CDW and immediately just sent it to the shop for a check. 1 of the 8 was bad. I'm now pricing a RAM tester to use in-house so I can be rest assured about what I'm putting in my machines.

The bottom line is that this is a major quality concern that both Apple and the VARs need to take more seriously. Aren't they testing this stuff themselves? Why does it seem like G5 RAM is much more prone to problems? My main point is to check that stuff (with a hardware-based diagnostic) and don't be surprised to find your OS is fine but your RAM is not.

[The Xserve G5 is the only Mac that bothers to use ECC memory to avoid this pernicious problem. Here's Apple's description from the Xserve G5 Architecture page. -MacInTouch:

Xserve G5 uses Error Correction Code (ECC) logic to protect the system from corrupt data and transmission errors. Each DIMM has an extra memory module that stores checksum data for every transaction. The system controller uses this ECC data to identify single-bit errors and corrects them on the fly, preventing unplanned system shutdowns. In the rare event of multiple-bit errors, the system controller detects the error and triggers a system notification to prevent bad data from corrupting further operations. You can set the Server Monitor software to alert you if error rates exceed the defined threshold.]

PC's use ECC memory. So vendors know the ECC will catch errors of a certain type -- it's no longer cost effective to prevent those errors from occurring. This makes sense -- you get more reliable memory for less money.

Problem though -- Macs don't use ECC. So they get the less reliable and cheaper memory -- without the compensatory mechanism. Bad news.

No comments: