The CPU need not have referenced the particular word that triggered the fault. Posted Aug 28, 7: Posted Dec 4, Thus, when HWPOISON is coupled with the appropriate fault-tolerant processors, Linux users can enjoy systems that are more tolerant to memory errors in spite of increased memory densities. The OS can then take appropriate action, like killing the process with the corrupted data or logging the event properly to disk. Yes, that’s the scenario in the sentences I excerpted from the article.

Uploader: Shakakazahn
Date Added: 14 August 2012
File Size: 27.72 Mb
Operating Systems: Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X
Downloads: 7758
Price: Free* [*Free Regsitration Required]

At first glance, an obvious solution for the poison handler would focus on the specific process and memory address es associated with the data error. The OS marks the memory as poisoned, or otherwise discards the contents of the page if it was clean.

Longer ECC provides capability to correct and detect more bites. The handler must allow for multiple poisoning events occurring in a short time window. This link is broken. If I’m not mistaken, that’s the processor family this article was referring to. The machine check is action optional and it can do just as you suggest. Yes, that’s the scenario in the sentences I excerpted from the article.

The poisoned bit in the flags field serves as a lock allowing rapid-fire poisoning machine checks on the same page to be handled only once by ignoring subsequent calls to the handler. This allow system soft- ware to perform recovery action on certain class of uncorrected errors and continue If I’m not mistaken, that’s the processor family this article was referring to. Potentially corrupted processes can then be located by finding all processes that have the corrupted page mapped. Er, maybe I’m missing the thrust of your question, but I thought it was sort of straightforward: Dirty pages are unmapped from all associated processes, which are subsequently killed.



Try to keep everything running as smoothly as possible and only bringing intsl the affected tasks if any. Originally developed for ia64 processors, Intel’s MCA Recovery architecture supports memory poisoning and various other hardware failure recovery mechanisms.

Intel’s recent preview of its Xeon processor codenamed Nehalem-EX promises support for memory poisoning. Since these pages have a duplicate backing copy on disk, the in-memory cache copy can be invalidated.

For users:

Usenix Annual Tech Conference The famous google memory error study. There is a notion of an “action optional” machine check. The MCA can occur on any “word”, where “word” is defined by the width of the ECC code applied at the corresponding level of memory.

Now flip with me to page and look at what SRAO errors are architecturally defined, there in section However, these pages containing critical kernel data cannot be isolated. The CPU need not have referenced the particular word that triggered the fault.

The blanket action of crashing the machine for all uncorrected soft and hard memory errors is sometimes over-reactive. The dirty flag is cleared for the page and the page swap cache entry is maintained. Action Optional means that the CPU detected some form of corruption in the background and tells the OS about using a machine check exception. Thus, processes can decide infel they want to handle the data poisoning.


mcelog — further reading

Additionally, the architecture must support data poisoning. Do you have different documentation that suggests otherwise? These delays include asynchronous hardware reporting of the machine check event, How can a machine check for accessing erroneous memory contents be asynchronous? ECC is able to recover from multib i y te errors. The Intel Software Developer’s manual describes the low level register interface of the x86 machine check architecture Machine checks are described injectoor Volume 3A: Perhaps this is handled properly, but by just unmapping, arn’t you running the risk that some later memory allocation by that process might get the same virtual address and thus instead of a SIGBUS the process keeps running with corrupted memory?

Thus, any poisoning machine check event may happen long after the corresponding data error event. Dirty pages in the swap cache are handled in a delayed fashion. Whether or not the CPU referenced the particular word that triggered the fault, the existing MCA may consider such faults catastrophic at the task level, and so does not bother to precisely track which instruction s may have consumed the bogus data.