Plaque It!
|
[0001] 1. Field of the Invention
[0002] The present invention relates to a redundant memory system and a memory controller used therefore. More particularly, the invention relates to a redundant memory system including a plurality of memory modules, such as a Redundant Array of Independent Memory Modules (RAIMM), and a memory controller used for controlling the memory system. The modules are typically in the form of the Dual Inline Memory Module (DIMM) or Single Inline Memory Module (SIMM).
[0003] 2. Description of the Related Art
[0004] Conventionally, to make it possible to realize continuous operation of a computer system in spite of the failure of memories, various memory control techniques have ever been developed and used. Typical examples of the techniques are the Error Checking and Correction (ECC) technique and the ChipKill technique. The ECC technique is a well-known technique to check and correct errors using a parity code. The ChipKill technique, which is disclosed, for example, in the Japanese Non-Examined Patent Publication No. 2001-142789 published in May 25, 2001, is a technique to avoid the use of the data read out from a failed memory element.
[0005] For example, the Japanese Non-Examined Patent Publication No. 5-128012 published in May 25, 1993 discloses an electronic disk apparatus. This electronic disk apparatus comprises M memory packages for each storing data of (N×M) bits/word, where N and M are positive integers; a memory power supply circuit for controlling the turn-on and turn-off of power supplied to the respective M memory packages; control means for reading data from a new memory package word by word in response to the turn-on operation of the memory power supply circuit with respect to the new memory package after replacement; and error correction means for correcting an error of at least N bits about the data thus read from the new memory package. This apparatus makes it possible to reconstitute the data at high speed using the error correction function.
[0006] The Japanese Non-Examined Patent Publication No. 10-111839 published in Apr. 28, 1998 discloses a memory circuit module. This memory circuit module comprises a data memory section for storing data; an ECO memory section for storing an error correction code of data stored in the data memory section; an error correction code generation section for generating an error correction code for data; and an error-correction/detection section for detecting and correcting errors using the error correction code stored in the ECC memory section. This module makes it possible to detect and correct ECC errors.
[0007] With the above-described conventional techniques, obtainable fault tolerance with respect to the memory is improved by the ECC or ChipKill technique. However, the following problems still exist:
[0008] The first problem is that if the operating system (OS) used in a computer system does not support the memory redundancy function, the operation of the computer system needs to be stopped in order to replace a failed memory module operating in a critical situation where the FCC or ChipKill function has been activated due to failure.
[0009] The second problem is that a failed memory module incorporated in a memory system is unable to be replaced with a new memory module in the energized state where electric power is supplied to the memory system, in other words, a failed memory module is unable to be replaced with a new one unless the operation of a computer system using the memory system is stopped. This is because the conventional memory control technique directly assigns the memory addresses in the memory space to the memory modules used and therefore, the modules used are unable to be replaced during the energized or in-service state.
[0010] According, an object of the present invention is to provide a redundant memory system that makes it possible to replace a failed one of memory modules incorporated into a memory system with a new memory module during the energized or in-service state even if the OS used in a computer system does not support the memory redundancy function.
[0011] Another object of the present invention is to provide a redundant memory system that makes it possible to replace dynamically a failed one of memory modules incorporated into a memory system with a new memory module according to the necessity even if the memory system is being energized.
[0012] Still another object of the present invention is to provide a memory controller that makes it possible to replace a failed one of memory modules incorporated into a memory system with a new memory module during the in-service state even if the OS used in a computer system does not support the memory redundancy function.
[0013] A further object of the present invention is to provide a memory controller that makes it possible to replace dynamically a failed one of memory modules incorporated into a memory system with a new memory module according to the necessity even if the memory system is being energized.
[0014] The above objects together with others not specifically mentioned will become clear to those skilled in the art from the following description.
[0015] According to a first aspect of the present invention, a redundant memory system is provided, which comprises:
[0016] memory slots;
[0017] memory modules for storing data, the modules being inserted into the respective slots; and
[0018] a memory controller connected to the slots and providing redundancy;
[0019] wherein the controller defines one of the modules as a parity memory and its remainder as data memories;
[0020] and wherein a first parity code is generated from desired data to be stored and written into the parity memory and the desired data are written into the respective data memories;
[0021] and wherein the desired data are read from the respective data memories and the first parity code is read from the parity memory to thereby conduct a parity check operation and an error correction operation of the desired data using the desired data and the first parity code, resulting in the redundancy.
[0022] With the redundant memory system according to the first aspect of the present invention, memory modules for storing data are inserted into respective slots. A memory controller for controlling the modules is connected to the slots and provides redundancy. Moreover, the controller defines one of the modules as a parity memory and the remainder thereof as data memories. A first parity code is generated from desired data to be stored and written into the parity memory and the desired data are written into the respective data memories. The desired data are read from the respective data memories while the first parity code is read from the parity memory to thereby conduct a parity check operation an error correction operation of the desired data using the desired data and the first parity code, resulting in the redundancy.
[0023] Accordingly, the memory controller controls the incorporated modules in such a way as to make an operation corresponding to a Redundant Array of Inexpensive Disks (RAID). Thus, a failed one of the memory modules incorporated into the memory system can be replaced with a new memory module during the energized or in-service state even if the OS (operating system) used in a computer system does not support the memory redundancy function.
[0024] In a preferred embodiment of the module according to the first aspect of the invention, the memory slots are capable of hot plugging or hot swapping operation, wherein a failed one of the memory modules is replaceable with a new memory module in an energized state of the memory system.
[0025] In another preferred embodiment of the module according to the first aspect of the invention, the controller generates a second parity code using the desired data read from respective data memories and then, compares the second parity code with the first parity code read from the parity memory. The parity check operation is conducted by comparing the second parity code with the first parity code, When one of the modules defined as the data memories is failed, the error correction operation of the desired data is conducted by reconfiguring the desired data read from the remaining non-failed data memories and the first parity data read from the parity memory.
[0026] According to a second aspect of the present invention, another redundant memory system is provided, which comprises:
[0027] n memory slots, where n is an integer greater than one;
[0028] n memory modules for storing data, the modules being inserted into the respective slots; and
[0029] a memory controller connected to the slots and providing redundancy;
[0030] wherein the controller comprises
[0031] n ECC/ChIPKILL circuits connected to the respective slots, for ECC code generation, error check, data reconfiguration, and ChipKill operation;
[0032] a parity-generation/check/reconfiguration circuit connected to the n ECC/CHIPKILL circuits, the parity-generation/check/reconfiguration circuit defining one of the n modules as a parity memory and its remainder as (n−1) data memories; wherein a first parity code is generated from desired data to be stored and written into the parity memory while the desired data are written into the respective (n−1) data memories and wherein a second parity code is generated from the desired data read from the (n−1) data memories and compared with the first parity code read from the parity memory, thereby conducting an error checking operation; and wherein when one of the (n−1) data memories is failed, the desired data is reconfigured using the first parity code and the (n−2) data memories other than the failed one; and
[0033] an error count circuit including a generation counter register for storing generation counts of ECC errors and ChipKill errors, and a comparator for comparing the generation counts with a threshold; wherein the comparator outputs an interrupt signal to the upper system when one of the generation counts exceeds the threshold.
[0034] With the redundant memory system according to the second aspect of the present invention, in the memory controller, n ECC/ChIPKILL circuits are connected to the respective slots, for ECC code generation, error check, data reconfiguration, and ChipKill operation.
[0035] Moreover, a parity-generation/check/reconfiguration circuit is connected to the n ECC/CHIPKILL circuits. The parity-generation/check/reconfiguration circuit defines one of the n modules as a parity memory and its remainder as (n−1) data memories. A first parity code is generated from desired data to be stored and written into the parity memory while the desired data are written into the respective (n−1) data memories. A second parity code is generated from the desired data read from the (n−1) data memories and compared with the first parity code read from the parity memory, thereby conducting an error checking operation. When one of the (n−1) data memories is failed, the desired data is reconfigured using the first parity code and the (n−2) data memories other than the failed one.
[0036] An error count circuit is further provided, which includes a generation counter register for storing generation counts of ECC errors and ChipKill errors, and a comparator for comparing the generation counts with a threshold. The comparator outputs an interrupt signal to the upper system when one of the generation counts exceeds the threshold.
[0037] Accordingly, the memory controller controls the n modules in such a way as to make an operation corresponding to a RAID. Thus, a failed one of the n modules incorporated into the memory system can be replaced with a new memory module during the energized or in-service state even if the OS (operating system) used in a computer system does not support the memory redundancy function.
[0038] In a preferred embodiment of the module according to the second aspect of the invention, the parity-generation/check/reconfiguration circuit has the function of:
[0039] deblocking the desired data to (n−1) parts of data;
[0040] generating the first parity code through an Exclusive OR operation of the (n−1) parts of data;
[0041] writing the (n−1) parts of data into the respective (n−1) data memories;
[0042] reading the (n−1) parts of data from the respective (n−1) data memories;
[0043] generating the second parity code through an Exclusive OR operation of the (n−1) parts of data read from the respective (n−1) data memories; and
[0044] comparing the second parity code with the first parity code to generate a result for error finding;
[0045] wherein when no error is found according to the result, the (n−1) parts of data read are blocked to reconstitute the desired data and output the said desired data;
[0046] and wherein when an error is found in one of the (n−1) parts of data read according to the result, the error is corrected using the first parity data and the remaining (n−2) parts of data other than the failed one, and the (n−1) parts of data read are blocked to reconstitute the desired data.
[0047] According to a third aspect of the present invention, a memory controller used for a memory system is provided. This memory controller comprises:
[0048] means for defining one of memory modules inserted into respective memory slots as a parity memory and its remainder as data memories;
[0049] means for generating a first parity code from desired data to be stored;
[0050] means for writing the desired data into the respective data memories and the first parity code into the parity memory; and
[0051] means for reading the desired data from the respective data memories and the first parity code from the parity memory to thereby conduct a parity check operation and an error correction operation of the desired data using the desired data and the first parity code, resulting in the redundancy.
[0052] With the memory controller according to the third aspect of the present invention, there are the same advantages as those of the redundant memory system according to the first aspect of the invention because of the same reason as explained in the redundant memory system according to the first aspect of the invention.
[0053] In a preferred embodiment of the controller according to the third aspect of the invention, the memory slots are capable of hot plugging or hot swapping operation, wherein a failed one of the memory modules is replaceable with a new memory module in an energized state of the memory system.
[0054] In another preferred embodiment of the controller according to the third aspect of the invention, a second parity code is generated using the desired data read from respective data memories and then, the second parity code is compared with the first parity code read from the parity memory. The parity check operation is conducted by comparing the second parity code with the first parity code. When one of the modules defined as the data memories is tailed, the error correction operation of the desired data is conducted by reconfiguring the desired data read from the remaining non-failed data memories and the first parity data read from the parity memory.
[0055] According to a fourth aspect of the present invention, another memory controller used for a memory system is provided. This memory controller comprises:
[0056] n ECC/ChIPKILL circuits connected to respective n memory slots, for ECC code generation, error check, data reconfiguration, and ChipKill operation, where n is an integer greater than one;
[0057] a parity-generation/check/reconfiguration circuit connected to the n ECC/CHIPKILL circuits, the parity-generation/check/reconfiguration circuit defining one of n memory modules as a parity memory and its remainder as (n−1) data memories; wherein a first parity code is generated from desired data to be stored and written into the parity memory while the desired data are written into the respective (n−1) data memories; and wherein a second parity code is generated from the desired data read from the (n−1) data memories and compared with the first parity code read from the parity memory, thereby conducting an error checking operation; and wherein when one of the (n−1) data memories is failed, the desired data is reconfigured using the first parity code and the (n−2) data memories other than the failed one; and
[0058] an error count circuit including a generation counter register for storing generation counts of ECC errors and ChipKill errors, and a comparator for comparing the generation counts with a threshold; wherein the comparator outputs an interrupt signal to the upper system when one of the generation counts exceeds the threshold.
[0059] With the memory controller according to the fourth aspect of the present invention, there are the same advantages as those of the redundant memory system according to the second aspect of the invention because of the same reason as explained in the redundant memory module according to the second aspect of the invention.
[0060] In a preferred embodiment of the controller according to the fourth aspect of the invention, the parity-generation/check/reconfiguration circuit has the function of:
[0061] deblocking the desired data to (n−1) parts of data;
[0062] generating the first parity code through an Exclusive OR operation of the (n−1) parts of data;
[0063] writing the (n−1) parts of data into the respective (n−1) data memories;
[0064] reading the (n−1) parts of data from the respective (n−1) data memories;
[0065] generating the second parity code through an Exclusive OR operation of the (n−1) parts of data read from the respecting (n −1) data memories; and
[0066] comparing the second parity code with the first parity code to generate a result for error finding;
[0067] wherein when no error is found according to the result, the (n−1) parts of data read are blocked to reconstitute the desired data and output the said desired data;
[0068] and wherein when an error is found in one of the (n−1) parts of data read according to the result, the error is corrected using the first parity data and the remaining (n−2) parts of data other than the failed one, and the (n−1) parts of data read are blocked to reconstitute the desired data
[0069] In the above-described redundant memory systems according to the first and second aspects of the invention and the above-described memory controllers according to the third and fourth aspects of the invention, there is an additional advantage that dynamic replacement of memory modules is possible even if the system is in service by using memory slots capable of the hot plugging operation according to the definition by the Joint Electron Device Engineering Council (JEDEC).
[0070] In order that the present invention may be readily carried into effect, it will now be described with reference to the accompanying drawings.
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080] Preferred embodiments of the present invention will be described in detail below while referring to the drawings attached.
[0081] As shown in
[0082] The memory controller
[0083] The ECC/CHIPKILL circuits
[0084] The parity-generation/check/reconfiguration circuit
[0085] The bypass circuit
[0086] Referring to
[0087] The generation count register
[0088] Referring to
[0089] Referring to
[0090] Referring to
[0091] Referring to
[0092] Next, the overall operation of the redundant memory system
[0093] When the power is turned on, as shown in
[0094] When the data is written into the memory system
[0095] For example, as shown in
[0096] Next, when the input data is read from the memory system
[0097] When no read error is found in the Step C
[0098] For example, as shown in
[0099] On the other hand, when a read error is found in one of the DIMMs
[0100] For example, as shown in
[0101] When the ECC error found in the step C
[0102] When the ECC error found in the step C
[0103] When the error is judged non-correctable by the ChipKill correction operation in the step CD, the memory controller
[0104] When one of the generation counts of the ECC 1-bit error, the ECC 2-bit error, the ChipKill error, and the read error of the generation counter
[0105] In the following explanation, it is supposed that one of the generation counts of the ECC 1-bit error, the ECC 2-bit error, the ChipKill error, and the read error of the generation counter
[0106] When the CPU
[0107] In response to the fault detection alarm thus emitted, the operator knows the occurrence of the fault in the memory system
[0108] Subsequently, a new, normal DIMM is inserted into the slot
[0109] In this way, the four parts of the correct data ail, α2, α3, and α4 and the parity code p
[0110] It is supposed that a correctable 1-bit error exists in the 16-bit faulty sub-data B
[0111] With the redundant memory system
[0112] (i) Redundancy can be given to the DIMMs
[0113] (ii) A failed one of the DIMMs
[0114] (iii) Dynamic replacement of the DIMMs
[0115] (iv) The system availability is improved because dynamic replacement of the DIMMs
[0116] It is needless to say that the invention is not limited to the above-described embodiment. Any modification is applicable to the embodiment. For example, the memory modules used in the above embodiment are in the form of the DIMM. However, any other form (e.g., SIMM) of memory modules may be used if it is replaceable in the energized state of a computer system.
[0117] While the preferred forms of the present invention have been described, it is to be understood that modifications will be apparent to those skilled in the art without departing from the spirit of the invention. The scope of the present invention, therefore, is to be determined solely by the following claims.