MS630 Memory Problem Determination/Resolution Guide

MS630 Memory Problem Determination/Resolution Guide

MS630 Memory Problem

Determination/Resolution Guide

Order Number EK-MS630-FI-001

ABSTRACT

The objective of this guide is to clearly define the recommended memory maintenance strategy for all MS630 memory arrays. There are no new procedures defined here. These are the original maintenance procedures explained in detail with an emphasis on problem determination (that is, determine what the underlying cause of the problem is and when to replace the FRU).

Digital Equipment Corporation

June, 1991

The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document.

Possession, use, duplication, or dissemination of the software described in this documentation is authorized only pursuant to a valid written license from Digital or the third-party owner of the software copyright.

No responsibility is assumed for the use or reliability of software on equipment that is not supplied by

Digital Equipment Corporation.

Copyright © Digital Equipment Corporation 1991

All Rights Reserved.

Printed in U.S.A.

The following are trademarks of Digital Equipment Corporation:

MicroVAX . . . MicroVAX II . . . VMS . . . the Digital logo

This document was prepared and published by Educational Services Development and Publishing, Digital

Equipment Corporation.

Contents

About This Manual

3

3.1

3.2

4

4.1

4.2

1

1.1

1.2

1.3

2

2.1

2.2

5

5.1

5.2

START HERE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Problem Symptom Determination . . . . . . . . . . . . . . . . . . . . . . . . . . .

FRU Replacement Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Non-Conforming Material Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

HARD FAULT — THEORY NUMBER 1 . . . . . . . . . . . . . . . . . . . . . . . .

Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TRANSIENT FAULT — THEORY NUMBER 2 . . . . . . . . . . . . . . . . . .

Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

MULTIPLE SYMPTOM FAULT — THEORY NUMBER 3 . . . . . . . . . .

Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

INTERMITTENT FAULT — THEORY NUMBER 4 . . . . . . . . . . . . . . .

Theory Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recommended Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Glossary

Figures

1

2

3

Memory Parity Error Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bad Pages Indicated Under SHOW MEMORY VMS/DCL . . . . . . . . . . .

Memory Error Detected While Running MDM or POST . . . . . . . . . . . .

Tables

1 Symptom Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

8

8

8

7

7

7

9

9

9

6

6

6

5

5

1

1

1

2

3

4

iii

About This Manual

This document provides guidance in the event of a MicroVAX memory problem. To use the guide, start at Section 1, identify the problem symptom that you are experiencing, and then follow the procedures.

The procedures in this document apply to all MicroVAX II based systems and all supported memory products (1 Mbyte on-CPU memory, 1-Mbyte MS630-AA through 8-

Mbyte MS630-CA). These procedures also take into account the recommended guidelines of FCO MS630-I001.

This guide assumes that the user has service knowledge of the MicroVAX system and appropriate service tools and procedures.

v

MS630 Memory Problem Determination/Resolution Guide 1

1 START HERE

1.1 Problem Symptom Determination

Use Table 1 to determine the current symptom that your system is experiencing. Once the symptom has been determined, refer to the figure indicated in column 2.

NOTE

The symptoms are listed in order of priority. In the event that more than one of the following symptoms exist, follow the directions for the first symptom found

(in order).

Table 1 Symptom Determination

If this Symptom Exists . . .

Fatal memory error in the ERRLOG

Bad pages shown under the VMS/DCL SHOW MEMORY command

Power-on self-test of the MDM diagnostic fails

Refer to this Figure

Figure 1

Figure 2

Figure 3

2 MS630 Memory Problem Determination/Resolution Guide

You are here because there is a MEMORY PARITY ERROR ENTRY in the system ERRLOG.

Did the system REBOOT properly?

YES

NO

Probable HARD FAULT (Theory Number 1, Section 2).

1. Confirm the problem. Find the FRU with either POST or MDM (if possible).

2. Use either ERRORLOG or console information to find the FRU and replace the

failed FRU (Section 1.2).

Has it been more than 3 months since the last failure (same symptom)?

NOTE: Use the site management guide or other site log to determine this.

NO

YES

Probable TRANSIENT FAULT (Theory Number 2, Section 3).

1. No FRU replacement required.

2. Record in the site management guide − EXIT.

Has only 1 FRU failed and are there no other problem symptoms evident?

YES

NO

MULTIPLE SYMPTOM FAULT (Theory Number 3, Section 4).

1. Cannot find the failed FRU.

2. Refer to the theory for further details.

Probable INTERMITTENT FAULT (Theory Number 4, Section 5).

Replace the failed FRU (Section 1.2).

Figure 1 Memory Parity Error Entry

MS630 Memory Problem Determination/Resolution Guide 3

You are here because BAD PAGES were indicated under SHOW MEMORY VMS/DCL.

Using ERF, examine the ERRLOG for memory errors.

Any errors found?

NO

YES

Go to Figure 1.

Schedule system time. Run MDM (memory diagnostic) to identify the FRU.

Problem confirmed?/FRU identified?

NO

YES

Probable HARD FAULT (Theory Number 1, Section 2).

Replace the failed FRU (Section 1.2).

Probable INTERMITTENT or MULTIPLE SYMPTOM FAULT.

Reboot the system and if the smptom persists (and you are unable to identify the FRU), contact support.

Figure 2 Bad Pages Indicated Under SHOW MEMORY VMS/DCL

4 MS630 Memory Problem Determination/Resolution Guide

You are here because of a memory error detected either while running an MDM diagnostic or while

POST was running.

Problem confirmed?/FRU identified?

YES

NO

Further diagnosis required.

Either contact support for further assistance or attempt to reproduce the problem.

Probable HARD FAULT (Theory Number 1, Section 2).

Replace the failed FRU (Section 1.2).

Figure 3 Memory Error Detected While Running MDM or POST

MS630 Memory Problem Determination/Resolution Guide 5

1.2 FRU Replacement Procedures

The following recommended procedures should be followed when FRU replacement of the memory module is necessary.

1. Identify/verify the FRU type (module number) and location (slot number).

2. Physically remove the FRU and install a spare FRU (same type) in its place. If the

FRU is an M7609 (8-Mbyte MS630-C), verify that the spare is either revision A2 or

C1. If not, then find one that is.

3. If a second memory array module is present in the system and it is an M7609 (8-

Mbyte MS630-C), verify that it is either revision A2 or C1. If not, then acquire and install FCO MS630-I001, which involves the replacement of this module as well.

4. Power up the system. Verify that POST passes. Run one pass of the MDM diagnostics. Reboot the operating system and verify that the problem symptom does not recur.

1.3 Non-Conforming Material Tag

After replacing an FRU, the module must be tagged prior to returning it to Logistics.

The following information should be included on the repair tag to aid in module repair and tracking:

• Indicate whether or not the FRU problem was:

– Hard (easily reproducible)

– Intermittent (comes and goes)

• Indicate the method used to diagnose the failed FRU:

– POST failure

– MDM failure

– VMS bad pages

– Parity error (in the ERRLOG)

6 MS630 Memory Problem Determination/Resolution Guide

2 HARD FAULT — THEORY NUMBER 1

2.1 Theory Description

This theory is valid if the fault is hard (reproducible). The underlying cause of such a fault is typically a physical component failure. If this class of fault is present, then it is quite likely that the memory array exhibits one or more of the following symptoms:

• MDM diagnostics fail

• VMS maps out bad pages when booted

• POST fails

• The system cannot boot successfully

2.2 Recommended Action

Replace the failed FRU. In the comments field on the repair tag, indicate ‘‘HARD FAULT,

XXX FAILURE’’, where XXX is the primary symptom (first item you encounter from the following list):

1. POST — if POST failed

2. MDM — if MDM failed (also fill out the diagnostic section on the repair tag)

3. VMS bad pages — if bad pages mapped out

4. Parity error — if the console/ERRLOG indicated this

MS630 Memory Problem Determination/Resolution Guide 7

3 TRANSIENT FAULT — THEORY NUMBER 2

3.1 Theory Description

This theory is valid when the parity error has been categorized as a transient event. In other words, the parity error happened only once and appears to be an isolated incident.

The most probable source of this failure is an alpha particle. An alpha particle is a minute, one-shot disturbance which inverts the contents of a single DRAM cell (in other words, changes a ‘‘1’’ to a ‘‘0’’ or vice versa). Once a cell is impacted by an alpha particle, it remains in the ‘‘inverted’’ state until the cell is re-written. Note that once the cell is re-written, all is OK (in other words, the fault is no longer present).

The alpha particle phenomenon is well known and documented and is experienced by all

DRAM systems of all vendors. This failure mode is the most prominent for all MicroVAX memory parity errors as the rate of occurrence of this phenomenon is 100 times that of hard/reproducible DRAM faults.

From past experience and field data, it is possible for a MicroVAX II system (with a fully populated memory subsystem) to experience a transient memory fault once every 3 to 6 months (worse case). The actual rate is dependant upon system load (usage), memory access rates, and application.

3.2 Recommended Action

As alpha particles inflict no permanent damage to a DRAM, repair is not necessary. Do not replace the FRU. Simply record the symptoms in the appropriate place (for example, the site management guide or the customer site log). The pertinent information recorded should include:

• Date/time of error

• Error description (for example, fatal memory error)

• FRU isolation information (slot 2 or 3)

• Diagnosis/theory (for example, transient as only one error)

8 MS630 Memory Problem Determination/Resolution Guide

4 MULTIPLE SYMPTOM FAULT — THEORY NUMBER 3

4.1 Theory Description

This theory is valid if either one or both of the following conditions exist:

• There is more than one problem symptom evident

• Multiple FRUs have failed

Due to the underlying complexity, it is not possible (and would be inaccurate) to find the failed FRU(s). However, the following are some guidelines for further diagnosis:

• If multiple FRUs fail, all exhibiting memory parity errors, then suspect a common component (for example, a cable or CPU module).

• If other problem symptoms are exhibited, focus on the earliest and/or common symptom.

• If something has recently been changed/installed in the system, consider that component.

4.2 Recommended Action

Perform the additional manual diagnosis of all problem symptoms and/or contact the next level of support/service.

MS630 Memory Problem Determination/Resolution Guide 9

5 INTERMITTENT FAULT — THEORY NUMBER 4

5.1 Theory Description

This theory is valid if the fault is recurring but not easily reproducible. The underlying cause of such a fault is typically a marginal physical component failure. If this class of fault is present, then it is quite likely that the memory array exhibits one or both of the following symptoms:

• System crashes periodically due to a memory parity error

• MDM diagnostics do NOT (probably) fail

5.2 Recommended Action

Replace the failed FRU. In the comments field on the repair tag, indicate

‘‘INTERMITTENT FAULT, XXX FAILURE’’, where XXX is the primary symptom (first item you encounter from the following list):

1. POST — if POST failed

2. MDM — if MDM failed (also fill out the diagnostic section on the repair tag)

3. VMS bad pages — if bad pages mapped out

4. Parity error — if the console/ERRLOG indicated this

Glossary

The following terms used within this document are described below as they pertain to memory systems, faults, and errors.

MEMORY SYSTEM TERMS

Alpha particle

An alpha particle is a minute, one-shot disturbance which inverts the contents of a single

DRAM cell (in other words, changes a ‘‘1’’ to a ‘‘2’’ or vice versa). The physical source of alpha particles is the DRAM packaging material.

Cell

The basic unit of a DRAM. This element corresponds to one bit of storage. For example, a 1-Megabit DRAM contains 1,000,000 cells.

DRAM

Dynamic Random Access Memory. This is the basic physical component (IC) of a memory array module. For example, there are 288 DRAMs on the M7609 MS630-CA 8-Mbyte memory array module.

Error

An error occurs when the expected state deviates from the actual state. For example, if a parity check is made on a byte of information fetched from memory, and even parity is computed (we expect odd parity), then a parity error is the result.

Fault

The term fault is used to describe the underlying cause (or source) of an error. For example, if a parity error occurs, the underlying cause may be a physical component fault.

Parity

Refers to a technique used to protect data storage. As implemented in the MicroVAX system, a spare (parity) bit is stored with every eight bits of data to aid in the detection of errors.

Glossary–1

Glossary–2

FAULT TERMS

The following definitions are all considered attributes of errors or faults. As such, the definitions of these adjectives are given as they apply to the terms ‘‘error’’ and ‘‘fault’’.

Faults can be categorized into three distinct groups. The nature of the group relates to the ‘‘period’’ of the fault (that is, how long the fault is present).

Hard fault

Permanent. The fault is always present. Any access/use of the fault results in an error.

An example of a fault is a permanently damaged DRAM (which results in a parity error upon every access to the DRAM).

Transient fault

This class of fault refers to a fault the occurs for only a brief period of time, then disappears forever. In other words, the fault occurs only once. Examples of transient faults include power line disturbances or alpha particle faults.

Intermittent fault

This class of fault refers to a fault which occurs periodically. The fault is not easily reproducible but does recur over some period of time. Sources of this class of fault may include marginal components or infrequently accessed logic.

ERROR TERMS

Once a fault occurs and the faulted ‘‘component’’ is accessed, an error results. In one sense, the error ‘‘inherits’’ the same attributes of the fault (for example, a permanent fault results in a permanent error). However, errors are more appropriately defined in terms of how they impact the system. To this degree there are two main classes of errors.

Recoverable

This attribute means that the error condition can be corrected. An example of error recovery is ECC single-bit correction. This class of error has little or no impact on system operation. (Note that this class of error is sometimes referred to as a soft error.)

Unrecoverable

This attribute means that the error condition cannot be corrected and the operation fails to complete. An example of an unrecoverable error is a memory parity error while in kernal mode.

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement