You are here: Group monitoring > Monitoring Group members > Displaying general member information

(Previous Topic: Monitoring alarms and operations)

Monitoring group members

Member hardware problems typically cause event messages and alarms. Monitor the member hardware and replace any failed components immediately.

1. Click Group
2. Click Members The Group Disk Space panel shows the total amount of free space in the group and in each pool (if applicable).

The Group Members panel lists all the members, the pool each belongs to, their capacity and amount of free space, RAID policy, number of disks, status, PS Series Firmware version (should be the same for all members), and the number of iSCSI connections to each member. (This indicates the number of volumes or snapshots with data on that member that are connected to an initiator. Nothing connects directly to a member.)
3. Check the following:
Member status – Member Status describes member status. If a member is offline, investigate the cause. Volumes with data on an offline member are also offline. If a member has a problem, double click the member to display additional information.
Low free space – Low free space in a member might indicate that overall group space is low. You can free space in a member by adding more members to the same pool (the group distributes volume data across the pool members).

Monitoring a specific member

To display details about a specific member, click Group, expand Members, then select the member name, and then click the Status tab.

Displaying general member information

In the General Member Information panel, check the RAID status. RAID Status describes the RAID status values and provides possible solutions where appropriate.

RAID Status shows RAID status for a member. The first column lists the status values, the second column provides descriptions, and the third column provides solutions.

Table: RAID Status

Status

Description

Solution

catastrophicLoss

Disk array lost group metadata or user data. The array does not initialize.

Contact your support provider.

degraded

A RAID set is in a degraded state. If the member RAID level is RAID 5, RAID 50, or RAID 6, performance might be impaired.

Identify and replace any failed drives.

expanding

Disk array is expanding (for example, because you installed additional disk drives or changed the member RAID policy).

None needed; informational

failed

Multiple disk drive failures occurred in the same RAID set. The member is set offline.

Contact your support provider.

ok

Disk array initialization is complete. Performance is normal.

None needed; informational

reconstructing

Disk array is reconstructing data on a drive (for example, because a drive failed, and a spare is replacing it). During reconstruction, performance might decrease. After reconstruction, performance returns to normal, unless a RAID set is degraded.

Identify and replace any failed drives.

verifying

Disk array is initializing (for example, because you set the member RAID policy).

None needed; informational

Displaying member health status

1. Click Group, then expand Members
2. Select the member name, and then click the Status tab.

In the Member Health Status panel:

Click Front view to display the front panel of the array.
Click Rear view to display the back panel of the array, including the control modules and the power supply and cooling modules. The front and rear views shown in your GUI depend on the array model of the group member.
Click Inside view (not available on all array models) to display the interior disk drive slots.
Click View alarms to display all the alarms for the member.

A red X over a hardware component indicates uninstalled or unconfigured hardware. A warning or error status symbol in the array graphic indicates a failed or failing component. Move the pointer over a component to show status details.

Member Statusshows member status. The first column lists the status values, second column describes them, and the third column lists solutions.

Table: Member Status

Status

Description

Solution

unconfigured

You did not select a RAID policy for the member.

None needed; informational.

initializing

Member is initializing according to the selected RAID policy.

None needed; informational.

online

Array is a functioning member of the group.

None needed; informational. A member can experience a failure but still be online.

offline

Member is unavailable, failed, or power was removed.

Identify and correct the problem.

vacating-in-progress

Member is moving data to the remaining pool members before it is removed from the group.

None needed; informational. This can be a long operation, based on the amount of data that must be moved to the other pool members.

vacated

Member has successfully moved its data to the other pool members before it is removed from the group.

None needed; informational.

Displaying member space

1. Click Group, then expand Members,
2. Select the member name, and click the Status tab.

The Member Space panel shows the total amount of usable space on the member, how much space is used by volumes, snapshots, and replicas, and the amount of free space, numerically and in a graphic.

Using LEDs to identify a member

If a hardware failure occurs in a member, the group generates an alarm which causes the member LED to light.

In addition, to help you identify a member, you can make the fan tray LED and the control module ERR LED on the member chassis flash.

To make a member’s LED flash, click Group, then expand Members, then select the member name, and then click Start LEDs flashing.
To stop flashing a member’s LED, click Group, then expand Members, then select the member name, and then click Stop LEDs flashing

Warning:  Never turn off power to a group member unless the member has been cleanly shut down. See Shutting down a member.

Monitoring the member enclosure

The member enclosure information includes the power supplies, cooling fans (usually integrated into the power supplies), and, on some array models, channel cards and an EIP card.

To display the member enclosure information, click Group, then expand Members, then select the member name, and then click the Enclosure tab.

Monitoring power supplies

A member has two or three power supplies. Most PS Series arrays use power supplies that have integrated cooling modules.

A member can survive one power supply failure. Replace failed power supplies as soon as possible.

For proper cooling, do not remove a power supply until you have a replacement.

For information about replacing a power supply, see the Hardware Maintenance manual for your array model or contact your PS Series support provider.

The Power Supplies panel shows the status of the power supplies. The number and type of hardware components shown depends on your array model.

Power Supply Status shows power supply status and possible solutions. The first column lists the status values, the second column provides descriptions, and the third column provides solutions.

Table: Power Supply Status

Status

Description

Solution

OK

Array is receiving power from the power supply.

None needed; informational.

no-power

Power supply is not installed or not connected to a power source, or the power supply is not turned on (not all power supply models).

Keep all power supplies installed and connected to a power source. If the power supply has a power switch, make sure the power switch is on.

failed

Power supply failure.

See your PS Series support provider for information about replacing the power supply.

Monitoring cooling and fans

A member has two or three cooling modules and multiple fans. Most PS Series arrays use power supplies that have integrated cooling modules.

Periodically, feel the room temperature where the hardware is located and make sure that the room is sufficiently cool and ventilated. Also make sure the fan trays and cooling modules have no red LEDs, and monitor the member temperature.

A member can survive one cooling module failure. Replace failed cooling modules as soon as possible.

The Cooling Fans panel shows the status of the fans on the cooling modules. Cooling Fan Status shows the cooling fan status. These status values apply to array models with combination power supply and cooling modules. The first column lists the status values, the second column provides descriptions, and the third provides solutions.

Table: Cooling Fan Status

Status

Description

Solution

fan-present

Cooling modules and fan are functioning.

None needed; informational.

fan-not-present

Cooling module or fan failed, or the cooling module is not installed, not turned on, or not connected to a power source.

Install a functioning cooling module, make sure the cooling module is connected to a power source, or turn on the cooling module (not available on all cooling module models).

See your PS Series support provider for information about replacing a failed cooling module.

The Temperature Sensors panel shows the current temperature for the various array controllers and processors, in addition to the normal temperature range. Array Temperature Status describes the array temperature status. The first column lists the status values, the second column provides descriptions, and the third provides solutions.

Table: Array Temperature Status

Status

Description

Solution

normal

Temperature is within normal range.

None needed; informational.

warning

Temperature is outside normal range, but within limits.

Check that all fans are working properly. Monitor the temperature carefully.

critical

Temperature is outside operating limits.

Check that all fans are working properly. Make sure the air conditioning system is working correctly, and make sure there is air flow around the array. If a processor's temperature stays high, replace the control module.

See your PS Series support provider for information about replacing a failed cooling module.

Multiple fan failures increase the array temperature. A high temperature results in event messages. The array might shut down before damage occurs.

Some PS Series arrays also show the ambient temperature, which is calculated in Celsius from the two sensor temperatures with the highest temperatures, using the following formula:

((Backplane Sensor 0 + Backplane Sensor 1) / 2) – 7

Monitoring channel cards

Some array models include redundant channel cards. An array continues to operate if a channel card fails. You can replace the failed channel card with no impact on group operation.

Channel Card Status describes channel card status and possible solutions, where appropriate.

Channel card status

Channel Card Status shows channel card status. The first column lists the status values, the second provides descriptions and solutions, and the third provides solutions.

Table: Channel Card Status

Status

Description

Solution

good

Channel card is functioning normally.

None needed; informational.

failed

Channel card failure.

Contact your PS Series support provider for information about replacing a channel card.

not-present

Channel card is missing or status is unavailable.

Contact your PS Series support provider for information about installing or replacing a channel card.

For information about replacing channel cards, see the Hardware Maintenance manual for your array model or contact your PS Series support provider.

Monitoring the EIP card

Some array models include an Enclosure Interface Processor (EIP) card. An array continues to operate if the EIP card fails. You can replace the failed EIP card with no impact on group operation.

In the Member Enclosure window, the EIP card panel shows the EIP card status.

EIP Card Status describes EIP card status and possible solutions to problems. The first column lists the status values, the second describes them, and the third provides solutions.

Table: EIP Card Status

Status

Description

Solution

good

EIP card is functioning normally.

None needed; informational.

failed

EIP card failure.

Contact your PS Series support provider for information about replacing an EIP card.

not-present

EIP card is missing or status is unavailable.

Contact your PS Series support provider for information about installing or replacing an EIP card.

For information about replacing the EIP card, see the Hardware Maintenance manual for your array model or contact your PS Series support provider.

Monitoring control modules

Each group member has one or two control modules installed. One control module is designated as active (responsible for serving I/O to the member). On the active control module the LED labeled ACT is lit.

In a dual control module array, the other control module is secondary (mirrors cache data from the active control module). Upon startup, either control module can be designated active or secondary, regardless of its previous status.

Under normal operation, the status of a control module (active or secondary) does not change, unless you restart the member.

In a single control module array, if the control module fails, the member is offline.

In a dual control module array, if the active control module fails, the secondary control module becomes active and begins serving I/O. This is called control module failover. I/O should continue if you connect cables to the newly active control module.

For information about replacing control modules, see the Hardware Maintenance manual for your array model or contact your PS Series support provider.

To display control module information, click Group, then expand Members, then select the member name, and then click the Controllers tab.

Each Control Module Slot panel shows the following information:

Status. See Control Module Status.
Boot time.
Cache battery status and NVRAM battery status. See Cache Battery Status and NVRAM Battery Status for descriptions of battery status and possible solutions where appropriate.
Model number.
Boot ROM version.
PS Series firmware version.

An empty slot means that a control module is not installed or has failed.

For information about replacing a control module, see the Hardware Maintenance manual for your array model or contact your PS Series support provider. Do not remove a failed control module until you have a replacement.

The Memory Cache panel displays the cache mode. Control module and battery status affect the cache mode. Write-through mode might impair performance. Identify why the cache is in write-through mode and correct the problem, if necessary. See About write cache operations.

Control module status

Control Module Status describes the control module status. The first column lists the status values, the second column describes them, and the third column provides solutions.

Table: Control Module Status

Status

Description

Solution

active

Serving I/O to the member.

None needed; informational.

secondary

Mirroring cache data from the active control module.

None needed; informational.

Cache battery status

Cache Battery Status describes the control module cache battery status. The first column lists the status values, the second column describes them, and the third column provides solutions.

Table: Cache Battery Status

Status

Description

Solution

ok

Battery is fully charged.

None needed; informational.

failed

Battery failure.

Contact your service provider for information about replacing batteries.

missing battery

Battery is missing.

Contact your service provider for information about replacing batteries.

low voltage

Battery is below the limit for normal operation.

If the battery status is low voltage for an extended period of time, contact your PS Series service provider for information about replacing batteries.

low voltage, is charging

Battery is charging but is still below the limit for normal operation.

If the battery status is low voltage, is charging for an extended period of time, contact your PS Series service provider for information about replacing batteries.

good battery, is charging

Battery is charging but has enough charge for normal operation.

None needed; informational.

NVRAM battery status

NVRAM Battery Status describes the control module NVRAM coin cell battery status. Not every array has an NVRAM battery. The first column lists the status values, the second column describes them, and the third column provides solutions.

Table: NVRAM Battery Status

Status

Description

Solution

good

Battery installed and fully charged.

None needed; informational.

bad

Battery failure.

Contact your PS Series service provider for information about replacing batteries.

not-present

Battery is not installed.

Contact your PS Series service provider for information about replacing batteries.

unknown

Battery status is not known.

Contact your PS Series service provider for information about replacing batteries.

Monitoring disk drives

Make sure you detect and replace failed disk drives as soon as possible. Although spare disks and RAID protect data against disk failures, multiple disk failures might put data in jeopardy.

To display the disk drive information, click Group, expand Members, then select the member name, and then click the Disks tab.

The Disk Array Summary panel shows the disk drives in the member. The number and type of drives shown depends on your array model.

The Installed Disks panel shows more information about each disk, including the slot, type, model and revision, size, status, and errors. Closely monitor drives with errors.

Disk drive status

Disk Drive Status shows disk drive status. The first column lists the status values, and the second column describes them.

Table: Disk Drive Status

Status

Description

Solution

too-small

Disk drive is smaller than other drives in the member. The drive cannot be used in the member.

Replace the drive with a drive that has the same size or a greater size than the installed drives.

failed

Disk drive failure.

See your PS Series support provider for information about replacing failed disk drives.

foreign

Disk drive has a foreign label. The drive was probably removed from a different array and then installed in this array.

To use the drive, click foreign disk and clear the label.

history-of-failures

Previously failed disk drive.

See your support provider. To use the drive, click history-of-failure and agree to use the disk.

offline

Indicates that the disk drive does not fall into the other status categories.

See your PS Series support provider.

online

Disk drive is functioning.

None needed; informational.

spare

Disk drive is a spare drive.

None needed; informational.

unsupported-version

Disk drive cannot use the firmware running on the member.

See your PS Series support provider.

Warning:  A disk drive failure in a RAID 5 or RAID 10 set that is degraded might result in data loss.

When a drive in a RAID set fails, a member behaves as follows:

If a spare disk drive is available: Data from the failed drive is reconstructed on the spare. During the reconstruction, the RAID set that contains the failed drive is temporarily degraded.
If a spare disk drive is not available, and the RAID set has not reached the maximum number of drive failures: The RAID set that contains the failed drive is degraded. For RAID 5, RAID 50, or RAID 6, performance might decrease.
If a spare disk drive is not available, and the RAID set has reached the maximum number of drive failures: The member is set offline, and any volumes and snapshots that have data stored on the member are set offline. Data might be lost and must be recovered from a backup or replica.

When you replace a failed disk, a member behaves as follows:

If a spare disk drive was used: The new drive automatically becomes a spare, with no effect on performance.
If a RAID set was degraded: Data is automatically reconstructed on the new drive and performance goes back to normal after reconstruction.
If a member was offline because of multiple RAID set drive failures: Any volumes snapshots with data on the member are set offline and data might be lost.

In some cases, a member might detect a problem with a disk drive. The member automatically copies the data on the failing disk drive to a spare disk drive, with no impact on availability and little impact on performance. The group generates event messages informing you of the progress of the copy-to-spare operation. I/O is written to both drives until the copy-to-spare operation completes. If the disk drive completely fails during the operation, data is reconstructed on the spare using parity data, as usual.

Replace any failed disks immediately. For information about replacing disk drives, see the Hardware Maintenance manual for your array model or contact your PS Series support provider.

Monitoring network hardware

A member must have at least one functioning network interface connected to a network and configured with an IP address. Each control module has multiple Ethernet ports.

If you experience network problems, group members might lose the ability to communicate with each other over the network. In such a group, some management operations are not allowed. For example, you cannot change the IP addresses of an isolated member.

If the members of a group cannot communicate, identify and correct the network problems. This restores the group to normal full operation, including network communication.

To display the network information, click Group, expand Members, then select the member name, and then click the Network tab.

The Status of Network Interfaces panel shows the following information:

Operational status – This is the current status of the network interface and can be:
up – Operational, connected to a functioning network, configured with an IP address and subnet mask, and enabled.
down – Not operational, not connected to a functioning network, not configured with an IP address or subnet mask, or disabled.
Requested status – This status is set by administrative action:
enabled – Configured and serving I/O.
disabled – Not serving I/O. Might be configured.

If the operational status is down, but the requested status is enabled, identify and correct the error.

To protect against network interface or port failure, connect multiple network interfaces on both control modules to the network.

Speed – Make sure that the interface speed is adequate.
MTU size – The path MTU size depends on the iSCSI initiator setting.
Packet errors – A few packet errors are not usually a problem. If a large number of packet errors occur, network problem or a network interface or port failure might exist. Identify and correct the problem.

The IP Configuration panel shows each interface and its IP address, netmask, MAC address and description, if any.

Monitoring iSCSI connections to a member

To display all connections to a member, click Group, expand Members, select the member name, and then click the Connections tab.

The iSCSI Connections panel shows information about the initiator address, which volume or snapshot it is connected to (Target column), how long the connection has been active, and which Ethernet port the initiator is using.

Check for multiple initiators writing to the same iSCSI target. This can cause target corruption if not handled correctly by the servers.

(Next Topic: Monitoring volumes, collections, and snapshots)

 


Copyright 2010 Dell Inc.