1 New Failures Work | Asm Health Checker Found

Decoding the Alert: "ASM Health Checker Found 1 New Failures" – Causes, Fixes, and Prevention

If you manage Oracle Grid Infrastructure (GI) or a standalone Automatic Storage Management (ASM) instance, one notification can send a chill down your spine: "ASM health checker found 1 new failures."

This message, often found in your alert log, crsd.log, or email alerts from Enterprise Manager (EM12c/13c), indicates that the automated ASM Health Checker has detected a new issue affecting the integrity, availability, or performance of your ASM environment. Ignoring it is not an option; unresolved failures can lead to disk group mount issues, I/O latency, or even database crashes.

This article provides a 360-degree breakdown of this alert: what triggers it, how to diagnose the root cause, step-by-step repair procedures, and long-term prevention strategies.


Post-incident actions (short incident report template)


6.3 Maintain Consistent Naming with ASMFD

Use ASM Filter Driver (ASMFD) instead of raw devices or ASMLIB. It automatically enforces permissions and detects failures faster.

Essay: “asm health checker found 1 new failures” — diagnosis, causes, and remediation

Introduction The terse message “asm health checker found 1 new failures” appears straightforward but carries significant operational weight: it signals that an ASM (Automatic Storage Management, or a similarly named subsystem) health-check routine has detected a failure. Whether that ASM is Oracle ASM, a cloud Autoscaling/Service Mesh monitor, or a custom “Application Service Monitor,” the phrasing implies an automated health-scan discovered one additional fault relative to its prior baseline. This essay examines the message’s possible meanings, root causes, investigative approach, risk implications, and systematic remediation and prevention strategies. The aim is to move from alarm to actionable resolution, and from reactive fixes to durable system hardening.

  1. Interpreting the message
  1. Immediate triage checklist (first 15–60 minutes)
  1. Root-cause analysis (systematic approach)
  1. Common root causes and how they manifest
  1. Remediation steps (concrete actions)
  1. Validation and recovery verification
  1. Post-incident actions (SRE-style)
  1. Design considerations for health checkers to reduce false positives and improve signal
  1. Risk assessment and business impact

Conclusion “asm health checker found 1 new failures” is more than a log line: it is an early warning. Responding effectively requires prompt triage, methodical diagnosis, and decisive remediation—combined with post-incident learning and engineering improvements to reduce recurrence. By classifying possible causes (storage, probe, resource, network, regression, auth), following a disciplined RCA approach, and implementing monitoring and automation best practices, teams can convert such alerts from frightening unknowns into manageable events and steadily improve system resilience.

Appendix: Minimal quick runbook (steps to execute immediately)

  1. Capture the alert details and correlate logs/metrics.
  2. Identify the affected resource (disk/pod/node/service).
  3. Attempt a manual probe/connection to reproduce failure.
  4. If production-impacting, trigger failover/scaleup and notify on-call.
  5. Apply targeted remediation (replace disk, fix probe, rollback deployment).
  6. Verify health across multiple intervals; monitor for recurrence.
  7. Create postmortem and assign permanent fixes.

— End —

Troubleshooting Oracle ASM Health Checker Failures The message "ASM Health Checker found 1 new failures"

is a critical alert in Oracle Automatic Storage Management (ASM). It typically appears in the ASM alert log when the background health monitoring process detects a problem that could threaten disk group availability. Immediate Impact

When this error is triggered, it often coincides with other critical events: Disk Group Dismounting

: ASM may force a dismount of a disk group (e.g., ORA-15130) to prevent data corruption. Instance Reconfiguration

: A "Dirty detach reconfiguration" may start as the cluster tries to handle the failure. Database Downtime

: If the affected disk group contains critical files like the OCR, Voting files, or database data files, the associated Oracle instance or Clusterware may crash. Common Root Causes Lost Storage Connectivity

: One or more LUNs/disks became inaccessible due to hardware, cable, or storage controller issues. Write I/O Errors

: ASM takes disks offline if it cannot complete a write operation, which can lead to a disk group failure if redundancy is lost. Insufficient Redundancy asm health checker found 1 new failures

: In "External Redundancy" disk groups, the failure of even a single disk causes the entire group to fail. Disk Header Corruption

: Physical corruption of the disk header can prevent ASM from identifying the disk as a "MEMBER" of a group. Investigative Steps

To identify and resolve the specific failure, follow these steps: ASM Generic Archives | Helmut's RAC / JEE Blog

ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It

The Automatic Storage Management (ASM) health checker is a crucial tool in Oracle databases that monitors the health and integrity of the storage infrastructure. When the ASM health checker reports a new failure, it's essential to understand the implications and take corrective actions to prevent data loss or system downtime. In this blog post, we'll discuss what an ASM health checker failure means, how to investigate the issue, and steps to resolve it.

What does an ASM health checker failure mean?

When the ASM health checker detects a problem, it logs an error message indicating that a failure has been detected. The message may look like this:

"ASM health checker found 1 new failure"

This message indicates that the ASM health checker has detected a single failure in the storage system. The failure could be related to various issues, such as:

Investigating the ASM health checker failure

To investigate the failure, follow these steps:

  1. Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message, timestamp, and affected disk group. You can find the alert log in the $ORACLE_BASE/diag/asm/+ASM/<instance_name>/trace directory.
  2. Run the asmcmd command: The asmcmd command-line tool provides a comprehensive view of the ASM configuration and status. Run asmcmd with the lsdg option to list the disk groups and their status: asmcmd ls dg
  3. Check the disk group status: Use the asmcmd command with the dg option to check the status of the affected disk group: asmcmd dg <disk_group_name>

Resolving the ASM health checker failure

Once you've identified the root cause of the failure, take corrective actions to resolve the issue:

  1. Replace a failed disk: If the failure is due to a disk error, replace the disk and re-add it to the ASM disk group.
  2. Check and correct connectivity: Verify that the storage connections are stable and functioning correctly.
  3. Free up disk space: If the failure is due to insufficient disk space, free up space by deleting unnecessary files or expanding the disk group.
  4. Reconfigure ASM: If the failure is due to an ASM configuration error, reconfigure ASM with the correct settings.

Best practices to prevent ASM health checker failures

To minimize the likelihood of ASM health checker failures: Decoding the Alert: "ASM Health Checker Found 1

  1. Regularly monitor ASM alerts: Regularly check the ASM alert log and respond promptly to any errors or warnings.
  2. Perform routine maintenance: Regularly perform routine maintenance tasks, such as checking disk space and replacing failed disks.
  3. Test and validate ASM configurations: Test and validate ASM configurations to ensure they are correct and optimal.

By understanding the causes of ASM health checker failures and taking proactive steps to prevent them, you can ensure the reliability and performance of your Oracle database storage infrastructure.

ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It

If you're a database administrator or a system administrator working with Oracle databases, you're likely familiar with the Automatic Storage Management (ASM) system. ASM is a storage management system that provides a simple and efficient way to manage storage for Oracle databases. One of the tools used to monitor and maintain ASM is the ASM Health Checker, which periodically checks the health of the ASM infrastructure and reports any issues or failures.

Recently, you may have encountered an alert or message indicating that the "ASM health checker found 1 new failure." This message can be concerning, especially if you're not familiar with what it means or how to resolve it. In this article, we'll explore what this message means, the possible causes, and step-by-step instructions on how to resolve the issue.

What Does the ASM Health Checker Do?

The ASM Health Checker is a background process that periodically checks the health of the ASM infrastructure. It monitors various aspects of ASM, including:

The ASM Health Checker runs automatically and reports any issues or failures it detects. The checker runs at regular intervals, which can be configured using the ASM_CHECK_INTERVAL parameter.

What Does "ASM Health Checker Found 1 New Failure" Mean?

When the ASM Health Checker detects a new failure, it reports the issue and provides information about the failure. The message "ASM health checker found 1 new failure" indicates that the checker has detected a problem with the ASM infrastructure that requires attention.

The failure can be related to various aspects of ASM, such as:

Possible Causes of the Failure

There are several possible causes for the ASM Health Checker to report a new failure. Some common causes include:

How to Resolve the Issue

To resolve the issue, follow these step-by-step instructions:

  1. Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message and the time it occurred. You can find the ASM alert log in the $ORACLE_BASE/diag/asm/+ASM/trace directory.
  2. Investigate the failure: Use the information from the ASM alert log to investigate the failure. Check the ASM disk groups, disks, and instances to identify any issues.
  3. Run the ASM Health Checker manually: Run the ASM Health Checker manually to get more information about the failure. You can do this using the following command:
ALTER SESSION SET CONTAINER = '+ASM';
BEGIN
  DBMS ASMADM .check_health;
END;
/

This command will provide more detailed information about the failure. Post-incident actions (short incident report template)

  1. Check the disk groups and disks: Check the disk groups and disks to ensure they are configured correctly and are online.
SELECT * FROM V$ASM_DISKGROUP;
SELECT * FROM V$ASM_DISK;
  1. Check the ASM instance: Check the ASM instance to ensure it is running and configured correctly.
SELECT * FROM V$ASM_INSTANCE;
  1. Perform corrective actions: Based on the investigation, perform corrective actions to resolve the issue. This may include:
    • Replacing a failed disk
    • Reconfiguring a disk group
    • Restarting the ASM instance
    • Correcting an I/O error or performance problem

Best Practices to Avoid Future Failures

To avoid future failures and ensure the health of your ASM infrastructure, follow these best practices:

By following these best practices and resolving the issue reported by the ASM Health Checker, you can ensure the health and performance of your ASM infrastructure and prevent future failures.

The alert "ASM Health Checker found 1 new failures" typically appears in your Oracle ASM alert logs when the Automatic Diagnostic Repository (ADR) health monitor detects a critical issue during a maintenance task, such as a diskgroup rebalance or a disk add operation. Understanding the Failure

When this message occurs, it indicates that a health check—either triggered automatically by an incident or run manually—has identified a problem that could compromise your storage. Common triggers include:

Disk Failgroup Issues: A diskgroup has fewer failure groups than recommended (e.g., fewer than 3 for normal redundancy).

Disk Status/Mount Failures: Disks are missing, offline, or have lost membership.

Metadata Corruption: Corruption found in the first 250 blocks of an ASM disk, which contain essential metadata.

Quorum Loss: The diskgroup cannot maintain a read quorum, often leading to an automatic dismount. How to Diagnose and Fix To resolve the failure, follow these diagnostic steps:

4. Mismatched Disk Group Compatibility

If compatible.asm, compatible.rdbms, or compatible.advm values are set incorrectly relative to the GI version, the health checker will report advisories as failures.

Step 3: Check Disk Group Status

SELECT name, state, type, total_mb, free_mb, required_mirror_free_mb
FROM v$asm_diskgroup;

Look for MOUNTED state but with disks OFFLINE or UNUSABLE.

Immediate actions (first 10 minutes)

  1. Don’t panic — gather context.
  2. Check the Health Checker details/console immediately for:
    • The specific check name and ID.
    • Timestamp of the failure.
    • Failure severity (critical/warning/info).
    • Any short description or error code provided.
  3. Look at recent system events (last 30–60 minutes):
    • Service restarts
    • Deployments/patches
    • Scheduled jobs or cron tasks
    • OS reboots or kernel messages
  4. Check logs related to the failed check:
    • Health checker logs
    • Application server logs
    • System logs (syslog/journalctl)
    • Network or load balancer logs if relevant
  5. Confirm whether the failure is still present by re-running the single failed check (if your health checker supports manual re-run).

8. Potential Edge Cases


Would you like me to extend this into:

Subject: ASM Health Check Report – New Failures Detected

To: Database Administration Team / System Health Monitoring Group
Date: [Insert Date]
Priority: Medium


Step 2: Query the ASM Health Checker Views (SQL)

Connect to the ASM instance and run:

sqlplus / as sysasm
SET LINESIZE 200
COL failure_type FORMAT a30
COL detail FORMAT a60
SELECT failure_id, failure_type, check_name, time_detected, status, detail
FROM v$asm_health_check
WHERE status = 'FAIL'
ORDER BY time_detected DESC;

If only one new failure exists, this yields exactly one row with actionable details.