1 New Failures Work | Asm Health Checker Found
Decoding the Alert: "ASM Health Checker Found 1 New Failures" – Causes, Fixes, and Prevention
If you manage Oracle Grid Infrastructure (GI) or a standalone Automatic Storage Management (ASM) instance, one notification can send a chill down your spine: "ASM health checker found 1 new failures."
This message, often found in your alert log, crsd.log, or email alerts from Enterprise Manager (EM12c/13c), indicates that the automated ASM Health Checker has detected a new issue affecting the integrity, availability, or performance of your ASM environment. Ignoring it is not an option; unresolved failures can lead to disk group mount issues, I/O latency, or even database crashes.
This article provides a 360-degree breakdown of this alert: what triggers it, how to diagnose the root cause, step-by-step repair procedures, and long-term prevention strategies.
Post-incident actions (short incident report template)
- Title: ASM Health Checker — 1 new failure detected — [component]
- Time detected: [timestamp UTC]
- Duration: [time to remediation]
- Impact: [user-facing or internal only]
- Root cause: [brief diagnosis]
- Actions taken: [steps performed]
- Permanent fix implemented: [yes/no — describe]
- Follow-ups: [improve monitoring, automation, runbook updates]
6.3 Maintain Consistent Naming with ASMFD
Use ASM Filter Driver (ASMFD) instead of raw devices or ASMLIB. It automatically enforces permissions and detects failures faster.
Essay: “asm health checker found 1 new failures” — diagnosis, causes, and remediation
Introduction The terse message “asm health checker found 1 new failures” appears straightforward but carries significant operational weight: it signals that an ASM (Automatic Storage Management, or a similarly named subsystem) health-check routine has detected a failure. Whether that ASM is Oracle ASM, a cloud Autoscaling/Service Mesh monitor, or a custom “Application Service Monitor,” the phrasing implies an automated health-scan discovered one additional fault relative to its prior baseline. This essay examines the message’s possible meanings, root causes, investigative approach, risk implications, and systematic remediation and prevention strategies. The aim is to move from alarm to actionable resolution, and from reactive fixes to durable system hardening.
- Interpreting the message
- Literal reading: an automated health-check process labelled “asm health checker” has logged a detection: exactly one new failure event.
- Context sensitivity: precise interpretation depends on environment:
- Oracle ASM: a disk/member or diskgroup issue (failed redundancy, I/O errors, disk offline).
- Service mesh/Autoscaling manager: a service instance or probe failed a liveness/readiness check.
- Custom agent named “asm health checker”: any monitored component (process, thread, port, storage, network) could be implicated.
- Important implications:
- “New” indicates a state transition from healthy to unhealthy (not a stale alert).
- “1” suggests a single point of failure, but single failures often cascade; it’s important to identify whether this is isolated or symptomatic.
- Immediate triage checklist (first 15–60 minutes)
- Capture the context:
- Timestamp and full log entry (message, surrounding logs, stack traces).
- Host, resource identifier (disk name, instance ID, pod, container, node).
- Correlate with monitoring dashboards, metrics (CPU, memory, I/O, latency), and recent deployments/changes.
- Prevent escalation:
- If the failure affects production traffic, consider circuit-breaker actions: divert traffic, scale up healthy instances, enable read-only modes, or failover to replicas.
- Announce to on-call and stakeholders with concise facts: what, when, where, impact, mitigation in progress.
- Gather artifacts:
- Health-check configuration (probe interval, timeout, retries).
- Recent configuration changes, deployments, patching, or maintenance windows.
- Checkpoint / snapshot data for storage systems; container logs and systemd/journald entries for nodes.
- Root-cause analysis (systematic approach)
- Classify the failure type:
- Infrastructure (disk or network failure, node crash, lost quorum).
- Application-level (process crash, thread deadlock, memory leak).
- Configuration/compatibility (misconfigured probe, wrong path, permissions).
- Environmental change (updated TLS certs, rotated keys, firewall rules).
- Use hypothesis-driven testing:
- Reproduce the health-check manually (curl, nc, mock probe) to see failure mode and error output.
- Run platform-specific diagnostics:
- Oracle ASM: check v$asm_disk, v$asm_disk_stat, alert logs; verify diskgroup status, disk headers, ASMLib/udev mappings.
- Kubernetes/service mesh: describe Pod, inspect readiness/liveness probe commands, check kubelet and container logs, look for OOMKilled or CrashLoopBackOff events.
- Generic: run system-level checks (dmesg for kernel I/O errors, SMART for disks, iptables/netstat for network problems).
- Cross-check metrics around the failure time (spikes in latency, error rates, system load).
- Look for correlated events:
- Recent rolling deployment, configuration commit, or scaledown/scaleup around the timestamp.
- Hardware alerts from infrastructure providers or cloud provider incident notices.
- Common root causes and how they manifest
- Disk/device failure (storage-backed ASM):
- Symptoms: I/O errors, device disappearing, degraded redundancy, “disk offline” in ASM tooling.
- Manifest: slow operations, timeouts, database errors, degraded redundancy alarms.
- Probe misconfiguration:
- Symptoms: probe path or command changed, insufficient privileges, changed binary path or API endpoint.
- Manifest: instant failures after configuration change or deployment, but application otherwise healthy.
- Resource exhaustion:
- Symptoms: OOMs, CPU saturation, slow responsiveness leading to probe timeouts.
- Manifest: spike in latency/queue depths, container restarts.
- Network partitions and DNS:
- Symptoms: connectivity failures, name resolution errors, requests timing out.
- Manifest: distributed systems losing quorum or failing health checks intermittently.
- Software/regression bug:
- Symptoms: new code path introduced a crash or deadlock.
- Manifest: reproducible failure tied to a recent change, stack traces in logs.
- Permission or credential expiry:
- Symptoms: auth failures, TLS handshake errors, permission denied.
- Manifest: logs showing unauthorized, certificate expired, or permission denied messages.
- Remediation steps (concrete actions)
- If storage/device failure (ASM storage example):
- Mark failed disk offline in ASM; replace or reattach physical/virtual disk.
- Recreate or restore disk headers if corrupted (only after backups and vendor guidance).
- Rebalance diskgroup to restore redundancy; monitor rebalance progress.
- Verify backups before destructive repair; engage support for hardware-level issues.
- If probe/config issue:
- Fix probe command/path/permissions; redeploy probe configuration.
- Tighten probe tolerances (timeouts/retries) conservatively—don’t mask real failures.
- Use start-up probes to give warm-up time before liveness checks.
- If resource exhaustion:
- Increase resource limits, add replicas, tune GC or request/limit settings in orchestrator.
- Identify memory leaks or CPU hotspots; apply fixes or rollback problematic release.
- If networking:
- Restore connectivity (route, firewall rules, security groups), verify DNS resolution.
- Consider adjusting health-check endpoints to use local checks where possible.
- If software regression:
- Roll back to last known-good version or hotfix the bug; add test coverage to catch similar errors.
- If credential/certificate issues:
- Renew or rotate credentials; update configuration; automate certificate renewal.
- Validation and recovery verification
- Confirm health-check returns healthy consistently across multiple intervals.
- Re-run full end-to-end tests and smoke tests for functionality.
- Monitor for recurrence over a longer window than the probe interval (e.g., 3× intervals).
- Check downstream systems for residual impact: queues, caches, replication lag, user-facing errors.
- Post-incident actions (SRE-style)
- Incident timeline: assemble a precise timeline of events, alerts, root cause, and actions taken.
- Blameless postmortem: document root cause, contributing factors, mitigations, and long-term fixes.
- Action items with owners and deadlines:
- Fix root cause (code, config, hardware replacement).
- Add or adjust monitoring (better metrics, alert thresholds, synthetic tests).
- Improve runbooks: clearly document steps for this exact alert.
- Automate repetitive fixes where safe (auto-replace failed disks, auto-scale on thresholds).
- Prevent recurrence:
- Increase redundancy, improve testing (chaos testing, canary releases), and validate health-check semantics.
- Ensure alerts escalate appropriately (avoid noisy alerts causing fatigue).
- Design considerations for health checkers to reduce false positives and improve signal
- Health-check best practices:
- Use layered checks: shallow liveness probe (process alive) and deep readiness/smoke tests (end-to-end).
- Include health endpoints that report application-specific readiness (dependency statuses, DB connectivity).
- Use gradual degradation rather than binary failure where possible (report degraded vs unhealthy).
- Align probe intervals, retries, and timeouts with realistic warm-up, GC, and transient network behavior.
- Avoid heavy-weight operations inside probes that amplify load.
- Observability:
- Emit structured health-check telemetry (success/failure counts, latencies, error codes).
- Correlate health-check events with trace IDs to debug distributed failures.
- Provide contextual metadata in alerts (node id, disk id, pod name, last successful probe timestamp).
- Risk assessment and business impact
- Single failure significance:
- Might be low impact if redundancy absorbs the fault; but single failure can escalate into wider outage if not contained (e.g., degraded rebuilds, increased load on remaining resources).
- Time-to-recovery (MTTR) and detection (MTTD):
- Shortening MTTD and MTTR reduces blast radius. Invest in precise alerts and automated remediation for common faults.
- Regulatory and data risks:
- For storage/ASM, degraded redundancy increases risk of data loss if another failure occurs before rebuild completes; prioritize repair.
Conclusion “asm health checker found 1 new failures” is more than a log line: it is an early warning. Responding effectively requires prompt triage, methodical diagnosis, and decisive remediation—combined with post-incident learning and engineering improvements to reduce recurrence. By classifying possible causes (storage, probe, resource, network, regression, auth), following a disciplined RCA approach, and implementing monitoring and automation best practices, teams can convert such alerts from frightening unknowns into manageable events and steadily improve system resilience.
Appendix: Minimal quick runbook (steps to execute immediately)
- Capture the alert details and correlate logs/metrics.
- Identify the affected resource (disk/pod/node/service).
- Attempt a manual probe/connection to reproduce failure.
- If production-impacting, trigger failover/scaleup and notify on-call.
- Apply targeted remediation (replace disk, fix probe, rollback deployment).
- Verify health across multiple intervals; monitor for recurrence.
- Create postmortem and assign permanent fixes.
— End —
Troubleshooting Oracle ASM Health Checker Failures The message "ASM Health Checker found 1 new failures"
is a critical alert in Oracle Automatic Storage Management (ASM). It typically appears in the ASM alert log when the background health monitoring process detects a problem that could threaten disk group availability. Immediate Impact
When this error is triggered, it often coincides with other critical events: Disk Group Dismounting
: ASM may force a dismount of a disk group (e.g., ORA-15130) to prevent data corruption. Instance Reconfiguration
: A "Dirty detach reconfiguration" may start as the cluster tries to handle the failure. Database Downtime
: If the affected disk group contains critical files like the OCR, Voting files, or database data files, the associated Oracle instance or Clusterware may crash. Common Root Causes Lost Storage Connectivity
: One or more LUNs/disks became inaccessible due to hardware, cable, or storage controller issues. Write I/O Errors
: ASM takes disks offline if it cannot complete a write operation, which can lead to a disk group failure if redundancy is lost. Insufficient Redundancy asm health checker found 1 new failures
: In "External Redundancy" disk groups, the failure of even a single disk causes the entire group to fail. Disk Header Corruption
: Physical corruption of the disk header can prevent ASM from identifying the disk as a "MEMBER" of a group. Investigative Steps
To identify and resolve the specific failure, follow these steps: ASM Generic Archives | Helmut's RAC / JEE Blog
ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It
The Automatic Storage Management (ASM) health checker is a crucial tool in Oracle databases that monitors the health and integrity of the storage infrastructure. When the ASM health checker reports a new failure, it's essential to understand the implications and take corrective actions to prevent data loss or system downtime. In this blog post, we'll discuss what an ASM health checker failure means, how to investigate the issue, and steps to resolve it.
What does an ASM health checker failure mean?
When the ASM health checker detects a problem, it logs an error message indicating that a failure has been detected. The message may look like this:
"ASM health checker found 1 new failure"
This message indicates that the ASM health checker has detected a single failure in the storage system. The failure could be related to various issues, such as:
- Disk errors or corruption
- Connectivity problems between the database server and storage
- Insufficient disk space or quota issues
- ASM configuration errors
Investigating the ASM health checker failure
To investigate the failure, follow these steps:
- Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message, timestamp, and affected disk group. You can find the alert log in the
$ORACLE_BASE/diag/asm/+ASM/<instance_name>/tracedirectory. - Run the
asmcmdcommand: Theasmcmdcommand-line tool provides a comprehensive view of the ASM configuration and status. Runasmcmdwith thelsdgoption to list the disk groups and their status:asmcmd ls dg - Check the disk group status: Use the
asmcmdcommand with thedgoption to check the status of the affected disk group:asmcmd dg <disk_group_name>
Resolving the ASM health checker failure
Once you've identified the root cause of the failure, take corrective actions to resolve the issue:
- Replace a failed disk: If the failure is due to a disk error, replace the disk and re-add it to the ASM disk group.
- Check and correct connectivity: Verify that the storage connections are stable and functioning correctly.
- Free up disk space: If the failure is due to insufficient disk space, free up space by deleting unnecessary files or expanding the disk group.
- Reconfigure ASM: If the failure is due to an ASM configuration error, reconfigure ASM with the correct settings.
Best practices to prevent ASM health checker failures
To minimize the likelihood of ASM health checker failures: Decoding the Alert: "ASM Health Checker Found 1
- Regularly monitor ASM alerts: Regularly check the ASM alert log and respond promptly to any errors or warnings.
- Perform routine maintenance: Regularly perform routine maintenance tasks, such as checking disk space and replacing failed disks.
- Test and validate ASM configurations: Test and validate ASM configurations to ensure they are correct and optimal.
By understanding the causes of ASM health checker failures and taking proactive steps to prevent them, you can ensure the reliability and performance of your Oracle database storage infrastructure.
ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It
If you're a database administrator or a system administrator working with Oracle databases, you're likely familiar with the Automatic Storage Management (ASM) system. ASM is a storage management system that provides a simple and efficient way to manage storage for Oracle databases. One of the tools used to monitor and maintain ASM is the ASM Health Checker, which periodically checks the health of the ASM infrastructure and reports any issues or failures.
Recently, you may have encountered an alert or message indicating that the "ASM health checker found 1 new failure." This message can be concerning, especially if you're not familiar with what it means or how to resolve it. In this article, we'll explore what this message means, the possible causes, and step-by-step instructions on how to resolve the issue.
What Does the ASM Health Checker Do?
The ASM Health Checker is a background process that periodically checks the health of the ASM infrastructure. It monitors various aspects of ASM, including:
- Disk availability and performance
- Disk group configuration and status
- ASM instance status and performance
- I/O operations and errors
The ASM Health Checker runs automatically and reports any issues or failures it detects. The checker runs at regular intervals, which can be configured using the ASM_CHECK_INTERVAL parameter.
What Does "ASM Health Checker Found 1 New Failure" Mean?
When the ASM Health Checker detects a new failure, it reports the issue and provides information about the failure. The message "ASM health checker found 1 new failure" indicates that the checker has detected a problem with the ASM infrastructure that requires attention.
The failure can be related to various aspects of ASM, such as:
- A disk failure or error
- A disk group configuration issue
- An ASM instance failure or performance issue
- An I/O error or performance problem
Possible Causes of the Failure
There are several possible causes for the ASM Health Checker to report a new failure. Some common causes include:
- Disk failure or error: A disk failure or error can occur due to hardware issues, such as a disk crash or a cable problem.
- Disk group configuration issue: A disk group configuration issue can occur if there are problems with the disk group configuration, such as a missing disk or an incorrect disk group name.
- ASM instance failure or performance issue: An ASM instance failure or performance issue can occur due to problems with the ASM instance, such as a lack of resources or a configuration issue.
- I/O error or performance problem: An I/O error or performance problem can occur due to issues with the storage subsystem, such as a slow disk or a network problem.
How to Resolve the Issue
To resolve the issue, follow these step-by-step instructions:
- Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message and the time it occurred. You can find the ASM alert log in the
$ORACLE_BASE/diag/asm/+ASM/tracedirectory. - Investigate the failure: Use the information from the ASM alert log to investigate the failure. Check the ASM disk groups, disks, and instances to identify any issues.
- Run the ASM Health Checker manually: Run the ASM Health Checker manually to get more information about the failure. You can do this using the following command:
ALTER SESSION SET CONTAINER = '+ASM';
BEGIN
DBMS ASMADM .check_health;
END;
/
This command will provide more detailed information about the failure. Post-incident actions (short incident report template)
- Check the disk groups and disks: Check the disk groups and disks to ensure they are configured correctly and are online.
SELECT * FROM V$ASM_DISKGROUP;
SELECT * FROM V$ASM_DISK;
- Check the ASM instance: Check the ASM instance to ensure it is running and configured correctly.
SELECT * FROM V$ASM_INSTANCE;
- Perform corrective actions: Based on the investigation, perform corrective actions to resolve the issue. This may include:
- Replacing a failed disk
- Reconfiguring a disk group
- Restarting the ASM instance
- Correcting an I/O error or performance problem
Best Practices to Avoid Future Failures
To avoid future failures and ensure the health of your ASM infrastructure, follow these best practices:
- Regularly monitor the ASM alert log: Regularly monitoring the ASM alert log can help you detect issues before they become major problems.
- Run the ASM Health Checker regularly: Run the ASM Health Checker regularly to identify potential issues before they occur.
- Configure disk groups and disks correctly: Ensure disk groups and disks are configured correctly and are online.
- Monitor ASM instance performance: Monitor ASM instance performance to ensure it is running optimally.
By following these best practices and resolving the issue reported by the ASM Health Checker, you can ensure the health and performance of your ASM infrastructure and prevent future failures.
The alert "ASM Health Checker found 1 new failures" typically appears in your Oracle ASM alert logs when the Automatic Diagnostic Repository (ADR) health monitor detects a critical issue during a maintenance task, such as a diskgroup rebalance or a disk add operation. Understanding the Failure
When this message occurs, it indicates that a health check—either triggered automatically by an incident or run manually—has identified a problem that could compromise your storage. Common triggers include:
Disk Failgroup Issues: A diskgroup has fewer failure groups than recommended (e.g., fewer than 3 for normal redundancy).
Disk Status/Mount Failures: Disks are missing, offline, or have lost membership.
Metadata Corruption: Corruption found in the first 250 blocks of an ASM disk, which contain essential metadata.
Quorum Loss: The diskgroup cannot maintain a read quorum, often leading to an automatic dismount. How to Diagnose and Fix To resolve the failure, follow these diagnostic steps:
4. Mismatched Disk Group Compatibility
If compatible.asm, compatible.rdbms, or compatible.advm values are set incorrectly relative to the GI version, the health checker will report advisories as failures.
Step 3: Check Disk Group Status
SELECT name, state, type, total_mb, free_mb, required_mirror_free_mb
FROM v$asm_diskgroup;
Look for MOUNTED state but with disks OFFLINE or UNUSABLE.
Immediate actions (first 10 minutes)
- Don’t panic — gather context.
- Check the Health Checker details/console immediately for:
- The specific check name and ID.
- Timestamp of the failure.
- Failure severity (critical/warning/info).
- Any short description or error code provided.
- Look at recent system events (last 30–60 minutes):
- Service restarts
- Deployments/patches
- Scheduled jobs or cron tasks
- OS reboots or kernel messages
- Check logs related to the failed check:
- Health checker logs
- Application server logs
- System logs (syslog/journalctl)
- Network or load balancer logs if relevant
- Confirm whether the failure is still present by re-running the single failed check (if your health checker supports manual re-run).
8. Potential Edge Cases
- First run ever → all failures are considered new (optionally treat as baseline, not alert)
- Check item disappears from results → ignore (unless required to track)
- Flapping failure (pass→fail→pass→fail) → should re-alert on each new appearance after a pass
Would you like me to extend this into:
- A CLI tool implementation (Go/Python)?
- A Prometheus exporter metric (
asm_new_failures_total)? - A Terraform module for AWS Health Checker integration?
Subject: ASM Health Check Report – New Failures Detected
To: Database Administration Team / System Health Monitoring Group
Date: [Insert Date]
Priority: Medium
Step 2: Query the ASM Health Checker Views (SQL)
Connect to the ASM instance and run:
sqlplus / as sysasm
SET LINESIZE 200
COL failure_type FORMAT a30
COL detail FORMAT a60
SELECT failure_id, failure_type, check_name, time_detected, status, detail
FROM v$asm_health_check
WHERE status = 'FAIL'
ORDER BY time_detected DESC;
If only one new failure exists, this yields exactly one row with actionable details.