[Plugin Proposal] Custom HealthCheckProcessor to fix delayed failure detection in production (FAST_TCP plugin)

Background

Nacos’s current health-check mechanism often fails to detect instance failures in time, especially under high QPS and node volatility.

We’ve encountered cases where:

  • Instances remain marked as healthy even after the backend port crashes
  • Heartbeat intervals are too generous for fast-fail microservices
  • Health status updates are delayed due to lazy sync strategy

Solution: Custom FAST_TCP HealthCheckProcessor

We wrote a minimal plugin FastHealthCheckProcessor that implements two enhancements:

  1. Shortened heartbeat timeout (default 15s → now 10s)
  2. Actual TCP connectivity test (verifies that the target instance is connectable via socket)

Highlights:

  • Non-invasive, uses HealthCheckType.CUSTOM
  • Triggers immediate updateInstance sync on failure
  • Can be plugged via standard configuration
  • Avoids false-positives while improving failure detection accuracy

Code

“`java
// import com.alibaba.nacos.api.naming.pojo.Instance;
import com.alibaba.nacos.naming.healthcheck.AbstractHealthCheckProcessor;
import com.alibaba.nacos.naming.healthcheck.HealthCheckTask;
import com.alibaba.nacos.naming.healthcheck.HealthCheckType;

import java.net.Socket;
import java.util.Optional;

/**

– 即时状态推送,避免延迟同步
*/
public class FastHealthCheckProcessor extends AbstractHealthCheckProcessor { @Override
public String getType() {
return “FAST_TCP”;
} @Override
public void process(HealthCheckTask task) {
Optional optional = task.getCluster().getService().allIPs().stream()
.filter(inst -> inst.getIp().equals(task.getIp()) && inst.getPort() == task.getPort())
.findFirst(); if (!optional.isPresent()) return; Instance instance = optional.get(); long lastBeat = instance.getLastBeat(); long now = System.currentTimeMillis(); boolean heartbeatTimeout = (now - lastBeat) > 10000; // 心跳超过10秒未响应 boolean tcpAlive = isPortAlive(instance.getIp(), instance.getPort()); if (heartbeatTimeout && !tcpAlive) { instance.setHealthy(false); task.getCluster().getService().updateInstance(instance); System.out.println("[FAST_TCP] Down: " + instance.getIp() + ":" + instance.getPort()); } else { instance.setHealthy(true); } } private boolean isPortAlive(String ip, int port) {
try (Socket socket = new Socket()) {
socket.connect(new java.net.InetSocketAddress(ip, port), 2000);
return true;
} catch (Exception e) {
return false;
}
} @Override
public HealthCheckType getHealthCheckType() {
return HealthCheckType.CUSTOM;
}
}

FastHealthCheckProcessor 插件(用于替代 Nacos 默认健康检查机制)

特点:

– 心跳检测更灵敏(10秒判断)

– 增加真实 TCP 端口探测,防止假存活

Leave a Reply

Your email address will not be published. Required fields are marked *