Saturday, October 30, 2021

Troubleshooting a weird Nagios NRPE SSL/TLS error

We recently gained limited access to a customer data center in order to monitor to some machines that our software is running on.  For historical reasons, we use Nagios as our monitoring tool (yes, I know that it's 2021) and we use NRPE to monitor our Linux boxes (yes, I know that NRPE is deprecated in favor of NCPA).

We had to provide the customer with a list of source IP addresses and target ports (for example, 5666 for NRPE) as part of the process to get the VPN set up.  Foreshadowing: this will become relevant soon.

After getting NRPE installed on all of our machines, we noticed that Nagios was failing to connect to any of the them.  The NRPE logs all had the following errors:

Starting up daemon
Server listening on 0.0.0.0 port 5666.
Server listening on :: port 5666.
Warning: Daemon is configured to accept command arguments from clients!
Listening for connections on port 5666
Allowing connections from: 127.0.0.1,::1,[redacted]
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
warning: can't get client address: Connection reset by peer

So, this is obviously an SSL/TLS problem.

However, everyone on the Internet basically says that this is a problem with the NRPE client machine (the Nagios source address isn't listed in "allowed_hosts", it's not set up for SSL correctly, you didn't compile it right, etc.).

After fighting with this for hours, we finally figured out what was wrong.

A hint was the "getpeername() failure"; if you open up the NRPE source code, this runs immediately after the connection is established.  The only way that you could see this error ("Transport endpoint is not connected") is if the socket was closed between that initial connection and "getpeername".

Running "tcpdump" on both sides yielded the following findings:

On Nagios:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: TLSv1 Client Hello
NRPE machine → Nagios: RST, ACK

On the NRPE machine to be monitored:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: RST, ACK

Both machines agreed on the first 3 packets: the classic TCP handshake.  However, they differed on the subsequent packets.  Nagios sent a TLSv1 "Client Hello" packet and immediately had the connection closed by the NRPE machine.  However, the NRPE machine did not see the TLSv1 "Client Hello" at all; rather, it saw that Nagios immediately closed the connection.

This is indicative of some trickery being done by the customer's equipment (firewall, VPN, etc.).  From what I can tell, they're quietly stripping out any TLS packets and killing the connection if it finds any.  They probably have an incorrect port rule set up for port 5666, but anyway, that's the problem here: the network infrastructure is tearing out the TLS packets and closing the connection.