From 30be016be5f3af403feca2f624f19f612bd62890 Mon Sep 17 00:00:00 2001
From: Rene Cannao <rene.cannao@gmail.com>
Date: Sat, 11 Apr 2026 16:02:30 +0000
Subject: [PATCH 1/2] fix(pgsql-monitor): clamp next_*_at on LOAD PGSQL
 VARIABLES TO RUNTIME
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes the pgsql-servers_ssl_params-t portion of sysown/proxysql#5610.
This is the final of four tests from the flake tracking issue.

## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp)

`pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter
increased with use_ssl=1 and no per-server row") and 34 ("Monitor
per-server: SSL OK counter resumes advancing after removing the
per-server row") were failing deterministically on v3.0. Both expected
`PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of:

  SET pgsql-monitor_connect_interval=2000;
  LOAD PGSQL VARIABLES TO RUNTIME;
  UPDATE pgsql_servers SET use_ssl=1;
  LOAD PGSQL SERVERS TO RUNTIME;

but the counter stayed at 0. Root cause is state caching in the pgsql
monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`):

  T=0       proxysql starts with default pgsql-monitor_connect_interval
            =120000. The first scheduler tick schedules an initial
            connect check; compute_next_intvs() sets next_connect_at
            = T + 120000 ms = T + 120 seconds.
  T+<30s>   test does SET connect_interval=2000 + LOAD PGSQL VARIABLES
            TO RUNTIME.  fetch_updated_conf() starts returning the new
            2000 value, but next_connect_at still points at T+120000
            because compute_next_intvs() only updates next_<type>_at
            when the corresponding task type has fired.
  T+<35s>   test reads ssl_connections_OK at the end of its 5-second
            wait.  Counter still 0 because the next scheduled connect
            check is ~85 seconds in the future.

The scheduler is working correctly; what's missing is a bridge between
"runtime variables were just refreshed" and "next_<type>_at should
reflect the refreshed (shorter) intervals".

Fix: in the scheduler loop, track whether the variable version bumped
this iteration, and if so, clamp each next_<type>_at down to
cur_intv_start + new_interval whenever the refreshed interval would
schedule the next check sooner than the currently-cached value. The
clamp is one-way (we never push next_<type>_at FURTHER into the
future, because growing the interval should not delay an already-
imminent check), idempotent, and safe against interval=0 (disabled)
which is left to compute_next_intvs() to set to ULONG_MAX.

Applied to all four monitor task types: ping, connect, readonly,
repl_lag. Same class of bug affected all of them; fixing only connect
would leave analogous latent issues for any test that changes the
other intervals at runtime.

Verified with the raw experiment before committing: configure
monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter
ticking at the 2-second cadence within 1-2 seconds of LOAD. Before
fix: counter stuck at 0 for ~2 minutes (until the cached 120 s
interval naturally elapses).

## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp)

With the scheduler fix alone, the test was still failing because its
main() was doing:

  SET pgsql-monitor_username='postgres';
  SET pgsql-monitor_password='postgres';

on the assumption that the backend had a postgres user with password
"postgres". But the actual test infra (test/infra/docker-pgsql16-single)
RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g.
`POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched,
so every monitor connect failed with:

  FATAL: password authentication failed for user "postgres"

which increments connect_check_ERR instead of ssl_connections_OK.
After the scheduler fix, these auth failures fired every 2 seconds
instead of every 120 seconds — but they were still failures, so the
counter never advanced.

Fix: remove the username/password switch entirely. The default
monitor/monitor user is already configured in the infra's pg_hba.conf
and authenticates successfully (I verified this manually via
`docker exec psql 'host=... user=monitor password=monitor sslmode=require'`
from both inside and outside the proxysql container). The
corresponding "restore original values" block is also removed since
there's nothing to restore.

## Local verification

3 consecutive runs of the full pgsql-servers_ssl_params-t test in
legacy-g4 infra with the patched proxysql + patched test binary:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  === pgsql-servers_ssl_params-t: 3/3 pass ===

Subtest-level confirmation from the final attempt's TAP log:

  # Original monitor: user=monitor interval=120000 ms
  # Initial PgSQL_Monitor_ssl_connections_OK: 33
  # After PgSQL_Monitor_ssl_connections_OK: 36      <- +3 in 5 s
  ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row
  # With TLSv1 per-server pin, ssl OK before wait: 39
  # With TLSv1 per-server pin, ssl OK after wait:  39 (delta=0)
  ok 33 - Monitor per-server: SSL OK counter does NOT advance when
          per-server row pins ssl_protocol_version_range to TLSv1
  # After cleanup, ssl OK recovered from 41 to 44    <- +3 in 5 s
  ok 34 - Monitor per-server: SSL OK counter resumes advancing after
          removing the per-server row

All three monitor-SSL subtests now exercise the real code path (SSL
handshake happening, counter incrementing, per-server pin blocking
SSL as designed) instead of observing a no-op.

## Side effect on pgsql-ssl_keylog-t subtest 7

Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because
it tripped on the same "pgsql monitor isn't making SSL connections"
symptom. With this fix merged, the skip's runtime condition
(`lines_before_monitor == lines_after_monitor`) will evaluate to false
once the monitor is actually producing SSL handshakes, and the test
will fall into the `ok(...)` branch automatically. No separate change
to that test is needed — the skip was defensive and is dead code
after this fix.

## Why mysql monitor is not touched here

This fix is scoped to the pgsql monitor scheduler only. The mysql
monitor is a different file (lib/MySQL_Monitor.cpp) with a different
scheduling architecture (per-thread timers, not a centralized
scheduler). If the same class of bug exists there, it would need a
separate patch - out of scope for this PR.
---
 lib/PgSQL_Monitor.cpp                         | 57 +++++++++++++++++++
 test/tap/tests/pgsql-servers_ssl_params-t.cpp | 28 ++++-----
 2 files changed, 72 insertions(+), 13 deletions(-)
diff --git a/lib/PgSQL_Monitor.cpp b/lib/PgSQL_Monitor.cpp
index 30fab6727..538d6d993 100644
--- a/lib/PgSQL_Monitor.cpp
+++ b/lib/PgSQL_Monitor.cpp
@@ -2533,14 +2533,71 @@ void* PgSQL_monitor_scheduler_thread() {
 
 			// Check variable version changes; refresh if needed
 			unsigned int glover = GloPTH->get_global_version();
+			bool vars_refreshed = false;
 			if (PgSQL_Thread__variables_version < glover) {
 				PgSQL_Thread__variables_version = glover;
 				pgsql_thread->refresh_variables();
+				vars_refreshed = true;
 			}
 
 			// Fetch config for next task scheduling
 			tasks_conf_t tasks_conf { fetch_updated_conf(GloPgMon, PgHGM) };
 
+			// When runtime variables were just refreshed (i.e. someone did
+			// SET pgsql-monitor_*_interval=<smaller_value>; LOAD PGSQL
+			// VARIABLES TO RUNTIME;), the newly-read interval values may be
+			// smaller than the ones we used the last time we recomputed
+			// next_intvs. Without the clamp below, each next_<type>_at still
+			// points at the cycle that was scheduled under the OLD (larger)
+			// interval, so a LOAD doesn't visibly take effect until the old
+			// cycle elapses — which can be up to ~2 minutes for connect
+			// checks under the default pgsql-monitor_connect_interval=120000.
+			//
+			// Example of the surprise this avoids:
+			//
+			//   T=0    proxysql starts with monitor_connect_interval=120000,
+			//          next_connect_at is scheduled for T=120000ms.
+			//   T=5    user runs SET pgsql-monitor_connect_interval=2000;
+			//          LOAD PGSQL VARIABLES TO RUNTIME; expecting the next
+			//          connect cycle within ~2 seconds.
+			//   T=7    without this clamp, next_connect_at is still 120000
+			//          — no new connect fires for another 113 seconds.
+			//
+			// The fix: after every successful refresh, shrink any
+			// next_<type>_at that's now farther in the future than
+			// (now + new_interval). We never push next_<type>_at further
+			// into the future — growing the interval should not delay an
+			// already-imminent check — so the direction of the clamp is
+			// one-way (min()). Types whose interval is 0 (disabled) are
+			// skipped and handled by the existing compute_next_intvs()
+			// which sets them to ULONG_MAX.
+			if (vars_refreshed) {
+				if (tasks_conf.ping.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval;
+					if (next_intvs.next_ping_at > clamped) {
+						next_intvs.next_ping_at = clamped;
+					}
+				}
+				if (tasks_conf.connect.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval;
+					if (next_intvs.next_connect_at > clamped) {
+						next_intvs.next_connect_at = clamped;
+					}
+				}
+				if (tasks_conf.readonly.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval;
+					if (next_intvs.next_readonly_at > clamped) {
+						next_intvs.next_readonly_at = clamped;
+					}
+				}
+				if (tasks_conf.repl_lag.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval;
+					if (next_intvs.next_repl_lag_at > clamped) {
+						next_intvs.next_repl_lag_at = clamped;
+					}
+				}
+			}
+
 			// Perform table maintenance
 			maint_mon_tables(tasks_conf, next_intvs, cur_intv_start);
 
diff --git a/test/tap/tests/pgsql-servers_ssl_params-t.cpp b/test/tap/tests/pgsql-servers_ssl_params-t.cpp
index 72705a0d7..68ccd695c 100644
--- a/test/tap/tests/pgsql-servers_ssl_params-t.cpp
+++ b/test/tap/tests/pgsql-servers_ssl_params-t.cpp
@@ -648,16 +648,24 @@ int main(int argc, char** argv) {
 	// Configure monitor: use 'postgres' user which is accepted by the
 	// backend, set a tight connect_interval so cycles run within the test
 	// wait window. Restore original values afterwards.
+	//
+	// We deliberately do NOT change pgsql-monitor_username /
+	// pgsql-monitor_password. An earlier version of this test set them
+	// to 'postgres'/'postgres' on the assumption that the backend had a
+	// postgres user with password 'postgres', but the actual CI infra
+	// (docker-pgsql16-single) randomizes POSTGRES_PASSWORD per container
+	// startup (e.g. "05e792e51d"), so the hardcoded credentials never
+	// authenticated and every monitor connect failed with
+	//   FATAL: password authentication failed for user "postgres"
+	// The default monitor/monitor user works against the infra's
+	// pg_hba.conf and is what the initial (pre-test) connect check
+	// already uses successfully.
 	long original_connect_interval = getConnectInterval(a);
 	std::string original_monitor_username = exec_scalar(a,
 		"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_username'");
-	std::string original_monitor_password = exec_scalar(a,
-		"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_password'");
 	diag("Original monitor: user=%s interval=%ld ms",
 		original_monitor_username.c_str(), original_connect_interval);
 
-	exec_ok(a, "SET pgsql-monitor_username='postgres'");
-	exec_ok(a, "SET pgsql-monitor_password='postgres'");
 	setConnectInterval(a, 2000);
 	exec_ok(a, "UPDATE pgsql_servers SET use_ssl=1");
 	exec_ok(a, "LOAD PGSQL SERVERS TO RUNTIME");
@@ -666,15 +674,9 @@ int main(int argc, char** argv) {
 	test_monitor_ssl_with_per_server_params(a);
 	test_monitor_uses_per_server_row(a);
 
-	// Restore original monitor settings
-	{
-		std::stringstream q;
-		q << "SET pgsql-monitor_username='" << original_monitor_username << "'";
-		exec_ok(a, q.str().c_str());
-		q.str("");
-		q << "SET pgsql-monitor_password='" << original_monitor_password << "'";
-		exec_ok(a, q.str().c_str());
-	}
+	// Restore original connect interval. Monitor username/password are
+	// no longer touched by this test - see comment above - so there's
+	// nothing to restore there.
 	setConnectInterval(a, (int)original_connect_interval);
 
 	// Part 3: Cluster query support

From 4bd2f95e5b501f71947e0d2489a4b5c11bcdd7a0 Mon Sep 17 00:00:00 2001
From: Rene Cannao <rene@proxysql.com>
Date: Sun, 12 Apr 2026 11:59:29 +0000
Subject: [PATCH 2/2] fix(pgsql-monitor): add proxy_debug logging and clean up
 test variable

- Add proxy_debug(PROXY_DEBUG_MONITOR, 5) calls when clamping fires,
  so the fix is observable in debug logs. Follows existing pattern
  from nearby scheduling debug output.
- Rename original_monitor_username to monitor_username and clarify it
  is for diagnostic logging only, since the variable is no longer used
  for restore after the credential change was removed.
---
 lib/PgSQL_Monitor.cpp                         | 12 ++++++++++++
 test/tap/tests/pgsql-servers_ssl_params-t.cpp |  7 ++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/lib/PgSQL_Monitor.cpp b/lib/PgSQL_Monitor.cpp
index 538d6d993..7f499d265 100644
--- a/lib/PgSQL_Monitor.cpp
+++ b/lib/PgSQL_Monitor.cpp
@@ -2575,24 +2575,36 @@ void* PgSQL_monitor_scheduler_thread() {
 				if (tasks_conf.ping.params.interval > 0) {
 					const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval;
 					if (next_intvs.next_ping_at > clamped) {
+						proxy_debug(PROXY_DEBUG_MONITOR, 5,
+							"Clamped next_ping_at   old=%lu new=%lu interval=%d\n",
+							next_intvs.next_ping_at, clamped, tasks_conf.ping.params.interval);
 						next_intvs.next_ping_at = clamped;
 					}
 				}
 				if (tasks_conf.connect.params.interval > 0) {
 					const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval;
 					if (next_intvs.next_connect_at > clamped) {
+						proxy_debug(PROXY_DEBUG_MONITOR, 5,
+							"Clamped next_connect_at   old=%lu new=%lu interval=%d\n",
+							next_intvs.next_connect_at, clamped, tasks_conf.connect.params.interval);
 						next_intvs.next_connect_at = clamped;
 					}
 				}
 				if (tasks_conf.readonly.params.interval > 0) {
 					const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval;
 					if (next_intvs.next_readonly_at > clamped) {
+						proxy_debug(PROXY_DEBUG_MONITOR, 5,
+							"Clamped next_readonly_at   old=%lu new=%lu interval=%d\n",
+							next_intvs.next_readonly_at, clamped, tasks_conf.readonly.params.interval);
 						next_intvs.next_readonly_at = clamped;
 					}
 				}
 				if (tasks_conf.repl_lag.params.interval > 0) {
 					const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval;
 					if (next_intvs.next_repl_lag_at > clamped) {
+						proxy_debug(PROXY_DEBUG_MONITOR, 5,
+							"Clamped next_repl_lag_at   old=%lu new=%lu interval=%d\n",
+							next_intvs.next_repl_lag_at, clamped, tasks_conf.repl_lag.params.interval);
 						next_intvs.next_repl_lag_at = clamped;
 					}
 				}
diff --git a/test/tap/tests/pgsql-servers_ssl_params-t.cpp b/test/tap/tests/pgsql-servers_ssl_params-t.cpp
index 68ccd695c..13f6627a6 100644
--- a/test/tap/tests/pgsql-servers_ssl_params-t.cpp
+++ b/test/tap/tests/pgsql-servers_ssl_params-t.cpp
@@ -661,10 +661,11 @@ int main(int argc, char** argv) {
 	// pg_hba.conf and is what the initial (pre-test) connect check
 	// already uses successfully.
 	long original_connect_interval = getConnectInterval(a);
-	std::string original_monitor_username = exec_scalar(a,
+	// Read current monitor username for diagnostic logging only.
+	std::string monitor_username = exec_scalar(a,
 		"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_username'");
-	diag("Original monitor: user=%s interval=%ld ms",
-		original_monitor_username.c_str(), original_connect_interval);
+	diag("Current monitor: user=%s interval=%ld ms",
+		monitor_username.c_str(), original_connect_interval);
 
 	setConnectInterval(a, 2000);
 	exec_ok(a, "UPDATE pgsql_servers SET use_ssl=1");