fix(pgsql-monitor): clamp next_*_at on LOAD PGSQL VARIABLES TO RUNTIME

Closes the pgsql-servers_ssl_params-t portion of sysown/proxysql#5610. This is the final of four tests from the flake tracking issue. ## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp) `pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter increased with use_ssl=1 and no per-server row") and 34 ("Monitor per-server: SSL OK counter resumes advancing after removing the per-server row") were failing deterministically on v3.0. Both expected `PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of: SET pgsql-monitor_connect_interval=2000; LOAD PGSQL VARIABLES TO RUNTIME; UPDATE pgsql_servers SET use_ssl=1; LOAD PGSQL SERVERS TO RUNTIME; but the counter stayed at 0. Root cause is state caching in the pgsql monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`): T=0 proxysql starts with default pgsql-monitor_connect_interval =120000. The first scheduler tick schedules an initial connect check; compute_next_intvs() sets next_connect_at = T + 120000 ms = T + 120 seconds. T+<30s> test does SET connect_interval=2000 + LOAD PGSQL VARIABLES TO RUNTIME. fetch_updated_conf() starts returning the new 2000 value, but next_connect_at still points at T+120000 because compute_next_intvs() only updates next_<type>_at when the corresponding task type has fired. T+<35s> test reads ssl_connections_OK at the end of its 5-second wait. Counter still 0 because the next scheduled connect check is ~85 seconds in the future. The scheduler is working correctly; what's missing is a bridge between "runtime variables were just refreshed" and "next_<type>_at should reflect the refreshed (shorter) intervals". Fix: in the scheduler loop, track whether the variable version bumped this iteration, and if so, clamp each next_<type>_at down to cur_intv_start + new_interval whenever the refreshed interval would schedule the next check sooner than the currently-cached value. The clamp is one-way (we never push next_<type>_at FURTHER into the future, because growing the interval should not delay an already- imminent check), idempotent, and safe against interval=0 (disabled) which is left to compute_next_intvs() to set to ULONG_MAX. Applied to all four monitor task types: ping, connect, readonly, repl_lag. Same class of bug affected all of them; fixing only connect would leave analogous latent issues for any test that changes the other intervals at runtime. Verified with the raw experiment before committing: configure monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter ticking at the 2-second cadence within 1-2 seconds of LOAD. Before fix: counter stuck at 0 for ~2 minutes (until the cached 120 s interval naturally elapses). ## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp) With the scheduler fix alone, the test was still failing because its main() was doing: SET pgsql-monitor_username='postgres'; SET pgsql-monitor_password='postgres'; on the assumption that the backend had a postgres user with password "postgres". But the actual test infra (test/infra/docker-pgsql16-single) RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g. `POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched, so every monitor connect failed with: FATAL: password authentication failed for user "postgres" which increments connect_check_ERR instead of ssl_connections_OK. After the scheduler fix, these auth failures fired every 2 seconds instead of every 120 seconds — but they were still failures, so the counter never advanced. Fix: remove the username/password switch entirely. The default monitor/monitor user is already configured in the infra's pg_hba.conf and authenticates successfully (I verified this manually via `docker exec psql 'host=... user=monitor password=monitor sslmode=require'` from both inside and outside the proxysql container). The corresponding "restore original values" block is also removed since there's nothing to restore. ## Local verification 3 consecutive runs of the full pgsql-servers_ssl_params-t test in legacy-g4 infra with the patched proxysql + patched test binary: attempt 1: PASS attempt 2: PASS attempt 3: PASS === pgsql-servers_ssl_params-t: 3/3 pass === Subtest-level confirmation from the final attempt's TAP log: # Original monitor: user=monitor interval=120000 ms # Initial PgSQL_Monitor_ssl_connections_OK: 33 # After PgSQL_Monitor_ssl_connections_OK: 36 <- +3 in 5 s ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row # With TLSv1 per-server pin, ssl OK before wait: 39 # With TLSv1 per-server pin, ssl OK after wait: 39 (delta=0) ok 33 - Monitor per-server: SSL OK counter does NOT advance when per-server row pins ssl_protocol_version_range to TLSv1 # After cleanup, ssl OK recovered from 41 to 44 <- +3 in 5 s ok 34 - Monitor per-server: SSL OK counter resumes advancing after removing the per-server row All three monitor-SSL subtests now exercise the real code path (SSL handshake happening, counter incrementing, per-server pin blocking SSL as designed) instead of observing a no-op. ## Side effect on pgsql-ssl_keylog-t subtest 7 Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because it tripped on the same "pgsql monitor isn't making SSL connections" symptom. With this fix merged, the skip's runtime condition (`lines_before_monitor == lines_after_monitor`) will evaluate to false once the monitor is actually producing SSL handshakes, and the test will fall into the `ok(...)` branch automatically. No separate change to that test is needed — the skip was defensive and is dead code after this fix. ## Why mysql monitor is not touched here This fix is scoped to the pgsql monitor scheduler only. The mysql monitor is a different file (lib/MySQL_Monitor.cpp) with a different scheduling architecture (per-thread timers, not a centralized scheduler). If the same class of bug exists there, it would need a separate patch - out of scope for this PR.
1 month ago · 30be016be5
parent 04466b35a5
commit 30be016be5
2 changed files with 72 additions and 13 deletions
--- a/lib/PgSQL_Monitor.cpp
+++ b/lib/PgSQL_Monitor.cpp
@ -2533,14 +2533,71 @@ void* PgSQL_monitor_scheduler_thread() {

 			// Check variable version changes; refresh if needed
 			unsigned int glover = GloPTH->get_global_version();
+			bool vars_refreshed = false;
 			if (PgSQL_Thread__variables_version < glover) {
 				PgSQL_Thread__variables_version = glover;
 				pgsql_thread->refresh_variables();
+				vars_refreshed = true;
 			}

 			// Fetch config for next task scheduling
 			tasks_conf_t tasks_conf { fetch_updated_conf(GloPgMon, PgHGM) };

+			// When runtime variables were just refreshed (i.e. someone did
+			// SET pgsql-monitor_*_interval=<smaller_value>; LOAD PGSQL
+			// VARIABLES TO RUNTIME;), the newly-read interval values may be
+			// smaller than the ones we used the last time we recomputed
+			// next_intvs. Without the clamp below, each next_<type>_at still
+			// points at the cycle that was scheduled under the OLD (larger)
+			// interval, so a LOAD doesn't visibly take effect until the old
+			// cycle elapses — which can be up to ~2 minutes for connect
+			// checks under the default pgsql-monitor_connect_interval=120000.
+			//
+			// Example of the surprise this avoids:
+			//
+			//   T=0    proxysql starts with monitor_connect_interval=120000,
+			//          next_connect_at is scheduled for T=120000ms.
+			//   T=5    user runs SET pgsql-monitor_connect_interval=2000;
+			//          LOAD PGSQL VARIABLES TO RUNTIME; expecting the next
+			//          connect cycle within ~2 seconds.
+			//   T=7    without this clamp, next_connect_at is still 120000
+			//          — no new connect fires for another 113 seconds.
+			//
+			// The fix: after every successful refresh, shrink any
+			// next_<type>_at that's now farther in the future than
+			// (now + new_interval). We never push next_<type>_at further
+			// into the future — growing the interval should not delay an
+			// already-imminent check — so the direction of the clamp is
+			// one-way (min()). Types whose interval is 0 (disabled) are
+			// skipped and handled by the existing compute_next_intvs()
+			// which sets them to ULONG_MAX.
+			if (vars_refreshed) {
+				if (tasks_conf.ping.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval;
+					if (next_intvs.next_ping_at > clamped) {
+						next_intvs.next_ping_at = clamped;
+					}
+				}
+				if (tasks_conf.connect.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval;
+					if (next_intvs.next_connect_at > clamped) {
+						next_intvs.next_connect_at = clamped;
+					}
+				}
+				if (tasks_conf.readonly.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval;
+					if (next_intvs.next_readonly_at > clamped) {
+						next_intvs.next_readonly_at = clamped;
+					}
+				}
+				if (tasks_conf.repl_lag.params.interval > 0) {
+					const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval;
+					if (next_intvs.next_repl_lag_at > clamped) {
+						next_intvs.next_repl_lag_at = clamped;
+					}
+				}
+			}
+
 			// Perform table maintenance
 			maint_mon_tables(tasks_conf, next_intvs, cur_intv_start);

--- a/test/tap/tests/pgsql-servers_ssl_params-t.cpp
+++ b/test/tap/tests/pgsql-servers_ssl_params-t.cpp
@ -648,16 +648,24 @@ int main(int argc, char** argv) {
 	// Configure monitor: use 'postgres' user which is accepted by the
 	// backend, set a tight connect_interval so cycles run within the test
 	// wait window. Restore original values afterwards.
+	//
+	// We deliberately do NOT change pgsql-monitor_username /
+	// pgsql-monitor_password. An earlier version of this test set them
+	// to 'postgres'/'postgres' on the assumption that the backend had a
+	// postgres user with password 'postgres', but the actual CI infra
+	// (docker-pgsql16-single) randomizes POSTGRES_PASSWORD per container
+	// startup (e.g. "05e792e51d"), so the hardcoded credentials never
+	// authenticated and every monitor connect failed with
+	//   FATAL: password authentication failed for user "postgres"
+	// The default monitor/monitor user works against the infra's
+	// pg_hba.conf and is what the initial (pre-test) connect check
+	// already uses successfully.
 	long original_connect_interval = getConnectInterval(a);
 	std::string original_monitor_username = exec_scalar(a,
 		"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_username'");
-	std::string original_monitor_password = exec_scalar(a,
-		"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_password'");
 	diag("Original monitor: user=%s interval=%ld ms",
 		original_monitor_username.c_str(), original_connect_interval);

-	exec_ok(a, "SET pgsql-monitor_username='postgres'");
-	exec_ok(a, "SET pgsql-monitor_password='postgres'");
 	setConnectInterval(a, 2000);
 	exec_ok(a, "UPDATE pgsql_servers SET use_ssl=1");
 	exec_ok(a, "LOAD PGSQL SERVERS TO RUNTIME");
@ -666,15 +674,9 @@ int main(int argc, char** argv) {
 	test_monitor_ssl_with_per_server_params(a);
 	test_monitor_uses_per_server_row(a);

-	// Restore original monitor settings
-	{
-		std::stringstream q;
-		q << "SET pgsql-monitor_username='" << original_monitor_username << "'";
-		exec_ok(a, q.str().c_str());
-		q.str("");
-		q << "SET pgsql-monitor_password='" << original_monitor_password << "'";
-		exec_ok(a, q.str().c_str());
-	}
+	// Restore original connect interval. Monitor username/password are
+	// no longer touched by this test - see comment above - so there's
+	// nothing to restore there.
 	setConnectInterval(a, (int)original_connect_interval);

 	// Part 3: Cluster query support