From 30be016be5f3af403feca2f624f19f612bd62890 Mon Sep 17 00:00:00 2001 From: Rene Cannao Date: Sat, 11 Apr 2026 16:02:30 +0000 Subject: [PATCH 1/2] fix(pgsql-monitor): clamp next_*_at on LOAD PGSQL VARIABLES TO RUNTIME MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the pgsql-servers_ssl_params-t portion of sysown/proxysql#5610. This is the final of four tests from the flake tracking issue. ## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp) `pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter increased with use_ssl=1 and no per-server row") and 34 ("Monitor per-server: SSL OK counter resumes advancing after removing the per-server row") were failing deterministically on v3.0. Both expected `PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of: SET pgsql-monitor_connect_interval=2000; LOAD PGSQL VARIABLES TO RUNTIME; UPDATE pgsql_servers SET use_ssl=1; LOAD PGSQL SERVERS TO RUNTIME; but the counter stayed at 0. Root cause is state caching in the pgsql monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`): T=0 proxysql starts with default pgsql-monitor_connect_interval =120000. The first scheduler tick schedules an initial connect check; compute_next_intvs() sets next_connect_at = T + 120000 ms = T + 120 seconds. T+<30s> test does SET connect_interval=2000 + LOAD PGSQL VARIABLES TO RUNTIME. fetch_updated_conf() starts returning the new 2000 value, but next_connect_at still points at T+120000 because compute_next_intvs() only updates next__at when the corresponding task type has fired. T+<35s> test reads ssl_connections_OK at the end of its 5-second wait. Counter still 0 because the next scheduled connect check is ~85 seconds in the future. The scheduler is working correctly; what's missing is a bridge between "runtime variables were just refreshed" and "next__at should reflect the refreshed (shorter) intervals". Fix: in the scheduler loop, track whether the variable version bumped this iteration, and if so, clamp each next__at down to cur_intv_start + new_interval whenever the refreshed interval would schedule the next check sooner than the currently-cached value. The clamp is one-way (we never push next__at FURTHER into the future, because growing the interval should not delay an already- imminent check), idempotent, and safe against interval=0 (disabled) which is left to compute_next_intvs() to set to ULONG_MAX. Applied to all four monitor task types: ping, connect, readonly, repl_lag. Same class of bug affected all of them; fixing only connect would leave analogous latent issues for any test that changes the other intervals at runtime. Verified with the raw experiment before committing: configure monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter ticking at the 2-second cadence within 1-2 seconds of LOAD. Before fix: counter stuck at 0 for ~2 minutes (until the cached 120 s interval naturally elapses). ## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp) With the scheduler fix alone, the test was still failing because its main() was doing: SET pgsql-monitor_username='postgres'; SET pgsql-monitor_password='postgres'; on the assumption that the backend had a postgres user with password "postgres". But the actual test infra (test/infra/docker-pgsql16-single) RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g. `POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched, so every monitor connect failed with: FATAL: password authentication failed for user "postgres" which increments connect_check_ERR instead of ssl_connections_OK. After the scheduler fix, these auth failures fired every 2 seconds instead of every 120 seconds — but they were still failures, so the counter never advanced. Fix: remove the username/password switch entirely. The default monitor/monitor user is already configured in the infra's pg_hba.conf and authenticates successfully (I verified this manually via `docker exec psql 'host=... user=monitor password=monitor sslmode=require'` from both inside and outside the proxysql container). The corresponding "restore original values" block is also removed since there's nothing to restore. ## Local verification 3 consecutive runs of the full pgsql-servers_ssl_params-t test in legacy-g4 infra with the patched proxysql + patched test binary: attempt 1: PASS attempt 2: PASS attempt 3: PASS === pgsql-servers_ssl_params-t: 3/3 pass === Subtest-level confirmation from the final attempt's TAP log: # Original monitor: user=monitor interval=120000 ms # Initial PgSQL_Monitor_ssl_connections_OK: 33 # After PgSQL_Monitor_ssl_connections_OK: 36 <- +3 in 5 s ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row # With TLSv1 per-server pin, ssl OK before wait: 39 # With TLSv1 per-server pin, ssl OK after wait: 39 (delta=0) ok 33 - Monitor per-server: SSL OK counter does NOT advance when per-server row pins ssl_protocol_version_range to TLSv1 # After cleanup, ssl OK recovered from 41 to 44 <- +3 in 5 s ok 34 - Monitor per-server: SSL OK counter resumes advancing after removing the per-server row All three monitor-SSL subtests now exercise the real code path (SSL handshake happening, counter incrementing, per-server pin blocking SSL as designed) instead of observing a no-op. ## Side effect on pgsql-ssl_keylog-t subtest 7 Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because it tripped on the same "pgsql monitor isn't making SSL connections" symptom. With this fix merged, the skip's runtime condition (`lines_before_monitor == lines_after_monitor`) will evaluate to false once the monitor is actually producing SSL handshakes, and the test will fall into the `ok(...)` branch automatically. No separate change to that test is needed — the skip was defensive and is dead code after this fix. ## Why mysql monitor is not touched here This fix is scoped to the pgsql monitor scheduler only. The mysql monitor is a different file (lib/MySQL_Monitor.cpp) with a different scheduling architecture (per-thread timers, not a centralized scheduler). If the same class of bug exists there, it would need a separate patch - out of scope for this PR. --- lib/PgSQL_Monitor.cpp | 57 +++++++++++++++++++ test/tap/tests/pgsql-servers_ssl_params-t.cpp | 28 ++++----- 2 files changed, 72 insertions(+), 13 deletions(-) diff --git a/lib/PgSQL_Monitor.cpp b/lib/PgSQL_Monitor.cpp index 30fab6727..538d6d993 100644 --- a/lib/PgSQL_Monitor.cpp +++ b/lib/PgSQL_Monitor.cpp @@ -2533,14 +2533,71 @@ void* PgSQL_monitor_scheduler_thread() { // Check variable version changes; refresh if needed unsigned int glover = GloPTH->get_global_version(); + bool vars_refreshed = false; if (PgSQL_Thread__variables_version < glover) { PgSQL_Thread__variables_version = glover; pgsql_thread->refresh_variables(); + vars_refreshed = true; } // Fetch config for next task scheduling tasks_conf_t tasks_conf { fetch_updated_conf(GloPgMon, PgHGM) }; + // When runtime variables were just refreshed (i.e. someone did + // SET pgsql-monitor_*_interval=; LOAD PGSQL + // VARIABLES TO RUNTIME;), the newly-read interval values may be + // smaller than the ones we used the last time we recomputed + // next_intvs. Without the clamp below, each next__at still + // points at the cycle that was scheduled under the OLD (larger) + // interval, so a LOAD doesn't visibly take effect until the old + // cycle elapses — which can be up to ~2 minutes for connect + // checks under the default pgsql-monitor_connect_interval=120000. + // + // Example of the surprise this avoids: + // + // T=0 proxysql starts with monitor_connect_interval=120000, + // next_connect_at is scheduled for T=120000ms. + // T=5 user runs SET pgsql-monitor_connect_interval=2000; + // LOAD PGSQL VARIABLES TO RUNTIME; expecting the next + // connect cycle within ~2 seconds. + // T=7 without this clamp, next_connect_at is still 120000 + // — no new connect fires for another 113 seconds. + // + // The fix: after every successful refresh, shrink any + // next__at that's now farther in the future than + // (now + new_interval). We never push next__at further + // into the future — growing the interval should not delay an + // already-imminent check — so the direction of the clamp is + // one-way (min()). Types whose interval is 0 (disabled) are + // skipped and handled by the existing compute_next_intvs() + // which sets them to ULONG_MAX. + if (vars_refreshed) { + if (tasks_conf.ping.params.interval > 0) { + const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval; + if (next_intvs.next_ping_at > clamped) { + next_intvs.next_ping_at = clamped; + } + } + if (tasks_conf.connect.params.interval > 0) { + const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval; + if (next_intvs.next_connect_at > clamped) { + next_intvs.next_connect_at = clamped; + } + } + if (tasks_conf.readonly.params.interval > 0) { + const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval; + if (next_intvs.next_readonly_at > clamped) { + next_intvs.next_readonly_at = clamped; + } + } + if (tasks_conf.repl_lag.params.interval > 0) { + const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval; + if (next_intvs.next_repl_lag_at > clamped) { + next_intvs.next_repl_lag_at = clamped; + } + } + } + // Perform table maintenance maint_mon_tables(tasks_conf, next_intvs, cur_intv_start); diff --git a/test/tap/tests/pgsql-servers_ssl_params-t.cpp b/test/tap/tests/pgsql-servers_ssl_params-t.cpp index 72705a0d7..68ccd695c 100644 --- a/test/tap/tests/pgsql-servers_ssl_params-t.cpp +++ b/test/tap/tests/pgsql-servers_ssl_params-t.cpp @@ -648,16 +648,24 @@ int main(int argc, char** argv) { // Configure monitor: use 'postgres' user which is accepted by the // backend, set a tight connect_interval so cycles run within the test // wait window. Restore original values afterwards. + // + // We deliberately do NOT change pgsql-monitor_username / + // pgsql-monitor_password. An earlier version of this test set them + // to 'postgres'/'postgres' on the assumption that the backend had a + // postgres user with password 'postgres', but the actual CI infra + // (docker-pgsql16-single) randomizes POSTGRES_PASSWORD per container + // startup (e.g. "05e792e51d"), so the hardcoded credentials never + // authenticated and every monitor connect failed with + // FATAL: password authentication failed for user "postgres" + // The default monitor/monitor user works against the infra's + // pg_hba.conf and is what the initial (pre-test) connect check + // already uses successfully. long original_connect_interval = getConnectInterval(a); std::string original_monitor_username = exec_scalar(a, "SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_username'"); - std::string original_monitor_password = exec_scalar(a, - "SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_password'"); diag("Original monitor: user=%s interval=%ld ms", original_monitor_username.c_str(), original_connect_interval); - exec_ok(a, "SET pgsql-monitor_username='postgres'"); - exec_ok(a, "SET pgsql-monitor_password='postgres'"); setConnectInterval(a, 2000); exec_ok(a, "UPDATE pgsql_servers SET use_ssl=1"); exec_ok(a, "LOAD PGSQL SERVERS TO RUNTIME"); @@ -666,15 +674,9 @@ int main(int argc, char** argv) { test_monitor_ssl_with_per_server_params(a); test_monitor_uses_per_server_row(a); - // Restore original monitor settings - { - std::stringstream q; - q << "SET pgsql-monitor_username='" << original_monitor_username << "'"; - exec_ok(a, q.str().c_str()); - q.str(""); - q << "SET pgsql-monitor_password='" << original_monitor_password << "'"; - exec_ok(a, q.str().c_str()); - } + // Restore original connect interval. Monitor username/password are + // no longer touched by this test - see comment above - so there's + // nothing to restore there. setConnectInterval(a, (int)original_connect_interval); // Part 3: Cluster query support From 4bd2f95e5b501f71947e0d2489a4b5c11bcdd7a0 Mon Sep 17 00:00:00 2001 From: Rene Cannao Date: Sun, 12 Apr 2026 11:59:29 +0000 Subject: [PATCH 2/2] fix(pgsql-monitor): add proxy_debug logging and clean up test variable - Add proxy_debug(PROXY_DEBUG_MONITOR, 5) calls when clamping fires, so the fix is observable in debug logs. Follows existing pattern from nearby scheduling debug output. - Rename original_monitor_username to monitor_username and clarify it is for diagnostic logging only, since the variable is no longer used for restore after the credential change was removed. --- lib/PgSQL_Monitor.cpp | 12 ++++++++++++ test/tap/tests/pgsql-servers_ssl_params-t.cpp | 7 ++++--- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/lib/PgSQL_Monitor.cpp b/lib/PgSQL_Monitor.cpp index 538d6d993..7f499d265 100644 --- a/lib/PgSQL_Monitor.cpp +++ b/lib/PgSQL_Monitor.cpp @@ -2575,24 +2575,36 @@ void* PgSQL_monitor_scheduler_thread() { if (tasks_conf.ping.params.interval > 0) { const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval; if (next_intvs.next_ping_at > clamped) { + proxy_debug(PROXY_DEBUG_MONITOR, 5, + "Clamped next_ping_at old=%lu new=%lu interval=%d\n", + next_intvs.next_ping_at, clamped, tasks_conf.ping.params.interval); next_intvs.next_ping_at = clamped; } } if (tasks_conf.connect.params.interval > 0) { const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval; if (next_intvs.next_connect_at > clamped) { + proxy_debug(PROXY_DEBUG_MONITOR, 5, + "Clamped next_connect_at old=%lu new=%lu interval=%d\n", + next_intvs.next_connect_at, clamped, tasks_conf.connect.params.interval); next_intvs.next_connect_at = clamped; } } if (tasks_conf.readonly.params.interval > 0) { const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval; if (next_intvs.next_readonly_at > clamped) { + proxy_debug(PROXY_DEBUG_MONITOR, 5, + "Clamped next_readonly_at old=%lu new=%lu interval=%d\n", + next_intvs.next_readonly_at, clamped, tasks_conf.readonly.params.interval); next_intvs.next_readonly_at = clamped; } } if (tasks_conf.repl_lag.params.interval > 0) { const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval; if (next_intvs.next_repl_lag_at > clamped) { + proxy_debug(PROXY_DEBUG_MONITOR, 5, + "Clamped next_repl_lag_at old=%lu new=%lu interval=%d\n", + next_intvs.next_repl_lag_at, clamped, tasks_conf.repl_lag.params.interval); next_intvs.next_repl_lag_at = clamped; } } diff --git a/test/tap/tests/pgsql-servers_ssl_params-t.cpp b/test/tap/tests/pgsql-servers_ssl_params-t.cpp index 68ccd695c..13f6627a6 100644 --- a/test/tap/tests/pgsql-servers_ssl_params-t.cpp +++ b/test/tap/tests/pgsql-servers_ssl_params-t.cpp @@ -661,10 +661,11 @@ int main(int argc, char** argv) { // pg_hba.conf and is what the initial (pre-test) connect check // already uses successfully. long original_connect_interval = getConnectInterval(a); - std::string original_monitor_username = exec_scalar(a, + // Read current monitor username for diagnostic logging only. + std::string monitor_username = exec_scalar(a, "SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_username'"); - diag("Original monitor: user=%s interval=%ld ms", - original_monitor_username.c_str(), original_connect_interval); + diag("Current monitor: user=%s interval=%ld ms", + monitor_username.c_str(), original_connect_interval); setConnectInterval(a, 2000); exec_ok(a, "UPDATE pgsql_servers SET use_ssl=1");