fix(pgsql-monitor): clamp next_*_at on LOAD PGSQL VARIABLES TO RUNTIME

Closes the pgsql-servers_ssl_params-t portion of sysown/proxysql#5610.
This is the final of four tests from the flake tracking issue.

## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp)

`pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter
increased with use_ssl=1 and no per-server row") and 34 ("Monitor
per-server: SSL OK counter resumes advancing after removing the
per-server row") were failing deterministically on v3.0. Both expected
`PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of:

  SET pgsql-monitor_connect_interval=2000;
  LOAD PGSQL VARIABLES TO RUNTIME;
  UPDATE pgsql_servers SET use_ssl=1;
  LOAD PGSQL SERVERS TO RUNTIME;

but the counter stayed at 0. Root cause is state caching in the pgsql
monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`):

  T=0       proxysql starts with default pgsql-monitor_connect_interval
            =120000. The first scheduler tick schedules an initial
            connect check; compute_next_intvs() sets next_connect_at
            = T + 120000 ms = T + 120 seconds.
  T+<30s>   test does SET connect_interval=2000 + LOAD PGSQL VARIABLES
            TO RUNTIME.  fetch_updated_conf() starts returning the new
            2000 value, but next_connect_at still points at T+120000
            because compute_next_intvs() only updates next_<type>_at
            when the corresponding task type has fired.
  T+<35s>   test reads ssl_connections_OK at the end of its 5-second
            wait.  Counter still 0 because the next scheduled connect
            check is ~85 seconds in the future.

The scheduler is working correctly; what's missing is a bridge between
"runtime variables were just refreshed" and "next_<type>_at should
reflect the refreshed (shorter) intervals".

Fix: in the scheduler loop, track whether the variable version bumped
this iteration, and if so, clamp each next_<type>_at down to
cur_intv_start + new_interval whenever the refreshed interval would
schedule the next check sooner than the currently-cached value. The
clamp is one-way (we never push next_<type>_at FURTHER into the
future, because growing the interval should not delay an already-
imminent check), idempotent, and safe against interval=0 (disabled)
which is left to compute_next_intvs() to set to ULONG_MAX.

Applied to all four monitor task types: ping, connect, readonly,
repl_lag. Same class of bug affected all of them; fixing only connect
would leave analogous latent issues for any test that changes the
other intervals at runtime.

Verified with the raw experiment before committing: configure
monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter
ticking at the 2-second cadence within 1-2 seconds of LOAD. Before
fix: counter stuck at 0 for ~2 minutes (until the cached 120 s
interval naturally elapses).

## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp)

With the scheduler fix alone, the test was still failing because its
main() was doing:

  SET pgsql-monitor_username='postgres';
  SET pgsql-monitor_password='postgres';

on the assumption that the backend had a postgres user with password
"postgres". But the actual test infra (test/infra/docker-pgsql16-single)
RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g.
`POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched,
so every monitor connect failed with:

  FATAL: password authentication failed for user "postgres"

which increments connect_check_ERR instead of ssl_connections_OK.
After the scheduler fix, these auth failures fired every 2 seconds
instead of every 120 seconds — but they were still failures, so the
counter never advanced.

Fix: remove the username/password switch entirely. The default
monitor/monitor user is already configured in the infra's pg_hba.conf
and authenticates successfully (I verified this manually via
`docker exec psql 'host=... user=monitor password=monitor sslmode=require'`
from both inside and outside the proxysql container). The
corresponding "restore original values" block is also removed since
there's nothing to restore.

## Local verification

3 consecutive runs of the full pgsql-servers_ssl_params-t test in
legacy-g4 infra with the patched proxysql + patched test binary:

  attempt 1: PASS
  attempt 2: PASS
  attempt 3: PASS
  === pgsql-servers_ssl_params-t: 3/3 pass ===

Subtest-level confirmation from the final attempt's TAP log:

  # Original monitor: user=monitor interval=120000 ms
  # Initial PgSQL_Monitor_ssl_connections_OK: 33
  # After PgSQL_Monitor_ssl_connections_OK: 36      <- +3 in 5 s
  ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row
  # With TLSv1 per-server pin, ssl OK before wait: 39
  # With TLSv1 per-server pin, ssl OK after wait:  39 (delta=0)
  ok 33 - Monitor per-server: SSL OK counter does NOT advance when
          per-server row pins ssl_protocol_version_range to TLSv1
  # After cleanup, ssl OK recovered from 41 to 44    <- +3 in 5 s
  ok 34 - Monitor per-server: SSL OK counter resumes advancing after
          removing the per-server row

All three monitor-SSL subtests now exercise the real code path (SSL
handshake happening, counter incrementing, per-server pin blocking
SSL as designed) instead of observing a no-op.

## Side effect on pgsql-ssl_keylog-t subtest 7

Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because
it tripped on the same "pgsql monitor isn't making SSL connections"
symptom. With this fix merged, the skip's runtime condition
(`lines_before_monitor == lines_after_monitor`) will evaluate to false
once the monitor is actually producing SSL handshakes, and the test
will fall into the `ok(...)` branch automatically. No separate change
to that test is needed — the skip was defensive and is dead code
after this fix.

## Why mysql monitor is not touched here

This fix is scoped to the pgsql monitor scheduler only. The mysql
monitor is a different file (lib/MySQL_Monitor.cpp) with a different
scheduling architecture (per-thread timers, not a centralized
scheduler). If the same class of bug exists there, it would need a
separate patch - out of scope for this PR.
v3.0-pgsql-monitor-reschedule-on-interval-change
Rene Cannao 1 month ago
parent 04466b35a5
commit 30be016be5

@ -2533,14 +2533,71 @@ void* PgSQL_monitor_scheduler_thread() {
// Check variable version changes; refresh if needed
unsigned int glover = GloPTH->get_global_version();
bool vars_refreshed = false;
if (PgSQL_Thread__variables_version < glover) {
PgSQL_Thread__variables_version = glover;
pgsql_thread->refresh_variables();
vars_refreshed = true;
}
// Fetch config for next task scheduling
tasks_conf_t tasks_conf { fetch_updated_conf(GloPgMon, PgHGM) };
// When runtime variables were just refreshed (i.e. someone did
// SET pgsql-monitor_*_interval=<smaller_value>; LOAD PGSQL
// VARIABLES TO RUNTIME;), the newly-read interval values may be
// smaller than the ones we used the last time we recomputed
// next_intvs. Without the clamp below, each next_<type>_at still
// points at the cycle that was scheduled under the OLD (larger)
// interval, so a LOAD doesn't visibly take effect until the old
// cycle elapses — which can be up to ~2 minutes for connect
// checks under the default pgsql-monitor_connect_interval=120000.
//
// Example of the surprise this avoids:
//
// T=0 proxysql starts with monitor_connect_interval=120000,
// next_connect_at is scheduled for T=120000ms.
// T=5 user runs SET pgsql-monitor_connect_interval=2000;
// LOAD PGSQL VARIABLES TO RUNTIME; expecting the next
// connect cycle within ~2 seconds.
// T=7 without this clamp, next_connect_at is still 120000
// — no new connect fires for another 113 seconds.
//
// The fix: after every successful refresh, shrink any
// next_<type>_at that's now farther in the future than
// (now + new_interval). We never push next_<type>_at further
// into the future — growing the interval should not delay an
// already-imminent check — so the direction of the clamp is
// one-way (min()). Types whose interval is 0 (disabled) are
// skipped and handled by the existing compute_next_intvs()
// which sets them to ULONG_MAX.
if (vars_refreshed) {
if (tasks_conf.ping.params.interval > 0) {
const uint64_t clamped = cur_intv_start + tasks_conf.ping.params.interval;
if (next_intvs.next_ping_at > clamped) {
next_intvs.next_ping_at = clamped;
}
}
if (tasks_conf.connect.params.interval > 0) {
const uint64_t clamped = cur_intv_start + tasks_conf.connect.params.interval;
if (next_intvs.next_connect_at > clamped) {
next_intvs.next_connect_at = clamped;
}
}
if (tasks_conf.readonly.params.interval > 0) {
const uint64_t clamped = cur_intv_start + tasks_conf.readonly.params.interval;
if (next_intvs.next_readonly_at > clamped) {
next_intvs.next_readonly_at = clamped;
}
}
if (tasks_conf.repl_lag.params.interval > 0) {
const uint64_t clamped = cur_intv_start + tasks_conf.repl_lag.params.interval;
if (next_intvs.next_repl_lag_at > clamped) {
next_intvs.next_repl_lag_at = clamped;
}
}
}
// Perform table maintenance
maint_mon_tables(tasks_conf, next_intvs, cur_intv_start);

@ -648,16 +648,24 @@ int main(int argc, char** argv) {
// Configure monitor: use 'postgres' user which is accepted by the
// backend, set a tight connect_interval so cycles run within the test
// wait window. Restore original values afterwards.
//
// We deliberately do NOT change pgsql-monitor_username /
// pgsql-monitor_password. An earlier version of this test set them
// to 'postgres'/'postgres' on the assumption that the backend had a
// postgres user with password 'postgres', but the actual CI infra
// (docker-pgsql16-single) randomizes POSTGRES_PASSWORD per container
// startup (e.g. "05e792e51d"), so the hardcoded credentials never
// authenticated and every monitor connect failed with
// FATAL: password authentication failed for user "postgres"
// The default monitor/monitor user works against the infra's
// pg_hba.conf and is what the initial (pre-test) connect check
// already uses successfully.
long original_connect_interval = getConnectInterval(a);
std::string original_monitor_username = exec_scalar(a,
"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_username'");
std::string original_monitor_password = exec_scalar(a,
"SELECT Variable_Value FROM global_variables WHERE Variable_Name='pgsql-monitor_password'");
diag("Original monitor: user=%s interval=%ld ms",
original_monitor_username.c_str(), original_connect_interval);
exec_ok(a, "SET pgsql-monitor_username='postgres'");
exec_ok(a, "SET pgsql-monitor_password='postgres'");
setConnectInterval(a, 2000);
exec_ok(a, "UPDATE pgsql_servers SET use_ssl=1");
exec_ok(a, "LOAD PGSQL SERVERS TO RUNTIME");
@ -666,15 +674,9 @@ int main(int argc, char** argv) {
test_monitor_ssl_with_per_server_params(a);
test_monitor_uses_per_server_row(a);
// Restore original monitor settings
{
std::stringstream q;
q << "SET pgsql-monitor_username='" << original_monitor_username << "'";
exec_ok(a, q.str().c_str());
q.str("");
q << "SET pgsql-monitor_password='" << original_monitor_password << "'";
exec_ok(a, q.str().c_str());
}
// Restore original connect interval. Monitor username/password are
// no longer touched by this test - see comment above - so there's
// nothing to restore there.
setConnectInterval(a, (int)original_connect_interval);
// Part 3: Cluster query support

Loading…
Cancel
Save