mirror of https://github.com/sysown/proxysql
Closes the pgsql-servers_ssl_params-t portion of sysown/proxysql#5610.
This is the final of four tests from the flake tracking issue.
## Problem 1 — the scheduler caching bug (lib/PgSQL_Monitor.cpp)
`pgsql-servers_ssl_params-t` subtests 32 ("Monitor SSL counter
increased with use_ssl=1 and no per-server row") and 34 ("Monitor
per-server: SSL OK counter resumes advancing after removing the
per-server row") were failing deterministically on v3.0. Both expected
`PgSQL_Monitor_ssl_connections_OK` to increase within a few seconds of:
SET pgsql-monitor_connect_interval=2000;
LOAD PGSQL VARIABLES TO RUNTIME;
UPDATE pgsql_servers SET use_ssl=1;
LOAD PGSQL SERVERS TO RUNTIME;
but the counter stayed at 0. Root cause is state caching in the pgsql
monitor scheduler loop (`PgSQL_monitor_scheduler_thread()`):
T=0 proxysql starts with default pgsql-monitor_connect_interval
=120000. The first scheduler tick schedules an initial
connect check; compute_next_intvs() sets next_connect_at
= T + 120000 ms = T + 120 seconds.
T+<30s> test does SET connect_interval=2000 + LOAD PGSQL VARIABLES
TO RUNTIME. fetch_updated_conf() starts returning the new
2000 value, but next_connect_at still points at T+120000
because compute_next_intvs() only updates next_<type>_at
when the corresponding task type has fired.
T+<35s> test reads ssl_connections_OK at the end of its 5-second
wait. Counter still 0 because the next scheduled connect
check is ~85 seconds in the future.
The scheduler is working correctly; what's missing is a bridge between
"runtime variables were just refreshed" and "next_<type>_at should
reflect the refreshed (shorter) intervals".
Fix: in the scheduler loop, track whether the variable version bumped
this iteration, and if so, clamp each next_<type>_at down to
cur_intv_start + new_interval whenever the refreshed interval would
schedule the next check sooner than the currently-cached value. The
clamp is one-way (we never push next_<type>_at FURTHER into the
future, because growing the interval should not delay an already-
imminent check), idempotent, and safe against interval=0 (disabled)
which is left to compute_next_intvs() to set to ULONG_MAX.
Applied to all four monitor task types: ping, connect, readonly,
repl_lag. Same class of bug affected all of them; fixing only connect
would leave analogous latent issues for any test that changes the
other intervals at runtime.
Verified with the raw experiment before committing: configure
monitor/monitor + interval=2000 + use_ssl=1 + LOAD, observe counter
ticking at the 2-second cadence within 1-2 seconds of LOAD. Before
fix: counter stuck at 0 for ~2 minutes (until the cached 120 s
interval naturally elapses).
## Problem 2 — test's hardcoded wrong credentials (pgsql-servers_ssl_params-t.cpp)
With the scheduler fix alone, the test was still failing because its
main() was doing:
SET pgsql-monitor_username='postgres';
SET pgsql-monitor_password='postgres';
on the assumption that the backend had a postgres user with password
"postgres". But the actual test infra (test/infra/docker-pgsql16-single)
RANDOMIZES `POSTGRES_PASSWORD` on every container startup — e.g.
`POSTGRES_PASSWORD=05e792e51d`. Hardcoded 'postgres' never matched,
so every monitor connect failed with:
FATAL: password authentication failed for user "postgres"
which increments connect_check_ERR instead of ssl_connections_OK.
After the scheduler fix, these auth failures fired every 2 seconds
instead of every 120 seconds — but they were still failures, so the
counter never advanced.
Fix: remove the username/password switch entirely. The default
monitor/monitor user is already configured in the infra's pg_hba.conf
and authenticates successfully (I verified this manually via
`docker exec psql 'host=... user=monitor password=monitor sslmode=require'`
from both inside and outside the proxysql container). The
corresponding "restore original values" block is also removed since
there's nothing to restore.
## Local verification
3 consecutive runs of the full pgsql-servers_ssl_params-t test in
legacy-g4 infra with the patched proxysql + patched test binary:
attempt 1: PASS
attempt 2: PASS
attempt 3: PASS
=== pgsql-servers_ssl_params-t: 3/3 pass ===
Subtest-level confirmation from the final attempt's TAP log:
# Original monitor: user=monitor interval=120000 ms
# Initial PgSQL_Monitor_ssl_connections_OK: 33
# After PgSQL_Monitor_ssl_connections_OK: 36 <- +3 in 5 s
ok 32 - Monitor SSL counter increased with use_ssl=1 and no per-server row
# With TLSv1 per-server pin, ssl OK before wait: 39
# With TLSv1 per-server pin, ssl OK after wait: 39 (delta=0)
ok 33 - Monitor per-server: SSL OK counter does NOT advance when
per-server row pins ssl_protocol_version_range to TLSv1
# After cleanup, ssl OK recovered from 41 to 44 <- +3 in 5 s
ok 34 - Monitor per-server: SSL OK counter resumes advancing after
removing the per-server row
All three monitor-SSL subtests now exercise the real code path (SSL
handshake happening, counter incrementing, per-server pin blocking
SSL as designed) instead of observing a no-op.
## Side effect on pgsql-ssl_keylog-t subtest 7
Subtest 7 of pgsql-ssl_keylog-t was marked as SKIP in PR #5612 because
it tripped on the same "pgsql monitor isn't making SSL connections"
symptom. With this fix merged, the skip's runtime condition
(`lines_before_monitor == lines_after_monitor`) will evaluate to false
once the monitor is actually producing SSL handshakes, and the test
will fall into the `ok(...)` branch automatically. No separate change
to that test is needed — the skip was defensive and is dead code
after this fix.
## Why mysql monitor is not touched here
This fix is scoped to the pgsql monitor scheduler only. The mysql
monitor is a different file (lib/MySQL_Monitor.cpp) with a different
scheduling architecture (per-thread timers, not a centralized
scheduler). If the same class of bug exists there, it would need a
separate patch - out of scope for this PR.
v3.0-pgsql-monitor-reschedule-on-interval-change
parent
04466b35a5
commit
30be016be5
Loading…
Reference in new issue