The first CI run of pgsql-retry_guard_in_txn_on_broken_backend-t (now
registered in legacy-g2) failed at "pg_terminate_backend signalled
successfully" because ProxySQL intercepts pg_backend_pid() and returns
its own thread_session_id (see PgSQL_Protocol.cpp:1398), not the real
backend PID.
Postgres server log from the failing run made this unambiguous:
[239] LOG: statement: BEGIN
[239] LOG: statement: SELECT pg_sleep(3)
[240] LOG: statement: SELECT pg_terminate_backend(22)
[240] WARNING: PID 22 is not a PostgreSQL backend process
[239] LOG: statement: ROLLBACK
Real backend is PID 239. "22" came from ProxySQL's fake-PID response,
so the kill went nowhere, pg_sleep(3) ran to completion, and the post-
fix FATAL_ERROR assertion never had a chance to be exercised — the
regression guard was trivially passing the bug through.
Replace the pg_backend_pid() probe with a pg_stat_activity lookup
issued from the direct superuser connection:
- embed a unique literal marker ("retry_guard_marker_<time>_<pid>")
in the sleep query so the backend running it is trivially
identifiable;
- do the find + terminate in one PQexecParams round-trip against
pg_stat_activity, so the pid can't change between lookup and kill.
plan() drops from 6 to 5 because we no longer have a separate "got a
pid" assertion. On a build that includes 68c76eb42 all 5 pass; on a
build without the fix the FATAL_ERROR assertion still fails as
before.