tap/utils: add wait_for_log_match() to fix log-scrape races

check_logs_for_command() in pgsql-copy_from_test-t did a single-shot call to get_matching_lines() right after the SQL that was expected to cause the "Switching to Fast Forward" / "Switching back to Normal" log line. Because ProxySQL writes that line asynchronously with respect to the query's reply, the scan occasionally reached EOF before the producer had flushed the line, turning the assertion into a flake. The failure that originally surfaced this was legacy-g1, test 85/327 of pgsql-copy_from_test-t: 1 of 236 assertions (not ok 126 - "Switching back to Normal mode"). It never reproduced under three targeted reruns, which is consistent with a timing race rather than a logic bug. Two changes: 1. Add wait_for_log_match() in test/tap/tap/utils.{h,cpp}: retry get_matching_lines() up to a timeout, clearing both eofbit and failbit between iterations so successive getline() calls pick up lines appended since the last scan. (Clearing only failbit, as get_matching_lines does internally, is not enough — once a stream has hit EOF its sticky eofbit keeps getline() from reading new bytes.) Defaults: 2000 ms total budget, 100 ms between polls — matching the idiom already used in admin_set_credentials_logging-t. 2. Rewrite check_logs_for_command() in pgsql-copy_from_test-t.cpp to call the new helper. Positive assertions still return as soon as the line appears (no added latency in the common case). Negative assertions (check_logs_for_command(...) == false) now wait up to the full timeout to confirm absence, closing the symmetrical race where a line would have arrived a few hundred ms later than the single-shot scan. Verified: 3/3 reruns of pgsql-copy_from_test-t in legacy-g1 post-fix all pass 236/236, runtime ~40 s per run (single-shot version was slightly faster because it short-circuited on negative assertions without waiting).
1 month ago · 2b1a4262b1
parent 6322d58376
commit 2b1a4262b1
3 changed files with 71 additions and 2 deletions
--- a/test/tap/tap/utils.cpp
+++ b/test/tap/tap/utils.cpp
@ -1693,6 +1693,36 @@ pair<size_t,vector<line_match_t>> get_matching_lines(
 }


+bool wait_for_log_match(
+	fstream& f_stream, const string& s_regex, uint32_t timeout_ms, uint32_t poll_interval_ms
+) {
+	// Guard against a zero interval, which would spin.
+	if (poll_interval_ms == 0) { poll_interval_ms = 10; }
+
+	const useconds_t poll_us = poll_interval_ms * 1000;
+	uint32_t elapsed_ms = 0;
+
+	while (true) {
+		// Clear both eofbit and failbit so the subsequent getline() in
+		// get_matching_lines() can read any bytes appended since the last scan.
+		// Without this, once the stream hits EOF on the first iteration, every
+		// retry short-circuits and returns no matches.
+		f_stream.clear(f_stream.rdstate() & ~std::ios_base::eofbit & ~std::ios_base::failbit);
+
+		const auto& [_, matches] = get_matching_lines(f_stream, s_regex);
+		if (!matches.empty()) {
+			return true;
+		}
+		if (elapsed_ms >= timeout_ms) {
+			return false;
+		}
+
+		usleep(poll_us);
+		elapsed_ms += poll_interval_ms;
+	}
+}
+
+
 std::pair<size_t,std::vector<line_match_t>> get_matching_lines_from_filename(
 	const std::string& filename, const std::string& s_regex, bool get_matches, size_t max_lines
 ) {
--- a/test/tap/tap/utils.h
+++ b/test/tap/tap/utils.h
@ -790,6 +790,38 @@ std::pair<size_t,std::vector<line_match_t>> get_matching_lines(
 );


+/**
+ * @brief Poll a log stream for a regex match, retrying until a match appears or the
+ *        timeout elapses. Handles the EOF sticky state that a single-shot
+ *        `get_matching_lines()` leaves behind after exhausting the stream.
+ *
+ * Use this in place of single-shot `get_matching_lines()` calls that check whether
+ * a log line was emitted as a consequence of an earlier action: ProxySQL's writes
+ * to its log are asynchronous with respect to the SQL that triggers them, so a
+ * single scan immediately after the SQL completes is racy. Retrying gives the
+ * producer time to flush; when the match is already present the first iteration
+ * returns immediately.
+ *
+ * @param f_stream Log stream (opened with `open_file_and_seek_end` or similar). On
+ *                 match, the stream position is advanced past the last match (same
+ *                 semantics as `get_matching_lines`); on no-match, the stream is
+ *                 rewound to its position at entry.
+ * @param regex Pattern to search for.
+ * @param timeout_ms Maximum wall time to wait for the first match. Defaults to
+ *                   2000 ms (20 attempts of 100 ms), which is the same budget used
+ *                   by admin_set_credentials_logging-t.
+ * @param poll_interval_ms Delay between scans when no match has appeared yet.
+ *                         Defaults to 100 ms.
+ * @return true if at least one matching line was observed within the budget.
+ */
+bool wait_for_log_match(
+	std::fstream& f_stream,
+	const std::string& regex,
+	uint32_t timeout_ms = 2000,
+	uint32_t poll_interval_ms = 100
+);
+
+
 /**
 * @brief Scan last N lines from a file and find lines matching a regex pattern.
 *
--- a/test/tap/tests/pgsql-copy_from_test-t.cpp
+++ b/test/tap/tests/pgsql-copy_from_test-t.cpp
@ -286,8 +286,15 @@ int is_string_in_result(PGresult* result, const char* target_str) {
 }

 bool check_logs_for_command(std::fstream& f_proxysql_log, const std::string& command_regex) {
-    const auto& [_, cmd_lines] { get_matching_lines(f_proxysql_log, command_regex) };
-	return cmd_lines.empty() ? false : true;
+    // ProxySQL's log writes for 'Switching {to Fast Forward,back to Normal} mode'
+    // happen asynchronously relative to the SQL that triggers them, so a single
+    // scan right after the SQL completes is racy. wait_for_log_match() retries
+    // with a short timeout, clearing EOF between scans; on hit it returns
+    // immediately, so assertions against a line that is already present pay no
+    // extra latency. Negative assertions (check_logs_for_command(...) == false)
+    // also benefit: they now wait up to the timeout to confirm absence, avoiding
+    // a false pass when the line would have arrived a few hundred ms later.
+    return wait_for_log_match(f_proxysql_log, command_regex);
 }

 bool setupTestTable(PGconn* conn) {