Tighten alias-lookup trust + add ambiguity gate + diagnostic log

Cin pre-review pass on the false-positive risk. Three tightenings: # 1. Bumped MB-search trust threshold from 0.6 → 0.85 `MusicBrainzService.lookup_artist_aliases` previously trusted any MB search match scoring ≥ 0.6 combined (name-similarity + MB relevance). For distinctive cross-script artists the user-reported case targets (Hiroyuki Sawano, Сергей Лазарев, etc.) real matches score ~1.0 — well above 0.85. The 0.6 floor was loose enough to let in moderate matches for ambiguous names, risking aliases for the wrong artist getting cached + applied. Bumped to 0.85. Tighter without rejecting any of the legit cross-script cases the PR is for. # 2. Ambiguity gate — skip when results within 0.1 of best When MB search returns multiple results all scoring high (within 0.1 of the best), the artist name is ambiguous — common name with multiple distinct artists ("John Smith" returning 10 different John Smiths). Pulling aliases for any one of them risks the wrong artist's data bridging incorrectly to a file's tag. Added explicit ambiguity detection: when 2+ results within 0.1, skip alias lookup entirely + cache empty. Matches Cin's "explicit > implicit" — the prior code just picked the highest score blindly. # 3. Diagnostic log when alias rescues a comparison When the alias path triggers a PASS that direct similarity would have FAILed, emit an INFO log: `Artist alias rescued comparison: expected='X' vs actual='Y' (direct sim=0.00, alias 'Z' → score=1.00)`. Lets future bug reports trace which alias triggered which decision. Doesn't change behavior — visibility only. Logs ONLY the rescue case, not happy-path direct matches (no log spam). # Tests added (5) `test_artist_alias_service.py` (+3): - `test_moderate_confidence_match_now_skipped_strict_threshold` - `test_ambiguous_results_skipped` - `test_unambiguous_high_confidence_match_succeeds` `test_acoustid_verification_aliases.py` (+3): - `test_alias_rescue_emits_info_log` — direct-fail + alias-pass emits INFO log - `test_no_log_when_direct_match_succeeds` — happy path quiet - `test_no_log_when_alias_doesnt_help` — failed path also quiet # Test infrastructure note Logging tests use a directly-attached `ListHandler` on `soulsync.acoustid.verification` (the actual logger name — dot-separated by `get_logger`), NOT pytest's caplog. Same pattern as the prior watchdog-test fix — caplog is intermittently flaky in full-suite runs for soulsync namespace loggers. An owned handler sidesteps both issues. # Verification - 85/85 matching tests pass (+5 from prior commit) - 2543 full suite passes (+6 from prior, +85 PR-total) - Ruff clean - Reporter's Japanese + Russian regression tests still pass — legit cross-script case (sim ≈ 1.0) clears the new 0.85 threshold easily
5 days ago · bc34d39ce9
parent 11397307b2
commit bc34d39ce9
4 changed files with 200 additions and 12 deletions
--- a/core/acoustid_verification.py
+++ b/core/acoustid_verification.py
@ -115,6 +115,12 @@ def _alias_aware_artist_sim(
      needed. Verifications where the direct match already passes
      never trigger the lookup chain — no wasted DB query for the
      happy path.
+
+    Diagnostic logging: emits an INFO line whenever an alias rescues
+    a comparison that direct similarity would have failed. Lets
+    future bug reports trace which alias triggered which PASS
+    decision (e.g. "this file passed because alias `澤野弘之` matched
+    the file's artist tag").
    """
    from core.matching.artist_aliases import artist_names_match

@ -140,6 +146,22 @@ def _alias_aware_artist_sim(
        threshold=ARTIST_MATCH_THRESHOLD,
        similarity=_similarity,
    )
+
+    # Diagnostic — alias rescued a comparison that direct would
+    # have failed. Worth logging at INFO since it's a user-visible
+    # decision (file PASS instead of FAIL). One line per rescue
+    # within a single verify call.
+    if score >= ARTIST_MATCH_THRESHOLD and direct < ARTIST_MATCH_THRESHOLD:
+        from core.matching.artist_aliases import best_alias_match
+        winner, _ = best_alias_match(
+            expected_artist, actual_artist, resolved, similarity=_similarity,
+        )
+        logger.info(
+            "Artist alias rescued comparison: expected=%r vs actual=%r "
+            "(direct sim=%.2f, alias %r → score=%.2f)",
+            expected_artist, actual_artist, direct, winner, score,
+        )
+
    return score


--- a/core/musicbrainz_service.py
+++ b/core/musicbrainz_service.py
@ -439,23 +439,48 @@ class MusicBrainzService:
            self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
            return []

-        # Pick the best match — highest combined score of MB's relevance
-        # and our name-similarity check (mirrors `match_artist`).
-        best_mbid = None
-        best_score = 0
+        # Score each result: combined of name-similarity + MB's own
+        # relevance. Score range 0.0-1.0.
+        scored = []
        for result in results:
            mb_name = result.get('name', '')
            mb_score = result.get('score', 0)
            sim = self._calculate_similarity(artist_name, mb_name)
            combined = (sim * 0.7) + (mb_score / 100 * 0.3)
-            if combined > best_score:
-                best_score = combined
-                best_mbid = result.get('id')
-
-        # Threshold: only trust the lookup when name + MB-relevance
-        # combined is reasonably high. Otherwise we're guessing,
-        # which could pull in aliases for the wrong artist.
-        if not best_mbid or best_score < 0.6:
+            mbid = result.get('id')
+            if mbid:
+                scored.append((combined, mbid))
+
+        if not scored:
+            self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
+            return []
+
+        scored.sort(key=lambda x: -x[0])
+        best_score, best_mbid = scored[0]
+
+        # Strict trust threshold: real matches for distinctive cross-
+        # script artists (the user-reported case) score >= 0.95.
+        # Anything below 0.85 is ambiguous and not worth the false-
+        # positive risk of pulling in aliases for the wrong artist.
+        if best_score < 0.85:
+            logger.debug(
+                "lookup_artist_aliases: best match for %r below trust "
+                "threshold (score=%.2f)", artist_name, best_score,
+            )
+            self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
+            return []
+
+        # Ambiguity detection: when 2+ results both score high (within
+        # 0.1 of the best), the search hit multiple distinct artists
+        # with similar names ("John Smith" returning 10 different
+        # John Smiths all at score 100). Pulling aliases for one of
+        # them could produce wrong matches. Skip + cache empty.
+        if len(scored) >= 2 and (scored[0][0] - scored[1][0]) < 0.1:
+            logger.debug(
+                "lookup_artist_aliases: ambiguous match for %r — top "
+                "two results within 0.1 (%.2f / %.2f). Skipping alias lookup.",
+                artist_name, scored[0][0], scored[1][0],
+            )
            self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
            return []

--- a/tests/matching/test_acoustid_verification_aliases.py
+++ b/tests/matching/test_acoustid_verification_aliases.py
@ -399,3 +399,100 @@ class TestAliasProviderCallable:
        )
        # ~0 because direct cross-script comparison fails
        assert score < 0.1
+
+
+# ---------------------------------------------------------------------------
+# Diagnostic logging — alias rescues are visible in logs
+# ---------------------------------------------------------------------------
+
+
+class TestAliasRescueLogging:
+    """When an alias bridges a comparison that direct similarity
+    would have failed, log it at INFO level. Future bug reports
+    where a file passed verification incorrectly can be traced back
+    to which alias triggered which decision.
+
+    Uses a directly-attached handler instead of pytest's caplog —
+    full-suite caplog is intermittently flaky for soulsync namespace
+    loggers (handler ordering, parallel test state). An owned
+    handler on the specific logger sidesteps both issues, same
+    pattern as the prior watchdog-test fix.
+    """
+
+    @staticmethod
+    def _capture_records():
+        """Attach an owned ListHandler to the verifier's logger.
+        Returns (records list, teardown callable)."""
+        import logging as _logging
+        records: list = []
+
+        class _ListHandler(_logging.Handler):
+            def emit(self, record):
+                records.append(record)
+
+        handler = _ListHandler(level=_logging.INFO)
+        # Logger name is `soulsync.acoustid.verification` per
+        # `core.acoustid_verification`'s `get_logger("acoustid_verification")`
+        # — dot-separated, NOT underscored.
+        verifier_logger = _logging.getLogger('soulsync.acoustid.verification')
+        verifier_logger.addHandler(handler)
+        prior_level = verifier_logger.level
+        verifier_logger.setLevel(_logging.INFO)
+
+        def teardown():
+            verifier_logger.removeHandler(handler)
+            verifier_logger.setLevel(prior_level)
+
+        return records, teardown
+
+    def test_alias_rescue_emits_info_log(self):
+        records, teardown = self._capture_records()
+        try:
+            _alias_aware_artist_sim(
+                'Hiroyuki Sawano', '澤野弘之', aliases=['澤野弘之'],
+            )
+        finally:
+            teardown()
+
+        rescue_logs = [
+            r.getMessage() for r in records
+            if 'alias rescued' in r.getMessage().lower()
+        ]
+        assert len(rescue_logs) >= 1, (
+            f"Expected an INFO log line about alias rescue; got "
+            f"{[r.getMessage() for r in records]}"
+        )
+
+    def test_no_log_when_direct_match_succeeds(self):
+        """Happy path doesn't spam logs — only rescue cases log."""
+        records, teardown = self._capture_records()
+        try:
+            _alias_aware_artist_sim(
+                'Foreigner', 'Foreigner', aliases=['ignored-alias'],
+            )
+        finally:
+            teardown()
+
+        rescue_logs = [
+            r.getMessage() for r in records
+            if 'alias rescued' in r.getMessage().lower()
+        ]
+        assert rescue_logs == []
+
+    def test_no_log_when_alias_doesnt_help(self):
+        """If aliases were available but didn't bridge the gap (still
+        below threshold), no rescue log — there was no rescue."""
+        records, teardown = self._capture_records()
+        try:
+            _alias_aware_artist_sim(
+                'Hiroyuki Sawano', 'Khalil Turk',
+                aliases=['Sergey Lazarev'],  # unrelated alias
+            )
+        finally:
+            teardown()
+
+        rescue_logs = [
+            r.getMessage() for r in records
+            if 'alias rescued' in r.getMessage().lower()
+        ]
+        assert rescue_logs == []
--- a/tests/matching/test_artist_alias_service.py
+++ b/tests/matching/test_artist_alias_service.py
@ -498,6 +498,50 @@ class TestLookupArtistAliasesMultiTier:
        # Didn't even try fetching aliases for the bad match
        service.mb_client.get_artist.assert_not_called()

+    def test_moderate_confidence_match_now_skipped_strict_threshold(self, service):
+        """Threshold tightened to 0.85 (was 0.6) — moderate matches
+        (sim ~0.7) are no longer trusted. Reduces false-positive
+        risk on ambiguous artist names."""
+        service.mb_client.search_artist.return_value = [
+            # Different name, MB matched on weak signal — combined
+            # score lands around 0.6-0.7, below the new 0.85 floor.
+            {'id': 'mb-x', 'name': 'John Williams', 'score': 50},
+        ]
+        aliases = service.lookup_artist_aliases('John Smith')
+        assert aliases == []
+        service.mb_client.get_artist.assert_not_called()
+
+    def test_ambiguous_results_skipped(self, service):
+        """When MB search returns multiple results with similar high
+        scores (within 0.1 of each other), the artist name is
+        ambiguous — common name with multiple distinct artists
+        ('John Smith' returning 10 different John Smiths). Pulling
+        aliases for one could mismatch the wrong artist's data
+        against our file. Skip + cache empty."""
+        service.mb_client.search_artist.return_value = [
+            {'id': 'mb-smith-1', 'name': 'John Smith', 'score': 100},
+            {'id': 'mb-smith-2', 'name': 'John Smith', 'score': 100},
+            {'id': 'mb-smith-3', 'name': 'John Smith', 'score': 100},
+        ]
+        aliases = service.lookup_artist_aliases('John Smith')
+        assert aliases == []
+        # Didn't fetch aliases for either ambiguous candidate
+        service.mb_client.get_artist.assert_not_called()
+
+    def test_unambiguous_high_confidence_match_succeeds(self, service):
+        """Sanity: a clear winner (top result high, no near-tie with
+        runner-up) still triggers alias fetch — the ambiguity gate
+        doesn't break the legit case."""
+        service.mb_client.search_artist.return_value = [
+            {'id': 'mb-sawano', 'name': 'Hiroyuki Sawano', 'score': 100},
+            {'id': 'mb-other', 'name': 'Unrelated Artist', 'score': 30},
+        ]
+        service.mb_client.get_artist.return_value = {
+            'aliases': [{'name': '澤野弘之'}],
+        }
+        aliases = service.lookup_artist_aliases('Hiroyuki Sawano')
+        assert '澤野弘之' in aliases
+
    def test_library_with_empty_aliases_falls_through_to_live(self, service, temp_db):
        """Edge case: library has the artist but `aliases` column is
        NULL (worker hasn't enriched yet). Don't get stuck — fall