Tighten alias-lookup trust + add ambiguity gate + diagnostic log

Cin pre-review pass on the false-positive risk. Three tightenings:

# 1. Bumped MB-search trust threshold from 0.6 → 0.85

`MusicBrainzService.lookup_artist_aliases` previously trusted any
MB search match scoring ≥ 0.6 combined (name-similarity + MB
relevance). For distinctive cross-script artists the user-reported
case targets (Hiroyuki Sawano, Сергей Лазарев, etc.) real matches
score ~1.0 — well above 0.85. The 0.6 floor was loose enough to
let in moderate matches for ambiguous names, risking aliases for
the wrong artist getting cached + applied.

Bumped to 0.85. Tighter without rejecting any of the legit
cross-script cases the PR is for.

# 2. Ambiguity gate — skip when results within 0.1 of best

When MB search returns multiple results all scoring high (within
0.1 of the best), the artist name is ambiguous — common name with
multiple distinct artists ("John Smith" returning 10 different
John Smiths). Pulling aliases for any one of them risks the wrong
artist's data bridging incorrectly to a file's tag.

Added explicit ambiguity detection: when 2+ results within 0.1,
skip alias lookup entirely + cache empty. Matches Cin's
"explicit > implicit" — the prior code just picked the highest
score blindly.

# 3. Diagnostic log when alias rescues a comparison

When the alias path triggers a PASS that direct similarity would
have FAILed, emit an INFO log: `Artist alias rescued comparison:
expected='X' vs actual='Y' (direct sim=0.00, alias 'Z' →
score=1.00)`.

Lets future bug reports trace which alias triggered which decision.
Doesn't change behavior — visibility only. Logs ONLY the rescue
case, not happy-path direct matches (no log spam).

# Tests added (5)

`test_artist_alias_service.py` (+3):
- `test_moderate_confidence_match_now_skipped_strict_threshold`
- `test_ambiguous_results_skipped`
- `test_unambiguous_high_confidence_match_succeeds`

`test_acoustid_verification_aliases.py` (+3):
- `test_alias_rescue_emits_info_log` — direct-fail + alias-pass
  emits INFO log
- `test_no_log_when_direct_match_succeeds` — happy path quiet
- `test_no_log_when_alias_doesnt_help` — failed path also quiet

# Test infrastructure note

Logging tests use a directly-attached `ListHandler` on
`soulsync.acoustid.verification` (the actual logger name —
dot-separated by `get_logger`), NOT pytest's caplog. Same pattern
as the prior watchdog-test fix — caplog is intermittently flaky
in full-suite runs for soulsync namespace loggers. An owned
handler sidesteps both issues.

# Verification

- 85/85 matching tests pass (+5 from prior commit)
- 2543 full suite passes (+6 from prior, +85 PR-total)
- Ruff clean
- Reporter's Japanese + Russian regression tests still pass —
  legit cross-script case (sim ≈ 1.0) clears the new 0.85
  threshold easily
pull/541/head
Broque Thomas 5 days ago
parent 11397307b2
commit bc34d39ce9

@ -115,6 +115,12 @@ def _alias_aware_artist_sim(
needed. Verifications where the direct match already passes
never trigger the lookup chain no wasted DB query for the
happy path.
Diagnostic logging: emits an INFO line whenever an alias rescues
a comparison that direct similarity would have failed. Lets
future bug reports trace which alias triggered which PASS
decision (e.g. "this file passed because alias `澤野弘之` matched
the file's artist tag").
"""
from core.matching.artist_aliases import artist_names_match
@ -140,6 +146,22 @@ def _alias_aware_artist_sim(
threshold=ARTIST_MATCH_THRESHOLD,
similarity=_similarity,
)
# Diagnostic — alias rescued a comparison that direct would
# have failed. Worth logging at INFO since it's a user-visible
# decision (file PASS instead of FAIL). One line per rescue
# within a single verify call.
if score >= ARTIST_MATCH_THRESHOLD and direct < ARTIST_MATCH_THRESHOLD:
from core.matching.artist_aliases import best_alias_match
winner, _ = best_alias_match(
expected_artist, actual_artist, resolved, similarity=_similarity,
)
logger.info(
"Artist alias rescued comparison: expected=%r vs actual=%r "
"(direct sim=%.2f, alias %r → score=%.2f)",
expected_artist, actual_artist, direct, winner, score,
)
return score

@ -439,23 +439,48 @@ class MusicBrainzService:
self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
return []
# Pick the best match — highest combined score of MB's relevance
# and our name-similarity check (mirrors `match_artist`).
best_mbid = None
best_score = 0
# Score each result: combined of name-similarity + MB's own
# relevance. Score range 0.0-1.0.
scored = []
for result in results:
mb_name = result.get('name', '')
mb_score = result.get('score', 0)
sim = self._calculate_similarity(artist_name, mb_name)
combined = (sim * 0.7) + (mb_score / 100 * 0.3)
if combined > best_score:
best_score = combined
best_mbid = result.get('id')
# Threshold: only trust the lookup when name + MB-relevance
# combined is reasonably high. Otherwise we're guessing,
# which could pull in aliases for the wrong artist.
if not best_mbid or best_score < 0.6:
mbid = result.get('id')
if mbid:
scored.append((combined, mbid))
if not scored:
self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
return []
scored.sort(key=lambda x: -x[0])
best_score, best_mbid = scored[0]
# Strict trust threshold: real matches for distinctive cross-
# script artists (the user-reported case) score >= 0.95.
# Anything below 0.85 is ambiguous and not worth the false-
# positive risk of pulling in aliases for the wrong artist.
if best_score < 0.85:
logger.debug(
"lookup_artist_aliases: best match for %r below trust "
"threshold (score=%.2f)", artist_name, best_score,
)
self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
return []
# Ambiguity detection: when 2+ results both score high (within
# 0.1 of the best), the search hit multiple distinct artists
# with similar names ("John Smith" returning 10 different
# John Smiths all at score 100). Pulling aliases for one of
# them could produce wrong matches. Skip + cache empty.
if len(scored) >= 2 and (scored[0][0] - scored[1][0]) < 0.1:
logger.debug(
"lookup_artist_aliases: ambiguous match for %r — top "
"two results within 0.1 (%.2f / %.2f). Skipping alias lookup.",
artist_name, scored[0][0], scored[1][0],
)
self._save_to_cache('artist_aliases', artist_name, None, None, {'aliases': []}, 0)
return []

@ -399,3 +399,100 @@ class TestAliasProviderCallable:
)
# ~0 because direct cross-script comparison fails
assert score < 0.1
# ---------------------------------------------------------------------------
# Diagnostic logging — alias rescues are visible in logs
# ---------------------------------------------------------------------------
class TestAliasRescueLogging:
"""When an alias bridges a comparison that direct similarity
would have failed, log it at INFO level. Future bug reports
where a file passed verification incorrectly can be traced back
to which alias triggered which decision.
Uses a directly-attached handler instead of pytest's caplog —
full-suite caplog is intermittently flaky for soulsync namespace
loggers (handler ordering, parallel test state). An owned
handler on the specific logger sidesteps both issues, same
pattern as the prior watchdog-test fix.
"""
@staticmethod
def _capture_records():
"""Attach an owned ListHandler to the verifier's logger.
Returns (records list, teardown callable)."""
import logging as _logging
records: list = []
class _ListHandler(_logging.Handler):
def emit(self, record):
records.append(record)
handler = _ListHandler(level=_logging.INFO)
# Logger name is `soulsync.acoustid.verification` per
# `core.acoustid_verification`'s `get_logger("acoustid_verification")`
# — dot-separated, NOT underscored.
verifier_logger = _logging.getLogger('soulsync.acoustid.verification')
verifier_logger.addHandler(handler)
prior_level = verifier_logger.level
verifier_logger.setLevel(_logging.INFO)
def teardown():
verifier_logger.removeHandler(handler)
verifier_logger.setLevel(prior_level)
return records, teardown
def test_alias_rescue_emits_info_log(self):
records, teardown = self._capture_records()
try:
_alias_aware_artist_sim(
'Hiroyuki Sawano', '澤野弘之', aliases=['澤野弘之'],
)
finally:
teardown()
rescue_logs = [
r.getMessage() for r in records
if 'alias rescued' in r.getMessage().lower()
]
assert len(rescue_logs) >= 1, (
f"Expected an INFO log line about alias rescue; got "
f"{[r.getMessage() for r in records]}"
)
def test_no_log_when_direct_match_succeeds(self):
"""Happy path doesn't spam logs — only rescue cases log."""
records, teardown = self._capture_records()
try:
_alias_aware_artist_sim(
'Foreigner', 'Foreigner', aliases=['ignored-alias'],
)
finally:
teardown()
rescue_logs = [
r.getMessage() for r in records
if 'alias rescued' in r.getMessage().lower()
]
assert rescue_logs == []
def test_no_log_when_alias_doesnt_help(self):
"""If aliases were available but didn't bridge the gap (still
below threshold), no rescue log there was no rescue."""
records, teardown = self._capture_records()
try:
_alias_aware_artist_sim(
'Hiroyuki Sawano', 'Khalil Turk',
aliases=['Sergey Lazarev'], # unrelated alias
)
finally:
teardown()
rescue_logs = [
r.getMessage() for r in records
if 'alias rescued' in r.getMessage().lower()
]
assert rescue_logs == []

@ -498,6 +498,50 @@ class TestLookupArtistAliasesMultiTier:
# Didn't even try fetching aliases for the bad match
service.mb_client.get_artist.assert_not_called()
def test_moderate_confidence_match_now_skipped_strict_threshold(self, service):
"""Threshold tightened to 0.85 (was 0.6) — moderate matches
(sim ~0.7) are no longer trusted. Reduces false-positive
risk on ambiguous artist names."""
service.mb_client.search_artist.return_value = [
# Different name, MB matched on weak signal — combined
# score lands around 0.6-0.7, below the new 0.85 floor.
{'id': 'mb-x', 'name': 'John Williams', 'score': 50},
]
aliases = service.lookup_artist_aliases('John Smith')
assert aliases == []
service.mb_client.get_artist.assert_not_called()
def test_ambiguous_results_skipped(self, service):
"""When MB search returns multiple results with similar high
scores (within 0.1 of each other), the artist name is
ambiguous common name with multiple distinct artists
('John Smith' returning 10 different John Smiths). Pulling
aliases for one could mismatch the wrong artist's data
against our file. Skip + cache empty."""
service.mb_client.search_artist.return_value = [
{'id': 'mb-smith-1', 'name': 'John Smith', 'score': 100},
{'id': 'mb-smith-2', 'name': 'John Smith', 'score': 100},
{'id': 'mb-smith-3', 'name': 'John Smith', 'score': 100},
]
aliases = service.lookup_artist_aliases('John Smith')
assert aliases == []
# Didn't fetch aliases for either ambiguous candidate
service.mb_client.get_artist.assert_not_called()
def test_unambiguous_high_confidence_match_succeeds(self, service):
"""Sanity: a clear winner (top result high, no near-tie with
runner-up) still triggers alias fetch the ambiguity gate
doesn't break the legit case."""
service.mb_client.search_artist.return_value = [
{'id': 'mb-sawano', 'name': 'Hiroyuki Sawano', 'score': 100},
{'id': 'mb-other', 'name': 'Unrelated Artist', 'score': 30},
]
service.mb_client.get_artist.return_value = {
'aliases': [{'name': '澤野弘之'}],
}
aliases = service.lookup_artist_aliases('Hiroyuki Sawano')
assert '澤野弘之' in aliases
def test_library_with_empty_aliases_falls_through_to_live(self, service, temp_db):
"""Edge case: library has the artist but `aliases` column is
NULL (worker hasn't enriched yet). Don't get stuck fall

Loading…
Cancel
Save