AcoustID scanner: handle multi-value artist credits

Discord report (Foxxify): the AcoustID scanner repair job flagged
multi-artist tracks as Wrong Song because AcoustID returns the
FULL credit ("Okayracer, aldrch & poptropicaslutz!") while the
library DB carries only the primary artist ("Okayracer"). Raw
SequenceMatcher similarity scored ~43% — well below the 60%
threshold — so the scanner created a finding even though the
audio was correct. User couldn't fix without lowering the global
artist threshold to ~30% (which would let real mismatches through).

# Fix

Extended the shared `core/matching/artist_aliases.py::artist_names_match`
helper (originally lifted for #441) with credit-token splitting.
When the actual artist string contains common separators —

- punctuation: `,`  `&`  `;`  `/`  `+`
- keywords (whitespace-bounded): `feat.` `ft.` `featuring` `with`
  `vs.` `x`

— the helper splits into individual contributors and checks each
against the expected artist. Primary-in-credit cases now resolve
at 100% instead of 43%.

Two pattern groups because punctuation separators don't need
surrounding whitespace, but keyword separators MUST be
whitespace-bounded — otherwise we'd split artists with `x` /
`with` etc. in their names ("JAY-X" → "JAY-" / "" issue).

Composes with the existing alias path: cross-script multi-artist
credits ("Hiroyuki Sawano" expected, "澤野弘之, FeaturedJp"
actual) work via alias-token-against-credit-token compare.

# Wire-in

Scanner at `core/repair_jobs/acoustid_scanner.py:202` replaces
the raw `SequenceMatcher` call with `artist_names_match`. Pass
RAW artist strings (not pre-normalised by `_normalize`) so the
splitter can recognise separators — `_normalize` strips ALL
punctuation, which destroyed the very tokens the splitter needs.

The AcoustID post-download verifier (`core/acoustid_verification.py`)
already routes through `_alias_aware_artist_sim` which calls the
same helper — gets the multi-value benefit automatically without
a separate wire-in.

# New `split_artist_credit` exported helper

Pure-function helper for callers who want token-level access to
the credit list (debugging, UI, future per-token enrichment). Same
splitter logic, exposed as a top-level function.

# Tests added (14)

`tests/matching/test_artist_aliases.py` (+11):
- `TestSplitArtistCredit` — parametrised across 12 credit-string
  formats (comma, ampersand, semicolon, slash, plus, feat./ft./
  featuring, with, vs., x, single-token, empty), drops empty
  tokens, strips per-token whitespace
- `TestMultiValueCreditMatching` — reporter's exact case
  (Okayracer in 3-artist credit → 100%), primary in middle/end of
  credit, genuine-mismatch still fails, single-token actual falls
  through to direct compare, multi-value composes with aliases,
  threshold still respected

`tests/test_acoustid_scanner.py` (+3):
- Reporter's case end-to-end through `_scan_file` — fingerprint
  99% / title 100% / multi-artist credit → no finding created
- Genuine artist mismatch still creates finding (no false
  suppression of real mismatches)
- `JobResultStub` minimal scaffold for the integration tests

# Verification

- 14 new tests pass (49 helper + 5 scanner total in their files)
- 110 matching + scanner tests pass total
- 2584 full suite passes (+25 from baseline 2559)
- Ruff clean
- Reporter's exact case (Okayracer in `Okayracer, aldrch &
  poptropicaslutz!`) now scores 100% match → no Wrong Song flag
pull/543/head
Broque Thomas 4 weeks ago
parent c038400d84
commit df304eb016

@ -31,8 +31,9 @@ direct similarity comparison — identical to the pre-fix behaviour.
from __future__ import annotations
import re
from difflib import SequenceMatcher
from typing import Callable, Iterable, Optional, Tuple
from typing import Callable, Iterable, List, Optional, Tuple
# Default threshold matches the existing ARTIST_MATCH_THRESHOLD in
@ -41,6 +42,28 @@ from typing import Callable, Iterable, Optional, Tuple
DEFAULT_ARTIST_MATCH_THRESHOLD = 0.6
# Multi-value credit-string separators. AcoustID returns the FULL
# artist credit ("Okayracer, aldrch & poptropicaslutz!") while the
# library DB carries only the primary artist ("Okayracer"). Raw string
# similarity scores ~40% — the primary IS in the credit but split by
# punctuation. Splitting on these tokens lets each contributor compare
# individually so the primary-artist match wins at near-100%.
#
# Two patterns because the punctuation separators (comma, ampersand,
# slash, etc.) don't need surrounding whitespace, but the keyword
# separators ("feat", "ft", "vs", etc.) MUST be whitespace-bounded —
# otherwise we'd split "JAY-X" or any artist with "x" / "with" etc.
# in their name.
_CREDIT_PUNCT_SPLITTER = r'\s*[,&;/+]\s*'
_CREDIT_KEYWORD_SPLITTER = (
r'\s+(?:feat\.?|ft\.?|featuring|with|vs\.?|x)\s+'
)
_CREDIT_SPLITTER = re.compile(
rf'(?:{_CREDIT_PUNCT_SPLITTER}|{_CREDIT_KEYWORD_SPLITTER})',
re.IGNORECASE,
)
def _default_normalize(text: str) -> str:
"""Lowercase + strip whitespace. Minimal — caller's normaliser
almost always replaces this with something stricter (parenthetical
@ -64,6 +87,24 @@ def _default_similarity(a: str, b: str) -> float:
return SequenceMatcher(None, na, nb).ratio()
def split_artist_credit(credit: str) -> List[str]:
"""Split a multi-value artist credit string into individual names.
Examples:
- ``"Okayracer, aldrch & poptropicaslutz!"`` ``["Okayracer", "aldrch", "poptropicaslutz!"]``
- ``"Daft Punk feat. Pharrell"`` ``["Daft Punk", "Pharrell"]``
- ``"Artist1 / Artist2 / Artist3"`` ``["Artist1", "Artist2", "Artist3"]``
- ``"Solo Artist"`` ``["Solo Artist"]`` (no separators single-entry list)
Empty string / whitespace-only entries dropped. Always returns at
least one entry when input is non-empty (the single-artist case).
"""
if not credit:
return []
parts = _CREDIT_SPLITTER.split(str(credit))
return [p.strip() for p in parts if p and p.strip()]
def _coerce_aliases(aliases: Optional[Iterable[str]]) -> Tuple[str, ...]:
"""Normalise the aliases input to a tuple of clean strings.
@ -129,8 +170,27 @@ def artist_names_match(
if direct_score >= threshold:
return True, direct_score
# Multi-value credit compare: AcoustID + media-server clients
# often surface the FULL credit ("Artist1, Artist2 & Artist3")
# while the library DB carries only the primary artist. Split
# `actual` into its constituent contributors and check each against
# `expected`. Skipped when actual is single-token (no separators
# present) — _split_credit returns [actual] in that case which
# equals the direct compare we already did, so don't recompute.
actual_credits = split_artist_credit(actual)
if len(actual_credits) > 1:
for credit in actual_credits:
score = sim(expected, credit)
if score > best_score:
best_score = score
if score >= threshold:
return True, score
# Alias compare: each alias is a known alternate spelling of the
# EXPECTED artist; match it against the ACTUAL name we observed.
# Also check each alias against each credit token from above so
# cross-script primary-in-collab cases (e.g. expected='Hiroyuki
# Sawano', actual='澤野弘之, FeaturedJp') still bridge.
# Highest score wins.
for alias in _coerce_aliases(aliases):
score = sim(alias, actual)
@ -138,6 +198,13 @@ def artist_names_match(
best_score = score
if score >= threshold:
return True, score
if len(actual_credits) > 1:
for credit in actual_credits:
token_score = sim(alias, credit)
if token_score > best_score:
best_score = token_score
if token_score >= threshold:
return True, token_score
return False, best_score

@ -199,7 +199,29 @@ class AcoustIDScannerJob(RepairJob):
norm_aid_artist = _normalize(aid_artist)
title_sim = SequenceMatcher(None, norm_expected_title, norm_aid_title).ratio()
artist_sim = SequenceMatcher(None, norm_expected_artist, norm_aid_artist).ratio() if norm_expected_artist else 1.0
# Issue (Foxxify Discord report): AcoustID returns the FULL artist
# credit (e.g. `Okayracer, aldrch & poptropicaslutz!`) while the
# library DB carries only the primary artist (`Okayracer`). Raw
# similarity scores ~43% — well below threshold — so multi-artist
# tracks get flagged as Wrong Song even though the primary IS in
# the credit. Route through the shared `artist_names_match` helper
# which splits the credit on common separators (comma, ampersand,
# feat./ft./with/vs., etc.) and checks each token. Primary-in-
# credit cases now resolve at 100% match instead of 43%.
#
# Pass RAW artist strings (not pre-normalised) so the splitter
# can recognise the separators. The helper applies its own
# case + whitespace normalisation internally per token.
if norm_expected_artist:
from core.matching.artist_aliases import artist_names_match
_, artist_sim = artist_names_match(
expected['artist'],
aid_artist,
threshold=artist_threshold,
)
else:
artist_sim = 1.0
if title_sim >= title_threshold and artist_sim >= artist_threshold:
return

@ -19,6 +19,7 @@ from core.matching.artist_aliases import (
DEFAULT_ARTIST_MATCH_THRESHOLD,
artist_names_match,
best_alias_match,
split_artist_credit,
)
@ -262,3 +263,130 @@ class TestBackwardCompatNoAliases:
def test_no_alias_path_matches_direct_similarity(self, expected, actual, should_match):
matched, _ = artist_names_match(expected, actual)
assert matched is should_match
# ---------------------------------------------------------------------------
# Multi-value artist credit — Discord report from Foxxify
# ---------------------------------------------------------------------------
#
# AcoustID returns the FULL artist credit ("Okayracer, aldrch &
# poptropicaslutz!") while the library DB carries only the primary
# artist ("Okayracer"). Pre-fix raw similarity scored ~43% — well
# below the 0.6 threshold — and the scanner flagged the track as
# Wrong Song. Post-fix the helper splits the credit and the primary
# match wins at near-100%.
class TestSplitArtistCredit:
@pytest.mark.parametrize('credit,expected', [
('Okayracer, aldrch & poptropicaslutz!',
['Okayracer', 'aldrch', 'poptropicaslutz!']),
('Daft Punk feat. Pharrell',
['Daft Punk', 'Pharrell']),
('Daft Punk ft. Pharrell',
['Daft Punk', 'Pharrell']),
('Daft Punk featuring Pharrell',
['Daft Punk', 'Pharrell']),
('Beyoncé with JAY-Z',
['Beyoncé', 'JAY-Z']),
('Eminem vs. Jay-Z',
['Eminem', 'Jay-Z']),
('Artist1 / Artist2 / Artist3',
['Artist1', 'Artist2', 'Artist3']),
('Artist1; Artist2; Artist3',
['Artist1', 'Artist2', 'Artist3']),
('Artist1 + Artist2',
['Artist1', 'Artist2']),
('A x B',
['A', 'B']),
('Solo Artist',
['Solo Artist']), # single-token = self
('',
[]),
])
def test_splits_on_known_separators(self, credit, expected):
assert split_artist_credit(credit) == expected
def test_drops_empty_tokens(self):
# Trailing / leading separators don't introduce empty entries
assert split_artist_credit('Artist,, Other') == ['Artist', 'Other']
def test_strips_whitespace_per_token(self):
assert split_artist_credit(' A , B ') == ['A', 'B']
class TestMultiValueCreditMatching:
def test_reporters_exact_case_okayracer(self):
"""Discord report from Foxxify — verbatim from the screenshot:
Expected: Okayracer
AcoustID: Okayracer, aldrch & poptropicaslutz!
Pre-fix: artist match 43% Wrong Song flag
Post-fix: primary in credit 100% match
"""
matched, score = artist_names_match(
'Okayracer',
'Okayracer, aldrch & poptropicaslutz!',
)
assert matched is True, (
f"Expected primary-in-credit match; got matched=False score={score}"
)
assert score == 1.0
def test_primary_in_middle_of_credit(self):
"""Primary artist isn't always first in the credit."""
matched, score = artist_names_match(
'Pharrell',
'Daft Punk feat. Pharrell',
)
assert matched is True
assert score == 1.0
def test_primary_at_end_of_credit(self):
matched, score = artist_names_match(
'JAY-Z',
'Beyoncé with JAY-Z',
)
assert matched is True
def test_no_match_when_expected_artist_not_in_credit(self):
"""Multi-value path doesn't mask genuine mismatches. If
expected isn't in the credit, the comparison should still
fail."""
matched, _ = artist_names_match(
'Madonna',
'Daft Punk feat. Pharrell',
)
assert matched is False
def test_single_token_actual_falls_through_to_direct(self):
"""When actual has no separators, multi-value path is a
no-op same as the direct compare."""
matched, _ = artist_names_match('Foreigner', 'Foreigner')
assert matched is True
# And different artists still fail
matched, _ = artist_names_match('Foreigner', 'Khalil Turk')
assert matched is False
def test_multi_value_combines_with_aliases(self):
"""Combination case: expected is romanized, actual credit
contains the kanji form alongside other artists. Both the
alias path AND the multi-value path must collaborate."""
matched, score = artist_names_match(
'Hiroyuki Sawano',
'澤野弘之, FeaturedJp Artist',
aliases=['澤野弘之', 'SawanoHiroyuki'],
)
assert matched is True
assert score == 1.0
def test_threshold_still_respected(self):
"""Multi-value path doesn't bypass the threshold — fuzzy
in-credit matches still need to clear it."""
matched, score = artist_names_match(
'XXXXXX',
'YYYYYY, ZZZZZZ',
threshold=0.99,
)
assert matched is False
assert score < 0.5

@ -86,3 +86,142 @@ def test_scan_handles_mixed_track_id_types(monkeypatch):
assert result.scanned == 1
assert scanned_track_ids == ["42"]
# ---------------------------------------------------------------------------
# Multi-value artist credit — Foxxify Discord report
# ---------------------------------------------------------------------------
#
# AcoustID returns the FULL artist credit while the library DB
# carries only the primary artist. Pre-fix raw SequenceMatcher
# scored 43% — below the 0.6 threshold — and the scanner created a
# Wrong Song finding even though the audio was correct. Post-fix the
# scanner routes through `artist_names_match` which splits the credit
# and finds the primary artist at 100%, suppressing the false flag.
def _make_finding_capturing_context(track_row, captured):
"""Context that captures any create_finding calls into the
`captured` list. Tests assert against this list to verify whether
the scanner created a finding (false positive) or correctly
skipped (multi-value match resolved)."""
conn = _FakeConnection([track_row])
config_manager = SimpleNamespace(
get=lambda key, default=None: default,
set=lambda *args, **kwargs: None,
)
db = SimpleNamespace(_get_connection=lambda: conn)
def fake_create_finding(**kwargs):
captured.append(kwargs)
return True
return SimpleNamespace(
db=db,
transfer_folder="/music",
config_manager=config_manager,
acoustid_client=object(),
create_finding=fake_create_finding,
report_progress=lambda **kwargs: None,
update_progress=lambda *args, **kwargs: None,
check_stop=lambda: False,
wait_if_paused=lambda: False,
sleep_or_stop=lambda *args, **kwargs: False,
)
def test_scanner_no_finding_when_primary_artist_in_acoustid_credit():
"""Reporter's exact case verbatim:
Library DB: title='Tea Parties With Dale Earnhardt' artist='Okayracer'
AcoustID: title='Tea Parties With Dale Earnhardt'
artist='Okayracer, aldrch & poptropicaslutz!'
Pre-fix: artist_sim=43% Wrong Song finding
Post-fix: 'Okayracer' found in credit 100% no finding
"""
job = AcoustIDScannerJob()
captured_findings = []
context = _make_finding_capturing_context(
track_row=("69241726", "Tea Parties With Dale Earnhardt", "Okayracer",
"/music/track.opus", 1, "Album", None, None),
captured=captured_findings,
)
fake_acoustid = SimpleNamespace(
fingerprint_and_lookup=lambda fpath: {
'best_score': 0.99,
'recordings': [{
'title': 'Tea Parties With Dale Earnhardt',
'artist': 'Okayracer, aldrch & poptropicaslutz!',
}],
},
)
result = JobResultStub()
job._scan_file(
'/music/track.opus',
'69241726',
{'title': 'Tea Parties With Dale Earnhardt', 'artist': 'Okayracer'},
fake_acoustid,
context,
result,
fp_threshold=0.85,
title_threshold=0.85,
artist_threshold=0.6,
)
assert captured_findings == [], (
f"Expected no finding (primary artist in credit); got {captured_findings}"
)
def test_scanner_still_flags_genuine_artist_mismatch():
"""Sanity: multi-value path doesn't suppress legitimate
mismatches. If expected artist is NOT in the credit at all,
finding still fires."""
job = AcoustIDScannerJob()
captured_findings = []
context = _make_finding_capturing_context(
track_row=("99", "Some Track", "Foreigner",
"/music/track.flac", 1, "Album", None, None),
captured=captured_findings,
)
fake_acoustid = SimpleNamespace(
fingerprint_and_lookup=lambda fpath: {
'best_score': 0.99,
'recordings': [{
'title': 'Some Track',
'artist': 'Different Band, Other Person & Random Featuring',
}],
},
)
result = JobResultStub()
job._scan_file(
'/music/track.flac',
'99',
{'title': 'Some Track', 'artist': 'Foreigner'},
fake_acoustid,
context,
result,
fp_threshold=0.85,
title_threshold=0.85,
artist_threshold=0.6,
)
assert len(captured_findings) == 1, (
f"Expected a finding for genuine mismatch; got {len(captured_findings)}"
)
assert captured_findings[0]['finding_type'] == 'acoustid_mismatch'
class JobResultStub:
"""Minimal JobResult-like stub for the scanner integration tests
above. The real JobResult tracks scanned/skipped/findings_created
counters via attribute assignment same shape works here."""
findings_created = 0
findings_skipped_dedup = 0
errors = 0
scanned = 0
skipped = 0

@ -3416,6 +3416,7 @@ const WHATS_NEW = {
'2.4.3': [
// --- post-release patch work on the 2.4.3 line — entries hidden by _getLatestWhatsNewVersion until the build version bumps ---
{ date: 'Unreleased — 2.4.3 patch work' },
{ title: 'AcoustID Scanner: Multi-Artist Songs No Longer Flagged As Wrong', desc: 'discord report (foxxify): the acoustid scanner repair job was flagging multi-artist tracks as "wrong song" because acoustid returns the full credit ("okayracer, aldrch & poptropicaslutz!") while the library db carries only the primary artist ("okayracer"). raw similarity scored ~43% — well below the 60% threshold — so the scanner created a wrong-song finding even though the audio was correct. user couldn\'t fix without lowering the global artist threshold to ~30% (which would let real mismatches through). cause: scanner used raw `SequenceMatcher` comparison that doesn\'t recognise the primary artist is just one of several contributors in the credit string. fix: extended the shared `core/matching/artist_aliases.py::artist_names_match` helper (lifted in #441) with credit-token splitting on common separators (comma, ampersand, semicolon, slash, plus, "feat.", "ft.", "featuring", "with", "vs.", "x"). when actual artist contains separators, helper splits into individual contributors and checks each against expected — primary-in-credit cases now resolve at 100% instead of 43%. composes with existing alias path so cross-script multi-artist credits ("hiroyuki sawano" expected, "澤野弘之, featured" actual) work too. wired into `core/repair_jobs/acoustid_scanner.py` — replaces the raw similarity call. acoustid post-download verifier already used the helper from #441 so it inherits the same fix automatically. 14 new tests pin: split-by-separator across 12 credit-string formats, primary at start/middle/end of credit, no-mask on genuine mismatches, single-token actual falls through to direct compare, multi-value composes with aliases, threshold still respected, end-to-end scanner integration with reporter\'s exact case (okayracer in okayracer-aldrch-poptropicaslutz credit → no finding), end-to-end scanner still flags genuine mismatches.', page: 'library' },
{ title: 'Deezer Cover Art: Embedded Covers No Longer Look Blurry', desc: 'discord report (tim): downloaded cover art via deezer metadata source came out visibly blurry in navidrome and on phones — particularly noticeable on large displays. cause: deezer\'s api returns `cover_xl` urls at 1000×1000 but the underlying cdn serves up to 1900×1900 by rewriting the size segment in the url path. soulsync wasn\'t doing the rewrite — same as iTunes mzstatic and spotify scdn already get upgraded. now `_upgrade_deezer_cover_url` (mirrors `_upgrade_spotify_image_url` pattern) rewrites the cdn url to request 1900×1900 before download. cdn serves source-native size when source < target so asking for 1900 on smaller-source albums returns the same bytes (no upscaling, no failure). applied at both download sites — auto post-process flow + the enhanced library view\'s "write tags to file" feature. existing `prefer_caa_art` toggle in settings → library → post-processing remains as the orthogonal workaround for users who want even higher quality (musicbrainz cover art archive, often 3000×3000+). 16 new tests pin: standard upgrade, alternate dzcdn host, artist picture urls, custom target sizes, idempotency on already-upgraded urls, defensive on non-deezer urls (spotify/itunes/caa/lastfm/random), empty/none handling.', page: 'settings' },
{ title: 'Cross-Script Artist Names No Longer Quarantine Files (Hiroyuki Sawano / 澤野弘之, Сергей Лазарев / Sergey Lazarev)', desc: 'github issue #442 (afonsog6): files where the artist tag was in one script and the expected metadata was in another — japanese kanji `澤野弘之` for `hiroyuki sawano`, cyrillic `сергей лазарев` for `sergey lazarev`, etc. — got quarantined post-download because acoustid verification scored the artist similarity at 0% (the two scripts share no characters). reporter could not even rescue the file via manual import — the import-modal goes through the same verifier and re-quarantined the same file. cause: verifier compared expected vs actual artist with raw `_similarity` and never consulted musicbrainz aliases, even though MB exposes them on every artist record. fix: new `core/matching/artist_aliases.py` pure helper with alias-aware comparison + new `artists.aliases` JSON column populated by the existing MB enrichment worker on every artist match (one extra `inc=aliases` request per artist) + new multi-tier resolver `MusicBrainzService.lookup_artist_aliases` (library DB → cache → live MB) so the verifier finds aliases even for un-enriched artists without thrashing the MB API. verifier resolves aliases ONCE per `verify_audio_file` call and feeds them through three artist comparison sites (best-match scoring, secondary scan when title matches but artist doesn\'t, final fallback scan). reporter\'s exact two cases reproduced as regression tests with stubbed MB service. backward compat: aliases unavailable / MB unreachable → verifier falls back to direct similarity (identical to pre-fix behaviour — never quarantines stricter than today). 70 new tests pin every layer: pure helper (28), service methods (31), verifier integration (11). audited adjacent artist-comparison sites (auto-import single-track id, discovery scoring, matching engine) — left untouched per scope discipline since they aren\'t the user-reported pain.', page: 'downloads' },
{ title: 'Plex: Library Scan Trigger No Longer Fails On Non-English Section Names', desc: 'github issue #535 (adrigzr): plex servers with the music library named anything other than "music" — Música, Musique, Musik, Musica, etc. — got a `Failed to trigger library scan for "Music": Invalid library section: Music` error after every import cycle, and `wishlist.processing` kept reporting "missing from media server after sync" for tracks that DID import correctly because the post-import scan never fired. cause: `trigger_library_scan` and `is_library_scanning` ignored the auto-detected `self.music_library` (correctly populated by `_find_music_library` filtering by `section.type == "artist"`) and called `self.server.library.section(library_name)` with a hardcoded "music" default — raised NotFound on any non-english server. read methods like `get_artists` already routed through `_get_music_sections` so they didn\'t have the bug; this aligns the scan-trigger path with the same resolution. fix: both single-library branches prefer `self.music_library` first, fall back to literal section lookup only when auto-detection hasn\'t run. activity-feed match in `is_library_scanning` also corrected to use the resolved section\'s actual title instead of the unused `library_name` arg — the prior log line read "triggered scan for music" even on Spanish servers. 13 new tests pin: trigger uses auto-detected section across 6 locale variants (Música / Musique / Musik / Musica / 音乐 / موسيقى), backward-compat fallback when music_library is None, explicit library_name kwarg ignored when auto-detected section exists, log line surfaces correct section title, scan-status check uses auto-detected section\'s `refreshing` attr, activity-feed match filters by resolved title (not library_name).', page: 'settings' },

Loading…
Cancel
Save