AcoustID scanner: multi-candidate match + duration guard + multi-value retag

Closes #587. Three coordinated fixes per codex's diagnosis. AcoustID
verification gate left intact — these fixes target the upstream
scanner false-positive surface plus a separate retag-path gap.

Bug 1 — scanner used recordings[0] as authoritative

`core/repair_jobs/acoustid_scanner.py:_scan_file` only checked the
top fingerprint match's metadata. AcoustID often returns multiple
recordings per fingerprint (sample collisions, multi-MB-record
cases) and the wrong-credited recording can outrank the right-
credited one. Foxxify case 2 (Nana / Nana): top match credited the
wrong artist while a lower-ranked candidate matched the user's
expected metadata exactly.

Lifted the verifier's all-candidates check to a shared pure helper
`core/matching/acoustid_candidates.py:find_matching_recording`. Both
verifier and scanner can now ask "given these candidates, does ANY
of them match expected (title, artist)?" with the same contract.
Scanner suppresses the finding when any candidate matches.

Bug 2 — no duration check guards against fingerprint hash collisions

Foxxify case 3: 17-minute mashup edit fingerprinted to a 5-minute
late-70s Japanese hiphop track (different songs, fingerprint hash
collision on a sampled section). Scanner had no signal to detect
this and would have recommended retagging the 17-min file as the
5-min track.

`duration_mismatches_strongly` in the same helper module flags drifts
beyond max(60s, 35%). Scanner now skips findings when the candidate's
duration disagrees strongly with the file's expected duration. Loaded
duration via the existing tracks SQL (added `t.duration` to the
SELECT). Returns False when either side is unknown — no behavior
change for older rows without duration data.

Bug 3 — scanner retag bypassed multi-value ARTISTS tag setting

`core/repair_worker.py:_fix_wrong_song` called `write_tags_to_file`
with single-string artist updates. The writer only wrote TPE1
(single string) and never read the user's
`metadata_enhancement.tags.write_multi_artist` config. Multi-value
ARTISTS tags got stripped on every retag, contradicting the
post-download enrichment pipeline's behavior.

Per codex's pick (option B over routing through enhance_file_metadata),
extended `write_tags_to_file` with an optional `artists_list`
parameter. Each format-specific writer respects the config flag the
same way enrichment.py does:
- ID3: TPE1 stays as joined display string + TXXX:Artists multi-value
- Vorbis/Opus/FLAC: `artist` display string + `artists` multi-value key
- MP4: \xa9ART as list when on, single string when off

Scanner retag derives the per-artist list by splitting AcoustID's
credit through the existing `split_artist_credit` helper (same
separators the matching layer already uses).

Backward compatible: callers that don't pass `artists_list` get the
exact same single-string write as before. No regression for the
write_artist_image button or any other tag_writer caller.

15 tests on the candidate helper + duration guard.
13 tests on the tag_writer multi-value path (write/skip/single/
no-list cases for FLAC + the config-gate helper).
4 new scanner regression tests pinning lower-ranked candidate
suppression, no-suppression when no candidate matches, duration
mismatch skip, no-skip when duration matches.

Existing scanner tests updated for the new 11-column SQL select
(added duration column to fake schema + test row tuples).

Full suite: 3097 passed. Ruff clean.
pull/598/head
Broque Thomas 1 month ago
parent 1fc4c4313b
commit 9cc09118bf

@ -0,0 +1,143 @@
"""Find a matching AcoustID candidate for an expected (title, artist).
AcoustID returns multiple recordings per fingerprint same audio can
correspond to multiple MusicBrainz recordings (different releases,
different metadata-quality entries, sample / cover-version collisions).
The "top" recording AcoustID returns isn't always the one whose
metadata matches the user's expected track.
Both the post-download verifier (`core/acoustid_verification.py`) and
the AcoustID library scanner (`core/repair_jobs/acoustid_scanner.py`)
need to ask: "given these candidates, does ANY of them match
(expected_title, expected_artist) by title+artist similarity?" The
verifier had its own inline loop; the scanner only checked the top
match false positives whenever the wrong-credited recording out-
ranked the right-credited one.
This module is the single shared boundary for that question.
"""
from __future__ import annotations
from typing import Any, Callable, Dict, Iterable, Optional, Tuple
from utils.logging_config import get_logger
logger = get_logger("matching.acoustid_candidates")
def find_matching_recording(
recordings: Iterable[Dict[str, Any]],
expected_title: str,
expected_artist: str,
*,
title_threshold: float = 0.70,
artist_threshold: float = 0.60,
similarity: Optional[Callable[[str, str], float]] = None,
artist_similarity: Optional[Callable[[str, str], float]] = None,
skip_predicate: Optional[Callable[[Dict[str, Any]], bool]] = None,
) -> Tuple[Optional[Dict[str, Any]], float, float]:
"""Return the first AcoustID candidate whose metadata passes both
title + artist similarity thresholds.
Args:
recordings: AcoustID recording dicts. Each must carry ``title``
and ``artist`` strings; entries without both are skipped.
expected_title: The track title the caller expected.
expected_artist: The artist the caller expected.
title_threshold: Minimum title similarity to accept (default 0.70).
artist_threshold: Minimum artist similarity to accept (default 0.60).
similarity: ``(a, b) -> float`` for title comparison. Defaults
to a lowercase exact-equals stub when not supplied callers
should pass their stricter normaliser (verifier passes its
parenthetical-stripping ``_similarity``; scanner passes
its own).
artist_similarity: ``(expected, actual) -> float`` for artist
comparison. Lets callers supply alias-aware comparison
(verifier wraps ``_alias_aware_artist_sim``; scanner wraps
``artist_names_match``). Defaults to ``similarity`` if
unset.
skip_predicate: Optional ``(recording_dict) -> bool``. When
truthy, the candidate is skipped (used by the verifier to
drop wrong-version recordings instrumental vs vocal etc).
Returns:
``(recording, title_sim, artist_sim)`` for the first matching
candidate, or ``(None, best_title_sim, best_artist_sim)`` when
none match. The non-None ``best_*`` values let callers report
the closest near-miss when they need to log why nothing matched.
Iteration order matches the input order (typically AcoustID's own
fingerprint-confidence ranking). Returns on first match does NOT
score every candidate looking for the highest sim.
"""
if not expected_title or not expected_artist:
return None, 0.0, 0.0
sim = similarity or _default_similarity
asim = artist_similarity or sim
best_title_sim = 0.0
best_artist_sim = 0.0
for rec in recordings or ():
if not isinstance(rec, dict):
continue
rec_title = (rec.get('title') or '').strip()
rec_artist = (rec.get('artist') or '').strip()
if not rec_title or not rec_artist:
continue
if skip_predicate and skip_predicate(rec):
continue
title_sim = sim(expected_title, rec_title)
if title_sim > best_title_sim:
best_title_sim = title_sim
artist_sim = asim(expected_artist, rec_artist)
if artist_sim > best_artist_sim:
best_artist_sim = artist_sim
if title_sim >= title_threshold and artist_sim >= artist_threshold:
return rec, title_sim, artist_sim
return None, best_title_sim, best_artist_sim
def _default_similarity(a: str, b: str) -> float:
if not a or not b:
return 0.0
return 1.0 if a.lower().strip() == b.lower().strip() else 0.0
# ────────────────────────────────────────────────────────────────────
# Duration guard — codex item (5).
# ────────────────────────────────────────────────────────────────────
def duration_mismatches_strongly(
expected_seconds: Optional[float],
candidate_seconds: Optional[float],
*,
abs_tolerance_s: float = 60.0,
rel_tolerance: float = 0.35,
) -> bool:
"""Return True when the candidate's duration is too far from expected
to confidently treat it as the same recording.
Catches fingerprint hash collisions (the reporter's 17-minute
mashup 5-minute Japanese hiphop track case). When EITHER duration
is unknown / non-positive, returns False no behavior change.
Threshold: drift greater than max(``abs_tolerance_s``,
``rel_tolerance * expected``). The relative term scales with track
length so a 20% mismatch on a 3-minute track and a 20% mismatch on
a 30-minute mix are both treated as suspicious.
"""
if not expected_seconds or expected_seconds <= 0:
return False
if not candidate_seconds or candidate_seconds <= 0:
return False
drift = abs(float(candidate_seconds) - float(expected_seconds))
threshold = max(abs_tolerance_s, rel_tolerance * float(expected_seconds))
return drift > threshold

@ -254,6 +254,74 @@ class AcoustIDScannerJob(RepairJob):
if title_sim >= title_threshold and artist_sim >= artist_threshold:
return
# Issue #587 (Foxxify) — top recording's metadata mismatched, but
# AcoustID often returns multiple recordings per fingerprint
# (sample collisions, multi-MB-record cases). Check ALL of them
# before flagging — if any candidate's metadata matches expected
# title + artist, the file IS the right song and AcoustID's top
# match was just a wrong-credited recording.
from core.matching.acoustid_candidates import (
duration_mismatches_strongly,
find_matching_recording,
)
from core.matching.artist_aliases import artist_names_match
def _scanner_title_sim(a, b):
return SequenceMatcher(None, _normalize(a), _normalize(b)).ratio()
def _scanner_artist_sim(expected_a, actual_a):
_, score = artist_names_match(expected_a, actual_a, threshold=artist_threshold)
return score
candidate_match, _, _ = find_matching_recording(
fp_result.get('recordings') or [],
expected['title'],
expected_artist,
title_threshold=title_threshold,
artist_threshold=artist_threshold,
similarity=_scanner_title_sim,
artist_similarity=_scanner_artist_sim,
)
if candidate_match is not None:
# A lower-ranked candidate matched — file IS the right song.
# No finding.
if context.report_progress:
context.report_progress(
log_line=(
f'Resolved (lower-ranked candidate match): {fname}'
f'expected "{expected["title"]}" matched candidate '
f'"{candidate_match.get("title")}" by '
f'"{candidate_match.get("artist")}"'
),
log_type='ok',
)
return
# Issue #587 (Foxxify "17min mashup → 5min track") — duration
# guard against fingerprint hash collisions. When the file's
# actual duration differs from AcoustID's matched recording by
# more than max(60s, 35%), the fingerprint is almost certainly
# a sample/intro collision, not a real recording match. Don't
# produce a confident "Wrong Song" finding.
try:
file_duration_s = (expected.get('duration_ms') or 0) / 1000.0
except Exception:
file_duration_s = 0
candidate_duration_s = best_recording.get('duration')
if candidate_duration_s is None and best_recording.get('length'):
candidate_duration_s = best_recording.get('length')
if duration_mismatches_strongly(file_duration_s, candidate_duration_s):
if context.report_progress:
context.report_progress(
log_line=(
f'Skipped (duration mismatch suggests fingerprint collision): '
f'{fname} — expected {file_duration_s:.0f}s, AcoustID '
f'candidate {candidate_duration_s:.0f}s'
),
log_type='skip',
)
return
# Mismatch detected
if context.report_progress:
context.report_progress(
@ -326,7 +394,8 @@ class AcoustIDScannerJob(RepairJob):
t.file_path, t.track_number,
al.title AS album_title, al.thumb_url, ar.thumb_url,
NULLIF(t.track_artist, '') AS track_artist,
ar.name AS album_artist
ar.name AS album_artist,
t.duration
FROM tracks t
LEFT JOIN artists ar ON ar.id = t.artist_id
LEFT JOIN albums al ON al.id = t.album_id
@ -352,6 +421,11 @@ class AcoustIDScannerJob(RepairJob):
'artist_thumb_url': row[7] or None,
'track_artist': row[8] or '', # raw (may be empty)
'album_artist': row[9] or '',
# Duration in MS (DB stores ms). Used by the
# duration-mismatch guard to spot fingerprint
# collisions where the matched recording is a
# totally different length.
'duration_ms': row[10] or 0,
}
except Exception as e:
logger.error("Error loading tracks from DB: %s", e)

@ -35,6 +35,23 @@ logger = get_logger("repair_worker")
AUDIO_EXTENSIONS = {'.mp3', '.flac', '.ogg', '.opus', '.m4a', '.aac', '.wav', '.wma', '.aiff', '.aif'}
def _split_acoustid_credit(credit: str) -> List[str]:
"""Split an AcoustID artist credit into individual contributor names.
Reuses the matching layer's credit splitter so the AcoustID retag
path tags multi-artist tracks the same way the post-download
enrichment pipeline does (comma / ampersand / feat. / etc).
Returns ``[credit]`` for single-artist credits the writer's
``len > 1`` check is what gates whether the multi-value tag gets
written.
"""
try:
from core.matching.artist_aliases import split_artist_credit
return split_artist_credit(credit)
except Exception:
return [credit] if credit else []
def _resolve_file_path(file_path, transfer_folder, download_folder=None,
config_manager=None, plex_client=None):
"""Resolve a stored DB path to an actual file on disk.
@ -1715,6 +1732,17 @@ class RepairWorker:
tag_updates = {'title': aid_title}
if aid_artist:
tag_updates['artist_name'] = aid_artist
# Issue #587 — derive a per-artist list from
# AcoustID's credit string when it carries
# multiple contributors. The post-download
# enrichment pipeline preserves multi-value
# ARTISTS tags via the user's
# `write_multi_artist` setting; the repair
# path was bypassing that and writing a
# single-string TPE1 only. Now respects the
# same setting via the writer's new
# `artists_list` derivation.
tag_updates['artists_list'] = _split_acoustid_credit(aid_artist)
write_tags_to_file(resolved, tag_updates)
logger.info("Wrote corrected tags to file: %s", resolved)
except Exception as tag_err:

@ -299,18 +299,31 @@ def write_tags_to_file(file_path: str, db_data: Dict[str, Any],
elif isinstance(genres, str):
genre_str = genres
# Multi-value artist support — issue #587. Caller can pass
# `artists_list` (per-track list of contributor names) and the
# writer respects the user's `metadata_enhancement.tags.write_multi_artist`
# config the same way the post-download enrichment pipeline does.
# When the setting is on AND the list has >1 entry:
# - ID3 keeps TPE1 as the joined display string (already in `artist`)
# and writes a separate TXXX:Artists frame with the list
# - Vorbis writes an `artists` multi-value key alongside `artist`
# - MP4 writes \xa9ART as the list when on, single string when off
# When OFF or the list is empty/single — same single-string write
# as before. Backward compatible for callers that don't pass it.
artists_list = _resolve_artists_list_for_write(db_data)
if isinstance(audio.tags, ID3):
written = _write_id3(audio, title, artist, album_artist, album,
year, genre_str, track_num, total_tracks,
disc_num, bpm)
disc_num, bpm, artists_list=artists_list)
elif isinstance(audio, (FLAC, OggVorbis)) or type(audio).__name__ == 'OggOpus':
written = _write_vorbis(audio, title, artist, album_artist, album,
year, genre_str, track_num, total_tracks,
disc_num, bpm)
disc_num, bpm, artists_list=artists_list)
elif isinstance(audio, MP4):
written = _write_mp4(audio, title, artist, album_artist, album,
year, genre_str, track_num, total_tracks,
disc_num, bpm)
disc_num, bpm, artists_list=artists_list)
# Embed cover art if requested
if embed_cover:
@ -343,8 +356,43 @@ def write_tags_to_file(file_path: str, db_data: Dict[str, Any],
# ── Format-specific writers ──
def _resolve_artists_list_for_write(db_data: Dict[str, Any]) -> Optional[List[str]]:
"""Pull a multi-value artists list from db_data when caller supplied one.
Accepts either ``artists_list`` (list of names) or ``artists`` (same
shape kept for symmetry with the post-process pipeline's
``_artists_list`` field). Drops empty / non-string entries. Returns
``None`` when no list was supplied so format writers can branch on
"single-string only" vs "multi-value too".
"""
raw = db_data.get('artists_list') or db_data.get('artists') or db_data.get('_artists_list')
if not raw:
return None
if not isinstance(raw, (list, tuple)):
return None
cleaned = []
for entry in raw:
if isinstance(entry, str):
text = entry.strip()
if text:
cleaned.append(text)
return cleaned or None
def _multi_artist_write_enabled() -> bool:
"""Read the same config flag the enrichment pipeline reads, so the
repair-path retag respects the user's choice."""
try:
from config.settings import config_manager
return bool(config_manager.get('metadata_enhancement.tags.write_multi_artist', False))
except Exception:
return False
def _write_id3(audio, title, artist, album_artist, album, year, genre,
track_num, total_tracks, disc_num, bpm) -> List[str]:
track_num, total_tracks, disc_num, bpm,
artists_list: Optional[List[str]] = None) -> List[str]:
written = []
if title:
audio.tags.delall('TIT2')
@ -354,6 +402,16 @@ def _write_id3(audio, title, artist, album_artist, album, year, genre,
audio.tags.delall('TPE1')
audio.tags.add(TPE1(encoding=3, text=[artist]))
written.append('artist')
# TPE1 stays as the joined display string. When the caller
# supplied a multi-value list AND the user has the
# write_multi_artist setting on, ALSO write the per-artist
# list to a TXXX:Artists frame (Picard convention). Mirrors
# the post-download enrichment writer at
# core/metadata/enrichment.py.
if artists_list and len(artists_list) > 1 and _multi_artist_write_enabled():
audio.tags.delall('TXXX:Artists')
audio.tags.add(TXXX(encoding=3, desc='Artists', text=list(artists_list)))
written.append('artists_multi')
if album_artist:
audio.tags.delall('TPE2')
audio.tags.add(TPE2(encoding=3, text=[album_artist]))
@ -387,7 +445,8 @@ def _write_id3(audio, title, artist, album_artist, album, year, genre,
def _write_vorbis(audio, title, artist, album_artist, album, year, genre,
track_num, total_tracks, disc_num, bpm) -> List[str]:
track_num, total_tracks, disc_num, bpm,
artists_list: Optional[List[str]] = None) -> List[str]:
written = []
if title:
audio['title'] = [title]
@ -395,6 +454,13 @@ def _write_vorbis(audio, title, artist, album_artist, album, year, genre,
if artist:
audio['artist'] = [artist]
written.append('artist')
# Vorbis-style multi-value: write the per-artist list to the
# `artists` key (separate from `artist`, picard convention) when
# the caller supplied a list AND the user has multi-value write
# enabled. Mirrors enrichment.py.
if artists_list and len(artists_list) > 1 and _multi_artist_write_enabled():
audio['artists'] = list(artists_list)
written.append('artists_multi')
if album_artist:
audio['albumartist'] = [album_artist]
written.append('album_artist')
@ -421,14 +487,24 @@ def _write_vorbis(audio, title, artist, album_artist, album, year, genre,
def _write_mp4(audio, title, artist, album_artist, album, year, genre,
track_num, total_tracks, disc_num, bpm) -> List[str]:
track_num, total_tracks, disc_num, bpm,
artists_list: Optional[List[str]] = None) -> List[str]:
written = []
if title:
audio['\xa9nam'] = [title]
written.append('title')
if artist:
audio['\xa9ART'] = [artist]
written.append('artist')
# MP4 \xa9ART can carry a list directly. When caller supplied
# a multi-value list AND user has multi-value write enabled,
# write the list. Otherwise single-string. Mirrors enrichment.py
# MP4 path.
if artists_list and len(artists_list) > 1 and _multi_artist_write_enabled():
audio['\xa9ART'] = list(artists_list)
written.append('artist')
written.append('artists_multi')
else:
audio['\xa9ART'] = [artist]
written.append('artist')
if album_artist:
audio['aART'] = [album_artist]
written.append('album_artist')

@ -0,0 +1,201 @@
"""Tests for the shared AcoustID candidate-matching helper.
Issue #587 / Foxxify report — scanner used to treat ``recordings[0]``
as authoritative, so when AcoustID returned multiple candidates and
the top one was the wrong-credited recording (different MB entry
under the same fingerprint), the scanner created a false-positive
"Wrong download" finding even though a lower-ranked candidate matched
the expected metadata exactly.
"""
from __future__ import annotations
from difflib import SequenceMatcher
import pytest
from core.matching.acoustid_candidates import (
duration_mismatches_strongly,
find_matching_recording,
)
def _ratio_sim(a: str, b: str) -> float:
"""Reasonable test similarity that handles non-trivial differences."""
if not a or not b:
return 0.0
return SequenceMatcher(None, a.lower().strip(), b.lower().strip()).ratio()
# ──────────────────────────────────────────────────────────────────────
# find_matching_recording
# ──────────────────────────────────────────────────────────────────────
def test_top_recording_matches_returns_immediately():
recordings = [
{'title': 'Nana', 'artist': 'Geoxor'},
{'title': 'Nana', 'artist': 'Edward Vesala Trio'},
]
result, t_sim, a_sim = find_matching_recording(
recordings, 'Nana', 'Geoxor', similarity=_ratio_sim,
)
assert result == {'title': 'Nana', 'artist': 'Geoxor'}
assert t_sim == 1.0
assert a_sim == 1.0
def test_falls_through_to_lower_ranked_match_for_foxxify_nana_case():
"""Reporter case 2: top AcoustID candidate is 'Nana' by 'Edward
Vesala Trio' (97% fingerprint), but the LOWER-ranked candidate
is the expected 'Nana' by 'Geoxor'. Pre-fix scanner saw only [0]
and flagged. Post-fix returns the matching candidate."""
recordings = [
{'title': 'Nana', 'artist': 'Edward Vesala Trio'}, # AcoustID's top match
{'title': 'Nana', 'artist': 'Geoxor'}, # the actual right one
]
result, _, _ = find_matching_recording(
recordings, 'Nana', 'Geoxor', similarity=_ratio_sim,
)
assert result == {'title': 'Nana', 'artist': 'Geoxor'}
def test_no_match_returns_none_with_best_seen_sims():
"""When no candidate passes thresholds, return the best-seen sims
so callers can log the closest near-miss in the finding."""
recordings = [
{'title': 'Different Song', 'artist': 'Different Artist'},
{'title': 'Sort Of Close', 'artist': 'Different Artist'},
]
result, t_sim, a_sim = find_matching_recording(
recordings, 'Different', 'AnotherArtist',
similarity=_ratio_sim,
title_threshold=0.95,
artist_threshold=0.95,
)
assert result is None
# Best seen — even though no candidate passed the threshold
assert t_sim > 0.0
assert a_sim > 0.0
def test_skips_recordings_missing_title_or_artist():
recordings = [
{'title': None, 'artist': 'Geoxor'},
{'title': 'Nana', 'artist': ''},
{'title': 'Nana', 'artist': 'Geoxor'},
]
result, _, _ = find_matching_recording(
recordings, 'Nana', 'Geoxor', similarity=_ratio_sim,
)
assert result == {'title': 'Nana', 'artist': 'Geoxor'}
def test_skips_non_dict_entries():
recordings = [None, 'string', {'title': 'Nana', 'artist': 'Geoxor'}]
result, _, _ = find_matching_recording(
recordings, 'Nana', 'Geoxor', similarity=_ratio_sim,
)
assert result == {'title': 'Nana', 'artist': 'Geoxor'}
def test_empty_inputs_return_none():
assert find_matching_recording([], 'X', 'Y')[0] is None
assert find_matching_recording([{'title': 'X', 'artist': 'Y'}], '', 'Y')[0] is None
assert find_matching_recording([{'title': 'X', 'artist': 'Y'}], 'X', '')[0] is None
def test_separate_artist_similarity_function_is_honored():
"""Verifier passes alias-aware comparison via artist_similarity.
Make sure it's used instead of the generic similarity."""
recordings = [{'title': 'Track', 'artist': '澤野弘之'}]
def alias_aware(expected, actual):
# Pretend our alias chain bridges Hiroyuki Sawano ↔ 澤野弘之
if expected == 'Hiroyuki Sawano' and actual == '澤野弘之':
return 1.0
return 0.0
result, _, a_sim = find_matching_recording(
recordings, 'Track', 'Hiroyuki Sawano',
similarity=_ratio_sim,
artist_similarity=alias_aware,
)
assert result is not None
assert a_sim == 1.0
def test_skip_predicate_drops_unwanted_candidates():
"""Verifier uses skip_predicate to drop wrong-version recordings
(instrumental vs vocal, etc.)."""
recordings = [
{'title': 'Track (Instrumental)', 'artist': 'X'},
{'title': 'Track', 'artist': 'X'},
]
result, _, _ = find_matching_recording(
recordings, 'Track', 'X',
similarity=_ratio_sim,
skip_predicate=lambda r: 'instrumental' in (r.get('title') or '').lower(),
)
assert result == {'title': 'Track', 'artist': 'X'}
def test_title_threshold_can_be_lowered_for_loose_matching():
recordings = [{'title': 'Sort Of Close', 'artist': 'Right Artist'}]
# With strict default threshold this fails
result_strict, _, _ = find_matching_recording(
recordings, 'Different', 'Right Artist', similarity=_ratio_sim,
)
assert result_strict is None
# With a permissive threshold the artist match alone wouldn't help —
# title sim must also pass.
result_loose, _, _ = find_matching_recording(
recordings, 'Different', 'Right Artist',
similarity=_ratio_sim, title_threshold=0.0,
)
assert result_loose is not None
# ──────────────────────────────────────────────────────────────────────
# duration_mismatches_strongly — guard against fingerprint collisions
# ──────────────────────────────────────────────────────────────────────
def test_durations_within_tolerance_pass():
# 3-minute track, 1-second drift — well within tolerance
assert duration_mismatches_strongly(180, 181) is False
# 3-minute vs 4-minute — within the 60s absolute tolerance
assert duration_mismatches_strongly(180, 240) is False
def test_drift_above_absolute_floor_flags():
# 3-minute expected, 5-minute candidate (120s drift > 63s threshold)
assert duration_mismatches_strongly(180, 300) is True
def test_relative_tolerance_scales_with_long_tracks():
# 30-minute expected vs 12-minute candidate (1080s vs 720s) —
# 18-minute drift > 35% of 30min = 10.5min → mismatch
assert duration_mismatches_strongly(1800, 720) is True
# 30-minute expected vs 28-minute candidate — 2min drift = under
# max(60s, 35%*30min) = max(60, 630) = 630s → still safe
assert duration_mismatches_strongly(1800, 1680) is False
def test_reporter_17min_mashup_vs_5min_track_flagged():
"""Foxxify's 17min mashup edit vs 5min late-70s Japanese hiphop —
fingerprint collision. Duration guard should mark this suspicious."""
assert duration_mismatches_strongly(17 * 60, 5 * 60) is True
def test_unknown_duration_returns_false_no_behavior_change():
"""When either side is missing duration, don't change behavior."""
assert duration_mismatches_strongly(None, 300) is False
assert duration_mismatches_strongly(180, None) is False
assert duration_mismatches_strongly(0, 300) is False
assert duration_mismatches_strongly(180, 0) is False
assert duration_mismatches_strongly(-5, 300) is False
def test_string_or_int_durations_handled():
# Defensive — coerce numeric types
assert duration_mismatches_strongly(180.5, 181.0) is False
assert duration_mismatches_strongly(int(180), int(300)) is True

@ -54,11 +54,11 @@ def _make_context(rows):
def test_load_db_tracks_skips_null_ids_and_normalizes_track_ids():
job = AcoustIDScannerJob()
context = _make_context([
# 10 columns: id, title, artist (COALESCE'd), file_path, track_number,
# 11 columns: id, title, artist (COALESCE'd), file_path, track_number,
# album_title, album_thumb, artist_thumb, track_artist (raw, may be ''),
# album_artist.
(None, "Broken Track", "Artist", "/music/broken.flac", 1, "Album", None, None, "", "Artist"),
(42, "Good Track", "Artist", "/music/good.flac", 2, "Album", "album-thumb", "artist-thumb", "", "Artist"),
# album_artist, duration_ms (issue #587 — duration guard).
(None, "Broken Track", "Artist", "/music/broken.flac", 1, "Album", None, None, "", "Artist", 180000),
(42, "Good Track", "Artist", "/music/good.flac", 2, "Album", "album-thumb", "artist-thumb", "", "Artist", 240000),
])
tracks = job._load_db_tracks(context)
@ -66,16 +66,17 @@ def test_load_db_tracks_skips_null_ids_and_normalizes_track_ids():
assert list(tracks.keys()) == ["42"]
assert tracks["42"]["title"] == "Good Track"
assert tracks["42"]["artist"] == "Artist"
assert tracks["42"]["duration_ms"] == 240000
def test_scan_handles_mixed_track_id_types(monkeypatch):
job = AcoustIDScannerJob()
context = _make_context([
# 10 columns: id, title, artist (COALESCE'd), file_path, track_number,
# 11 columns: id, title, artist (COALESCE'd), file_path, track_number,
# album_title, album_thumb, artist_thumb, track_artist (raw, may be ''),
# album_artist.
(None, "Broken Track", "Artist", "/music/broken.flac", 1, "Album", None, None, "", "Artist"),
(42, "Good Track", "Artist", "/music/good.flac", 2, "Album", "album-thumb", "artist-thumb", "", "Artist"),
# album_artist, duration_ms.
(None, "Broken Track", "Artist", "/music/broken.flac", 1, "Album", None, None, "", "Artist", 180000),
(42, "Good Track", "Artist", "/music/good.flac", 2, "Album", "album-thumb", "artist-thumb", "", "Artist", 240000),
])
monkeypatch.setattr(job, "_resolve_path", lambda file_path, _context: file_path)
@ -275,7 +276,8 @@ def _make_real_db_context(tmp_path):
album_id TEXT,
file_path TEXT,
track_number INTEGER,
track_artist TEXT
track_artist TEXT,
duration INTEGER
);
""")
conn.commit()
@ -620,3 +622,160 @@ def test_scanner_file_tag_matches_db_no_behavioral_change(monkeypatch):
)
assert captured_findings == []
# ---------------------------------------------------------------------------
# Issue #587 — multi-candidate scan + duration guard (Foxxify report)
# ---------------------------------------------------------------------------
def test_scanner_no_finding_when_lower_ranked_candidate_matches():
"""Foxxify case 2 — AcoustID returns multiple recordings per
fingerprint; the top match is the wrong-credited recording but a
lower-ranked candidate matches expected metadata exactly. Scanner
should iterate ALL candidates and suppress the finding.
Repro: file is "Nana" by Geoxor, AcoustID top match is "Nana" by
Edward Vesala Trio (different recording sharing similar
fingerprint), AcoustID's second candidate is the actual Geoxor
track. Pre-fix scanner only saw [0] flagged. Post-fix sees [1]
no flag."""
job = AcoustIDScannerJob()
captured_findings = []
context = _make_finding_capturing_context(
track_row=("nana", "Nana", "Geoxor",
"/music/nana.opus", 6, "Stardust", None, None),
captured=captured_findings,
)
fake_acoustid = SimpleNamespace(
fingerprint_and_lookup=lambda fpath: {
'best_score': 0.97,
'recordings': [
# AcoustID's top match — wrong artist for our file
{'title': 'Nana', 'artist': 'Edward Vesala Trio'},
# Lower-ranked candidate — actually matches our expected
{'title': 'Nana', 'artist': 'Geoxor'},
],
},
)
result = JobResultStub()
job._scan_file(
'/music/nana.opus', 'nana',
{'title': 'Nana', 'artist': 'Geoxor'},
fake_acoustid, context, result,
fp_threshold=0.85, title_threshold=0.85, artist_threshold=0.6,
)
assert captured_findings == [], (
f"Expected no finding (lower-ranked candidate matches); got {captured_findings}"
)
def test_scanner_still_flags_when_no_candidate_matches():
"""Confirm the multi-candidate check doesn't accidentally suppress
legitimate mismatches if NO candidate matches expected metadata,
the finding still fires."""
job = AcoustIDScannerJob()
captured_findings = []
context = _make_finding_capturing_context(
track_row=("99", "Expected Title", "Expected Artist",
"/music/track.flac", 1, "Album", None, None),
captured=captured_findings,
)
fake_acoustid = SimpleNamespace(
fingerprint_and_lookup=lambda fpath: {
'best_score': 0.99,
'recordings': [
{'title': 'Wrong Track', 'artist': 'Wrong Artist A'},
{'title': 'Different Wrong', 'artist': 'Wrong Artist B'},
],
},
)
result = JobResultStub()
job._scan_file(
'/music/track.flac', '99',
{'title': 'Expected Title', 'artist': 'Expected Artist'},
fake_acoustid, context, result,
fp_threshold=0.85, title_threshold=0.85, artist_threshold=0.6,
)
assert len(captured_findings) == 1
def test_scanner_skips_finding_on_strong_duration_mismatch():
"""Foxxify case 3 — 17-minute mashup edit fingerprints to a 5-minute
late-70s Japanese hiphop track. Fingerprint matched a sample/intro
section but the recordings are clearly different (drastic length
difference). Scanner should skip the finding rather than recommend
retag of a totally different track length."""
job = AcoustIDScannerJob()
captured_findings = []
context = _make_finding_capturing_context(
track_row=("mashup", "Some Mashup Edit", "Mashup Artist",
"/music/mashup.opus", 1, "Mashups", None, None),
captured=captured_findings,
)
# AcoustID matched a 5-minute Japanese hiphop track via fingerprint
# hash collision. Expected file is 17 minutes — duration guard
# should kick in.
fake_acoustid = SimpleNamespace(
fingerprint_and_lookup=lambda fpath: {
'best_score': 0.98,
'recordings': [
{'title': 'Different Song', 'artist': 'Different Artist',
'duration': 300}, # 5 min — way off from our 17 min file
],
},
)
result = JobResultStub()
# 17 minutes = 1020 sec = 1020000 ms
job._scan_file(
'/music/mashup.opus', 'mashup',
{'title': 'Some Mashup Edit', 'artist': 'Mashup Artist', 'duration_ms': 1020000},
fake_acoustid, context, result,
fp_threshold=0.85, title_threshold=0.85, artist_threshold=0.6,
)
assert captured_findings == [], (
f"Expected no finding (duration mismatch suggests collision); got {captured_findings}"
)
def test_scanner_still_flags_when_duration_matches():
"""Confirm the duration guard only kicks in for STRONG mismatches —
similar-length wrong song still gets flagged."""
job = AcoustIDScannerJob()
captured_findings = []
context = _make_finding_capturing_context(
track_row=("99", "Expected", "Artist",
"/music/track.flac", 1, "Album", None, None),
captured=captured_findings,
)
fake_acoustid = SimpleNamespace(
fingerprint_and_lookup=lambda fpath: {
'best_score': 0.99,
'recordings': [
{'title': 'Wrong Song', 'artist': 'Wrong Artist',
'duration': 180}, # 3 min, matches expected
],
},
)
result = JobResultStub()
# 3-minute file with 3-minute candidate — same length, but title +
# artist clearly mismatch → finding should still fire
job._scan_file(
'/music/track.flac', '99',
{'title': 'Expected', 'artist': 'Artist', 'duration_ms': 180000},
fake_acoustid, context, result,
fp_threshold=0.85, title_threshold=0.85, artist_threshold=0.6,
)
assert len(captured_findings) == 1

@ -0,0 +1,175 @@
"""Tests for the multi-value artist write path in tag_writer.
Issue #587 — the AcoustID scanner's "Apply Match" retag was bypassing
the user's `metadata_enhancement.tags.write_multi_artist` setting and
writing single-string TPE1 only. The repair-path retag now passes an
``artists_list`` to ``write_tags_to_file`` and the writer respects the
config flag the same way the post-download enrichment pipeline does.
"""
from __future__ import annotations
import os
import tempfile
from unittest.mock import patch
import pytest
from mutagen.flac import FLAC, StreamInfo
from core.tag_writer import (
_multi_artist_write_enabled,
_resolve_artists_list_for_write,
write_tags_to_file,
)
# ──────────────────────────────────────────────────────────────────────
# _resolve_artists_list_for_write — derives the multi-value list
# ──────────────────────────────────────────────────────────────────────
def test_resolves_artists_list_field():
assert _resolve_artists_list_for_write({'artists_list': ['A', 'B']}) == ['A', 'B']
def test_resolves_artists_field_alias():
assert _resolve_artists_list_for_write({'artists': ['A', 'B']}) == ['A', 'B']
def test_resolves_underscore_artists_list_field():
# _artists_list — the post-process pipeline's internal field name
assert _resolve_artists_list_for_write({'_artists_list': ['A', 'B']}) == ['A', 'B']
def test_returns_none_when_no_list_supplied():
assert _resolve_artists_list_for_write({'artist_name': 'Solo'}) is None
def test_returns_none_for_non_list_input():
assert _resolve_artists_list_for_write({'artists_list': 'A, B'}) is None
assert _resolve_artists_list_for_write({'artists_list': {'A': 1}}) is None
def test_strips_empty_and_non_string_entries():
out = _resolve_artists_list_for_write({'artists_list': ['A', '', None, ' ', 'B']})
assert out == ['A', 'B']
def test_returns_none_when_list_only_has_empty_entries():
assert _resolve_artists_list_for_write({'artists_list': ['', None, ' ']}) is None
# ──────────────────────────────────────────────────────────────────────
# _multi_artist_write_enabled — config gate
# ──────────────────────────────────────────────────────────────────────
def test_multi_artist_write_reads_config():
with patch('config.settings.config_manager.get', return_value=True):
assert _multi_artist_write_enabled() is True
with patch('config.settings.config_manager.get', return_value=False):
assert _multi_artist_write_enabled() is False
def test_multi_artist_write_swallows_config_error():
with patch('config.settings.config_manager.get', side_effect=RuntimeError('boom')):
assert _multi_artist_write_enabled() is False
# ──────────────────────────────────────────────────────────────────────
# Vorbis end-to-end — write multi-value to a real FLAC file
# ──────────────────────────────────────────────────────────────────────
def _make_minimal_flac(path):
"""Create a tiny but valid FLAC with mutagen's lowest-overhead
stream info so we can write tags and read them back."""
# Empty audio body — just enough to satisfy mutagen's parser.
# 44-byte FLAC header + STREAMINFO block.
minimal = (
b'fLaC'
+ b'\x80\x00\x00\x22' # last STREAMINFO block, length 34
+ b'\x00\x10\x00\x10' # min/max block size
+ b'\x00\x00\x00\x00\x00\x00' # min/max frame size
+ b'\x0a\xc4\x42\xf0\x00\x00\x00\x00' # sample rate / channels / etc
+ b'\x00' * 16 # MD5
)
with open(path, 'wb') as f:
f.write(minimal)
@pytest.fixture
def flac_path():
fd, path = tempfile.mkstemp(suffix='.flac')
os.close(fd)
_make_minimal_flac(path)
yield path
try:
os.remove(path)
except OSError:
pass
def test_multi_value_artists_key_written_when_setting_on(flac_path):
with patch('config.settings.config_manager.get', return_value=True):
result = write_tags_to_file(flac_path, {
'title': 'Track',
'artist_name': 'Artist A, Artist B',
'artists_list': ['Artist A', 'Artist B'],
}, embed_cover=False)
assert result['success'] is True
assert 'artists_multi' in result['written_fields']
audio = FLAC(flac_path)
# Display string preserved as `artist`
assert audio.get('artist') == ['Artist A, Artist B']
# Multi-value list written to `artists` key (Picard convention)
assert audio.get('artists') == ['Artist A', 'Artist B']
def test_multi_value_artists_key_skipped_when_setting_off(flac_path):
with patch('config.settings.config_manager.get', return_value=False):
result = write_tags_to_file(flac_path, {
'title': 'Track',
'artist_name': 'Artist A, Artist B',
'artists_list': ['Artist A', 'Artist B'],
}, embed_cover=False)
assert result['success'] is True
assert 'artists_multi' not in result['written_fields']
audio = FLAC(flac_path)
assert audio.get('artist') == ['Artist A, Artist B']
# No multi-value key when setting is off — backward compat
assert audio.get('artists') is None
def test_single_artist_does_not_write_multi_value_key(flac_path):
"""When list has only one entry (or no list at all), don't
write the multi-value key even if the setting is on. The point
is to differentiate true multi-artist from single-artist tracks."""
with patch('config.settings.config_manager.get', return_value=True):
result = write_tags_to_file(flac_path, {
'title': 'Track',
'artist_name': 'Solo Artist',
'artists_list': ['Solo Artist'],
}, embed_cover=False)
assert result['success'] is True
assert 'artists_multi' not in result['written_fields']
audio = FLAC(flac_path)
assert audio.get('artists') is None
def test_no_artists_list_legacy_callers_unchanged(flac_path):
"""Backward compat — callers that don't supply artists_list get
the same single-string write as before. No regression for the
write_artist_image button or any other tag_writer caller."""
with patch('config.settings.config_manager.get', return_value=True):
result = write_tags_to_file(flac_path, {
'title': 'Track',
'artist_name': 'Solo Artist',
}, embed_cover=False)
assert result['success'] is True
assert 'artists_multi' not in result['written_fields']
audio = FLAC(flac_path)
assert audio.get('artists') is None

@ -3416,6 +3416,7 @@ const WHATS_NEW = {
'2.5.2': [
// --- May 13, 2026 — 2.5.2 release ---
{ date: 'May 13, 2026 — 2.5.2 release' },
{ title: 'AcoustID Scanner: Multi-Candidate Match + Duration Guard + Multi-Value Retag', desc: 'discord report (foxxify) issue #587: scanner produced false-positive "Wrong Song" findings for tracks where AcoustID returned multiple recordings per fingerprint and the top match was a wrong-credited recording (different MB entry sharing the same fingerprint). also: applying an AcoustID match retag stripped multi-value ARTISTS tags. three coordinated fixes per codex diagnosis. fix 1: scanner now iterates ALL AcoustID candidates (not just `recordings[0]`) — if any candidate matches expected title + artist, no finding. lifted to a shared pure helper `core/matching/acoustid_candidates.py:find_matching_recording` so both verifier and scanner use the same logic. fix 2: duration guard catches fingerprint hash collisions (foxxify\'s 17-minute mashup edit fingerprinted to a 5-minute late-70s japanese hiphop track — different songs, same fingerprint hash collision on a sampled section). when file duration vs AcoustID candidate duration differs by more than max(60s, 35%), the scanner skips the finding. fix 3: scanner retag (`_fix_wrong_song`) was bypassing the user\'s `metadata_enhancement.tags.write_multi_artist` setting because `write_tags_to_file` only wrote single-string TPE1. now extended with optional `artists_list` parameter that, when supplied + setting on, writes the multi-value tag (TXXX:Artists for ID3, `artists` key for vorbis/opus/flac, list-form `\xa9ART` for mp4) — exact same behavior as the post-download enrichment pipeline. AcoustID retag derives the per-artist list by splitting AcoustID\'s credit on the same separators (comma / ampersand / feat. / ft. / etc) the matching layer already uses. AcoustID version-mismatch gate left intact (still correctly catches genuinely-wrong files). 15 tests on the candidate helper + duration guard, 13 tests on the multi-value tag write path, 4 new scanner regression tests pinning every shape: lower-ranked candidate match suppression, no-suppression when no candidate matches, duration mismatch skip, no-skip when duration matches.', page: 'tools' },
{ title: 'Cross-Script Artist Aliases: Cyrillic / Kanji Canonical Names Now Bridge', desc: 'github issue #586 (follow-up to #442): "Dmitry Yablonsky" tracks were quarantining as audio mismatch — file identified as "Русская филармония, Дмитрий Яблонский" (4% artist similarity) — even though the Cyrillic spelling is just the russian transliteration. three layered bugs in the alias resolution chain. fix 1: `fetch_artist_aliases` only read `data.aliases` and IGNORED the artist record\'s canonical `name` and `sort-name`. for artists where the canonical form is the cross-script spelling (e.g. MB stores "Дмитрий Яблонский" as canonical, "Dmitry Yablonsky" as alias — or vice versa), the missing direction never made it into the alias list. now both canonical name + sort-name are included alongside the explicit aliases (deduped). fix 2: `lookup_artist_aliases` ran search in strict mode only (`artist:"..."` lucene query), which skips MB\'s alias and sortname indexes. cross-script searches found nothing under strict. now falls back to non-strict (bare query, hits all indexes) when strict returns empty OR all results fail the trust gate. fix 3: trust gate weighted local similarity 70% — cross-script pairs have similarity ~0 → combined score ~0.30 → below the 0.85 threshold → cached as empty even when MB\'s own confidence was 100. new escape: when MB score is ≥ 95 AND the result is unambiguous (top result clearly leads), accept regardless of local similarity. covers cases where MB definitely knows the right artist but our local sim collapses to zero. existing #442 same-script path (Hiroyuki Sawano ↔ 澤野弘之) still passes via combined-score path. 12 new tests pin every layer + the exact reporter scenario end-to-end via `artist_names_match`. existing alias tests updated to reflect canonical-name inclusion + 2-call strict+non-strict pattern.', page: 'downloads' },
{ title: 'MTV Unplugged & Live Albums No Longer False-Quarantine', desc: 'github issue #589: tracks from live / unplugged / concert albums (MTV Unplugged, Live At Wembley, etc) consistently failed AcoustID verification with "Version mismatch: expected (live) but file is (original)". two upstream bugs fed into the false positive — the AcoustID gate itself was correctly catching the wrong file Tidal had selected. fix 1: album-scoped library check at `core/downloads/master.py` was scoring "Shy Away (MTV Unplugged Live)" (source) vs "Shy Away" (local DB) with raw string similarity → ~0.3 → marked missing → re-downloaded even though user already owned it. new pure helper `core/matching/album_context_title.py:strip_redundant_album_suffix` strips suffixes whose tokens are fully subsumed by the album context (live/unplugged/acoustic/session markers + tolerated noise + album-title words). only fires inside the album-confirmed scope so global matching elsewhere is unchanged. fix 2: `core/tidal_download_client.py` qualifier filter only ran on FALLBACK searches — primary search returned all results unfiltered, so a query for "Shy Away (MTV Unplugged Live)" could accept the studio cut if Tidal ranked it first. now applies to both primary and fallback. fix 3: qualifier check now inspects both `track.name` AND `track.album.name` — for concert / unplugged releases the live signal often lives in the album title, not the track. AcoustID version-mismatch gate left intact (still correctly catches genuinely-wrong files). 19 tests on the album-context helper + 13 tests on the tidal qualifier helper pin every shape: MTV Unplugged variants, dash-style suffixes, brackets, year tolerance, plural-form markers, anti-regression cases (instrumental/remix on a studio album must NOT be stripped), defensive non-dict / missing-album inputs.', page: 'downloads' },
{ title: 'Deezer: Contributing Artist Tagging Now Consistent', desc: 'github issue #588: contributors tagging worked for some tracks but silently dropped them for others — most reproducibly for tracks whose ALBUM was fetched before the per-track post-process ran. trace: `core/deezer_client.py:get_track_details` cache check used `track_position` as the "full payload" sentinel, but BOTH `/track/<id>` AND `/album/<id>/tracks` set that field. only `/track/<id>` sets the `contributors` array. when album-tracks data hit the cache first, `get_track_details` returned the partial record → `_build_enhanced_track` found no contributors → the metadata-source contributors-upgrade silently fell back to single-artist. fix: lifted cache-validity to a pure helper `_is_full_track_payload` that requires BOTH `track_position` AND `contributors` key presence (empty list `[]` is valid — single-artist tracks fetched via `/track/<id>` carry it explicitly). partial cache hits now fall through to a fresh `/track/<id>` fetch. 11 boundary tests pin every shape: full payload, single-artist with empty contributors list, partial album-tracks shape, search-result shape, none/non-dict, cache-hit/cache-miss/api-failure paths.', page: 'downloads' },

Loading…
Cancel
Save