AcoustID scanner: prefer track_artist for compilation albums

Discord report (Skowl): downloaded a compilation album ("High Tea
Music: Vol 1") where every track has a different artist (Eclypse,
Andromedik, T & Sugah, Gourski, etc.) and the AcoustID scanner
flagged every single track as Wrong Song. The file tags had the
correct per-track artist (e.g. "Eclypse" for "City Lights"), but
the scanner compared against the album-level artist ("Andromedik",
the curator). Raw similarity 12% → Wrong Song flag.

# Why the prior multi-value fix didn't help

Foxxify's case (just-merged PR): AcoustID returned multi-value
credit "Okayracer, aldrch & poptropicaslutz!" — primary IS in the
credit. Splitting found it.

Skowl's case: both sides single-value but DIFFERENT artists.
Splitter has nothing to find — Eclypse simply isn't in "Andromedik".
Different bug.

# Cause

Scanner SQL at `core/repair_jobs/acoustid_scanner.py:281` joined
the `artists` table via `tracks.artist_id` which points at the
ALBUM artist (the curator/label-name applied to every row in a
compilation). The `tracks.track_artist` column already holds the
correct per-track artist for compilations — populated by every
server-scan path (Plex `originalTitle`, Jellyfin `ArtistItems`,
Navidrome per-track `artist`) AND the auto-import / direct-download
post-process flow (`record_soulsync_library_entry` writes it when
different from album artist). Scanner just wasn't reading it.

# Fix

```sql
SELECT t.id, t.title,
       COALESCE(NULLIF(t.track_artist, ''), ar.name) AS artist,
       ...
```

Prefers per-track artist when populated, falls back to album artist
for legacy rows / single-artist albums where `track_artist` is NULL.
`NULLIF(t.track_artist, '')` handles the empty-string-instead-of-null
case some legacy rows might have.

# Composes with Foxxify's multi-value fix

For the rare compilation track where AcoustID ALSO returns a
multi-value credit (e.g. compilation track has multiple credited
performers), both paths work together — `track_artist` gives the
correct expected primary, then the helper splits the credit and
finds it.

# Tests added (2)

- `test_load_db_tracks_prefers_track_artist_for_compilation` —
  reporter's exact case: track with `track_artist='Eclypse'` AND
  `artist_id` pointing at album artist 'Andromedik' resolves to
  'Eclypse'. Second track with NULL `track_artist` falls back to
  album artist 'Andromedik' (single-artist + legacy compat).
- `test_load_db_tracks_falls_back_when_track_artist_empty_string`
  — empty string in `track_artist` (some legacy rows) → NULLIF
  returns NULL → COALESCE falls back to album artist.

Both use a real SQLite DB so the COALESCE/NULLIF logic + JOIN
runs against actual schema (SimpleNamespace fakes can't simulate
JOINs).

# Verification

- 6/6 scanner tests pass (2 new + 4 existing)
- 2586 full suite passes (+2 from prior commit)
- Ruff clean
pull/544/head
Broque Thomas 3 days ago
parent 08a0b39b91
commit 812db1fbbf

@ -274,8 +274,20 @@ class AcoustIDScannerJob(RepairJob):
try:
conn = context.db._get_connection()
cursor = conn.cursor()
# Discord report (Skowl): compilation albums like "High Tea
# Music: Vol 1" have a different artist per track but the
# `tracks.artist_id` foreign key points at the ALBUM artist
# (curator / label-name applied to every track). AcoustID
# returns the actual per-track artist → 12% similarity →
# Wrong Song flag. Fix: prefer `tracks.track_artist` (the
# per-track artist, populated by every server-scan + auto-
# import path when different from album artist) and fall
# back to the album artist only when the per-track column
# is NULL or empty (legacy rows / single-artist albums).
cursor.execute("""
SELECT t.id, t.title, ar.name, t.file_path, t.track_number,
SELECT t.id, t.title,
COALESCE(NULLIF(t.track_artist, ''), ar.name) AS artist,
t.file_path, t.track_number,
al.title AS album_title, al.thumb_url, ar.thumb_url
FROM tracks t
LEFT JOIN artists ar ON ar.id = t.artist_id

@ -225,3 +225,138 @@ class JobResultStub:
errors = 0
scanned = 0
skipped = 0
# ---------------------------------------------------------------------------
# Compilation albums — Skowl Discord report
# ---------------------------------------------------------------------------
#
# Compilation albums (e.g. "High Tea Music: Vol 1") have different
# artists per track but `tracks.artist_id` points at the ALBUM artist
# (curator / label name applied to every row). The scanner used to
# compare AcoustID's per-track artist against the album artist →
# 12% sim → Wrong Song flag on every track. The `tracks.track_artist`
# column already holds the correct per-track artist for these cases
# (populated by every server-scan + auto-import path) — scanner just
# wasn't reading it. Post-fix `_load_db_tracks` prefers track_artist
# via `COALESCE(NULLIF(t.track_artist, ''), ar.name)`.
def _make_real_db_context(tmp_path):
"""Build a context with a REAL SQLite DB so the scanner's
multi-table JOIN runs against actual schema. SimpleNamespace
fakes can't simulate the JOIN."""
import sqlite3
db_path = tmp_path / "test.db"
conn = sqlite3.connect(str(db_path))
conn.row_factory = sqlite3.Row
cursor = conn.cursor()
cursor.executescript("""
CREATE TABLE artists (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
thumb_url TEXT
);
CREATE TABLE albums (
id TEXT PRIMARY KEY,
title TEXT NOT NULL,
thumb_url TEXT
);
CREATE TABLE tracks (
id TEXT PRIMARY KEY,
title TEXT,
artist_id TEXT,
album_id TEXT,
file_path TEXT,
track_number INTEGER,
track_artist TEXT
);
""")
conn.commit()
conn.close()
class _RealDB:
def _get_connection(self):
c = sqlite3.connect(str(db_path))
c.row_factory = sqlite3.Row
return c
return _RealDB()
def test_load_db_tracks_prefers_track_artist_for_compilation():
"""Reporter's exact case (Skowl) — compilation album where
every track has a different artist credited via track_artist
column, while artist_id points at the album-level curator."""
import tempfile, pathlib
tmp = pathlib.Path(tempfile.mkdtemp())
db = _make_real_db_context(tmp)
conn = db._get_connection()
cursor = conn.cursor()
cursor.execute("INSERT INTO artists (id, name) VALUES ('andro', 'Andromedik')")
cursor.execute(
"INSERT INTO albums (id, title) VALUES ('hightea', 'High Tea Music: Vol 1')"
)
cursor.execute(
"INSERT INTO tracks (id, title, artist_id, album_id, file_path, track_artist) "
"VALUES (?, ?, ?, ?, ?, ?)",
('city-lights', 'City Lights', 'andro', 'hightea',
'/music/citylights.mp3', 'Eclypse'),
)
cursor.execute(
"INSERT INTO tracks (id, title, artist_id, album_id, file_path, track_artist) "
"VALUES (?, ?, ?, ?, ?, ?)",
('invasion', 'Invasion', 'andro', 'hightea',
'/music/invasion.mp3', None), # NULL track_artist falls back
)
conn.commit()
conn.close()
job = AcoustIDScannerJob()
context = SimpleNamespace(
db=db,
config_manager=SimpleNamespace(get=lambda *a, **k: None),
)
tracks = job._load_db_tracks(context)
# Track with track_artist populated → Eclypse (per-track), NOT
# Andromedik (album-artist via artist_id).
assert tracks['city-lights']['artist'] == 'Eclypse', (
f"Compilation track must use track_artist; got {tracks['city-lights']['artist']!r}"
)
# Track with NULL track_artist → falls back to album artist
# via COALESCE. Backward compat for legacy rows + single-artist
# albums where track_artist isn't populated.
assert tracks['invasion']['artist'] == 'Andromedik'
def test_load_db_tracks_falls_back_when_track_artist_empty_string():
"""Defensive: NULLIF treats empty string as NULL too. Some
legacy rows might have stored '' instead of NULL."""
import tempfile, pathlib
tmp = pathlib.Path(tempfile.mkdtemp())
db = _make_real_db_context(tmp)
conn = db._get_connection()
cursor = conn.cursor()
cursor.execute("INSERT INTO artists (id, name) VALUES ('a', 'Album Artist')")
cursor.execute("INSERT INTO albums (id, title) VALUES ('alb', 'Album')")
cursor.execute(
"INSERT INTO tracks (id, title, artist_id, album_id, file_path, track_artist) "
"VALUES (?, ?, ?, ?, ?, ?)",
('t1', 'T1', 'a', 'alb', '/music/t1.mp3', ''), # empty string
)
conn.commit()
conn.close()
job = AcoustIDScannerJob()
context = SimpleNamespace(
db=db,
config_manager=SimpleNamespace(get=lambda *a, **k: None),
)
tracks = job._load_db_tracks(context)
# Empty string in track_artist → NULLIF returns NULL → COALESCE
# falls back to album artist
assert tracks['t1']['artist'] == 'Album Artist'

@ -3416,6 +3416,7 @@ const WHATS_NEW = {
'2.4.3': [
// --- post-release patch work on the 2.4.3 line — entries hidden by _getLatestWhatsNewVersion until the build version bumps ---
{ date: 'Unreleased — 2.4.3 patch work' },
{ title: 'AcoustID Scanner: Compilation Albums No Longer Flag Every Track', desc: 'discord report (skowl): downloaded a compilation album like "high tea music: vol 1" where every track has a different artist (eclypse, andromedik, t & sugah, gourski, himmes, sektor, lexurus, etc.) and the acoustid scanner flagged every single track as wrong song — the file tag had the correct per-track artist (e.g. "eclypse" for "city lights") but the scanner compared against the album-level artist ("andromedik", the curator). raw similarity 12% → wrong song flag. the multi-value-credit fix from the prior pr (foxxify) didn\'t help because both sides were single-value but DIFFERENT artists. cause: scanner sql joined `artists` table via `tracks.artist_id` which points at the ALBUM artist, not the per-track artist. but `tracks.track_artist` column was already populated with the correct per-track value by every server scan + auto-import path that handles compilations. scanner just wasn\'t reading it. fix: changed the scanner select to `COALESCE(NULLIF(t.track_artist, \'\'), ar.name)` — prefers per-track artist when populated, falls back to album artist for legacy rows / single-artist albums where track_artist is null. NULLIF handles the empty-string-instead-of-null case for legacy data. composes with foxxify\'s multi-value fix — for the rare compilation track where acoustid ALSO returns a multi-value credit, both paths work together. 2 new tests pin: compilation track uses per-track artist (reporter\'s exact case), null/empty track_artist falls back to album artist via coalesce.', page: 'library' },
{ title: 'AcoustID Scanner: Multi-Artist Songs No Longer Flagged As Wrong', desc: 'discord report (foxxify): the acoustid scanner repair job was flagging multi-artist tracks as "wrong song" because acoustid returns the full credit ("okayracer, aldrch & poptropicaslutz!") while the library db carries only the primary artist ("okayracer"). raw similarity scored ~43% — well below the 60% threshold — so the scanner created a wrong-song finding even though the audio was correct. user couldn\'t fix without lowering the global artist threshold to ~30% (which would let real mismatches through). cause: scanner used raw `SequenceMatcher` comparison that doesn\'t recognise the primary artist is just one of several contributors in the credit string. fix: extended the shared `core/matching/artist_aliases.py::artist_names_match` helper (lifted in #441) with credit-token splitting on common separators (comma, ampersand, semicolon, slash, plus, "feat.", "ft.", "featuring", "with", "vs.", "x"). when actual artist contains separators, helper splits into individual contributors and checks each against expected — primary-in-credit cases now resolve at 100% instead of 43%. composes with existing alias path so cross-script multi-artist credits ("hiroyuki sawano" expected, "澤野弘之, featured" actual) work too. wired into `core/repair_jobs/acoustid_scanner.py` — replaces the raw similarity call. acoustid post-download verifier already used the helper from #441 so it inherits the same fix automatically. 14 new tests pin: split-by-separator across 12 credit-string formats, primary at start/middle/end of credit, no-mask on genuine mismatches, single-token actual falls through to direct compare, multi-value composes with aliases, threshold still respected, end-to-end scanner integration with reporter\'s exact case (okayracer in okayracer-aldrch-poptropicaslutz credit → no finding), end-to-end scanner still flags genuine mismatches.', page: 'library' },
{ title: 'Deezer Cover Art: Embedded Covers No Longer Look Blurry', desc: 'discord report (tim): downloaded cover art via deezer metadata source came out visibly blurry in navidrome and on phones — particularly noticeable on large displays. cause: deezer\'s api returns `cover_xl` urls at 1000×1000 but the underlying cdn serves up to 1900×1900 by rewriting the size segment in the url path. soulsync wasn\'t doing the rewrite — same as iTunes mzstatic and spotify scdn already get upgraded. now `_upgrade_deezer_cover_url` (mirrors `_upgrade_spotify_image_url` pattern) rewrites the cdn url to request 1900×1900 before download. cdn serves source-native size when source < target so asking for 1900 on smaller-source albums returns the same bytes (no upscaling, no failure). applied at both download sites — auto post-process flow + the enhanced library view\'s "write tags to file" feature. existing `prefer_caa_art` toggle in settings → library → post-processing remains as the orthogonal workaround for users who want even higher quality (musicbrainz cover art archive, often 3000×3000+). 16 new tests pin: standard upgrade, alternate dzcdn host, artist picture urls, custom target sizes, idempotency on already-upgraded urls, defensive on non-deezer urls (spotify/itunes/caa/lastfm/random), empty/none handling.', page: 'settings' },
{ title: 'Cross-Script Artist Names No Longer Quarantine Files (Hiroyuki Sawano / 澤野弘之, Сергей Лазарев / Sergey Lazarev)', desc: 'github issue #442 (afonsog6): files where the artist tag was in one script and the expected metadata was in another — japanese kanji `澤野弘之` for `hiroyuki sawano`, cyrillic `сергей лазарев` for `sergey lazarev`, etc. — got quarantined post-download because acoustid verification scored the artist similarity at 0% (the two scripts share no characters). reporter could not even rescue the file via manual import — the import-modal goes through the same verifier and re-quarantined the same file. cause: verifier compared expected vs actual artist with raw `_similarity` and never consulted musicbrainz aliases, even though MB exposes them on every artist record. fix: new `core/matching/artist_aliases.py` pure helper with alias-aware comparison + new `artists.aliases` JSON column populated by the existing MB enrichment worker on every artist match (one extra `inc=aliases` request per artist) + new multi-tier resolver `MusicBrainzService.lookup_artist_aliases` (library DB → cache → live MB) so the verifier finds aliases even for un-enriched artists without thrashing the MB API. verifier resolves aliases ONCE per `verify_audio_file` call and feeds them through three artist comparison sites (best-match scoring, secondary scan when title matches but artist doesn\'t, final fallback scan). reporter\'s exact two cases reproduced as regression tests with stubbed MB service. backward compat: aliases unavailable / MB unreachable → verifier falls back to direct similarity (identical to pre-fix behaviour — never quarantines stricter than today). 70 new tests pin every layer: pure helper (28), service methods (31), verifier integration (11). audited adjacent artist-comparison sites (auto-import single-track id, discovery scoring, matching engine) — left untouched per scope discipline since they aren\'t the user-reported pain.', page: 'downloads' },

Loading…
Cancel
Save