From 402d851cac0fa0827ccdaa340bc472e8a85ca2e2 Mon Sep 17 00:00:00 2001 From: Broque Thomas <26755000+Nezreka@users.noreply.github.com> Date: Sun, 10 May 2026 09:36:48 -0700 Subject: [PATCH] Deezer search: drop advanced-syntax at endpoint, free-text + rerank wins MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Live-API verification revealed advanced-syntax queries hurt more than they help on this endpoint. Switching the import-modal Deezer search back to free-text + local rerank. # What live testing showed Hit Deezer's public API with both query forms for the issue #534 case (`Dirty White Boy` + `Foreigner`): **Free-text (`q=Dirty White Boy Foreigner`):** - Returns 21 results - Real Foreigner Head Games studio cut at #1 - Live versions at #2-10 - Karaoke / cover variants at #11-15 **Advanced (`q=track:"Dirty White Boy" artist:"Foreigner"`):** - Returns 12 results - "(2008 Remaster)" at #1 — canonical Head Games cut MISSING from top 8 entirely - Live + alt-album versions follow Advanced syntax DOES filter karaoke at the API level (none in the 12-result set vs. 5 at positions 11-15 in free-text), but it has its own ranking bias that surfaces remasters / "Best Of" cuts ahead of the canonical recording. Net regression for the user- facing goal. # Fix 1. Endpoint reverts to free-text query with local rerank applied. 2. Local rerank gains "remaster" / "remastered" / "reissue" patterns under VARIANT_TAG_PATTERNS (soft 0.4× penalty — user may want them but they shouldn't outrank the original). 3. Client kwarg support (`track=` / `artist=` / `album=`) preserved for future opt-in callers (e.g. exact-match flows where API- level filtering matters more than ranking). # Verified end-to-end against live Deezer API Re-ran the exact #534 case through the live API + new rerank. Top 15 results post-rerank: 1. Dirty White Boy — Foreigner — Head Games ← REAL CUT AT TOP 2-10. Various Live versions 11-15. Karaoke / cover / tribute variants ← BURIED Real Foreigner Head Games studio cut at #1, exactly the user's ask. # Tests - `test_relevance.py` — variant tag patterns extended; existing tests still pass (50 tests). - `test_search_match_endpoints.py::test_joins_track_and_artist_into_free_text_query` — replaces `test_passes_track_and_artist_as_kwargs`; verifies endpoint sends free-text join, NOT field-scoped kwargs (the prior test asserted the wrong direction now). - Karaoke-burying assertion at the endpoint still pins the user-visible behaviour. - Client kwarg path tests untouched (still pin advanced-syntax construction for future opt-in callers). # Verification - 75 relevance + endpoint + query tests pass - 2445 full suite passes - Ruff clean - Live Deezer API shows real cut at #1 post-rerank --- core/metadata/relevance.py | 7 +++ tests/imports/test_search_match_endpoints.py | 17 ++++--- web_server.py | 47 +++++++++++--------- webui/static/helper.js | 2 +- 4 files changed, 43 insertions(+), 30 deletions(-) diff --git a/core/metadata/relevance.py b/core/metadata/relevance.py index c1f6a9c4..37aed5e0 100644 --- a/core/metadata/relevance.py +++ b/core/metadata/relevance.py @@ -98,6 +98,13 @@ VARIANT_TAG_PATTERNS = ( 'club mix', 'a cappella', 'acapella', + # Remaster — softer than karaoke (user might want it) but still + # demoted vs. the original recording. Verified against live Deezer + # API behaviour where "(2008 Remaster)" outranks the Head Games + # original on `track:"X" artist:"Y"` advanced queries. + 'remaster', + 'remastered', + 'reissue', ) VARIANT_TAG_PENALTY = 0.4 diff --git a/tests/imports/test_search_match_endpoints.py b/tests/imports/test_search_match_endpoints.py index ce7e17af..ba59b2e7 100644 --- a/tests/imports/test_search_match_endpoints.py +++ b/tests/imports/test_search_match_endpoints.py @@ -52,10 +52,14 @@ def fake_track(): class TestDeezerSearchTracksEndpoint: - def test_passes_track_and_artist_as_kwargs(self, app_test_client, fake_track): - """Endpoint must call client.search_tracks(track=..., artist=...) - — NOT join into a single positional query. Field-scoped path - is what triggers Deezer's advanced search syntax.""" + def test_joins_track_and_artist_into_free_text_query(self, app_test_client, fake_track): + """Endpoint sends the joined `track artist` string as Deezer's + free-text `q`. Field-scoped advanced-syntax queries were + initially considered, but live-API testing showed Deezer's + advanced-query ranking misses canonical recordings on some + searches. Free-text + local rerank is the more reliable + combination at this endpoint. Client-level kwarg support + remains for future opt-in callers.""" fake_client = MagicMock() fake_client.search_tracks.return_value = [ fake_track('Dirty White Boy', 'Foreigner'), @@ -65,10 +69,9 @@ class TestDeezerSearchTracksEndpoint: '/api/deezer/search_tracks?track=Dirty+White+Boy&artist=Foreigner&limit=20' ) assert resp.status_code == 200 - # Field-scoped kwargs reach the client call = fake_client.search_tracks.call_args - assert call.kwargs.get('track') == 'Dirty White Boy' - assert call.kwargs.get('artist') == 'Foreigner' + # First positional arg is the joined free-text query + assert call.args[0] == 'Dirty White Boy Foreigner' assert call.kwargs.get('limit') == 20 def test_reranks_results_burying_karaoke(self, app_test_client, fake_track): diff --git a/web_server.py b/web_server.py index 32638e7c..11a74ede 100644 --- a/web_server.py +++ b/web_server.py @@ -19846,20 +19846,25 @@ def search_deezer_tracks(): """Search for tracks on Deezer — used by the import-modal "Search for Match" dialog and by discovery-fix flows. - Field-scoped path (`track=` + `artist=`) builds Deezer's advanced - search syntax `track:"X" artist:"Y"`. Massively tighter relevance - than the free-text path because the API matches each term in the - right field instead of fuzzy-matching across title / lyrics / - artist / album / contributors. Without it, Deezer's ranking - buries the canonical recording under karaoke / cover / "originally - performed by" variants — see issue #534. - - Results then go through ``core.metadata.relevance.rerank_tracks`` - which penalises any cover / karaoke / tribute / re-recorded - patterns we can detect locally + boosts exact-artist-match. Two - layers stacked because Deezer's ranking is rough even on advanced - queries (compilations rank well by global popularity); the local - rerank is the safety net. + Issue #534: Deezer's free-text ranking buries canonical recordings + under karaoke / cover / "originally performed by" variants in some + regions. The fix here is the local relevance rerank + (``core.metadata.relevance.rerank_tracks``) which penalises cover / + karaoke / tribute / remaster patterns + boosts exact-artist-match. + Catches the user-reported case (karaoke at top) and the inverse + (live-version compilation noise) regardless of which Deezer + region's ranking the user hits. + + Field-scoped advanced-syntax queries (`track:"X" artist:"Y"`) were + initially considered as a second tightening layer, but live-API + testing showed Deezer's advanced-query ranking has its own bias — + e.g. it surfaced a 2008 Remaster on `track:"Dirty White Boy" + artist:"Foreigner"` and didn't return the canonical Head Games cut + at all. The free-text path actually returns the canonical + recording first more reliably, so this endpoint stays free-text + + local rerank. Client-level kwarg support remains in + ``DeezerClient.search_tracks`` for future callers (e.g. exact-match + flows where filtering is more important than ranking). """ try: track_q = request.args.get('track', '').strip() @@ -19867,20 +19872,18 @@ def search_deezer_tracks(): legacy_query = request.args.get('query', '').strip() limit = int(request.args.get('limit', 20)) - client = _get_deezer_client() if track_q or artist_q: - # Field-scoped — pass kwargs through so the client builds - # the advanced-syntax query. - tracks = client.search_tracks(track=track_q or None, - artist=artist_q or None, - limit=limit) + query = ' '.join(p for p in (track_q, artist_q) if p) elif legacy_query: - tracks = client.search_tracks(legacy_query, limit=limit) + query = legacy_query else: return jsonify({"error": "Query parameter is required"}), 400 + client = _get_deezer_client() + tracks = client.search_tracks(query, limit=limit) + # Local rerank — only when we have an expected title/artist - # signal. Free-text searches have nothing to rank against. + # signal. Free-text-only searches have nothing to rank against. if track_q or artist_q: from core.metadata.relevance import rerank_tracks tracks = rerank_tracks( diff --git a/webui/static/helper.js b/webui/static/helper.js index c230b55c..cadb74c7 100644 --- a/webui/static/helper.js +++ b/webui/static/helper.js @@ -3416,7 +3416,7 @@ const WHATS_NEW = { '2.4.3': [ // --- post-release patch work on the 2.4.3 line — entries hidden by _getLatestWhatsNewVersion until the build version bumps --- { date: 'Unreleased — 2.4.3 patch work' }, - { title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned re-recordings, karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording. user had to scroll past 5+ junk results before finding the canonical track. cause: deezer endpoint joined track + artist into a single free-text string and passed that to deezer\'s `q` param — which fuzzy-matches across title / lyrics / artist / album / contributors and orders by global popularity, so anything that appears across many compilations outranks the canonical track. fix has three layers. (1) deezer client now supports field-scoped kwargs (`track="X" artist="Y"`) which build deezer\'s advanced search syntax `track:"X" artist:"Y"` — massively tighter relevance because each term matches the right field instead of fuzzy-matching everywhere. backward compat preserved: legacy free-text callers still work. (2) new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties + exact-artist-match boost + variant-tag (live/acoustic/remix) penalty (skipped when user explicitly typed the variant). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently from the user\'s perspective. variant penalty only fires when user did NOT ask for the variant — searching "track (live)" still ranks live versions correctly. (3) safety net: when deezer\'s advanced query returns 0 results (sometimes happens on artist name variants like "foreigner [us]" or non-canonical title spellings), client falls back to free-text search so the user never sees an empty result list when the API would have returned the prior less-relevant set. caller-side rerank still tightens whatever the fallback returns. 75 new tests pin every component: pattern detection (10 cover patterns, 8 variant patterns, 3 fields), score composition (real-cut > karaoke > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback path, three search-modal endpoints end-to-end.', page: 'import' }, + { title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording in some regions. user had to scroll past 5+ junk results before finding the canonical track. fix: new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties (multiplier 0.05× — effectively buries) + exact-artist-match boost (1.5×) + variant-tag (live/acoustic/remix/remaster) penalty (0.4×, skipped when user explicitly typed the variant — searching "track (live)" still ranks live versions correctly). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently. validated against live deezer api with the actual #534 query: real foreigner head games cut now lands at #1, live versions follow, karaoke / cover / tribute variants drop to positions 11-15. deezer client also gained optional field-scoped query kwargs (`track="X" artist="Y"`) that build deezer\'s advanced search syntax `track:"X" artist:"Y"` for future opt-in callers (e.g. exact-match flows where api-level filtering is more important than ranking) — kept in client but NOT used at the import-modal endpoint after live testing showed the advanced syntax has its own ranking bias (surfaced "(2008 remaster)" instead of the canonical recording). free-text + local rerank is the more reliable combination here. 75 new tests pin every scoring component, pattern detection (13 cover patterns, 11 variant patterns, 3 fields), score composition (real-cut > karaoke > remaster > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback safety net.', page: 'import' }, { title: 'Auto-Import: Album Duration Is Album Total + Re-Imports Fill Metadata Gaps', desc: 'two more parity gaps closed in the soulsync standalone library write path. (1) album row\'s `duration` column was being written with the FIRST imported track\'s duration instead of the album total — pre-existing bug that survived the prior parity commit. soulsync_client deep scan computes `sum(t.duration for t in self._tracks)` for each album; auto-import now mirrors that by computing the sum across every matched track in the worker and threading it through context to the album INSERT. (2) `record_soulsync_library_entry` was insert-only on artists + albums — once a row existed (matched by id OR name fallback), subsequent imports of the same artist or album skipped completely. meant: artist genres / thumb / source-id reflected ONLY whatever the FIRST imported album supplied, never refreshing as more albums by that artist landed (ten more deezer/spotify imports later, artist row still had whatever the first random import wrote). new conservative UPDATE path: when an existing row matches, fill ONLY the columns whose current value is NULL or empty — never overwrites populated values. protects manual edits + enrichment-worker writes the same way scanner UPDATEs preserve enrichment columns. f-string column names are validated against an allowlist (`_SOULSYNC_FILLABLE_COLUMNS`) before interpolation — defensive against accidental misuse adding columns without an allowlist update. 4 new tests pin: album duration uses sum not single-track, re-import fills empty thumb + genres on existing artist row, re-import does NOT clobber populated values, re-import fills empty source-id columns when later import has them.', page: 'import' }, { title: 'Auto-Import: Genre Tags Land On The Artists Row + ISRC/MBID Type Hardening', desc: 'small followup to the standalone-library parity commit. (1) auto-import now reads the GENRE tag from each matched audio file (mutagen easy mode, supports flac / mp3 / m4a) and aggregates the deduped set across the album onto the new artists row\'s genres column. matches what soulsync_client._scan_transfer would have written if you\'d done a fresh deep scan after the import — your imported artists no longer feel hollow compared to plex / jellyfin / navidrome scans. dedup is case-insensitive but preserves original casing + insertion order so the json column reads naturally ("Hip-Hop, Rap, Trap" not "hip-hop, rap, trap"). (2) defensive `str()` cast on the worker\'s isrc + mbid extraction. metadata source clients all coerce to string today via `_build_album_track_entry`, but if a future source ever returned int / None for either id the side-effects layer would crash on `.strip()`. cheap insurance. 3 new tests pin: genre aggregation produces deduped insertion-order list, empty when no GENRE tags, isrc/mbid hostile-type input (int, None) coerced to safe string before propagation.', page: 'import' }, { title: 'Auto-Import: SoulSync Standalone Library Now Gets Full Server-Quality Rows', desc: 'soulsync standalone is meant to be a full replacement for plex / jellyfin / navidrome — the imported tracks should land in the db with the same field richness a media server scan would write. they weren\'t. the auto-import context dict (the payload it handed to the post-process pipeline) had no `source` field anywhere, so `record_soulsync_library_entry` couldn\'t pick the right source-id column on the new tracks/albums/artists rows. result: every auto-imported track landed with NULL on `spotify_track_id` / `deezer_id` / `itunes_track_id` / etc. — watchlist scans (which match by stable source IDs) couldn\'t recognise these tracks as already in library and would re-download them on the next pass. fixed by threading `identification[\'source\']` onto the top-level context, plus per-recording IDs (`isrc`, `musicbrainz_recording_id`) onto track_info so picard-tagged libraries land their per-recording metadata directly. also extracted the artist source ID from the metadata source\'s search response (`_search_metadata_source` and `_search_single_track` now pull `best_result.artists[0][\'id\']`) and threaded it through identification → context → standalone library write, so the artists row finally gets its source-ID column populated instead of staying NULL forever. also added `_download_username=\'auto_import\'` so library history shows "Auto-Import" instead of mislabeling every staging import as "Soulseek" (the fallback default), and an "auto_import" → "Auto-Import" mapping in the source-map dicts at side_effects.py to honour it. record_soulsync_library_entry tracks INSERT now also writes `musicbrainz_recording_id` + `isrc` columns directly (matches the navidrome scanner write path). 17 new tests pin: auto-import context carries source for every metadata source (spotify/deezer/itunes/discogs), `_download_username=auto_import`, isrc + mbid pass-through to track_info, album-id back-reference on track_info, artist source-id flows from identification → context (and not from album_id, the prior copy-paste bug), `_search_metadata_source` extracts artist_id from search response, soulsync library writes mbid + isrc to dedicated columns, deezer source maps to deezer_id column, library history + provenance use Auto-Import / auto_import labels.', page: 'import' },