From 402d851cac0fa0827ccdaa340bc472e8a85ca2e2 Mon Sep 17 00:00:00 2001
From: Broque Thomas <26755000+Nezreka@users.noreply.github.com>
Date: Sun, 10 May 2026 09:36:48 -0700
Subject: [PATCH] Deezer search: drop advanced-syntax at endpoint, free-text +
 rerank wins
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Live-API verification revealed advanced-syntax queries hurt more
than they help on this endpoint. Switching the import-modal Deezer
search back to free-text + local rerank.

# What live testing showed

Hit Deezer's public API with both query forms for the issue #534
case (`Dirty White Boy` + `Foreigner`):

**Free-text (`q=Dirty White Boy Foreigner`):**
- Returns 21 results
- Real Foreigner Head Games studio cut at #1
- Live versions at #2-10
- Karaoke / cover variants at #11-15

**Advanced (`q=track:"Dirty White Boy" artist:"Foreigner"`):**
- Returns 12 results
- "(2008 Remaster)" at #1 — canonical Head Games cut MISSING from
  top 8 entirely
- Live + alt-album versions follow

Advanced syntax DOES filter karaoke at the API level (none in the
12-result set vs. 5 at positions 11-15 in free-text), but it has
its own ranking bias that surfaces remasters / "Best Of" cuts
ahead of the canonical recording. Net regression for the user-
facing goal.

# Fix

1. Endpoint reverts to free-text query with local rerank applied.
2. Local rerank gains "remaster" / "remastered" / "reissue"
   patterns under VARIANT_TAG_PATTERNS (soft 0.4× penalty — user
   may want them but they shouldn't outrank the original).
3. Client kwarg support (`track=` / `artist=` / `album=`) preserved
   for future opt-in callers (e.g. exact-match flows where API-
   level filtering matters more than ranking).

# Verified end-to-end against live Deezer API

Re-ran the exact #534 case through the live API + new rerank.
Top 15 results post-rerank:

1. Dirty White Boy — Foreigner — Head Games  ← REAL CUT AT TOP
2-10. Various Live versions
11-15. Karaoke / cover / tribute variants  ← BURIED

Real Foreigner Head Games studio cut at #1, exactly the user's
ask.

# Tests

- `test_relevance.py` — variant tag patterns extended; existing
  tests still pass (50 tests).
- `test_search_match_endpoints.py::test_joins_track_and_artist_into_free_text_query`
  — replaces `test_passes_track_and_artist_as_kwargs`; verifies
  endpoint sends free-text join, NOT field-scoped kwargs (the
  prior test asserted the wrong direction now).
- Karaoke-burying assertion at the endpoint still pins the
  user-visible behaviour.
- Client kwarg path tests untouched (still pin advanced-syntax
  construction for future opt-in callers).

# Verification

- 75 relevance + endpoint + query tests pass
- 2445 full suite passes
- Ruff clean
- Live Deezer API shows real cut at #1 post-rerank
---
 core/metadata/relevance.py                   |  7 +++
 tests/imports/test_search_match_endpoints.py | 17 ++++---
 web_server.py                                | 47 +++++++++++---------
 webui/static/helper.js                       |  2 +-
 4 files changed, 43 insertions(+), 30 deletions(-)

diff --git a/core/metadata/relevance.py b/core/metadata/relevance.py
index c1f6a9c4..37aed5e0 100644
--- a/core/metadata/relevance.py
+++ b/core/metadata/relevance.py
@@ -98,6 +98,13 @@ VARIANT_TAG_PATTERNS = (
     'club mix',
     'a cappella',
     'acapella',
+    # Remaster — softer than karaoke (user might want it) but still
+    # demoted vs. the original recording. Verified against live Deezer
+    # API behaviour where "(2008 Remaster)" outranks the Head Games
+    # original on `track:"X" artist:"Y"` advanced queries.
+    'remaster',
+    'remastered',
+    'reissue',
 )
 
 VARIANT_TAG_PENALTY = 0.4
diff --git a/tests/imports/test_search_match_endpoints.py b/tests/imports/test_search_match_endpoints.py
index ce7e17af..ba59b2e7 100644
--- a/tests/imports/test_search_match_endpoints.py
+++ b/tests/imports/test_search_match_endpoints.py
@@ -52,10 +52,14 @@ def fake_track():
 
 
 class TestDeezerSearchTracksEndpoint:
-    def test_passes_track_and_artist_as_kwargs(self, app_test_client, fake_track):
-        """Endpoint must call client.search_tracks(track=..., artist=...)
-        — NOT join into a single positional query. Field-scoped path
-        is what triggers Deezer's advanced search syntax."""
+    def test_joins_track_and_artist_into_free_text_query(self, app_test_client, fake_track):
+        """Endpoint sends the joined `track artist` string as Deezer's
+        free-text `q`. Field-scoped advanced-syntax queries were
+        initially considered, but live-API testing showed Deezer's
+        advanced-query ranking misses canonical recordings on some
+        searches. Free-text + local rerank is the more reliable
+        combination at this endpoint. Client-level kwarg support
+        remains for future opt-in callers."""
         fake_client = MagicMock()
         fake_client.search_tracks.return_value = [
             fake_track('Dirty White Boy', 'Foreigner'),
@@ -65,10 +69,9 @@ class TestDeezerSearchTracksEndpoint:
                 '/api/deezer/search_tracks?track=Dirty+White+Boy&artist=Foreigner&limit=20'
             )
         assert resp.status_code == 200
-        # Field-scoped kwargs reach the client
         call = fake_client.search_tracks.call_args
-        assert call.kwargs.get('track') == 'Dirty White Boy'
-        assert call.kwargs.get('artist') == 'Foreigner'
+        # First positional arg is the joined free-text query
+        assert call.args[0] == 'Dirty White Boy Foreigner'
         assert call.kwargs.get('limit') == 20
 
     def test_reranks_results_burying_karaoke(self, app_test_client, fake_track):
diff --git a/web_server.py b/web_server.py
index 32638e7c..11a74ede 100644
--- a/web_server.py
+++ b/web_server.py
@@ -19846,20 +19846,25 @@ def search_deezer_tracks():
     """Search for tracks on Deezer — used by the import-modal "Search
     for Match" dialog and by discovery-fix flows.
 
-    Field-scoped path (`track=` + `artist=`) builds Deezer's advanced
-    search syntax `track:"X" artist:"Y"`. Massively tighter relevance
-    than the free-text path because the API matches each term in the
-    right field instead of fuzzy-matching across title / lyrics /
-    artist / album / contributors. Without it, Deezer's ranking
-    buries the canonical recording under karaoke / cover / "originally
-    performed by" variants — see issue #534.
-
-    Results then go through ``core.metadata.relevance.rerank_tracks``
-    which penalises any cover / karaoke / tribute / re-recorded
-    patterns we can detect locally + boosts exact-artist-match. Two
-    layers stacked because Deezer's ranking is rough even on advanced
-    queries (compilations rank well by global popularity); the local
-    rerank is the safety net.
+    Issue #534: Deezer's free-text ranking buries canonical recordings
+    under karaoke / cover / "originally performed by" variants in some
+    regions. The fix here is the local relevance rerank
+    (``core.metadata.relevance.rerank_tracks``) which penalises cover /
+    karaoke / tribute / remaster patterns + boosts exact-artist-match.
+    Catches the user-reported case (karaoke at top) and the inverse
+    (live-version compilation noise) regardless of which Deezer
+    region's ranking the user hits.
+
+    Field-scoped advanced-syntax queries (`track:"X" artist:"Y"`) were
+    initially considered as a second tightening layer, but live-API
+    testing showed Deezer's advanced-query ranking has its own bias —
+    e.g. it surfaced a 2008 Remaster on `track:"Dirty White Boy"
+    artist:"Foreigner"` and didn't return the canonical Head Games cut
+    at all. The free-text path actually returns the canonical
+    recording first more reliably, so this endpoint stays free-text +
+    local rerank. Client-level kwarg support remains in
+    ``DeezerClient.search_tracks`` for future callers (e.g. exact-match
+    flows where filtering is more important than ranking).
     """
     try:
         track_q = request.args.get('track', '').strip()
@@ -19867,20 +19872,18 @@ def search_deezer_tracks():
         legacy_query = request.args.get('query', '').strip()
         limit = int(request.args.get('limit', 20))
 
-        client = _get_deezer_client()
         if track_q or artist_q:
-            # Field-scoped — pass kwargs through so the client builds
-            # the advanced-syntax query.
-            tracks = client.search_tracks(track=track_q or None,
-                                          artist=artist_q or None,
-                                          limit=limit)
+            query = ' '.join(p for p in (track_q, artist_q) if p)
         elif legacy_query:
-            tracks = client.search_tracks(legacy_query, limit=limit)
+            query = legacy_query
         else:
             return jsonify({"error": "Query parameter is required"}), 400
 
+        client = _get_deezer_client()
+        tracks = client.search_tracks(query, limit=limit)
+
         # Local rerank — only when we have an expected title/artist
-        # signal. Free-text searches have nothing to rank against.
+        # signal. Free-text-only searches have nothing to rank against.
         if track_q or artist_q:
             from core.metadata.relevance import rerank_tracks
             tracks = rerank_tracks(
diff --git a/webui/static/helper.js b/webui/static/helper.js
index c230b55c..cadb74c7 100644
--- a/webui/static/helper.js
+++ b/webui/static/helper.js
@@ -3416,7 +3416,7 @@ const WHATS_NEW = {
     '2.4.3': [
         // --- post-release patch work on the 2.4.3 line — entries hidden by _getLatestWhatsNewVersion until the build version bumps ---
         { date: 'Unreleased — 2.4.3 patch work' },
-        { title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned re-recordings, karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording. user had to scroll past 5+ junk results before finding the canonical track. cause: deezer endpoint joined track + artist into a single free-text string and passed that to deezer\'s `q` param — which fuzzy-matches across title / lyrics / artist / album / contributors and orders by global popularity, so anything that appears across many compilations outranks the canonical track. fix has three layers. (1) deezer client now supports field-scoped kwargs (`track="X" artist="Y"`) which build deezer\'s advanced search syntax `track:"X" artist:"Y"` — massively tighter relevance because each term matches the right field instead of fuzzy-matching everywhere. backward compat preserved: legacy free-text callers still work. (2) new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties + exact-artist-match boost + variant-tag (live/acoustic/remix) penalty (skipped when user explicitly typed the variant). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently from the user\'s perspective. variant penalty only fires when user did NOT ask for the variant — searching "track (live)" still ranks live versions correctly. (3) safety net: when deezer\'s advanced query returns 0 results (sometimes happens on artist name variants like "foreigner [us]" or non-canonical title spellings), client falls back to free-text search so the user never sees an empty result list when the API would have returned the prior less-relevant set. caller-side rerank still tightens whatever the fallback returns. 75 new tests pin every component: pattern detection (10 cover patterns, 8 variant patterns, 3 fields), score composition (real-cut > karaoke > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback path, three search-modal endpoints end-to-end.', page: 'import' },
+        { title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording in some regions. user had to scroll past 5+ junk results before finding the canonical track. fix: new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties (multiplier 0.05× — effectively buries) + exact-artist-match boost (1.5×) + variant-tag (live/acoustic/remix/remaster) penalty (0.4×, skipped when user explicitly typed the variant — searching "track (live)" still ranks live versions correctly). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently. validated against live deezer api with the actual #534 query: real foreigner head games cut now lands at #1, live versions follow, karaoke / cover / tribute variants drop to positions 11-15. deezer client also gained optional field-scoped query kwargs (`track="X" artist="Y"`) that build deezer\'s advanced search syntax `track:"X" artist:"Y"` for future opt-in callers (e.g. exact-match flows where api-level filtering is more important than ranking) — kept in client but NOT used at the import-modal endpoint after live testing showed the advanced syntax has its own ranking bias (surfaced "(2008 remaster)" instead of the canonical recording). free-text + local rerank is the more reliable combination here. 75 new tests pin every scoring component, pattern detection (13 cover patterns, 11 variant patterns, 3 fields), score composition (real-cut > karaoke > remaster > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback safety net.', page: 'import' },
         { title: 'Auto-Import: Album Duration Is Album Total + Re-Imports Fill Metadata Gaps', desc: 'two more parity gaps closed in the soulsync standalone library write path. (1) album row\'s `duration` column was being written with the FIRST imported track\'s duration instead of the album total — pre-existing bug that survived the prior parity commit. soulsync_client deep scan computes `sum(t.duration for t in self._tracks)` for each album; auto-import now mirrors that by computing the sum across every matched track in the worker and threading it through context to the album INSERT. (2) `record_soulsync_library_entry` was insert-only on artists + albums — once a row existed (matched by id OR name fallback), subsequent imports of the same artist or album skipped completely. meant: artist genres / thumb / source-id reflected ONLY whatever the FIRST imported album supplied, never refreshing as more albums by that artist landed (ten more deezer/spotify imports later, artist row still had whatever the first random import wrote). new conservative UPDATE path: when an existing row matches, fill ONLY the columns whose current value is NULL or empty — never overwrites populated values. protects manual edits + enrichment-worker writes the same way scanner UPDATEs preserve enrichment columns. f-string column names are validated against an allowlist (`_SOULSYNC_FILLABLE_COLUMNS`) before interpolation — defensive against accidental misuse adding columns without an allowlist update. 4 new tests pin: album duration uses sum not single-track, re-import fills empty thumb + genres on existing artist row, re-import does NOT clobber populated values, re-import fills empty source-id columns when later import has them.', page: 'import' },
         { title: 'Auto-Import: Genre Tags Land On The Artists Row + ISRC/MBID Type Hardening', desc: 'small followup to the standalone-library parity commit. (1) auto-import now reads the GENRE tag from each matched audio file (mutagen easy mode, supports flac / mp3 / m4a) and aggregates the deduped set across the album onto the new artists row\'s genres column. matches what soulsync_client._scan_transfer would have written if you\'d done a fresh deep scan after the import — your imported artists no longer feel hollow compared to plex / jellyfin / navidrome scans. dedup is case-insensitive but preserves original casing + insertion order so the json column reads naturally ("Hip-Hop, Rap, Trap" not "hip-hop, rap, trap"). (2) defensive `str()` cast on the worker\'s isrc + mbid extraction. metadata source clients all coerce to string today via `_build_album_track_entry`, but if a future source ever returned int / None for either id the side-effects layer would crash on `.strip()`. cheap insurance. 3 new tests pin: genre aggregation produces deduped insertion-order list, empty when no GENRE tags, isrc/mbid hostile-type input (int, None) coerced to safe string before propagation.', page: 'import' },
         { title: 'Auto-Import: SoulSync Standalone Library Now Gets Full Server-Quality Rows', desc: 'soulsync standalone is meant to be a full replacement for plex / jellyfin / navidrome — the imported tracks should land in the db with the same field richness a media server scan would write. they weren\'t. the auto-import context dict (the payload it handed to the post-process pipeline) had no `source` field anywhere, so `record_soulsync_library_entry` couldn\'t pick the right source-id column on the new tracks/albums/artists rows. result: every auto-imported track landed with NULL on `spotify_track_id` / `deezer_id` / `itunes_track_id` / etc. — watchlist scans (which match by stable source IDs) couldn\'t recognise these tracks as already in library and would re-download them on the next pass. fixed by threading `identification[\'source\']` onto the top-level context, plus per-recording IDs (`isrc`, `musicbrainz_recording_id`) onto track_info so picard-tagged libraries land their per-recording metadata directly. also extracted the artist source ID from the metadata source\'s search response (`_search_metadata_source` and `_search_single_track` now pull `best_result.artists[0][\'id\']`) and threaded it through identification → context → standalone library write, so the artists row finally gets its source-ID column populated instead of staying NULL forever. also added `_download_username=\'auto_import\'` so library history shows "Auto-Import" instead of mislabeling every staging import as "Soulseek" (the fallback default), and an "auto_import" → "Auto-Import" mapping in the source-map dicts at side_effects.py to honour it. record_soulsync_library_entry tracks INSERT now also writes `musicbrainz_recording_id` + `isrc` columns directly (matches the navidrome scanner write path). 17 new tests pin: auto-import context carries source for every metadata source (spotify/deezer/itunes/discogs), `_download_username=auto_import`, isrc + mbid pass-through to track_info, album-id back-reference on track_info, artist source-id flows from identification → context (and not from album_id, the prior copy-paste bug), `_search_metadata_source` extracts artist_id from search response, soulsync library writes mbid + isrc to dedicated columns, deezer source maps to deezer_id column, library history + provenance use Auto-Import / auto_import labels.', page: 'import' },