Deezer search: drop advanced-syntax at endpoint, free-text + rerank wins

Live-API verification revealed advanced-syntax queries hurt more than they help on this endpoint. Switching the import-modal Deezer search back to free-text + local rerank. # What live testing showed Hit Deezer's public API with both query forms for the issue #534 case (`Dirty White Boy` + `Foreigner`): **Free-text (`q=Dirty White Boy Foreigner`):** - Returns 21 results - Real Foreigner Head Games studio cut at #1 - Live versions at #2-10 - Karaoke / cover variants at #11-15 **Advanced (`q=track:"Dirty White Boy" artist:"Foreigner"`):** - Returns 12 results - "(2008 Remaster)" at #1 — canonical Head Games cut MISSING from top 8 entirely - Live + alt-album versions follow Advanced syntax DOES filter karaoke at the API level (none in the 12-result set vs. 5 at positions 11-15 in free-text), but it has its own ranking bias that surfaces remasters / "Best Of" cuts ahead of the canonical recording. Net regression for the user- facing goal. # Fix 1. Endpoint reverts to free-text query with local rerank applied. 2. Local rerank gains "remaster" / "remastered" / "reissue" patterns under VARIANT_TAG_PATTERNS (soft 0.4× penalty — user may want them but they shouldn't outrank the original). 3. Client kwarg support (`track=` / `artist=` / `album=`) preserved for future opt-in callers (e.g. exact-match flows where API- level filtering matters more than ranking). # Verified end-to-end against live Deezer API Re-ran the exact #534 case through the live API + new rerank. Top 15 results post-rerank: 1. Dirty White Boy — Foreigner — Head Games ← REAL CUT AT TOP 2-10. Various Live versions 11-15. Karaoke / cover / tribute variants ← BURIED Real Foreigner Head Games studio cut at #1, exactly the user's ask. # Tests - `test_relevance.py` — variant tag patterns extended; existing tests still pass (50 tests). - `test_search_match_endpoints.py::test_joins_track_and_artist_into_free_text_query` — replaces `test_passes_track_and_artist_as_kwargs`; verifies endpoint sends free-text join, NOT field-scoped kwargs (the prior test asserted the wrong direction now). - Karaoke-burying assertion at the endpoint still pins the user-visible behaviour. - Client kwarg path tests untouched (still pin advanced-syntax construction for future opt-in callers). # Verification - 75 relevance + endpoint + query tests pass - 2445 full suite passes - Ruff clean - Live Deezer API shows real cut at #1 post-rerank
2 months ago · 402d851cac
parent 59992d42a8
commit 402d851cac
4 changed files with 43 additions and 30 deletions
--- a/core/metadata/relevance.py
+++ b/core/metadata/relevance.py
@ -98,6 +98,13 @@ VARIANT_TAG_PATTERNS = (
    'club mix',
    'a cappella',
    'acapella',
+    # Remaster — softer than karaoke (user might want it) but still
+    # demoted vs. the original recording. Verified against live Deezer
+    # API behaviour where "(2008 Remaster)" outranks the Head Games
+    # original on `track:"X" artist:"Y"` advanced queries.
+    'remaster',
+    'remastered',
+    'reissue',
 )

 VARIANT_TAG_PENALTY = 0.4
--- a/tests/imports/test_search_match_endpoints.py
+++ b/tests/imports/test_search_match_endpoints.py
@ -52,10 +52,14 @@ def fake_track():


 class TestDeezerSearchTracksEndpoint:
-    def test_passes_track_and_artist_as_kwargs(self, app_test_client, fake_track):
-        """Endpoint must call client.search_tracks(track=..., artist=...)
-        — NOT join into a single positional query. Field-scoped path
-        is what triggers Deezer's advanced search syntax."""
+    def test_joins_track_and_artist_into_free_text_query(self, app_test_client, fake_track):
+        """Endpoint sends the joined `track artist` string as Deezer's
+        free-text `q`. Field-scoped advanced-syntax queries were
+        initially considered, but live-API testing showed Deezer's
+        advanced-query ranking misses canonical recordings on some
+        searches. Free-text + local rerank is the more reliable
+        combination at this endpoint. Client-level kwarg support
+        remains for future opt-in callers."""
        fake_client = MagicMock()
        fake_client.search_tracks.return_value = [
            fake_track('Dirty White Boy', 'Foreigner'),
@ -65,10 +69,9 @@ class TestDeezerSearchTracksEndpoint:
                '/api/deezer/search_tracks?track=Dirty+White+Boy&artist=Foreigner&limit=20'
            )
        assert resp.status_code == 200
-        # Field-scoped kwargs reach the client
        call = fake_client.search_tracks.call_args
-        assert call.kwargs.get('track') == 'Dirty White Boy'
-        assert call.kwargs.get('artist') == 'Foreigner'
+        # First positional arg is the joined free-text query
+        assert call.args[0] == 'Dirty White Boy Foreigner'
        assert call.kwargs.get('limit') == 20

    def test_reranks_results_burying_karaoke(self, app_test_client, fake_track):
--- a/web_server.py
+++ b/web_server.py
@ -19846,20 +19846,25 @@ def search_deezer_tracks():
    """Search for tracks on Deezer — used by the import-modal "Search
    for Match" dialog and by discovery-fix flows.

-    Field-scoped path (`track=` + `artist=`) builds Deezer's advanced
-    search syntax `track:"X" artist:"Y"`. Massively tighter relevance
-    than the free-text path because the API matches each term in the
-    right field instead of fuzzy-matching across title / lyrics /
-    artist / album / contributors. Without it, Deezer's ranking
-    buries the canonical recording under karaoke / cover / "originally
-    performed by" variants — see issue #534.
-
-    Results then go through ``core.metadata.relevance.rerank_tracks``
-    which penalises any cover / karaoke / tribute / re-recorded
-    patterns we can detect locally + boosts exact-artist-match. Two
-    layers stacked because Deezer's ranking is rough even on advanced
-    queries (compilations rank well by global popularity); the local
-    rerank is the safety net.
+    Issue #534: Deezer's free-text ranking buries canonical recordings
+    under karaoke / cover / "originally performed by" variants in some
+    regions. The fix here is the local relevance rerank
+    (``core.metadata.relevance.rerank_tracks``) which penalises cover /
+    karaoke / tribute / remaster patterns + boosts exact-artist-match.
+    Catches the user-reported case (karaoke at top) and the inverse
+    (live-version compilation noise) regardless of which Deezer
+    region's ranking the user hits.
+
+    Field-scoped advanced-syntax queries (`track:"X" artist:"Y"`) were
+    initially considered as a second tightening layer, but live-API
+    testing showed Deezer's advanced-query ranking has its own bias —
+    e.g. it surfaced a 2008 Remaster on `track:"Dirty White Boy"
+    artist:"Foreigner"` and didn't return the canonical Head Games cut
+    at all. The free-text path actually returns the canonical
+    recording first more reliably, so this endpoint stays free-text +
+    local rerank. Client-level kwarg support remains in
+    ``DeezerClient.search_tracks`` for future callers (e.g. exact-match
+    flows where filtering is more important than ranking).
    """
    try:
        track_q = request.args.get('track', '').strip()
@ -19867,20 +19872,18 @@ def search_deezer_tracks():
        legacy_query = request.args.get('query', '').strip()
        limit = int(request.args.get('limit', 20))

-        client = _get_deezer_client()
        if track_q or artist_q:
-            # Field-scoped — pass kwargs through so the client builds
-            # the advanced-syntax query.
-            tracks = client.search_tracks(track=track_q or None,
-                                          artist=artist_q or None,
-                                          limit=limit)
+            query = ' '.join(p for p in (track_q, artist_q) if p)
        elif legacy_query:
-            tracks = client.search_tracks(legacy_query, limit=limit)
+            query = legacy_query
        else:
            return jsonify({"error": "Query parameter is required"}), 400

+        client = _get_deezer_client()
+        tracks = client.search_tracks(query, limit=limit)
+
        # Local rerank — only when we have an expected title/artist
-        # signal. Free-text searches have nothing to rank against.
+        # signal. Free-text-only searches have nothing to rank against.
        if track_q or artist_q:
            from core.metadata.relevance import rerank_tracks
            tracks = rerank_tracks(
--- a/webui/static/helper.js
+++ b/webui/static/helper.js
@ -3416,7 +3416,7 @@ const WHATS_NEW = {
    '2.4.3': [
        // --- post-release patch work on the 2.4.3 line — entries hidden by _getLatestWhatsNewVersion until the build version bumps ---
        { date: 'Unreleased — 2.4.3 patch work' },
-        { title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned re-recordings, karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording. user had to scroll past 5+ junk results before finding the canonical track. cause: deezer endpoint joined track + artist into a single free-text string and passed that to deezer\'s `q` param — which fuzzy-matches across title / lyrics / artist / album / contributors and orders by global popularity, so anything that appears across many compilations outranks the canonical track. fix has three layers. (1) deezer client now supports field-scoped kwargs (`track="X" artist="Y"`) which build deezer\'s advanced search syntax `track:"X" artist:"Y"` — massively tighter relevance because each term matches the right field instead of fuzzy-matching everywhere. backward compat preserved: legacy free-text callers still work. (2) new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties + exact-artist-match boost + variant-tag (live/acoustic/remix) penalty (skipped when user explicitly typed the variant). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently from the user\'s perspective. variant penalty only fires when user did NOT ask for the variant — searching "track (live)" still ranks live versions correctly. (3) safety net: when deezer\'s advanced query returns 0 results (sometimes happens on artist name variants like "foreigner [us]" or non-canonical title spellings), client falls back to free-text search so the user never sees an empty result list when the API would have returned the prior less-relevant set. caller-side rerank still tightens whatever the fallback returns. 75 new tests pin every component: pattern detection (10 cover patterns, 8 variant patterns, 3 fields), score composition (real-cut > karaoke > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback path, three search-modal endpoints end-to-end.', page: 'import' },
+        { title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording in some regions. user had to scroll past 5+ junk results before finding the canonical track. fix: new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties (multiplier 0.05× — effectively buries) + exact-artist-match boost (1.5×) + variant-tag (live/acoustic/remix/remaster) penalty (0.4×, skipped when user explicitly typed the variant — searching "track (live)" still ranks live versions correctly). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently. validated against live deezer api with the actual #534 query: real foreigner head games cut now lands at #1, live versions follow, karaoke / cover / tribute variants drop to positions 11-15. deezer client also gained optional field-scoped query kwargs (`track="X" artist="Y"`) that build deezer\'s advanced search syntax `track:"X" artist:"Y"` for future opt-in callers (e.g. exact-match flows where api-level filtering is more important than ranking) — kept in client but NOT used at the import-modal endpoint after live testing showed the advanced syntax has its own ranking bias (surfaced "(2008 remaster)" instead of the canonical recording). free-text + local rerank is the more reliable combination here. 75 new tests pin every scoring component, pattern detection (13 cover patterns, 11 variant patterns, 3 fields), score composition (real-cut > karaoke > remaster > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback safety net.', page: 'import' },
        { title: 'Auto-Import: Album Duration Is Album Total + Re-Imports Fill Metadata Gaps', desc: 'two more parity gaps closed in the soulsync standalone library write path. (1) album row\'s `duration` column was being written with the FIRST imported track\'s duration instead of the album total — pre-existing bug that survived the prior parity commit. soulsync_client deep scan computes `sum(t.duration for t in self._tracks)` for each album; auto-import now mirrors that by computing the sum across every matched track in the worker and threading it through context to the album INSERT. (2) `record_soulsync_library_entry` was insert-only on artists + albums — once a row existed (matched by id OR name fallback), subsequent imports of the same artist or album skipped completely. meant: artist genres / thumb / source-id reflected ONLY whatever the FIRST imported album supplied, never refreshing as more albums by that artist landed (ten more deezer/spotify imports later, artist row still had whatever the first random import wrote). new conservative UPDATE path: when an existing row matches, fill ONLY the columns whose current value is NULL or empty — never overwrites populated values. protects manual edits + enrichment-worker writes the same way scanner UPDATEs preserve enrichment columns. f-string column names are validated against an allowlist (`_SOULSYNC_FILLABLE_COLUMNS`) before interpolation — defensive against accidental misuse adding columns without an allowlist update. 4 new tests pin: album duration uses sum not single-track, re-import fills empty thumb + genres on existing artist row, re-import does NOT clobber populated values, re-import fills empty source-id columns when later import has them.', page: 'import' },
        { title: 'Auto-Import: Genre Tags Land On The Artists Row + ISRC/MBID Type Hardening', desc: 'small followup to the standalone-library parity commit. (1) auto-import now reads the GENRE tag from each matched audio file (mutagen easy mode, supports flac / mp3 / m4a) and aggregates the deduped set across the album onto the new artists row\'s genres column. matches what soulsync_client._scan_transfer would have written if you\'d done a fresh deep scan after the import — your imported artists no longer feel hollow compared to plex / jellyfin / navidrome scans. dedup is case-insensitive but preserves original casing + insertion order so the json column reads naturally ("Hip-Hop, Rap, Trap" not "hip-hop, rap, trap"). (2) defensive `str()` cast on the worker\'s isrc + mbid extraction. metadata source clients all coerce to string today via `_build_album_track_entry`, but if a future source ever returned int / None for either id the side-effects layer would crash on `.strip()`. cheap insurance. 3 new tests pin: genre aggregation produces deduped insertion-order list, empty when no GENRE tags, isrc/mbid hostile-type input (int, None) coerced to safe string before propagation.', page: 'import' },
        { title: 'Auto-Import: SoulSync Standalone Library Now Gets Full Server-Quality Rows', desc: 'soulsync standalone is meant to be a full replacement for plex / jellyfin / navidrome — the imported tracks should land in the db with the same field richness a media server scan would write. they weren\'t. the auto-import context dict (the payload it handed to the post-process pipeline) had no `source` field anywhere, so `record_soulsync_library_entry` couldn\'t pick the right source-id column on the new tracks/albums/artists rows. result: every auto-imported track landed with NULL on `spotify_track_id` / `deezer_id` / `itunes_track_id` / etc. — watchlist scans (which match by stable source IDs) couldn\'t recognise these tracks as already in library and would re-download them on the next pass. fixed by threading `identification[\'source\']` onto the top-level context, plus per-recording IDs (`isrc`, `musicbrainz_recording_id`) onto track_info so picard-tagged libraries land their per-recording metadata directly. also extracted the artist source ID from the metadata source\'s search response (`_search_metadata_source` and `_search_single_track` now pull `best_result.artists[0][\'id\']`) and threaded it through identification → context → standalone library write, so the artists row finally gets its source-ID column populated instead of staying NULL forever. also added `_download_username=\'auto_import\'` so library history shows "Auto-Import" instead of mislabeling every staging import as "Soulseek" (the fallback default), and an "auto_import" → "Auto-Import" mapping in the source-map dicts at side_effects.py to honour it. record_soulsync_library_entry tracks INSERT now also writes `musicbrainz_recording_id` + `isrc` columns directly (matches the navidrome scanner write path). 17 new tests pin: auto-import context carries source for every metadata source (spotify/deezer/itunes/discogs), `_download_username=auto_import`, isrc + mbid pass-through to track_info, album-id back-reference on track_info, artist source-id flows from identification → context (and not from album_id, the prior copy-paste bug), `_search_metadata_source` extracts artist_id from search response, soulsync library writes mbid + isrc to dedicated columns, deezer source maps to deezer_id column, library history + provenance use Auto-Import / auto_import labels.', page: 'import' },