Deezer search: drop advanced-syntax at endpoint, free-text + rerank wins

Live-API verification revealed advanced-syntax queries hurt more
than they help on this endpoint. Switching the import-modal Deezer
search back to free-text + local rerank.

# What live testing showed

Hit Deezer's public API with both query forms for the issue #534
case (`Dirty White Boy` + `Foreigner`):

**Free-text (`q=Dirty White Boy Foreigner`):**
- Returns 21 results
- Real Foreigner Head Games studio cut at #1
- Live versions at #2-10
- Karaoke / cover variants at #11-15

**Advanced (`q=track:"Dirty White Boy" artist:"Foreigner"`):**
- Returns 12 results
- "(2008 Remaster)" at #1 — canonical Head Games cut MISSING from
  top 8 entirely
- Live + alt-album versions follow

Advanced syntax DOES filter karaoke at the API level (none in the
12-result set vs. 5 at positions 11-15 in free-text), but it has
its own ranking bias that surfaces remasters / "Best Of" cuts
ahead of the canonical recording. Net regression for the user-
facing goal.

# Fix

1. Endpoint reverts to free-text query with local rerank applied.
2. Local rerank gains "remaster" / "remastered" / "reissue"
   patterns under VARIANT_TAG_PATTERNS (soft 0.4× penalty — user
   may want them but they shouldn't outrank the original).
3. Client kwarg support (`track=` / `artist=` / `album=`) preserved
   for future opt-in callers (e.g. exact-match flows where API-
   level filtering matters more than ranking).

# Verified end-to-end against live Deezer API

Re-ran the exact #534 case through the live API + new rerank.
Top 15 results post-rerank:

1. Dirty White Boy — Foreigner — Head Games  ← REAL CUT AT TOP
2-10. Various Live versions
11-15. Karaoke / cover / tribute variants  ← BURIED

Real Foreigner Head Games studio cut at #1, exactly the user's
ask.

# Tests

- `test_relevance.py` — variant tag patterns extended; existing
  tests still pass (50 tests).
- `test_search_match_endpoints.py::test_joins_track_and_artist_into_free_text_query`
  — replaces `test_passes_track_and_artist_as_kwargs`; verifies
  endpoint sends free-text join, NOT field-scoped kwargs (the
  prior test asserted the wrong direction now).
- Karaoke-burying assertion at the endpoint still pins the
  user-visible behaviour.
- Client kwarg path tests untouched (still pin advanced-syntax
  construction for future opt-in callers).

# Verification

- 75 relevance + endpoint + query tests pass
- 2445 full suite passes
- Ruff clean
- Live Deezer API shows real cut at #1 post-rerank
pull/539/head
Broque Thomas 4 days ago
parent 59992d42a8
commit 402d851cac

@ -98,6 +98,13 @@ VARIANT_TAG_PATTERNS = (
'club mix',
'a cappella',
'acapella',
# Remaster — softer than karaoke (user might want it) but still
# demoted vs. the original recording. Verified against live Deezer
# API behaviour where "(2008 Remaster)" outranks the Head Games
# original on `track:"X" artist:"Y"` advanced queries.
'remaster',
'remastered',
'reissue',
)
VARIANT_TAG_PENALTY = 0.4

@ -52,10 +52,14 @@ def fake_track():
class TestDeezerSearchTracksEndpoint:
def test_passes_track_and_artist_as_kwargs(self, app_test_client, fake_track):
"""Endpoint must call client.search_tracks(track=..., artist=...)
NOT join into a single positional query. Field-scoped path
is what triggers Deezer's advanced search syntax."""
def test_joins_track_and_artist_into_free_text_query(self, app_test_client, fake_track):
"""Endpoint sends the joined `track artist` string as Deezer's
free-text `q`. Field-scoped advanced-syntax queries were
initially considered, but live-API testing showed Deezer's
advanced-query ranking misses canonical recordings on some
searches. Free-text + local rerank is the more reliable
combination at this endpoint. Client-level kwarg support
remains for future opt-in callers."""
fake_client = MagicMock()
fake_client.search_tracks.return_value = [
fake_track('Dirty White Boy', 'Foreigner'),
@ -65,10 +69,9 @@ class TestDeezerSearchTracksEndpoint:
'/api/deezer/search_tracks?track=Dirty+White+Boy&artist=Foreigner&limit=20'
)
assert resp.status_code == 200
# Field-scoped kwargs reach the client
call = fake_client.search_tracks.call_args
assert call.kwargs.get('track') == 'Dirty White Boy'
assert call.kwargs.get('artist') == 'Foreigner'
# First positional arg is the joined free-text query
assert call.args[0] == 'Dirty White Boy Foreigner'
assert call.kwargs.get('limit') == 20
def test_reranks_results_burying_karaoke(self, app_test_client, fake_track):

@ -19846,20 +19846,25 @@ def search_deezer_tracks():
"""Search for tracks on Deezer — used by the import-modal "Search
for Match" dialog and by discovery-fix flows.
Field-scoped path (`track=` + `artist=`) builds Deezer's advanced
search syntax `track:"X" artist:"Y"`. Massively tighter relevance
than the free-text path because the API matches each term in the
right field instead of fuzzy-matching across title / lyrics /
artist / album / contributors. Without it, Deezer's ranking
buries the canonical recording under karaoke / cover / "originally
performed by" variants — see issue #534.
Results then go through ``core.metadata.relevance.rerank_tracks``
which penalises any cover / karaoke / tribute / re-recorded
patterns we can detect locally + boosts exact-artist-match. Two
layers stacked because Deezer's ranking is rough even on advanced
queries (compilations rank well by global popularity); the local
rerank is the safety net.
Issue #534: Deezer's free-text ranking buries canonical recordings
under karaoke / cover / "originally performed by" variants in some
regions. The fix here is the local relevance rerank
(``core.metadata.relevance.rerank_tracks``) which penalises cover /
karaoke / tribute / remaster patterns + boosts exact-artist-match.
Catches the user-reported case (karaoke at top) and the inverse
(live-version compilation noise) regardless of which Deezer
region's ranking the user hits.
Field-scoped advanced-syntax queries (`track:"X" artist:"Y"`) were
initially considered as a second tightening layer, but live-API
testing showed Deezer's advanced-query ranking has its own bias —
e.g. it surfaced a 2008 Remaster on `track:"Dirty White Boy"
artist:"Foreigner"` and didn't return the canonical Head Games cut
at all. The free-text path actually returns the canonical
recording first more reliably, so this endpoint stays free-text +
local rerank. Client-level kwarg support remains in
``DeezerClient.search_tracks`` for future callers (e.g. exact-match
flows where filtering is more important than ranking).
"""
try:
track_q = request.args.get('track', '').strip()
@ -19867,20 +19872,18 @@ def search_deezer_tracks():
legacy_query = request.args.get('query', '').strip()
limit = int(request.args.get('limit', 20))
client = _get_deezer_client()
if track_q or artist_q:
# Field-scoped — pass kwargs through so the client builds
# the advanced-syntax query.
tracks = client.search_tracks(track=track_q or None,
artist=artist_q or None,
limit=limit)
query = ' '.join(p for p in (track_q, artist_q) if p)
elif legacy_query:
tracks = client.search_tracks(legacy_query, limit=limit)
query = legacy_query
else:
return jsonify({"error": "Query parameter is required"}), 400
client = _get_deezer_client()
tracks = client.search_tracks(query, limit=limit)
# Local rerank — only when we have an expected title/artist
# signal. Free-text searches have nothing to rank against.
# signal. Free-text-only searches have nothing to rank against.
if track_q or artist_q:
from core.metadata.relevance import rerank_tracks
tracks = rerank_tracks(

@ -3416,7 +3416,7 @@ const WHATS_NEW = {
'2.4.3': [
// --- post-release patch work on the 2.4.3 line — entries hidden by _getLatestWhatsNewVersion until the build version bumps ---
{ date: 'Unreleased — 2.4.3 patch work' },
{ title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned re-recordings, karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording. user had to scroll past 5+ junk results before finding the canonical track. cause: deezer endpoint joined track + artist into a single free-text string and passed that to deezer\'s `q` param — which fuzzy-matches across title / lyrics / artist / album / contributors and orders by global popularity, so anything that appears across many compilations outranks the canonical track. fix has three layers. (1) deezer client now supports field-scoped kwargs (`track="X" artist="Y"`) which build deezer\'s advanced search syntax `track:"X" artist:"Y"` — massively tighter relevance because each term matches the right field instead of fuzzy-matching everywhere. backward compat preserved: legacy free-text callers still work. (2) new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties + exact-artist-match boost + variant-tag (live/acoustic/remix) penalty (skipped when user explicitly typed the variant). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently from the user\'s perspective. variant penalty only fires when user did NOT ask for the variant — searching "track (live)" still ranks live versions correctly. (3) safety net: when deezer\'s advanced query returns 0 results (sometimes happens on artist name variants like "foreigner [us]" or non-canonical title spellings), client falls back to free-text search so the user never sees an empty result list when the API would have returned the prior less-relevant set. caller-side rerank still tightens whatever the fallback returns. 75 new tests pin every component: pattern detection (10 cover patterns, 8 variant patterns, 3 fields), score composition (real-cut > karaoke > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback path, three search-modal endpoints end-to-end.', page: 'import' },
{ title: 'Search For Match: No More Karaoke / Cover / "Originally Performed By" Junk At The Top', desc: 'github issue #534 (radoslav-orlov): typing "dirty white boy" + "foreigner" into the import-modal "search for match" dialog returned karaoke versions, "originally performed by" compilations, and tribute-band cuts ranked above the actual foreigner studio recording in some regions. user had to scroll past 5+ junk results before finding the canonical track. fix: new `core/metadata/relevance.py` helper reranks results locally with cover/karaoke/tribute/re-recorded penalties (multiplier 0.05× — effectively buries) + exact-artist-match boost (1.5×) + variant-tag (live/acoustic/remix/remaster) penalty (0.4×, skipped when user explicitly typed the variant — searching "track (live)" still ranks live versions correctly). applied at the deezer + itunes + spotify search-tracks endpoints so all three sources behave consistently. validated against live deezer api with the actual #534 query: real foreigner head games cut now lands at #1, live versions follow, karaoke / cover / tribute variants drop to positions 11-15. deezer client also gained optional field-scoped query kwargs (`track="X" artist="Y"`) that build deezer\'s advanced search syntax `track:"X" artist:"Y"` for future opt-in callers (e.g. exact-match flows where api-level filtering is more important than ranking) — kept in client but NOT used at the import-modal endpoint after live testing showed the advanced syntax has its own ranking bias (surfaced "(2008 remaster)" instead of the canonical recording). free-text + local rerank is the more reliable combination here. 75 new tests pin every scoring component, pattern detection (13 cover patterns, 11 variant patterns, 3 fields), score composition (real-cut > karaoke > remaster > re-recorded), the issue #534 screenshot reproduced as a regression test, deezer client query construction + free-text fallback safety net.', page: 'import' },
{ title: 'Auto-Import: Album Duration Is Album Total + Re-Imports Fill Metadata Gaps', desc: 'two more parity gaps closed in the soulsync standalone library write path. (1) album row\'s `duration` column was being written with the FIRST imported track\'s duration instead of the album total — pre-existing bug that survived the prior parity commit. soulsync_client deep scan computes `sum(t.duration for t in self._tracks)` for each album; auto-import now mirrors that by computing the sum across every matched track in the worker and threading it through context to the album INSERT. (2) `record_soulsync_library_entry` was insert-only on artists + albums — once a row existed (matched by id OR name fallback), subsequent imports of the same artist or album skipped completely. meant: artist genres / thumb / source-id reflected ONLY whatever the FIRST imported album supplied, never refreshing as more albums by that artist landed (ten more deezer/spotify imports later, artist row still had whatever the first random import wrote). new conservative UPDATE path: when an existing row matches, fill ONLY the columns whose current value is NULL or empty — never overwrites populated values. protects manual edits + enrichment-worker writes the same way scanner UPDATEs preserve enrichment columns. f-string column names are validated against an allowlist (`_SOULSYNC_FILLABLE_COLUMNS`) before interpolation — defensive against accidental misuse adding columns without an allowlist update. 4 new tests pin: album duration uses sum not single-track, re-import fills empty thumb + genres on existing artist row, re-import does NOT clobber populated values, re-import fills empty source-id columns when later import has them.', page: 'import' },
{ title: 'Auto-Import: Genre Tags Land On The Artists Row + ISRC/MBID Type Hardening', desc: 'small followup to the standalone-library parity commit. (1) auto-import now reads the GENRE tag from each matched audio file (mutagen easy mode, supports flac / mp3 / m4a) and aggregates the deduped set across the album onto the new artists row\'s genres column. matches what soulsync_client._scan_transfer would have written if you\'d done a fresh deep scan after the import — your imported artists no longer feel hollow compared to plex / jellyfin / navidrome scans. dedup is case-insensitive but preserves original casing + insertion order so the json column reads naturally ("Hip-Hop, Rap, Trap" not "hip-hop, rap, trap"). (2) defensive `str()` cast on the worker\'s isrc + mbid extraction. metadata source clients all coerce to string today via `_build_album_track_entry`, but if a future source ever returned int / None for either id the side-effects layer would crash on `.strip()`. cheap insurance. 3 new tests pin: genre aggregation produces deduped insertion-order list, empty when no GENRE tags, isrc/mbid hostile-type input (int, None) coerced to safe string before propagation.', page: 'import' },
{ title: 'Auto-Import: SoulSync Standalone Library Now Gets Full Server-Quality Rows', desc: 'soulsync standalone is meant to be a full replacement for plex / jellyfin / navidrome — the imported tracks should land in the db with the same field richness a media server scan would write. they weren\'t. the auto-import context dict (the payload it handed to the post-process pipeline) had no `source` field anywhere, so `record_soulsync_library_entry` couldn\'t pick the right source-id column on the new tracks/albums/artists rows. result: every auto-imported track landed with NULL on `spotify_track_id` / `deezer_id` / `itunes_track_id` / etc. — watchlist scans (which match by stable source IDs) couldn\'t recognise these tracks as already in library and would re-download them on the next pass. fixed by threading `identification[\'source\']` onto the top-level context, plus per-recording IDs (`isrc`, `musicbrainz_recording_id`) onto track_info so picard-tagged libraries land their per-recording metadata directly. also extracted the artist source ID from the metadata source\'s search response (`_search_metadata_source` and `_search_single_track` now pull `best_result.artists[0][\'id\']`) and threaded it through identification → context → standalone library write, so the artists row finally gets its source-ID column populated instead of staying NULL forever. also added `_download_username=\'auto_import\'` so library history shows "Auto-Import" instead of mislabeling every staging import as "Soulseek" (the fallback default), and an "auto_import" → "Auto-Import" mapping in the source-map dicts at side_effects.py to honour it. record_soulsync_library_entry tracks INSERT now also writes `musicbrainz_recording_id` + `isrc` columns directly (matches the navidrome scanner write path). 17 new tests pin: auto-import context carries source for every metadata source (spotify/deezer/itunes/discogs), `_download_username=auto_import`, isrc + mbid pass-through to track_info, album-id back-reference on track_info, artist source-id flows from identification → context (and not from album_id, the prior copy-paste bug), `_search_metadata_source` extracts artist_id from search response, soulsync library writes mbid + isrc to dedicated columns, deezer source maps to deezer_id column, library history + provenance use Auto-Import / auto_import labels.', page: 'import' },

Loading…
Cancel
Save