SoulSync

Commit Graph

Author	SHA1	Message	Date
Broque Thomas	24c2d75c6d	Make extract_external_ids recognize all source-tagging conventions Smoke-testing the just-merged provenance PR against live logs revealed the new ID-match block was silently no-opping: no [ExtID Match] / [Provenance Match] log lines despite the code path being live. Tracing revealed two related gaps in extract_external_ids' source detection: 1. Underscore-prefixed key. Deezer / Discogs / Hydrabase clients tag normalized track dicts with ``_source`` (underscore prefix — convention used in 8+ places across core/). The extractor only looked for ``provider`` and ``source``, so Deezer-sourced tracks silently returned no IDs. 2. No provider field at all. Spotify and iTunes raw API responses carry ``id`` but no provider/source key of any kind. The extractor couldn't disambiguate the native ``id``, so Spotify-primary scans would have hit the same silent miss once the user switched primary sources. Two-part fix: - ``extract_external_ids`` now recognizes ``_source`` as another candidate provider field. - New optional ``source_hint`` parameter lets the caller supply the configured primary source as a fallback when the track dict has no provider field of its own. Track-side provider field still wins when present (defensive against a wrong hint). Watchlist scanner now passes ``get_primary_source()`` as the hint so both naming conventions (Deezer-style _source, Spotify-style no-tag) get handled uniformly. 6 new regression tests cover: - _source recognized for Deezer - _source recognized for Hydrabase (cross-provider mapping) - _source recognized for Discogs (no library column — verifies graceful no-crash) - source_hint disambiguates raw tracks for spotify/itunes/deezer - track-side provider takes precedence over hint - None hint defaults safely Full pytest 1630 passed; ruff clean. After this lands and the server restarts, watchlist scans should produce [ExtID Match] / [Provenance Match] log lines for tracks already on disk regardless of which metadata source the user has configured as primary.	4 weeks ago
Broque Thomas	ecb8939c80	Match library tracks by external IDs before fuzzy in watchlist scan Reported case (CAL): a track already on disk got re-downloaded by the watchlist scanner on every scan. Library DB had stale album metadata for the file (track tagged on album "Left Alone") while the metadata source reported it on a different album ("NPC" single). The title+artist+album fuzzy block correctly said the album names didn't match and declared the track missing — but the file's stable external IDs (Spotify ID, ISRC, etc.) unambiguously identified it as the same recording. The earlier compilation-album fix (PR #461) handled qualifier drift ("OST" vs "Music From The Motion Picture"). This case is two genuinely different album names referring to the same song. Fix: provider-neutral external-ID short-circuit before the fuzzy block in `is_track_missing_from_library`. Pulls every recognized ID off the source track (Spotify / iTunes / Deezer / Tidal / Qobuz / MusicBrainz / AudioDB / Hydrabase / ISRC), runs a single SELECT against the indexed external-ID columns on the `tracks` table, and treats any hit as "track exists in library — don't re-download". If no IDs are available (older imports without enrichment, library scans that didn't populate external IDs), falls through to the existing fuzzy logic so the safety net stays intact. New `core/library/track_identity.py` module with two helpers: - `extract_external_ids(track)`: handles dict and object-style track shapes, direct-field aliases (spotify_id / spotify_track_id / SPOTIFY_TRACK_ID), and provider-disambiguated native `id` fields (when track has `provider='deezer'` and `id='X'`, treats X as a Deezer ID). - `find_library_track_by_external_id(db, external_ids, server_source)`: builds an OR of indexed column matches with IS NOT NULL guards, optional server_source filter that also passes legacy NULL rows, single-row LIMIT. ISRC bridges across providers — a library track imported via Deezer can be matched against a Spotify scan when both sides carry the same ISRC. 43 regression tests in `tests/test_library_track_identity.py`: - 9 ID-extraction tests for direct fields (Spotify / iTunes / Deezer / ISRC / MBID / AudioDB / Hydrabase) - 8 ID-extraction tests via the provider field (8 providers + source alias + missing-provider-ignored) - 7 mixed/defensive tests (multiple IDs, object-style, empty strings, None track, numeric coercion) - 8 lookup tests (per-provider + ISRC cross-bridge) - 3 OR-semantics tests - 4 server_source filter tests - 2 ID-column-map sanity tests Full pytest 1606 passed; ruff clean.	4 weeks ago

Author

SHA1

Message

Date

Broque Thomas

24c2d75c6d

Make extract_external_ids recognize all source-tagging conventions

Smoke-testing the just-merged provenance PR against live logs revealed
the new ID-match block was silently no-opping: no [ExtID Match] /
[Provenance Match] log lines despite the code path being live. Tracing
revealed two related gaps in extract_external_ids' source detection:

1. **Underscore-prefixed key.** Deezer / Discogs / Hydrabase clients
   tag normalized track dicts with ``_source`` (underscore prefix —
   convention used in 8+ places across core/). The extractor only
   looked for ``provider`` and ``source``, so Deezer-sourced tracks
   silently returned no IDs.

2. **No provider field at all.** Spotify and iTunes raw API responses
   carry ``id`` but no provider/source key of any kind. The extractor
   couldn't disambiguate the native ``id``, so Spotify-primary scans
   would have hit the same silent miss once the user switched primary
   sources.

Two-part fix:

- ``extract_external_ids`` now recognizes ``_source`` as another
  candidate provider field.
- New optional ``source_hint`` parameter lets the caller supply the
  configured primary source as a fallback when the track dict has no
  provider field of its own. Track-side provider field still wins
  when present (defensive against a wrong hint).

Watchlist scanner now passes ``get_primary_source()`` as the hint so
both naming conventions (Deezer-style _source, Spotify-style no-tag)
get handled uniformly.

6 new regression tests cover:
- _source recognized for Deezer
- _source recognized for Hydrabase (cross-provider mapping)
- _source recognized for Discogs (no library column — verifies
  graceful no-crash)
- source_hint disambiguates raw tracks for spotify/itunes/deezer
- track-side provider takes precedence over hint
- None hint defaults safely

Full pytest 1630 passed; ruff clean. After this lands and the server
restarts, watchlist scans should produce [ExtID Match] /
[Provenance Match] log lines for tracks already on disk regardless of
which metadata source the user has configured as primary.

Broque Thomas

ecb8939c80

Match library tracks by external IDs before fuzzy in watchlist scan

Reported case (CAL): a track already on disk got re-downloaded by the
watchlist scanner on every scan. Library DB had stale album metadata
for the file (track tagged on album "Left Alone") while the metadata
source reported it on a different album ("NPC" single). The
title+artist+album fuzzy block correctly said the album names didn't
match and declared the track missing — but the file's stable external
IDs (Spotify ID, ISRC, etc.) unambiguously identified it as the same
recording.

The earlier compilation-album fix (PR #461) handled qualifier drift
("OST" vs "Music From The Motion Picture"). This case is two
genuinely different album names referring to the same song.

Fix: provider-neutral external-ID short-circuit before the fuzzy
block in `is_track_missing_from_library`. Pulls every recognized ID
off the source track (Spotify / iTunes / Deezer / Tidal / Qobuz /
MusicBrainz / AudioDB / Hydrabase / ISRC), runs a single SELECT
against the indexed external-ID columns on the `tracks` table, and
treats any hit as "track exists in library — don't re-download".

If no IDs are available (older imports without enrichment, library
scans that didn't populate external IDs), falls through to the
existing fuzzy logic so the safety net stays intact.

New `core/library/track_identity.py` module with two helpers:
- `extract_external_ids(track)`: handles dict and object-style track
  shapes, direct-field aliases (spotify_id / spotify_track_id /
  SPOTIFY_TRACK_ID), and provider-disambiguated native `id` fields
  (when track has `provider='deezer'` and `id='X'`, treats X as a
  Deezer ID).
- `find_library_track_by_external_id(db, external_ids,
  server_source)`: builds an OR of indexed column matches with
  IS NOT NULL guards, optional server_source filter that also
  passes legacy NULL rows, single-row LIMIT.

ISRC bridges across providers — a library track imported via Deezer
can be matched against a Spotify scan when both sides carry the
same ISRC.

43 regression tests in `tests/test_library_track_identity.py`:
- 9 ID-extraction tests for direct fields (Spotify / iTunes / Deezer /
  ISRC / MBID / AudioDB / Hydrabase)
- 8 ID-extraction tests via the provider field (8 providers + source
  alias + missing-provider-ignored)
- 7 mixed/defensive tests (multiple IDs, object-style, empty strings,
  None track, numeric coercion)
- 8 lookup tests (per-provider + ISRC cross-bridge)
- 3 OR-semantics tests
- 4 server_source filter tests
- 2 ID-column-map sanity tests

Full pytest 1606 passed; ruff clean.

2 Commits (dev)