From 34ba26f5c893cdf3ddd8185cdd116a2a303f4c91 Mon Sep 17 00:00:00 2001
From: Broque Thomas <26755000+Nezreka@users.noreply.github.com>
Date: Sat, 2 May 2026 17:44:10 -0700
Subject: [PATCH] Persist source IDs at download time + backfill onto tracks on
 sync
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Followup to fix/watchlist-external-id-match. The companion PR closed
the demand side — the watchlist scanner asks for tracks by external IDs
before falling back to fuzzy. But for users on Plex / Jellyfin /
Navidrome the supply side was still broken: tracks.spotify_track_id
(and the other ID columns) only got populated by the asynchronous
enrichment workers, sometimes hours after the file was actually
written. During that window the ID match fell through to fuzzy and
the bug returned.

We were already collecting every ID during post-processing — they
live in the `pp` dict in core/metadata/source.py:embed_source_ids and
get embedded into file tags. We just dropped the in-memory copy
afterwards.

This PR persists them and uses them:

- Schema migration adds spotify_track_id / itunes_track_id /
  deezer_track_id / tidal_track_id / qobuz_track_id /
  musicbrainz_recording_id / audiodb_id / soul_id / isrc columns +
  indexes to the existing track_downloads table (already keyed by
  file_path).
- core/metadata/source.py:embed_source_ids exposes pp["id_tags"] and
  the resolved ISRC back to the import context as _embedded_id_tags
  / _isrc.
- core/imports/side_effects.py:record_download_provenance reads those
  context fields and passes them to db.record_track_download, which
  now accepts the new ID kwargs and persists them.
- New db.get_provenance_by_file_path with exact + basename-suffix
  fallback (handles container mount-root differences between
  download-time path and media-server-reported path).
- New db.backfill_track_external_ids_from_provenance copies IDs
  from track_downloads onto a tracks row idempotently — COALESCE on
  every column preserves any value the enrichment worker already
  wrote (enrichment is more authoritative for late binding).
- database/music_database.py:insert_or_update_media_track (the
  single insertion point used by every Plex / Jellyfin / Navidrome
  sync) calls the backfill immediately after each INSERT/UPDATE.
- New core/library/track_identity.py:find_provenance_by_external_id
  used as a second-tier fallback in watchlist_scanner.is_track_missing
  _from_library — catches the window between download and media-server
  sync. Caller checks os.path.exists on the provenance file_path
  before treating it as "already in library" so a deleted file
  doesn't prevent re-download.

Effect: freshly downloaded files become ID-recognizable to the
watchlist on the very next scan, no enrichment-wait window.

19 regression tests in tests/test_provenance_id_persistence.py:
- Schema migration adds expected columns + indexes
- record_track_download persists every ID kwarg
- record_track_download backward-compat (old kwargs still work)
- get_provenance_by_file_path: exact match, basename fallback for
  mount-root differences, multi-record latest-wins, defensive None
- backfill: copies all IDs, preserves existing via COALESCE,
  no-op when no provenance exists
- find_provenance_by_external_id: per-ID lookup, ISRC cross-bridge,
  OR semantics, latest-wins on multiple matches

Out of scope: backfilling provenance for files downloaded BEFORE
this PR (their track_downloads rows don't carry the new IDs). Those
continue to wait for enrichment. Acceptable — only affects historical
files; new downloads benefit immediately.

Full pytest 1625 passed; ruff clean.
---
 core/imports/side_effects.py            |  33 +++
 core/library/track_identity.py          |  81 ++++++
 core/metadata/source.py                 |  19 ++
 core/watchlist_scanner.py               |  21 ++
 database/music_database.py              | 173 +++++++++++-
 tests/test_provenance_id_persistence.py | 358 ++++++++++++++++++++++++
 webui/static/helper.js                  |   1 +
 7 files changed, 681 insertions(+), 5 deletions(-)
 create mode 100644 tests/test_provenance_id_persistence.py

diff --git a/core/imports/side_effects.py b/core/imports/side_effects.py
index 2824dd2c..bfa6855b 100644
--- a/core/imports/side_effects.py
+++ b/core/imports/side_effects.py
@@ -189,6 +189,30 @@ def record_download_provenance(context: Dict[str, Any]) -> None:
         except Exception:
             pass
 
+        # Pull the metadata-source IDs out of context. ``embed_source_ids``
+        # in core/metadata/source.py wrote them to ``_embedded_id_tags``
+        # at the end of post-processing — we persist them here so the
+        # watchlist scanner can recognize freshly downloaded files
+        # without waiting for the async enrichment workers.
+        embedded = context.get("_embedded_id_tags") or {}
+
+        def _embedded(*keys):
+            for k in keys:
+                v = embedded.get(k)
+                if v:
+                    return str(v)
+            return None
+
+        spotify_track_id = _embedded("SPOTIFY_TRACK_ID")
+        itunes_track_id = _embedded("ITUNES_TRACK_ID")
+        deezer_track_id = _embedded("DEEZER_TRACK_ID")
+        tidal_track_id = _embedded("TIDAL_TRACK_ID")
+        qobuz_track_id = _embedded("QOBUZ_TRACK_ID")
+        musicbrainz_recording_id = _embedded("MUSICBRAINZ_RECORDING_ID")
+        audiodb_id = _embedded("AUDIODB_TRACK_ID")
+        soul_id = _embedded("SOUL_ID")
+        isrc = context.get("_isrc")
+
         db = get_database()
         db.record_track_download(
             file_path=file_path,
@@ -203,6 +227,15 @@ def record_download_provenance(context: Dict[str, Any]) -> None:
             bit_depth=bit_depth,
             sample_rate=sample_rate,
             bitrate=bitrate,
+            spotify_track_id=spotify_track_id,
+            itunes_track_id=itunes_track_id,
+            deezer_track_id=deezer_track_id,
+            tidal_track_id=tidal_track_id,
+            qobuz_track_id=qobuz_track_id,
+            musicbrainz_recording_id=musicbrainz_recording_id,
+            audiodb_id=audiodb_id,
+            soul_id=soul_id,
+            isrc=isrc,
         )
     except Exception:
         pass
diff --git a/core/library/track_identity.py b/core/library/track_identity.py
index 8aaad5ff..584fba45 100644
--- a/core/library/track_identity.py
+++ b/core/library/track_identity.py
@@ -199,8 +199,89 @@ def find_library_track_by_external_id(
                 pass
 
 
+# Maps the conceptual ID name to the column on the ``track_downloads``
+# (provenance) table where SoulSync persists the IDs at download time.
+# Naming convention differs from ``tracks``: provenance uses the
+# explicit ``_track_id`` suffix to match the existing column shape.
+PROVENANCE_ID_COLUMNS: Dict[str, str] = {
+    'spotify_id': 'spotify_track_id',
+    'itunes_id': 'itunes_track_id',
+    'deezer_id': 'deezer_track_id',
+    'tidal_id': 'tidal_track_id',
+    'qobuz_id': 'qobuz_track_id',
+    'mbid': 'musicbrainz_recording_id',
+    'audiodb_id': 'audiodb_id',
+    'soul_id': 'soul_id',
+    'isrc': 'isrc',
+}
+
+
+def find_provenance_by_external_id(
+    db: Any,
+    *,
+    external_ids: Dict[str, str],
+) -> Optional[Dict[str, Any]]:
+    """Return a row from the ``track_downloads`` table whose any external
+    ID column matches one of the provided IDs, or None.
+
+    Used as a second-tier fallback by the watchlist scanner: when the
+    primary library tracks-table lookup misses (e.g. the row exists but
+    the enrichment worker hasn't backfilled its IDs yet, or the row
+    doesn't exist yet because the media-server scan hasn't run since the
+    download), this checks whether SoulSync downloaded the file recently
+    enough that the IDs are sitting in the provenance table.
+
+    Caller should typically also confirm the ``file_path`` on the
+    returned row still exists on disk before treating the track as
+    "already in library" — otherwise a deleted file would prevent
+    re-download.
+    """
+    if not external_ids:
+        return None
+
+    clauses: List[str] = []
+    params: List[Any] = []
+    for id_name, id_value in external_ids.items():
+        column = PROVENANCE_ID_COLUMNS.get(id_name)
+        if not column or not id_value:
+            continue
+        clauses.append(f"({column} = ? AND {column} IS NOT NULL AND {column} != '')")
+        params.append(id_value)
+
+    if not clauses:
+        return None
+
+    where_external = " OR ".join(clauses)
+    sql = f"SELECT * FROM track_downloads WHERE ({where_external}) ORDER BY id DESC LIMIT 1"
+
+    conn = None
+    try:
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute(sql, params)
+        row = cursor.fetchone()
+        if row is None:
+            return None
+        try:
+            return dict(row)
+        except (TypeError, ValueError):
+            cols = [c[0] for c in cursor.description]
+            return dict(zip(cols, row, strict=False))
+    except Exception as exc:
+        logger.debug(f"find_provenance_by_external_id query failed: {exc}")
+        return None
+    finally:
+        if conn is not None:
+            try:
+                conn.close()
+            except Exception:
+                pass
+
+
 __all__ = [
     'EXTERNAL_ID_COLUMNS',
+    'PROVENANCE_ID_COLUMNS',
     'extract_external_ids',
     'find_library_track_by_external_id',
+    'find_provenance_by_external_id',
 ]
diff --git a/core/metadata/source.py b/core/metadata/source.py
index 0c101f5e..fc5f2b12 100644
--- a/core/metadata/source.py
+++ b/core/metadata/source.py
@@ -1093,5 +1093,24 @@ def embed_source_ids(audio_file, metadata: dict, context: dict = None, runtime=N
             metadata["musicbrainz_release_id"] = release_id
             _update_album_year_in_database(db, metadata, release_year)
 
+        # Expose the final ID tag set + per-source values back to the
+        # import context so downstream side-effects (notably
+        # ``record_download_provenance``) can persist them to the
+        # ``track_downloads`` table without re-collecting. Without this,
+        # the watchlist scanner would have to wait for the async
+        # enrichment workers to backfill ``tracks.spotify_track_id`` etc.
+        # before recognizing freshly downloaded files.
+        if isinstance(context, dict):
+            try:
+                context["_embedded_id_tags"] = dict(pp.get("id_tags") or {})
+                isrc_value = (
+                    pp.get("isrc") or pp.get("deezer_isrc") or pp.get("tidal_isrc")
+                    or pp.get("hifi_isrc") or pp.get("qobuz_isrc")
+                )
+                if isrc_value:
+                    context["_isrc"] = str(isrc_value)
+            except Exception:
+                pass
+
     except Exception as exc:
         logger.error("Error embedding source IDs (non-fatal): %s", exc)
diff --git a/core/watchlist_scanner.py b/core/watchlist_scanner.py
index fc1dae39..50bda80b 100644
--- a/core/watchlist_scanner.py
+++ b/core/watchlist_scanner.py
@@ -2074,7 +2074,9 @@ class WatchlistScanner:
                 from core.library.track_identity import (
                     extract_external_ids,
                     find_library_track_by_external_id,
+                    find_provenance_by_external_id,
                 )
+                import os as _os_local
                 source_ids = extract_external_ids(track)
                 if source_ids:
                     matched = find_library_track_by_external_id(
@@ -2089,6 +2091,25 @@ class WatchlistScanner:
                             f"(matched on: {', '.join(sorted(source_ids.keys()))})"
                         )
                         return False  # Track exists in library
+
+                    # Second-tier fallback: provenance table. Catches the
+                    # window between "SoulSync downloaded the file" and
+                    # "media-server scan + sync populated the tracks row
+                    # with IDs". File still has to exist on disk —
+                    # otherwise a user who deleted a file would never get
+                    # it back.
+                    prov = find_provenance_by_external_id(
+                        self.database, external_ids=source_ids,
+                    )
+                    if prov is not None:
+                        prov_path = prov.get('file_path')
+                        if prov_path and _os_local.path.exists(prov_path):
+                            logger.info(
+                                f"[Provenance Match] Track found in download provenance: "
+                                f"'{original_title}' by '{artists_to_search[0] if artists_to_search else 'Unknown'}' "
+                                f"(matched on: {', '.join(sorted(source_ids.keys()))})"
+                            )
+                            return False
             except Exception as ext_id_err:
                 logger.debug(f"External-ID match probe failed (falling through to fuzzy): {ext_id_err}")
 
diff --git a/database/music_database.py b/database/music_database.py
index 2fbb8cde..b9ffee38 100644
--- a/database/music_database.py
+++ b/database/music_database.py
@@ -1384,6 +1384,33 @@ class MusicDatabase:
                 cursor.execute("ALTER TABLE track_downloads ADD COLUMN bitrate INTEGER")
                 logger.info("Added audio detail columns (bit_depth, sample_rate, bitrate) to track_downloads")
 
+            # Migration: Add external metadata-source ID columns to
+            # track_downloads. Persists the IDs we already collect at
+            # post-processing time so the watchlist scanner + media-server
+            # sync backfill can read them without waiting for the async
+            # enrichment workers.
+            external_id_cols = [
+                'spotify_track_id', 'itunes_track_id', 'deezer_track_id',
+                'tidal_track_id', 'qobuz_track_id', 'musicbrainz_recording_id',
+                'audiodb_id', 'soul_id', 'isrc',
+            ]
+            added_external = False
+            for _col in external_id_cols:
+                if _col not in td_columns:
+                    cursor.execute(f"ALTER TABLE track_downloads ADD COLUMN {_col} TEXT")
+                    added_external = True
+            if added_external:
+                logger.info(f"Added external-ID columns to track_downloads: {', '.join(external_id_cols)}")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_spotify_id ON track_downloads (spotify_track_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_itunes_id ON track_downloads (itunes_track_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_deezer_id ON track_downloads (deezer_track_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_tidal_id ON track_downloads (tidal_track_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_qobuz_id ON track_downloads (qobuz_track_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_mbid ON track_downloads (musicbrainz_recording_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_audiodb_id ON track_downloads (audiodb_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_soul_id ON track_downloads (soul_id)")
+            cursor.execute("CREATE INDEX IF NOT EXISTS idx_td_isrc ON track_downloads (isrc)")
+
             # Discovery artist blacklist — artists users never want to see in discovery
             cursor.execute("""
                 CREATE TABLE IF NOT EXISTS discovery_artist_blacklist (
@@ -5009,6 +5036,20 @@ class MusicDatabase:
 
                 conn.commit()
 
+                # Backfill external metadata-source IDs from track_downloads
+                # provenance. SoulSync collected them at download time but the
+                # media-server scan can't see them — without this hook,
+                # tracks.spotify_track_id / itunes_track_id / etc. stay empty
+                # until the async enrichment workers eventually catch up
+                # (hours later), during which window the watchlist scanner
+                # treats freshly downloaded files as missing and re-downloads
+                # them. Idempotent COALESCE on each column preserves any value
+                # the enrichment worker already wrote.
+                try:
+                    self.backfill_track_external_ids_from_provenance(track_id, file_path)
+                except Exception as backfill_err:
+                    logger.debug(f"Provenance ID backfill skipped for track {track_id}: {backfill_err}")
+
                 # Log new imports to library history
                 if is_new_track:
                     try:
@@ -9914,8 +9955,24 @@ class MusicDatabase:
                                source_filename: str, source_size: int = 0, audio_quality: str = '',
                                track_title: str = '', track_artist: str = '', track_album: str = '',
                                status: str = 'completed', track_id: str = None,
-                               bit_depth: int = None, sample_rate: int = None, bitrate: int = None) -> Optional[int]:
-        """Record a download with full source provenance. Returns the record ID."""
+                               bit_depth: int = None, sample_rate: int = None, bitrate: int = None,
+                               spotify_track_id: Optional[str] = None,
+                               itunes_track_id: Optional[str] = None,
+                               deezer_track_id: Optional[str] = None,
+                               tidal_track_id: Optional[str] = None,
+                               qobuz_track_id: Optional[str] = None,
+                               musicbrainz_recording_id: Optional[str] = None,
+                               audiodb_id: Optional[str] = None,
+                               soul_id: Optional[str] = None,
+                               isrc: Optional[str] = None) -> Optional[int]:
+        """Record a download with full source provenance. Returns the record ID.
+
+        External-ID kwargs (spotify_track_id et al.) capture the metadata-
+        source identity that the user originally asked for — they're written
+        at download time so the watchlist scanner can recognize the file as
+        already present without waiting for the async enrichment workers
+        to backfill them onto the ``tracks`` row.
+        """
         try:
             conn = self._get_connection()
             cursor = conn.cursor()
@@ -9941,17 +9998,123 @@ class MusicDatabase:
                 INSERT INTO track_downloads
                 (track_id, file_path, source_service, source_username, source_filename,
                  source_size, audio_quality, track_title, track_artist, track_album, status,
-                 bit_depth, sample_rate, bitrate)
-                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                 bit_depth, sample_rate, bitrate,
+                 spotify_track_id, itunes_track_id, deezer_track_id, tidal_track_id,
+                 qobuz_track_id, musicbrainz_recording_id, audiodb_id, soul_id, isrc)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
             """, (track_id, file_path, source_service, source_username, source_filename,
                   source_size, audio_quality, track_title, track_artist, track_album, status,
-                  bit_depth, sample_rate, bitrate))
+                  bit_depth, sample_rate, bitrate,
+                  spotify_track_id, itunes_track_id, deezer_track_id, tidal_track_id,
+                  qobuz_track_id, musicbrainz_recording_id, audiodb_id, soul_id, isrc))
             conn.commit()
             return cursor.lastrowid
         except Exception as e:
             logger.error(f"Error recording track download: {e}")
             return None
 
+    def get_provenance_by_file_path(self, file_path: str) -> Optional[Dict[str, Any]]:
+        """Return the most recent track_downloads row matching ``file_path``.
+
+        Tries exact match first, then a basename-suffix LIKE fallback for
+        cases where the media-server scan reports the file at a slightly
+        different path than what was recorded at download time (Windows
+        separators, symlink resolution, container mount-root differences).
+        """
+        if not file_path:
+            return None
+        try:
+            conn = self._get_connection()
+            cursor = conn.cursor()
+            cursor.execute(
+                "SELECT * FROM track_downloads WHERE file_path = ? ORDER BY id DESC LIMIT 1",
+                (file_path,),
+            )
+            row = cursor.fetchone()
+            if row is None:
+                import os as _os
+                fname = _os.path.basename(file_path.replace('\\', '/'))
+                if fname:
+                    cursor.execute(
+                        "SELECT * FROM track_downloads WHERE file_path LIKE ? OR file_path LIKE ? "
+                        "ORDER BY id DESC LIMIT 1",
+                        (f'%/{fname}', f'%\\{fname}'),
+                    )
+                    row = cursor.fetchone()
+            if row is None:
+                return None
+            try:
+                return dict(row)
+            except (TypeError, ValueError):
+                cols = [c[0] for c in cursor.description]
+                return dict(zip(cols, row, strict=False))
+        except Exception as exc:
+            logger.debug(f"get_provenance_by_file_path failed: {exc}")
+            return None
+
+    def backfill_track_external_ids_from_provenance(self, track_id: str, file_path: Optional[str]) -> int:
+        """Copy external IDs from ``track_downloads`` onto a ``tracks`` row.
+
+        Idempotent: only writes columns that are currently NULL/empty on
+        the tracks row AND have a value in the provenance row. Returns the
+        number of columns updated. Called from
+        ``insert_or_update_media_track`` immediately after the row is
+        inserted/updated so freshly synced media-server rows pick up
+        whatever IDs SoulSync already knew at download time.
+        """
+        if not track_id or not file_path:
+            return 0
+        prov = self.get_provenance_by_file_path(file_path)
+        if not prov:
+            return 0
+
+        # Map provenance column -> tracks column. Different naming
+        # conventions because tracks.* uses shorter names (``deezer_id``,
+        # ``tidal_id``, ``qobuz_id``) while track_downloads uses the
+        # explicit ``_track_id`` suffix to avoid ambiguity.
+        prov_to_tracks = {
+            'spotify_track_id': 'spotify_track_id',
+            'itunes_track_id': 'itunes_track_id',
+            'deezer_track_id': 'deezer_id',
+            'tidal_track_id': 'tidal_id',
+            'qobuz_track_id': 'qobuz_id',
+            'musicbrainz_recording_id': 'musicbrainz_recording_id',
+            'audiodb_id': 'audiodb_id',
+            'soul_id': 'soul_id',
+            'isrc': 'isrc',
+        }
+
+        updates: Dict[str, str] = {}
+        for prov_col, track_col in prov_to_tracks.items():
+            val = prov.get(prov_col)
+            if not val:
+                continue
+            updates[track_col] = str(val)
+        if not updates:
+            return 0
+
+        try:
+            conn = self._get_connection()
+            cursor = conn.cursor()
+            # Coalesce-update: only fill empty columns. Preserves any IDs
+            # the enrichment worker already populated (those are usually
+            # more reliable than provenance for non-primary sources).
+            set_clauses = []
+            params = []
+            for track_col, val in updates.items():
+                set_clauses.append(f"{track_col} = COALESCE(NULLIF({track_col}, ''), ?)")
+                params.append(val)
+            params.append(track_id)
+            cursor.execute(
+                f"UPDATE tracks SET {', '.join(set_clauses)} WHERE id = ?",
+                params,
+            )
+            conn.commit()
+            return cursor.rowcount or 0
+        except Exception as exc:
+            logger.debug(f"backfill_track_external_ids_from_provenance failed: {exc}")
+            return 0
+
     def get_track_downloads(self, track_id: str) -> list:
         """Get all download records for a library track."""
         try:
diff --git a/tests/test_provenance_id_persistence.py b/tests/test_provenance_id_persistence.py
new file mode 100644
index 00000000..19b24b2a
--- /dev/null
+++ b/tests/test_provenance_id_persistence.py
@@ -0,0 +1,358 @@
+"""Regression tests for the post-processing → provenance → tracks ID flow.
+
+Companion to test_library_track_identity.py. The watchlist external-ID
+match (PR #470) closed the demand side: when the watchlist asks "do we
+have this track?", it queries by Spotify/iTunes/Deezer/etc. IDs before
+falling back to fuzzy. But for users on Plex / Jellyfin / Navidrome,
+the ``tracks.spotify_track_id`` column only gets populated by
+asynchronous enrichment workers — sometimes hours after the file is
+written. During that window the ID match falls through to fuzzy and
+the bug returns.
+
+This PR closes the supply side: the IDs we already collect at
+post-processing time get persisted to ``track_downloads``, and the
+media-server sync code copies them onto the new ``tracks`` row
+immediately. These tests pin:
+
+1. Schema migration adds the new ID columns + indexes
+2. ``record_track_download`` accepts and persists the new kwargs
+3. ``get_provenance_by_file_path`` finds rows by exact + suffix match
+4. ``backfill_track_external_ids_from_provenance`` copies IDs onto a
+   tracks row idempotently (COALESCE — preserves existing values)
+5. ``find_provenance_by_external_id`` queries the new columns
+"""
+
+from __future__ import annotations
+
+import os
+import sqlite3
+import tempfile
+from pathlib import Path
+from typing import Any, Dict
+
+import pytest
+
+
+@pytest.fixture
+def db_path(tmp_path: Path):
+    return tmp_path / "test_music.db"
+
+
+@pytest.fixture
+def db(db_path: Path, monkeypatch):
+    """Real MusicDatabase against a tmp SQLite file so the schema
+    migration runs end-to-end (validates the ALTER TABLE additions)."""
+    monkeypatch.setenv('DATABASE_PATH', str(db_path))
+    # MusicDatabase is heavy; isolate to a fresh import each test so
+    # other tests don't get our env-var pollution.
+    import importlib
+    import database.music_database as music_db_module
+    importlib.reload(music_db_module)
+    db = music_db_module.MusicDatabase(str(db_path))
+    yield db
+
+
+# ---------------------------------------------------------------------------
+# Schema migration
+# ---------------------------------------------------------------------------
+
+
+class TestSchemaMigration:
+    def test_track_downloads_has_new_external_id_columns(self, db):
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute("PRAGMA table_info(track_downloads)")
+        cols = {row[1] for row in cursor.fetchall()}
+
+        assert 'spotify_track_id' in cols
+        assert 'itunes_track_id' in cols
+        assert 'deezer_track_id' in cols
+        assert 'tidal_track_id' in cols
+        assert 'qobuz_track_id' in cols
+        assert 'musicbrainz_recording_id' in cols
+        assert 'audiodb_id' in cols
+        assert 'soul_id' in cols
+        assert 'isrc' in cols
+
+    def test_track_downloads_has_external_id_indexes(self, db):
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute(
+            "SELECT name FROM sqlite_master WHERE type='index' AND tbl_name='track_downloads'"
+        )
+        idx_names = {row[0] for row in cursor.fetchall()}
+
+        assert 'idx_td_spotify_id' in idx_names
+        assert 'idx_td_itunes_id' in idx_names
+        assert 'idx_td_deezer_id' in idx_names
+        assert 'idx_td_isrc' in idx_names
+
+
+# ---------------------------------------------------------------------------
+# record_track_download persists IDs
+# ---------------------------------------------------------------------------
+
+
+class TestRecordTrackDownloadPersistsIds:
+    def test_persists_all_external_ids(self, db):
+        rec_id = db.record_track_download(
+            file_path='/lib/Artist/Album/Track.mp3',
+            source_service='soulseek',
+            source_username='user1',
+            source_filename='Track.mp3',
+            track_title='Track',
+            spotify_track_id='sp1',
+            itunes_track_id='it1',
+            deezer_track_id='dz1',
+            tidal_track_id='td1',
+            qobuz_track_id='qb1',
+            musicbrainz_recording_id='mb-uuid-1',
+            audiodb_id='adb1',
+            soul_id='hyd-soul-1',
+            isrc='USRC17607839',
+        )
+        assert rec_id is not None
+
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute(
+            "SELECT spotify_track_id, itunes_track_id, deezer_track_id, "
+            "tidal_track_id, qobuz_track_id, musicbrainz_recording_id, "
+            "audiodb_id, soul_id, isrc FROM track_downloads WHERE id = ?",
+            (rec_id,),
+        )
+        row = tuple(cursor.fetchone())
+        assert row == (
+            'sp1', 'it1', 'dz1', 'td1', 'qb1', 'mb-uuid-1',
+            'adb1', 'hyd-soul-1', 'USRC17607839',
+        )
+
+    def test_omitted_ids_persist_as_null(self, db):
+        """Backward compat — callers that don't pass the new kwargs
+        still work, columns just stay NULL."""
+        rec_id = db.record_track_download(
+            file_path='/lib/Artist/Album/Track.mp3',
+            source_service='soulseek',
+            source_username='user1',
+            source_filename='Track.mp3',
+            track_title='Track',
+        )
+        assert rec_id is not None
+
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute("SELECT spotify_track_id FROM track_downloads WHERE id = ?", (rec_id,))
+        assert cursor.fetchone()[0] is None
+
+
+# ---------------------------------------------------------------------------
+# get_provenance_by_file_path
+# ---------------------------------------------------------------------------
+
+
+class TestGetProvenanceByFilePath:
+    def test_exact_match(self, db):
+        db.record_track_download(
+            file_path='/lib/Artist/Album/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp1',
+        )
+        result = db.get_provenance_by_file_path('/lib/Artist/Album/Track.mp3')
+        assert result is not None
+        assert result['spotify_track_id'] == 'sp1'
+
+    def test_returns_none_when_no_match(self, db):
+        result = db.get_provenance_by_file_path('/nonexistent/path.mp3')
+        assert result is None
+
+    def test_returns_none_for_empty_path(self, db):
+        assert db.get_provenance_by_file_path('') is None
+        assert db.get_provenance_by_file_path(None) is None
+
+    def test_basename_suffix_fallback(self, db):
+        """Recorded path differs from queried path by mount root —
+        common when SoulSync container writes under /app/Transfer
+        but Plex container reports the same file as /media/Music."""
+        db.record_track_download(
+            file_path='/app/Transfer/Artist/Album/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp1',
+        )
+        result = db.get_provenance_by_file_path('/media/Music/Artist/Album/Track.mp3')
+        assert result is not None
+        assert result['spotify_track_id'] == 'sp1'
+
+    def test_returns_most_recent_when_multiple(self, db):
+        """Same file_path can have multiple download records (re-downloads,
+        retries). Most recent wins."""
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp-old',
+        )
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp-new',
+        )
+        result = db.get_provenance_by_file_path('/lib/Track.mp3')
+        assert result['spotify_track_id'] == 'sp-new'
+
+
+# ---------------------------------------------------------------------------
+# backfill_track_external_ids_from_provenance
+# ---------------------------------------------------------------------------
+
+
+class TestBackfillTrackExternalIdsFromProvenance:
+    def _seed_artist_album_and_track(self, db, *, track_id, file_path):
+        """Insert a minimal artists/albums/tracks chain so backfill has
+        a row to update."""
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute(
+            "INSERT INTO artists (id, name, server_source) VALUES (?, ?, 'plex')",
+            ('artist-1', 'Test Artist'),
+        )
+        cursor.execute(
+            "INSERT INTO albums (id, artist_id, title, server_source) VALUES (?, ?, ?, 'plex')",
+            ('album-1', 'artist-1', 'Test Album'),
+        )
+        cursor.execute(
+            "INSERT INTO tracks (id, album_id, artist_id, title, file_path, server_source) "
+            "VALUES (?, ?, ?, ?, ?, 'plex')",
+            (track_id, 'album-1', 'artist-1', 'Test Track', file_path),
+        )
+        conn.commit()
+
+    def test_copies_all_ids_when_tracks_columns_empty(self, db):
+        self._seed_artist_album_and_track(db, track_id='t1', file_path='/lib/Track.mp3')
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp1',
+            deezer_track_id='dz1',
+            isrc='USRC17607839',
+        )
+
+        updated = db.backfill_track_external_ids_from_provenance('t1', '/lib/Track.mp3')
+        assert updated > 0
+
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute(
+            "SELECT spotify_track_id, deezer_id, isrc FROM tracks WHERE id = ?",
+            ('t1',),
+        )
+        assert tuple(cursor.fetchone()) == ('sp1', 'dz1', 'USRC17607839')
+
+    def test_preserves_existing_ids(self, db):
+        """COALESCE-update — if the enrichment worker already wrote a
+        spotify_track_id, the provenance backfill must NOT overwrite it
+        (enrichment is generally more authoritative for late binding)."""
+        self._seed_artist_album_and_track(db, track_id='t1', file_path='/lib/Track.mp3')
+
+        # Pre-populate spotify_track_id with the enrichment-worker value
+        conn = db._get_connection()
+        cursor = conn.cursor()
+        cursor.execute("UPDATE tracks SET spotify_track_id = 'sp-from-enrichment' WHERE id = ?", ('t1',))
+        conn.commit()
+
+        # Provenance has a different value
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp-from-provenance',
+            deezer_track_id='dz1',  # This one IS missing on tracks, should backfill
+        )
+
+        db.backfill_track_external_ids_from_provenance('t1', '/lib/Track.mp3')
+
+        cursor.execute("SELECT spotify_track_id, deezer_id FROM tracks WHERE id = ?", ('t1',))
+        row = cursor.fetchone()
+        assert row[0] == 'sp-from-enrichment', "Existing spotify_track_id must be preserved"
+        assert row[1] == 'dz1', "Empty deezer_id should be filled from provenance"
+
+    def test_returns_zero_when_no_provenance(self, db):
+        self._seed_artist_album_and_track(db, track_id='t1', file_path='/lib/Track.mp3')
+        # No record_track_download call — no provenance row exists
+        updated = db.backfill_track_external_ids_from_provenance('t1', '/lib/Track.mp3')
+        assert updated == 0
+
+    def test_returns_zero_for_empty_inputs(self, db):
+        assert db.backfill_track_external_ids_from_provenance(None, '/lib/Track.mp3') == 0
+        assert db.backfill_track_external_ids_from_provenance('t1', None) == 0
+        assert db.backfill_track_external_ids_from_provenance('t1', '') == 0
+
+
+# ---------------------------------------------------------------------------
+# find_provenance_by_external_id
+# ---------------------------------------------------------------------------
+
+
+class TestFindProvenanceByExternalId:
+    def test_match_by_spotify_id(self, db):
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp1',
+        )
+
+        from core.library.track_identity import find_provenance_by_external_id
+        result = find_provenance_by_external_id(db, external_ids={'spotify_id': 'sp1'})
+        assert result is not None
+        assert result['file_path'] == '/lib/Track.mp3'
+        assert result['spotify_track_id'] == 'sp1'
+
+    def test_match_by_isrc(self, db):
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            isrc='USRC17607839',
+        )
+
+        from core.library.track_identity import find_provenance_by_external_id
+        result = find_provenance_by_external_id(db, external_ids={'isrc': 'USRC17607839'})
+        assert result is not None
+
+    def test_returns_none_when_no_match(self, db):
+        from core.library.track_identity import find_provenance_by_external_id
+        result = find_provenance_by_external_id(db, external_ids={'spotify_id': 'sp-other'})
+        assert result is None
+
+    def test_returns_none_for_empty_external_ids(self, db):
+        from core.library.track_identity import find_provenance_by_external_id
+        assert find_provenance_by_external_id(db, external_ids={}) is None
+
+    def test_returns_most_recent_when_multiple_matches(self, db):
+        """Re-downloads create multiple rows. Newest wins."""
+        db.record_track_download(
+            file_path='/lib/Track-v1.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            spotify_track_id='sp1',
+        )
+        db.record_track_download(
+            file_path='/lib/Track-v2.mp3',
+            source_service='tidal', source_username='tidal', source_filename='Track.flac',
+            spotify_track_id='sp1',
+        )
+
+        from core.library.track_identity import find_provenance_by_external_id
+        result = find_provenance_by_external_id(db, external_ids={'spotify_id': 'sp1'})
+        assert result['file_path'] == '/lib/Track-v2.mp3'
+
+    def test_or_semantics_across_id_types(self, db):
+        """Provenance has only ISRC; source asks with multiple IDs incl. ISRC.
+        Match should fire on ISRC."""
+        db.record_track_download(
+            file_path='/lib/Track.mp3',
+            source_service='soulseek', source_username='u', source_filename='Track.mp3',
+            isrc='USRC17607839',
+        )
+
+        from core.library.track_identity import find_provenance_by_external_id
+        result = find_provenance_by_external_id(db, external_ids={
+            'spotify_id': 'sp-mismatch',
+            'isrc': 'USRC17607839',
+        })
+        assert result is not None
diff --git a/webui/static/helper.js b/webui/static/helper.js
index ebcda0cf..57cbebae 100644
--- a/webui/static/helper.js
+++ b/webui/static/helper.js
@@ -3447,6 +3447,7 @@ const WHATS_NEW = {
         { title: 'Fix Qobuz Connection Not Sticking After Login', desc: 'logging in via the qobuz connect button on settings showed "connected: <username> (active)" but underneath an error said "qobuz not authenticated...", and the dashboard indicator stayed yellow. cause: two separate qobuz client instances run side by side (one for the auth flow, one for the enrichment worker) and login only updated the first one. now the worker\'s client gets synced from config the moment login / token / logout completes, so the dashboard indicator goes green and connection-test stops yelling.', page: 'settings' },
         { title: 'Fix Lossy Copy Not Deleting Original FLAC', desc: 'with lossy copy enabled and "delete original" turned on (you wanted an mp3-only library), every download still left both the flac and the converted mp3 sitting in the same folder. the setting was being read but never acted on during the conversion step. now the original gets removed right after a successful conversion, with a same-path safety check + graceful handling if the original is already gone or locked.', page: 'settings' },
         { title: 'Watchlist Stops Re-Downloading Tracks That Already Exist', desc: 'a track that was already on disk got re-downloaded by the watchlist on every scan because the library had stale album metadata for it (file tagged on the wrong album by an old import) and the album fuzzy comparison declared the track missing. now the watchlist also matches by stable external IDs (spotify / itunes / deezer / tidal / qobuz / musicbrainz / audiodb / hydrabase / isrc) before falling through to the fuzzy block — so any track whose tags or DB row carry a matching ID is recognized as already present regardless of album drift. provider-neutral, falls through to existing fuzzy logic for older imports without IDs.', page: 'watchlist' },
+        { title: 'Persist Source IDs at Download Time + Backfill on Sync', desc: 'every download already collects spotify/itunes/deezer/tidal/qobuz/musicbrainz/audiodb/hydrabase/isrc IDs during post-processing, but for plex/jellyfin/navidrome users they got dropped on the floor — only enrichment workers eventually wrote them onto the tracks row, hours later. now those IDs persist to the track_downloads table immediately, the media-server sync code copies them onto the new tracks row the moment it gets created, and the watchlist scanner has a second-tier fallback to query provenance directly when the tracks row hasn\'t been synced yet. closes the enrichment-wait window — freshly downloaded files are recognizable on the very next watchlist scan instead of after enrichment catches up.', page: 'library' },
         { title: 'Sidebar Library Button Shows Artist Breadcrumb', desc: 'when you open an artist detail page (from library, search, or the global search popover), the sidebar Library button now lights up and rewrites its label to "Library / Artist Name" — long names truncate with an ellipsis and the full name shows on hover. revertes to plain "Library" when you leave. purely visual, no functionality change.', page: 'library' },
         { title: 'Enrichment Bubble Routes Consolidated', desc: 'internal — every dashboard enrichment bubble (musicbrainz, spotify, itunes, deezer, discogs, audiodb, lastfm, genius, tidal, qobuz) used to hit its own per-service status / pause / resume route in web_server.py. unified them under a single registry-driven endpoint set: /api/enrichment/<service>/<action>. spotify\'s rate-limit guard, lastfm/genius yield-override behavior, and tidal/qobuz extra status fields are encoded as data on the registry. 27 new tests cover the registry behavior.' },
         { title: 'Drop Old Per-Service Enrichment Routes', desc: 'internal — followup to the registry consolidation. now that the dashboard has cut over to /api/enrichment/<service>/<action>, deleted the 30 hand-rolled per-service routes from web_server.py (musicbrainz/audiodb/discogs/deezer/spotify/itunes/lastfm/genius/tidal/qobuz status+pause+resume). ~510 lines gone from the monolith, no behavior change.' },