You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
SoulSync/docs/metadata-types-migration.md

5.2 KiB

Typed Metadata Migration Plan

Why

Right now the metadata pipeline has no real contract about the shape of data flowing between providers and consumers. Each provider (Spotify, iTunes, Deezer, Tidal, Qobuz, MusicBrainz, AudioDB, Discogs, Hydrabase) returns its own response shape, and consumer code defensively extracts every field via fallback chains:

def _build_album_info(album_data, album_id, album_name='', artist_name=''):
    images = _extract_lookup_value(album_data, 'images', default=[]) or []
    ...
    return {
        'id': _extract_lookup_value(album_data, 'id', 'album_id',
                                    'collectionId', 'release_id',
                                    default=album_id) or album_id,
        ...
    }

This pattern works but makes the codebase hard to extend safely:

  • Adding a new provider means adding more keys to the fallback chains in every consumer file (currently ~150 call sites of _extract_lookup_value across the codebase).
  • Fixing a bug in extraction means fixing it in N places.
  • New consumers can't trust the data — they re-run defensive logic.
  • Tests are theatre because the contract is "whatever shape happens to come in."

What this PR adds

core/metadata/types.py defines the canonical typed dataclasses:

  • Album — required fields: id, name, artists, release_date, total_tracks, album_type. Optional: image_url, artist_id, genres, label, barcode, external_ids, external_urls.
  • Track — required fields: id, name, artists, album, duration_ms. Optional: track/disc number, image, ISRC, etc.
  • Artist — required fields: id, name. Optional: image, genres.

Plus per-provider classmethod converters on Album:

  • Album.from_spotify_dict(raw)
  • Album.from_itunes_dict(raw)
  • Album.from_deezer_dict(raw)
  • Album.from_discogs_dict(raw)
  • Album.from_musicbrainz_dict(raw)
  • Album.from_hydrabase_dict(raw)
  • Album.from_qobuz_dict(raw)
  • Album.from_tidal_object(obj) — note: Tidal goes through the tidalapi library which returns Python objects rather than raw dicts, so this converter is named _object not _dict to make the input contract explicit.

Enrichment-only providers (Last.fm, Genius, AcoustID, ListenBrainz, AudioDB) don't return Album-shaped responses — they enrich existing rows with tags, lyrics URLs, fingerprint matches, etc. No Album converter needed for those.

Each converter is the SINGLE place that knows that provider's wire shape. Adding a new provider = adding one classmethod here and nothing else needs to change.

Album.to_context_dict() returns the canonical dict shape SoulSync's existing import / download pipelines expect — the bridge between typed data and the current dict-passing internal API.

What this PR DOES NOT do

This PR does not migrate any consumer. No behavior changes. The new types and converters are pure additive — every existing code path keeps using _extract_lookup_value exactly as before.

The reason: a single big-bang migration would be a 153-call-site refactor with subtle behavior risk. Better to land the foundation in isolation, prove the contract via tests, then migrate consumers one at a time in follow-up PRs that are individually reviewable and revertable.

Migration roadmap

Numbered in suggested order. Each item is its own PR.

  1. Foundation (this PR). Land core/metadata/types.py + converters + tests. Document migration plan.
  2. Migrate _build_album_info in core/metadata/album_tracks.py — accept either a typed Album OR a raw dict. When it gets a typed Album, return album.to_context_dict(). When it gets a raw dict, normalize via the appropriate from_<source>_dict() based on the provided source argument. Reduces from 41 LOC of fallback chains to ~5 LOC of dispatch.
  3. Migrate _build_single_import_context_payload in the same file — same pattern.
  4. Migrate Spotify client. SpotifyClient.get_album() returns Album instead of raw dict. Internal callers update. Public API surface unchanged where it has to be (return both for one release, deprecate dict version).
  5. Migrate iTunes/Deezer/Tidal/Qobuz/Discogs/Hydrabase clients. Same pattern. Each client's get_album() returns Album.
  6. Migrate consumers in core/discovery/quality_scanner.py, core/imports/context.py, etc. Drop their fallback chains in favor of typed access.
  7. Add Track converters and migrate Track-shaped consumers. Same pattern as Album.
  8. Add Artist converters and migrate Artist-shaped consumers.
  9. Deprecate _extract_lookup_value. Once no caller needs it, delete it.

Each PR is independently revertable. Behavior preserved at every step.

Acceptance criteria for this PR

  • All converters produce a fully-populated Album from realistic provider response samples.
  • Every required field is set even when source data is partial.
  • to_context_dict() shape is identical across all six providers (pinned via cross-provider parametrized tests).
  • No existing consumer is changed; existing tests pass unchanged.
  • Cross-provider invariants (release_date format, album_type values, Discogs (N) stripping, iTunes artwork upgrade) are pinned.