mirror of https://github.com/Nezreka/SoulSync.git
dev
feat/auto-sync-schedule-types
fix/usenet-album-poll-sab-handoff
main
fix/quarantine-source-dedup
release/2.5.3
fix/disable-beatport-features
johnbaumb-discover-redesign
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4.0
2.4.1
2.4.2
2.5.0
2.5.1
2.5.2
2.5.3
2.5.4
2.5.5
2.5.6
2.5.7
2.5.9
2.6.0
2.6.1
2.6.2
v0.65
${ noResults }
11 Commits (048e4e85d5ceb6bbb9f6a6d2b2efbc259cc331ff)
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
b42cafa150 |
AcoustID + quarantine modal: three bug fixes (closes #607, closes #608)
Issue #607 (AfonsoG6) -- two AcoustID problems: 1. Live recordings false-quarantining as "Version mismatch: expected '... (Live at Venue)' (live) but file is '...' (original)" because MusicBrainz often stores the recording entity with a bare title -- the venue / live annotation lives on the release entity, not the recording. The audio fingerprint correctly identifies the live recording, but the title-text comparison flagged it as wrong. New pure helper `core/matching/version_mismatch.py:is_acceptable_version_mismatch` accepts the mismatch only when: - One-sided AND involves 'live': exactly one side is 'live' and the other is bare 'original'. Two-sided mismatches stay strict. - Fingerprint score >= 0.85 (stricter than the existing 0.80 minimum -- escape valve only fires when AcoustID is more confident than its own threshold). - Bare title similarity >= 0.70. - Artist similarity >= 0.60. Other version markers (instrumental, remix, acoustic, demo, etc) stay strict -- those have distinct fingerprints AND MB always annotates them in the recording title. The existing test_acoustid_version_mismatch.py suite passes unchanged. 2. Audio-mismatch failure message reported "identified as '' by '' (artist=100%)" when AcoustID returned multiple recordings -- prior code mixed `recordings[0]`'s strings (which can be empty) with `best_rec`'s scores. Now uses `matched_title` / `matched_artist` consistently in both the high-confidence-skip path and the final fail message. Issue #608 (AfonsoG6) -- quarantine modal: 3. Approve / Delete buttons silently no-op'd when the filename contained an apostrophe -- the unescaped quote broke the inline JS in the onclick handler. Now wraps the id via `escapeHtml(JSON.stringify(id))`, which round-trips quotes / backslashes / unicode / newlines safely through the HTML attribute to JS string boundary. 4. Bonus UX: quarantine entry expanded view now shows source uploader (username) and original soulseek filename when the sidecar carries that context -- helps trace which uploader the bad file came from. Backend exposes `source_username` + `source_filename` fields from `sidecar.context.original_search_result`. Degrades to '' on legacy thin sidecars. Tests: - 23 new boundary tests in tests/matching/test_version_mismatch.py pin every shape: equal versions trivial, one-sided live both directions, threshold floors (each just below default -> reject), two-sided strict, non-live one-sided strict (covers exact test_instrumental_returned_for_vocal_request_fails scenario), custom-threshold overrides. - 4 existing test_acoustid_version_mismatch.py tests pass unchanged. - 507 AcoustID / matching / imports tests pass. |
2 weeks ago |
|
|
9cc09118bf |
AcoustID scanner: multi-candidate match + duration guard + multi-value retag
Closes #587. Three coordinated fixes per codex's diagnosis. AcoustID verification gate left intact — these fixes target the upstream scanner false-positive surface plus a separate retag-path gap. Bug 1 — scanner used recordings[0] as authoritative `core/repair_jobs/acoustid_scanner.py:_scan_file` only checked the top fingerprint match's metadata. AcoustID often returns multiple recordings per fingerprint (sample collisions, multi-MB-record cases) and the wrong-credited recording can outrank the right- credited one. Foxxify case 2 (Nana / Nana): top match credited the wrong artist while a lower-ranked candidate matched the user's expected metadata exactly. Lifted the verifier's all-candidates check to a shared pure helper `core/matching/acoustid_candidates.py:find_matching_recording`. Both verifier and scanner can now ask "given these candidates, does ANY of them match expected (title, artist)?" with the same contract. Scanner suppresses the finding when any candidate matches. Bug 2 — no duration check guards against fingerprint hash collisions Foxxify case 3: 17-minute mashup edit fingerprinted to a 5-minute late-70s Japanese hiphop track (different songs, fingerprint hash collision on a sampled section). Scanner had no signal to detect this and would have recommended retagging the 17-min file as the 5-min track. `duration_mismatches_strongly` in the same helper module flags drifts beyond max(60s, 35%). Scanner now skips findings when the candidate's duration disagrees strongly with the file's expected duration. Loaded duration via the existing tracks SQL (added `t.duration` to the SELECT). Returns False when either side is unknown — no behavior change for older rows without duration data. Bug 3 — scanner retag bypassed multi-value ARTISTS tag setting `core/repair_worker.py:_fix_wrong_song` called `write_tags_to_file` with single-string artist updates. The writer only wrote TPE1 (single string) and never read the user's `metadata_enhancement.tags.write_multi_artist` config. Multi-value ARTISTS tags got stripped on every retag, contradicting the post-download enrichment pipeline's behavior. Per codex's pick (option B over routing through enhance_file_metadata), extended `write_tags_to_file` with an optional `artists_list` parameter. Each format-specific writer respects the config flag the same way enrichment.py does: - ID3: TPE1 stays as joined display string + TXXX:Artists multi-value - Vorbis/Opus/FLAC: `artist` display string + `artists` multi-value key - MP4: \xa9ART as list when on, single string when off Scanner retag derives the per-artist list by splitting AcoustID's credit through the existing `split_artist_credit` helper (same separators the matching layer already uses). Backward compatible: callers that don't pass `artists_list` get the exact same single-string write as before. No regression for the write_artist_image button or any other tag_writer caller. 15 tests on the candidate helper + duration guard. 13 tests on the tag_writer multi-value path (write/skip/single/ no-list cases for FLAC + the config-gate helper). 4 new scanner regression tests pinning lower-ranked candidate suppression, no-suppression when no candidate matches, duration mismatch skip, no-skip when duration matches. Existing scanner tests updated for the new 11-column SQL select (added duration column to fake schema + test row tuples). Full suite: 3097 passed. Ruff clean. |
2 weeks ago |
|
|
0aa18b0180 |
Cross-script artist aliases: include canonical name + non-strict fallback
Closes #586. Follow-up to #442 — Cyrillic / kanji canonical names weren't bridging cross-script comparisons. Reporter case: "Dmitry Yablonsky" tracks quarantined as audio mismatch with file identified as "Русская филармония, Дмитрий Яблонский" (4% artist sim) even though the Cyrillic spelling is just the Russian transliteration. Codex diagnosed three layered bugs in the alias resolution chain. This fixes all three. Bug 1 — fetch_artist_aliases ignores canonical name + sort-name `core/musicbrainz_service.py:fetch_artist_aliases` only read `data['aliases']`. For artists where MB's canonical `name` IS the cross-script form (and the Latin spelling lives only in aliases — or vice versa), the missing direction never made it into the returned list. Fix: include both `data['name']` and `data['sort-name']` alongside the explicit alias entries (deduped, also pulls each alias entry's sort-name when present). Bug 2 — lookup_artist_aliases ran search in strict mode only Strict mode queries `artist:"..."` only and skips MB's alias and sortname indexes. Cross-script searches found nothing under strict because the user's Latin input never matches a Cyrillic canonical name in the artist index. Fix: lifted the search-and-score logic to a private helper `_search_and_score_artists(name, strict=)` and fall back to non-strict when strict returns empty OR all results fail the trust gate. Non-strict (bare query) hits all indexes. Bug 3 — trust gate weighted local similarity 70% Combined score = local_sim * 0.7 + mb_score/100 * 0.3. Cross-script pairs have local sim ~0 → combined ~0.30 → below the 0.85 threshold → cached as empty even when MB's own confidence was 100. Fix: added an MB-only escape — when MB score is >= 95 AND the result is unambiguous (top result's MB score leads the runner-up by >= 5), accept regardless of local similarity. The existing combined-score path stays intact for same-script matches (#442 Hiroyuki Sawano case still passes via that path). 12 new tests pin every layer: - fetch_artist_aliases canonical-name inclusion + dedup against alias entries + missing-canonical handling + exception path - strict-then-non-strict fallback (empty-strict + low-strict-score) - trust gate MB-only escape + low-confidence rejection + ambiguity rejection (two artists same MB score) + same-script regression - end-to-end reporter scenario with the real `artist_names_match` helper proving the bridge works for "Русская филармония, Дмитрий Яблонский" vs expected "Dmitry Yablonsky" Existing alias tests in `test_artist_alias_service.py` updated to reflect: canonical name now appears in `fetch_artist_aliases` output, lookup makes 2 search calls (strict + non-strict fallback) on first cache miss instead of 1. Full suite: 3065 passed. |
2 weeks ago |
|
|
e7ecaca3fd |
Fix MTV Unplugged & live-album false-quarantine pipeline
Closes #589. Tracks from MTV Unplugged / Live At / unplugged albums consistently failed AcoustID verification with "Version mismatch: expected (live) but file is (original)". Two upstream bugs fed into the false positive — the AcoustID gate itself was correctly catching the wrong file Tidal had selected. Codex diagnosed all three layers, this fixes the two upstream causes and leaves the verifier alone. Bug 1 — album-scoped library check false-misses owned albums `core/downloads/master.py:184` scored "Shy Away (MTV Unplugged Live)" (source title from playlist) vs "Shy Away" (local DB stored title) with raw string similarity. Massive length asymmetry → ~0.3 → below the 0.7 threshold → marked missing. Combined with the `allow_duplicates and batch_is_album` short-circuit that disables the global fallback for album downloads, the user's already-owned album re-triggered every track for download. Explains the screenshot showing "0 found / 7 missing" on an album the user manually placed. New pure helper `core/matching/album_context_title.py:strip_redundant_album_suffix` strips trailing parenthetical / bracket / dash suffixes whose tokens are fully subsumed by the album context — at least one version marker (live / unplugged / acoustic / session / concert / tour) overlapping with the album, and every other token is either a known marker, a year, a tolerated noise word, or a word from the album title. Album-context-implied "live" added when the album mentions unplugged / concert / tour / session. Wired into the album-confirmed scope ONLY (not global matching). Compares both raw and normalized source titles per album track and takes the max similarity, so the helper returning the input unchanged (when album doesn't imply version context) preserves the pre-fix behavior. Bug 2 — Tidal qualifier filter only ran on fallback searches `core/tidal_download_client.py:345` set `is_fallback = attempt_idx > 0` and only filtered when `is_fallback and required_qualifiers`. Primary search returned all results unfiltered, so a query for "Shy Away (MTV Unplugged Live)" could accept the studio cut if Tidal happened to rank it first. Now the qualifier filter applies to BOTH primary and fallback search attempts — log message updated to indicate which path triggered. Bug 3 — qualifier check ignored album.name The legacy `_track_name_contains_qualifiers` only inspected the track name. For concert / unplugged releases the live signal typically lives in the album title, not the track title. New `_track_matches_qualifiers` accepts a track object and inspects both `track.name` AND `track.album.name`. Legacy helper preserved to keep its existing test contract. AcoustID version-mismatch gate at core/acoustid_verification.py left intact — it correctly catches genuinely-wrong files that slip through upstream filters. The In My Feelings (Instrumental) test that pins this behavior continues to pass. 19 tests on the album-context helper covering MTV Unplugged variants, dash/parens/brackets suffix shapes, year tolerance, plural-form markers, the implied-live set, anti-regression cases (instrumental/remix on a studio album must NOT be stripped), empty/none defensive paths. 13 tests on the Tidal qualifier helper covering legacy track-name-only behavior preserved, qualifier in track name alone, qualifier in album name alone (the MTV Unplugged scenario), multi-qualifier requirements, no-qualifiers always passes, defensive against missing track.album, word-boundary avoiding substring false-matches, _extract_qualifiers picking up live + unplugged from the user's exact reporter query. Full suite: 3053 passed. |
2 weeks ago |
|
|
df304eb016 |
AcoustID scanner: handle multi-value artist credits
Discord report (Foxxify): the AcoustID scanner repair job flagged
multi-artist tracks as Wrong Song because AcoustID returns the
FULL credit ("Okayracer, aldrch & poptropicaslutz!") while the
library DB carries only the primary artist ("Okayracer"). Raw
SequenceMatcher similarity scored ~43% — well below the 60%
threshold — so the scanner created a finding even though the
audio was correct. User couldn't fix without lowering the global
artist threshold to ~30% (which would let real mismatches through).
# Fix
Extended the shared `core/matching/artist_aliases.py::artist_names_match`
helper (originally lifted for #441) with credit-token splitting.
When the actual artist string contains common separators —
- punctuation: `,` `&` `;` `/` `+`
- keywords (whitespace-bounded): `feat.` `ft.` `featuring` `with`
`vs.` `x`
— the helper splits into individual contributors and checks each
against the expected artist. Primary-in-credit cases now resolve
at 100% instead of 43%.
Two pattern groups because punctuation separators don't need
surrounding whitespace, but keyword separators MUST be
whitespace-bounded — otherwise we'd split artists with `x` /
`with` etc. in their names ("JAY-X" → "JAY-" / "" issue).
Composes with the existing alias path: cross-script multi-artist
credits ("Hiroyuki Sawano" expected, "澤野弘之, FeaturedJp"
actual) work via alias-token-against-credit-token compare.
# Wire-in
Scanner at `core/repair_jobs/acoustid_scanner.py:202` replaces
the raw `SequenceMatcher` call with `artist_names_match`. Pass
RAW artist strings (not pre-normalised by `_normalize`) so the
splitter can recognise separators — `_normalize` strips ALL
punctuation, which destroyed the very tokens the splitter needs.
The AcoustID post-download verifier (`core/acoustid_verification.py`)
already routes through `_alias_aware_artist_sim` which calls the
same helper — gets the multi-value benefit automatically without
a separate wire-in.
# New `split_artist_credit` exported helper
Pure-function helper for callers who want token-level access to
the credit list (debugging, UI, future per-token enrichment). Same
splitter logic, exposed as a top-level function.
# Tests added (14)
`tests/matching/test_artist_aliases.py` (+11):
- `TestSplitArtistCredit` — parametrised across 12 credit-string
formats (comma, ampersand, semicolon, slash, plus, feat./ft./
featuring, with, vs., x, single-token, empty), drops empty
tokens, strips per-token whitespace
- `TestMultiValueCreditMatching` — reporter's exact case
(Okayracer in 3-artist credit → 100%), primary in middle/end of
credit, genuine-mismatch still fails, single-token actual falls
through to direct compare, multi-value composes with aliases,
threshold still respected
`tests/test_acoustid_scanner.py` (+3):
- Reporter's case end-to-end through `_scan_file` — fingerprint
99% / title 100% / multi-artist credit → no finding created
- Genuine artist mismatch still creates finding (no false
suppression of real mismatches)
- `JobResultStub` minimal scaffold for the integration tests
# Verification
- 14 new tests pass (49 helper + 5 scanner total in their files)
- 110 matching + scanner tests pass total
- 2584 full suite passes (+25 from baseline 2559)
- Ruff clean
- Reporter's exact case (Okayracer in `Okayracer, aldrch &
poptropicaslutz!`) now scores 100% match → no Wrong Song flag
|
2 weeks ago |
|
|
bc34d39ce9 |
Tighten alias-lookup trust + add ambiguity gate + diagnostic log
Cin pre-review pass on the false-positive risk. Three tightenings:
# 1. Bumped MB-search trust threshold from 0.6 → 0.85
`MusicBrainzService.lookup_artist_aliases` previously trusted any
MB search match scoring ≥ 0.6 combined (name-similarity + MB
relevance). For distinctive cross-script artists the user-reported
case targets (Hiroyuki Sawano, Сергей Лазарев, etc.) real matches
score ~1.0 — well above 0.85. The 0.6 floor was loose enough to
let in moderate matches for ambiguous names, risking aliases for
the wrong artist getting cached + applied.
Bumped to 0.85. Tighter without rejecting any of the legit
cross-script cases the PR is for.
# 2. Ambiguity gate — skip when results within 0.1 of best
When MB search returns multiple results all scoring high (within
0.1 of the best), the artist name is ambiguous — common name with
multiple distinct artists ("John Smith" returning 10 different
John Smiths). Pulling aliases for any one of them risks the wrong
artist's data bridging incorrectly to a file's tag.
Added explicit ambiguity detection: when 2+ results within 0.1,
skip alias lookup entirely + cache empty. Matches Cin's
"explicit > implicit" — the prior code just picked the highest
score blindly.
# 3. Diagnostic log when alias rescues a comparison
When the alias path triggers a PASS that direct similarity would
have FAILed, emit an INFO log: `Artist alias rescued comparison:
expected='X' vs actual='Y' (direct sim=0.00, alias 'Z' →
score=1.00)`.
Lets future bug reports trace which alias triggered which decision.
Doesn't change behavior — visibility only. Logs ONLY the rescue
case, not happy-path direct matches (no log spam).
# Tests added (5)
`test_artist_alias_service.py` (+3):
- `test_moderate_confidence_match_now_skipped_strict_threshold`
- `test_ambiguous_results_skipped`
- `test_unambiguous_high_confidence_match_succeeds`
`test_acoustid_verification_aliases.py` (+3):
- `test_alias_rescue_emits_info_log` — direct-fail + alias-pass
emits INFO log
- `test_no_log_when_direct_match_succeeds` — happy path quiet
- `test_no_log_when_alias_doesnt_help` — failed path also quiet
# Test infrastructure note
Logging tests use a directly-attached `ListHandler` on
`soulsync.acoustid.verification` (the actual logger name —
dot-separated by `get_logger`), NOT pytest's caplog. Same pattern
as the prior watchdog-test fix — caplog is intermittently flaky
in full-suite runs for soulsync namespace loggers. An owned
handler sidesteps both issues.
# Verification
- 85/85 matching tests pass (+5 from prior commit)
- 2543 full suite passes (+6 from prior, +85 PR-total)
- Ruff clean
- Reporter's Japanese + Russian regression tests still pass —
legit cross-script case (sim ≈ 1.0) clears the new 0.85
threshold easily
|
2 weeks ago |
|
|
11397307b2 |
Alias resolution polish: lazy-fire on direct-match failure + worker backfill
Two perf gaps that would have failed Cin's review: # Gap #1: alias lookup fired unconditionally Pre-fix in this commit, `_resolve_expected_artist_aliases` ran at the top of every `verify_audio_file` call regardless of whether the direct artist match would have passed. For users whose library is mostly same-script (95% of cases), every successful verification was paying for a wasted DB query (and possibly a wasted MB API call for un-enriched artists). Restructured the helper to accept a callable provider instead of a pre-resolved list. Provider invoked LAZILY only when direct similarity falls below `ARTIST_MATCH_THRESHOLD`. Verifier passes a memoising thunk that resolves once across the 3 comparison sites within one verification. `_alias_aware_artist_sim` now accepts `aliases` as either: - iterable of strings (used eagerly — backward compat with tests that already know the aliases) - callable returning the iterable (resolved on first need within a verification) Happy path (direct match passes): zero DB queries, zero MB calls. Cross-script case: one resolution shared across 3 sites — same as the prior contract. # Gap #2: existing-MBID artists never got alias backfill Worker's `_process_item` artist branch had an `existing_id` short- circuit (line 296) that updated MBID status but skipped alias fetch. Result: every user with an already-enriched library had MBIDs but NULL aliases on day-one of this PR. Live MB lookup at verify-time covered them, but at the cost of N live calls for N artists across the library. Added one-time backfill: when existing-MBID is found AND `artists.aliases` for that row is empty, fetch + persist aliases. Subsequent re-scan cycles short-circuit on the populated column — no repeated MB calls. New helper `_artist_aliases_empty(artist_id)` does the cheap NULL check via direct SQL. Best-effort: defensively returns True on errors so backfill happens (a redundant MB call is cheaper than missing the backfill entirely). # Tests added (9) `test_acoustid_verification_aliases.py` (+6): - `TestLazyAliasResolution` (3): no lookup when direct match passes, lookup fires only when direct fails, lookup memoised across the 3 sites within one verification. - `TestAliasProviderCallable` (3): iterable passed directly, callable resolves lazily, callable returning empty falls back to direct sim. `test_artist_alias_service.py` (+3): - `test_existing_mbid_path_backfills_aliases_when_column_empty` - `test_existing_mbid_path_skips_backfill_when_aliases_already_set` - `test_existing_mbid_backfill_failure_does_not_break_match` # Verification - 79/79 matching tests pass (+9 from prior commit) - 2537 full suite passes (+9, +79 PR-total) - Ruff clean - Backward compat: every prior-commit test still passes (the iterable-shape API still works alongside the new callable shape) |
2 weeks ago |
|
|
7066233c37 |
Wire alias-aware artist match into AcoustID verifier — fixes #442
This is the user-visible commit. The reporter's exact two cases (Japanese kanji, Russian Cyrillic) now pass verification instead of being quarantined. # What changed Verifier's three artist-similarity sites now route through the shared `core.matching.artist_aliases.artist_names_match` helper instead of raw `_similarity`: - `_find_best_title_artist_match` (per-recording scoring at the best-match stage) - Secondary scan when title matches but best-match's artist doesn't (line ~355 pre-fix) - Final fallback scan over all recordings (line ~403 pre-fix) Aliases for the expected artist are resolved ONCE at the top of `verify_audio_file` via `_resolve_expected_artist_aliases`, which calls the new `MusicBrainzService.lookup_artist_aliases` chain (library DB → cache → live MB). Single resolution per verification regardless of how many AcoustID recordings come back — pinned by test. New helper `_alias_aware_artist_sim(expected, actual, aliases)` wraps the pure helper with the verifier's normaliser (`_similarity`) and threshold (`ARTIST_MATCH_THRESHOLD`). Returns a single float so existing threshold-comparison code paths keep their shape — minimal diff. # Reporter's cases — verified Case 1 (issue #442 verbatim): File: YAMANAIAME by 澤野弘之 Expected: YAMANAIAME by Hiroyuki Sawano Pre-fix: Quarantined (artist=0%) Post-fix: PASS (alias '澤野弘之' resolved from MB) Case 2 (issue #442 verbatim): File: On the Other Side by Sergey Lazarev Expected: On the other side by Сергей Лазарев Pre-fix: Quarantined (artist=7%) Post-fix: PASS (alias 'Sergey Lazarev' resolved from MB) Both reproduced as regression tests with stubbed MB service. # Backward compat Three test cases pin that no-aliases / failure paths preserve pre-fix behaviour exactly: - Clear artist mismatch (different artist, same script) still FAILs — aliases bridge synonyms, not unrelated artists. - Exact title + artist match still PASSes regardless of aliases. - MB service raise → verifier completes with direct similarity (treats failure as "no aliases available" — same as pre-fix). Also covers manual import: the import-modal "Search for Match" flow goes through the same verifier, so the reporter's complaint that "manual import simply throws them back in quarantine again" is fixed by the same change. # Tests added (11) `tests/matching/test_acoustid_verification_aliases.py`: - `_alias_aware_artist_sim`: alias bridges score ↑, no-aliases falls back, aliases don't mask genuine mismatches - `_find_best_title_artist_match` accepts + uses aliases - Reporter's case 1 (Japanese) end-to-end - Reporter's case 2 (Russian) end-to-end - Backward compat: no-aliases mismatch still fails, exact match still passes, MB-service-raise doesn't break verification - Performance: alias lookup fires ONCE per verification regardless of recording count # Verification - 11 new verifier tests pass - 31 prior service tests pass - 28 prior helper tests pass - 294 matching + imports tests pass total (no regression) - Ruff clean |
2 weeks ago |
|
|
15244f24cf |
Live MB lookup for un-enriched artists with cache
Previous commit only populated `artists.aliases` for artists the MB worker had enriched. But the AcoustID verifier (next commit) needs aliases for ANY expected artist — including: - Artists not yet in the user's library (first download) - Artists in the library where MB enrichment hasn't run yet - Artists where MB enrichment ran but found no MBID (NULL aliases) This commit adds a multi-tier resolution helper that fills those gaps without thrashing the MB API. # Multi-tier resolution `lookup_artist_aliases(artist_name) -> list[str]`: 1. **Library DB** (fast path): existing `get_artist_aliases` lookup by name. No network. Most common path once the worker has enriched everything. 2. **Cache** (existing `musicbrainz_cache` table, entity_type= `artist_aliases`): a prior live lookup for this name. Empty cache hit is respected (don't re-query when MB previously had nothing). 3. **Live MB**: search artist by name → pick highest-confidence match (combined name-similarity + MB relevance) → fetch aliases for that MBID → cache the result. Always returns a list (possibly empty), never raises. Empty result on any tier means "no alternate spellings found, fall back to direct match" — identical to the pre-fix behaviour. # Threshold gate Live lookup only trusts the MB search result when combined similarity score >= 0.6. Below that, we'd be guessing at the wrong artist — searching `John Smith` returns multiple John Smiths and pulling aliases for one of them could mismatch. Cache the empty result so we don't keep re-searching the same low-confidence name. # Performance contract Critical for the verifier path: 100 quarantine candidates with the same expected artist must NOT trigger 100 MB API calls. Cache hit on second + subsequent calls per unique artist name. Verified by test pinning the call counts. # Tests added (8) - Tier 1 library DB hit — no MB API call fired - Tier 3 live MB lookup → search → fetch → returns aliases - Tier 2 cache hit on second call — no re-query - Empty input → empty return + no API call - Network failure on search → empty + cached so we don't retry - No search results → empty + cached - Low-confidence match (sim < 0.6) skipped — defends against picking the wrong artist - Library row exists but aliases NULL → falls through to live lookup (defends against the half-enriched state) # Verification - 31/31 service tests pass (8 new + 23 prior) - Ruff clean |
2 weeks ago |
|
|
48d848bb74 |
MB worker populates artists.aliases on enrichment
Issue #442 — MusicBrainz exposes alternate-spelling aliases (Japanese kanji `澤野弘之` for `Hiroyuki Sawano`, Cyrillic `Сергей Лазарев` for `Sergey Lazarev`, etc.) on every artist record. SoulSync's MB enrichment worker had access to this data via `get_artist(mbid, includes=['aliases'])` but wasn't reading or persisting it. This commit wires the alias fetch into the worker's existing artist-match path, persists to the new `artists.aliases` column added in the prior commit, and adds a verifier-friendly read-by- name lookup so the AcoustID verifier (next commit) can resolve aliases without an MB round-trip when the artist is in the library. # New service methods - `fetch_artist_aliases(mbid) -> list[str]` — calls `mb_client.get_artist(mbid, includes=['aliases'])`, parses the alias array, dedupes case-insensitively. Returns empty list on any failure (missing key, network error, malformed response) so transient MB outages never trigger stricter quarantine decisions than the pre-fix behaviour. Empty mbid → no API call. - `update_artist_aliases(artist_id, aliases)` — persists as JSON array to `artists.aliases`. Idempotent — overwrites prior value. Empty list clears the column. None artist_id is a no-op. - `get_artist_aliases(artist_name) -> list[str]` — reads back by artist NAME (not id), case-insensitive. Used by the verifier where the expected artist comes from track metadata — there's no library row id at quarantine time. Returns empty list for unknown artists, missing data, or corrupt JSON (defensive against legacy rows). # Worker integration `MusicBrainzWorker._process_item` artist branch: - After `update_artist_mbid` succeeds, fetch aliases for the matched MBID and persist via `update_artist_aliases`. - Best-effort: alias fetch wrapped in try/except, failure logs at debug level, doesn't regress the match outcome. - No alias call when the artist didn't match an MBID (nothing to enrich). # Tests (23) - `fetch_artist_aliases`: extracts names from MB response, case-insensitive dedup, skips empty/null entries, missing-key fallback, network failure → empty, empty mbid no API call, verifies `inc=aliases` request param. - `update_artist_aliases`: persists as JSON, idempotent overwrite, empty list clears column, None id is no-op. - `get_artist_aliases`: returns aliases for known artist, case-insensitive lookup, empty for unknown artist / no-aliases row, handles corrupt JSON + non-list shape gracefully. - Worker integration: matched artist triggers fetch + persist, no alias call when not matched, alias-fetch failure doesn't break the match outcome. # Verification - 23/23 new tests pass - Ruff clean |
2 weeks ago |
|
|
235ada7e0f |
Add pure artist-name comparison helper with alias awareness
Issue #442 — files tagged with one spelling of an artist's name (Japanese kanji `澤野弘之`) get quarantined when SoulSync expects the romanized spelling (`Hiroyuki Sawano`). Raw similarity comparison scored 0% across scripts. MusicBrainz exposes alternate-spelling aliases on every artist record but the verifier never consulted them. This commit adds the pure helper that does the alias-aware comparison. No I/O, no DB access, no network. Caller supplies the aliases (looked up from library DB or live MB by later commits in this PR). Default threshold matches the verifier's existing `ARTIST_MATCH_THRESHOLD` (0.6) so wiring this in preserves current pass/fail semantics on the no-alias path. # API ``` artist_names_match(expected, actual, *, aliases=None, threshold=0.6, similarity=None) -> (matched: bool, best_score: float) ``` - Direct compare first (fast path + baseline score) - If below threshold, score each alias against `actual` - First alias to clear threshold → match - Returns the best score across all candidates so callers can log the score they made the decision on ``` best_alias_match(expected, actual, aliases=None, *, similarity=None) -> (winner: Optional[str], best_score: float) ``` Companion helper for callers that want to surface WHICH alias triggered the match (debug logs, UI explanations). No threshold — purely informative. # Architectural choices - **Pure function**: no I/O. Caller (verifier, future matching-engine consumers) owns alias lookup strategy + threshold tuning. - **Custom similarity callable**: lets the verifier pass its parenthetical-stripping normaliser without this module having to know about it. Defaults to lowercase + SequenceMatcher (matches the verifier's existing behaviour). - **Defensive coercion**: aliases input handles None entries, empty strings, non-string types, sets, tuples, lists — caller may feed raw MB response data without cleaning first. - **Backward compat**: `aliases=None` or empty → behaves identically to a plain similarity check. Paths not yet wired up to alias lookup see no behaviour change. # Tests (28) - Direct compare (no aliases): exact / case / whitespace / fuzzy / different - Cross-script with aliases: Japanese ↔ romanized (reporter's case 1), Cyrillic ↔ Latin (reporter's case 2), symmetric direction, no-match fallthrough so aliases don't mask genuine mismatches - Aliases input handling: None, empty, set, tuple, None-entries, non-string entries - Threshold: default matches verifier's 0.6, custom stricter, custom looser - Custom similarity: applies to both direct + alias compare - Best-alias-match introspection - Backward compat parametrised across 5 cases # What this commit does NOT do This is the helper module + tests only. Subsequent commits in this PR populate aliases (MB worker), provide live MB lookup with cache for un-enriched artists, and wire the helper into the AcoustID verifier where the quarantine decision actually fires. |
2 weeks ago |