Commit Graph

10 Commits

Author SHA1 Message Date
55088f0ccb fix: ZOPK knowledge base image display and data quality issues
Some checks are pending
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
- Fix broken news thumbnails by adding og:image extraction during content
  scraping (replaces Brave proxy URLs that block hotlinking)
- Add image onerror fallback in templates showing domain favicon when
  original image fails to load
- Decode Brave proxy image URLs to original source URLs before saving
- Enforce English-only entity types in AI extraction prompt to prevent
  mixed Polish/English type names
- Add migration 083 to normalize 14 existing Polish entity types and
  clean up 5 stale fetch jobs stuck in 'running' status
- Add backfill script for existing articles with broken image URLs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 08:57:41 +01:00
8f393fbe4a fix(zopk): Improve scraper content extraction with domain selectors and empty-match fix
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
Critical bug: CSS selector pipeline stopped at first match even if element
had 0-94 chars of text (empty <article> tags on wnp.pl, polskieradio24.pl,
portalkomunalny.pl, weekendfm.pl). Now skips elements with <200 chars text.

Added domain-specific selectors for: radiogdansk.pl (Elementor),
nadmorski24.pl (Joomla), portalkomunalny.pl, weekendfm.pl, globenergia.pl,
polskieradio24.pl.

Added 9 domains to SKIP_DOMAINS: wnp.pl (paywall), tvp.pl/tvp.info (JS SPA),
gp24.pl/strefaobrony.pl/dziennikbaltycki.pl (Cloudflare), pap.pl,
obserwatorfinansowy.pl, cire.pl (block bots).

Moved 'article' lower in default selectors to avoid matching empty tags first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 15:58:05 +01:00
18f9f98f5d fix(zopk): Raise minimum scraped content threshold from 100 to 500 chars
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
Articles with only 100-458 chars were passing validation but contained
metadata/teasers instead of full article text, causing all knowledge
extraction to fail ("Treść za krótka do ekstrakcji"). The 500-char
minimum better aligns with the 200-token chunking requirement (~800 chars).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 15:50:21 +01:00
3c1f920675 fix(zopk): Translate remaining English messages and unify skip status
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
- Remaining scraper messages: Domain/Not HTML/Extraction error → Polish
- Embedding failures shown as skipped (yellow) instead of failed (red)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 15:44:34 +01:00
3b3bb7bdd7 fix(zopk): Polish error messages and show failures as skipped, not errors
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
Admin was confused by red "Błędy: 2" when scraping/extraction had
expected issues (403, content too short). Changes:
- All scraper/extractor messages translated to Polish
- HTTP 403/404/429 get specific descriptive messages
- Expected failures shown as yellow "Pominięte" instead of red "Błędy"
- "No chunks created" → "Treść za krótka do ekstrakcji"
- Summary label "Błędy" → "Pominięte"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 15:36:00 +01:00
081c0d7ec5 fix: Naprawiono dekodowanie URL-i Google News
Zmieniono kolejność metod dekodowania - googlenewsdecoder jest teraz
używany jako pierwsza metoda zamiast ostatniej. Poprzednia kolejność
powodowała wpadanie w pętlę z consent.google.com i wyczerpanie max_depth
przed wywołaniem działającej biblioteki.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 23:34:40 +01:00
37af8abc73 feat(admin): Paski postępu dla operacji AI w panelu ZOPK
Dodano Server-Sent Events (SSE) dla śledzenia postępu w czasie rzeczywistym:
- Scraping treści artykułów
- Ekstrakcja wiedzy przez Gemini AI
- Generowanie embeddingów

Funkcje:
- Modal z paskiem postępu i statystykami
- Live log operacji z kolorowaniem statusów
- Podsumowanie na zakończenie (sukces/błędy/czas)
- Możliwość zamknięcia modalu po zakończeniu

Zmiany techniczne:
- 3 nowe SSE endpointy (/stream)
- ProgressUpdate dataclass w scraperze
- Callback pattern w batch_scrape, batch_extract, generate_chunk_embeddings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 23:23:05 +01:00
3e90cdbfc7 feat(scraper): Dekodowanie URL-i Google News do oryginalnych źródeł
- Dodano funkcję decode_google_news_url() z 3 metodami dekodowania:
  1. Base64 decoding (preferowana, bez HTTP request)
  2. HTTP redirect following
  3. googlenewsdecoder library jako fallback
- Scraper automatycznie dekoduje URL-e Google News przed scrapowaniem
- Zaktualizowano news.url i news.source_domain po dekodowaniu
- Dodano news.google.com do SKIP_DOMAINS (wymaga dekodowania)
- Dodano googlenewsdecoder do requirements.txt

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 22:12:27 +01:00
1e42c4fbd8 fix(scraper): Dodano domeny paywall do SKIP_DOMAINS
- wyborcza.pl - paywall Gazety Wyborczej
- rp.pl - paywall Rzeczpospolitej
- wykop.pl - agregator bez oryginalnej treści
- reddit.com - agregator

Te domeny zwracają cookie dialog zamiast treści artykułów

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 20:48:26 +01:00
1b4cd31c41 feat(zopk): Knowledge Base + NordaGPT integration (FAZY 0-3)
FAZA 0 - Web Scraping:
- Migracja 015: pola full_content, scrape_status w zopk_news
- zopk_content_scraper.py: scraper z rate limiting i selektorami

FAZA 1 - Knowledge Extraction:
- zopk_knowledge_service.py: chunking, facts, entities extraction
- Endpointy /admin/zopk/knowledge/extract

FAZA 2 - Embeddings:
- gemini_service.py: generate_embedding(), generate_embeddings_batch()
- Model text-embedding-004 (768 dimensions)

FAZA 3 - NordaGPT Integration:
- nordabiz_chat.py: _is_zopk_query(), _get_zopk_knowledge_context()
- System prompt z bazą wiedzy ZOPK
- Semantic search w kontekście chatu

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 20:15:30 +01:00