nordabiz

Author	SHA1	Message	Date
Maciej Pienczyn	8f393fbe4a	fix(zopk): Improve scraper content extraction with domain selectors and empty-match fix Some checks are pending NordaBiz Tests / Unit & Integration Tests (push) Waiting to run Details NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions Details NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions Details NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions Details Critical bug: CSS selector pipeline stopped at first match even if element had 0-94 chars of text (empty <article> tags on wnp.pl, polskieradio24.pl, portalkomunalny.pl, weekendfm.pl). Now skips elements with <200 chars text. Added domain-specific selectors for: radiogdansk.pl (Elementor), nadmorski24.pl (Joomla), portalkomunalny.pl, weekendfm.pl, globenergia.pl, polskieradio24.pl. Added 9 domains to SKIP_DOMAINS: wnp.pl (paywall), tvp.pl/tvp.info (JS SPA), gp24.pl/strefaobrony.pl/dziennikbaltycki.pl (Cloudflare), pap.pl, obserwatorfinansowy.pl, cire.pl (block bots). Moved 'article' lower in default selectors to avoid matching empty tags first. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 15:58:05 +01:00
Maciej Pienczyn	18f9f98f5d	fix(zopk): Raise minimum scraped content threshold from 100 to 500 chars Some checks are pending NordaBiz Tests / Unit & Integration Tests (push) Waiting to run Details NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions Details NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions Details NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions Details Articles with only 100-458 chars were passing validation but contained metadata/teasers instead of full article text, causing all knowledge extraction to fail ("Treść za krótka do ekstrakcji"). The 500-char minimum better aligns with the 200-token chunking requirement (~800 chars). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 15:50:21 +01:00
Maciej Pienczyn	3c1f920675	fix(zopk): Translate remaining English messages and unify skip status Some checks are pending NordaBiz Tests / Unit & Integration Tests (push) Waiting to run Details NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions Details NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions Details NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions Details - Remaining scraper messages: Domain/Not HTML/Extraction error → Polish - Embedding failures shown as skipped (yellow) instead of failed (red) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 15:44:34 +01:00
Maciej Pienczyn	3b3bb7bdd7	fix(zopk): Polish error messages and show failures as skipped, not errors Some checks are pending NordaBiz Tests / Unit & Integration Tests (push) Waiting to run Details NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions Details NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions Details NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions Details Admin was confused by red "Błędy: 2" when scraping/extraction had expected issues (403, content too short). Changes: - All scraper/extractor messages translated to Polish - HTTP 403/404/429 get specific descriptive messages - Expected failures shown as yellow "Pominięte" instead of red "Błędy" - "No chunks created" → "Treść za krótka do ekstrakcji" - Summary label "Błędy" → "Pominięte" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 15:36:00 +01:00
Maciej Pienczyn	081c0d7ec5	fix: Naprawiono dekodowanie URL-i Google News Zmieniono kolejność metod dekodowania - googlenewsdecoder jest teraz używany jako pierwsza metoda zamiast ostatniej. Poprzednia kolejność powodowała wpadanie w pętlę z consent.google.com i wyczerpanie max_depth przed wywołaniem działającej biblioteki. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 23:34:40 +01:00
Maciej Pienczyn	37af8abc73	feat(admin): Paski postępu dla operacji AI w panelu ZOPK Dodano Server-Sent Events (SSE) dla śledzenia postępu w czasie rzeczywistym: - Scraping treści artykułów - Ekstrakcja wiedzy przez Gemini AI - Generowanie embeddingów Funkcje: - Modal z paskiem postępu i statystykami - Live log operacji z kolorowaniem statusów - Podsumowanie na zakończenie (sukces/błędy/czas) - Możliwość zamknięcia modalu po zakończeniu Zmiany techniczne: - 3 nowe SSE endpointy (/stream) - ProgressUpdate dataclass w scraperze - Callback pattern w batch_scrape, batch_extract, generate_chunk_embeddings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 23:23:05 +01:00
Maciej Pienczyn	3e90cdbfc7	feat(scraper): Dekodowanie URL-i Google News do oryginalnych źródeł - Dodano funkcję decode_google_news_url() z 3 metodami dekodowania: 1. Base64 decoding (preferowana, bez HTTP request) 2. HTTP redirect following 3. googlenewsdecoder library jako fallback - Scraper automatycznie dekoduje URL-e Google News przed scrapowaniem - Zaktualizowano news.url i news.source_domain po dekodowaniu - Dodano news.google.com do SKIP_DOMAINS (wymaga dekodowania) - Dodano googlenewsdecoder do requirements.txt Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 22:12:27 +01:00
Maciej Pienczyn	1e42c4fbd8	fix(scraper): Dodano domeny paywall do SKIP_DOMAINS - wyborcza.pl - paywall Gazety Wyborczej - rp.pl - paywall Rzeczpospolitej - wykop.pl - agregator bez oryginalnej treści - reddit.com - agregator Te domeny zwracają cookie dialog zamiast treści artykułów Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 20:48:26 +01:00
Maciej Pienczyn	1b4cd31c41	feat(zopk): Knowledge Base + NordaGPT integration (FAZY 0-3) FAZA 0 - Web Scraping: - Migracja 015: pola full_content, scrape_status w zopk_news - zopk_content_scraper.py: scraper z rate limiting i selektorami FAZA 1 - Knowledge Extraction: - zopk_knowledge_service.py: chunking, facts, entities extraction - Endpointy /admin/zopk/knowledge/extract FAZA 2 - Embeddings: - gemini_service.py: generate_embedding(), generate_embeddings_batch() - Model text-embedding-004 (768 dimensions) FAZA 3 - NordaGPT Integration: - nordabiz_chat.py: _is_zopk_query(), _get_zopk_knowledge_context() - System prompt z bazą wiedzy ZOPK - Semantic search w kontekście chatu Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-16 20:15:30 +01:00

9 Commits