nordabiz/docs/architecture/flows/02-search-flow.md
Maciej Pienczyn 23493f0b61 docs: Aktualizacja dokumentacji do Gemini 3 Flash
Zmiana domyślnego modelu w dokumentacji i kodzie:
- gemini-2.5-flash → gemini-3-flash-preview
- gemini-2.5-pro → gemini-3-pro-preview

Zaktualizowane pliki:
- README.md - opis technologii
- docs/architecture/*.md - diagramy i przepływy
- nordabiz_chat.py - fallback model name
- zopk_news_service.py - model dla AI evaluation
- templates/admin/zopk_dashboard.html - wyświetlany model

Zachowano mapowania legacy modeli dla kompatybilności wstecznej.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 14:19:05 +01:00

36 KiB

Company Search Flow

Document Version: 1.0 Last Updated: 2026-01-10 Status: Production LIVE Flow Type: Company Search & Discovery


Overview

This document describes the complete company search flow for the Norda Biznes Partner application, covering:

  • User Search Interface (/search route)
  • Search Service Architecture (unified search with multiple strategies)
  • AI Chat Integration (context-aware company discovery)
  • Search Strategies:
    • NIP/REGON direct lookup
    • Synonym expansion
    • PostgreSQL Full-Text Search (FTS)
    • Fuzzy matching (pg_trgm)
    • SQLite keyword scoring fallback

Key Technology:

  • Search Engine: Custom unified SearchService
  • Database: PostgreSQL FTS with tsvector indexing
  • Fuzzy Matching: pg_trgm extension for typo tolerance
  • Synonym Expansion: Domain-specific keyword mappings
  • AI Integration: Used by NordaBiz Chat for context building

Performance Features:

  • Direct identifier lookup (NIP/REGON) bypasses full search
  • Database-level full-text search indexing
  • Synonym expansion increases recall
  • Configurable result limits (default 50)
  • Fallback mechanisms for SQLite compatibility

1. Search Flow Overview

1.1 High-Level Architecture

flowchart TD
    User[User] -->|1. Search query| UI[Search UI<br/>/search route]
    AIUser[AI Chat User] -->|1. Natural language| Chat[AI Chat<br/>/chat route]

    UI -->|2. Call| SearchSvc[Search Service<br/>search_service.py]
    Chat -->|2. Find companies| SearchSvc

    SearchSvc -->|3. Detect query type| QueryType{Query Type?}

    QueryType -->|NIP: 10 digits| NIPLookup[NIP Direct Lookup]
    QueryType -->|REGON: 9/14 digits| REGONLookup[REGON Direct Lookup]
    QueryType -->|Text query| DBCheck{Database<br/>Type?}

    DBCheck -->|PostgreSQL| PGFTS[PostgreSQL FTS<br/>+ Fuzzy Match]
    DBCheck -->|SQLite| SQLiteFallback[SQLite Keyword<br/>Scoring]

    NIPLookup -->|4. Query DB| DB[(PostgreSQL<br/>companies)]
    REGONLookup -->|4. Query DB| DB
    PGFTS -->|4. FTS query| DB
    SQLiteFallback -->|4. LIKE query| DB

    DB -->|5. Results| SearchSvc
    SearchSvc -->|6. SearchResult[]| UI
    SearchSvc -->|6. Company[]| Chat

    UI -->|7. Render| SearchResults[search_results.html]
    Chat -->|7. Build context| AIContext[AI Context Builder]

    SearchResults -->|8. Display| User
    AIContext -->|8. Generate response| AIUser

    style SearchSvc fill:#4CAF50
    style PGFTS fill:#2196F3
    style DB fill:#FF9800
    style NIPLookup fill:#9C27B0
    style REGONLookup fill:#9C27B0

2. Search Strategies

2.1 Strategy Selection Algorithm

flowchart TD
    Start([User Query]) --> Clean[Strip whitespace]
    Clean --> Empty{Empty<br/>query?}

    Empty -->|Yes| AllCompanies[Return all companies<br/>ORDER BY name]
    Empty -->|No| NIPCheck{Is NIP?<br/>10 digits}

    NIPCheck -->|Yes| NIPSearch[Direct NIP lookup<br/>WHERE nip = ?]
    NIPCheck -->|No| REGONCheck{Is REGON?<br/>9 or 14 digits}

    REGONCheck -->|Yes| REGONSearch[Direct REGON lookup<br/>WHERE regon = ?]
    REGONCheck -->|No| DBType{Database<br/>Type?}

    DBType -->|PostgreSQL| PGFlow[PostgreSQL FTS Flow]
    DBType -->|SQLite| SQLiteFlow[SQLite Keyword Flow]

    NIPSearch --> Found{Found?}
    REGONSearch --> Found

    Found -->|Yes| ReturnSingle[Return single result<br/>score=100, match_type='nip/regon']
    Found -->|No| ReturnEmpty[Return empty list]

    PGFlow --> PGSynonym[Expand synonyms]
    PGSynonym --> PGExtCheck{pg_trgm<br/>available?}

    PGExtCheck -->|Yes| FTS_Fuzzy[FTS + Fuzzy search<br/>ts_rank + similarity]
    PGExtCheck -->|No| FTS_Only[FTS only<br/>ts_rank]

    FTS_Fuzzy --> PGResults{Results?}
    FTS_Only --> PGResults

    PGResults -->|Yes| ReturnScored[Return scored results<br/>ORDER BY score DESC]
    PGResults -->|No| Fallback[Execute SQLite fallback]

    SQLiteFlow --> SQLiteSynonym[Expand synonyms]
    SQLiteSynonym --> Fallback

    Fallback --> InMemory[In-memory keyword scoring]
    InMemory --> ReturnScored

    ReturnSingle --> End([SearchResult[]])
    ReturnEmpty --> End
    ReturnScored --> End
    AllCompanies --> End

    style NIPSearch fill:#9C27B0
    style REGONSearch fill:#9C27B0
    style FTS_Fuzzy fill:#2196F3
    style FTS_Only fill:#2196F3
    style InMemory fill:#FF9800

2.2 Synonym Expansion

Purpose: Increase search recall by expanding user queries with domain-specific synonyms

Examples:

KEYWORD_SYNONYMS = {
    # IT / Web
    'strony': ['www', 'web', 'internet', 'witryny', 'seo', 'e-commerce', 'sklep', 'portal'],
    'aplikacje': ['software', 'programowanie', 'systemy', 'crm', 'erp', 'app'],
    'it': ['informatyka', 'komputery', 'software', 'systemy', 'serwis'],

    # Construction
    'budowa': ['budownictwo', 'konstrukcje', 'remonty', 'wykończenia', 'dach', 'elewacja'],
    'remont': ['wykończenie', 'naprawa', 'renowacja', 'modernizacja'],

    # Services
    'księgowość': ['rachunkowość', 'finanse', 'podatki', 'biuro rachunkowe', 'kadry'],
    'prawo': ['prawnik', 'adwokat', 'radca', 'kancelaria', 'notariusz'],

    # Production
    'metal': ['stal', 'obróbka', 'spawanie', 'cnc', 'ślusarstwo'],
    'drewno': ['stolarka', 'meble', 'tartak', 'carpentry'],
}

Algorithm:

  1. Tokenize user query (split on whitespace, strip punctuation)
  2. For each word:
    • Direct lookup in KEYWORD_SYNONYMS keys
    • Check if word appears in any synonym list
    • Add matching synonyms to expanded query
  3. Return unique set of keywords

Example Expansion:

Input:  "strony internetowe"
Output: ['strony', 'internetowe', 'www', 'web', 'internet', 'witryny',
         'seo', 'e-commerce', 'ecommerce', 'sklep', 'portal', 'online',
         'cyfrowe', 'marketing']

3. PostgreSQL Full-Text Search (FTS)

3.1 FTS Search Sequence

sequenceDiagram
    actor User
    participant Route as Flask Route<br/>/search
    participant SearchSvc as SearchService
    participant PG as PostgreSQL
    participant FTS as Full-Text Engine<br/>(tsvector)
    participant Trgm as pg_trgm Extension<br/>(fuzzy matching)

    User->>Route: GET /search?q=strony www
    Route->>SearchSvc: search("strony www", limit=50)

    Note over SearchSvc: Detect PostgreSQL database
    SearchSvc->>SearchSvc: _expand_keywords("strony www")
    Note over SearchSvc: Expanded: [strony, www, web, internet,<br/>witryny, seo, e-commerce, ...]

    SearchSvc->>SearchSvc: Build tsquery: "strony:* | www:* | web:* | ..."
    SearchSvc->>SearchSvc: Build ILIKE patterns: [%strony%, %www%, %web%, ...]

    SearchSvc->>PG: Check pg_trgm extension available
    PG->>SearchSvc: Extension exists

    SearchSvc->>PG: Execute FTS + Fuzzy query
    Note over PG: SELECT c.id,<br/>ts_rank(search_vector, tsquery) as fts_score,<br/>similarity(name, query) as fuzzy_score,<br/>CASE WHEN founding_history ILIKE ...<br/>FROM companies c<br/>WHERE search_vector @@ tsquery<br/>OR similarity(name, query) > 0.2<br/>OR name/description ILIKE patterns

    PG->>FTS: Match against search_vector
    FTS->>PG: FTS matches with ts_rank scores

    PG->>Trgm: Calculate similarity(name, query)
    Trgm->>PG: Fuzzy match scores (0.0-1.0)

    PG->>SearchSvc: Result rows: [(id, fts_score, fuzzy_score, history_score), ...]

    SearchSvc->>PG: Fetch full Company objects<br/>WHERE id IN (...)
    PG->>SearchSvc: Company objects

    SearchSvc->>SearchSvc: Determine match_type (fts/fuzzy/history)
    SearchSvc->>SearchSvc: Normalize scores (0-100)

    SearchSvc->>Route: SearchResult[] with companies, scores, match_types
    Route->>User: Render search_results.html

3.2 PostgreSQL FTS Implementation

File: search_service.py (lines 251-378)

Database Requirements:

  • Extension: pg_trgm (optional, enables fuzzy matching)
  • Column: companies.search_vector (tsvector, indexed)
  • Index: GIN index on search_vector for fast full-text search

SQL Query Structure (with pg_trgm):

SELECT c.id,
    COALESCE(ts_rank(c.search_vector, to_tsquery('simple', :tsquery)), 0) as fts_score,
    COALESCE(similarity(c.name, :query), 0) as fuzzy_score,
    CASE WHEN c.founding_history ILIKE ANY(:like_patterns) THEN 0.5 ELSE 0 END as history_score
FROM companies c
WHERE c.status = 'active'
AND (
    c.search_vector @@ to_tsquery('simple', :tsquery)          -- FTS match
    OR similarity(c.name, :query) > 0.2                        -- Fuzzy name match
    OR c.name ILIKE ANY(:like_patterns)                        -- Keyword in name
    OR c.description_short ILIKE ANY(:like_patterns)           -- Keyword in description
    OR c.founding_history ILIKE ANY(:like_patterns)            -- Keyword in owners/founders
    OR c.description_full ILIKE ANY(:like_patterns)            -- Keyword in full text
)
ORDER BY GREATEST(
    COALESCE(ts_rank(c.search_vector, to_tsquery('simple', :tsquery)), 0),
    COALESCE(similarity(c.name, :query), 0),
    CASE WHEN c.founding_history ILIKE ANY(:like_patterns) THEN 0.5 ELSE 0 END
) DESC
LIMIT :limit

Parameters:

  • :tsquery - Expanded keywords joined with | (OR), each with :* prefix matching
    • Example: "strony:* | www:* | web:* | internet:*"
  • :query - Original user query for fuzzy matching
  • :like_patterns - Array of ILIKE patterns for direct keyword matches
    • Example: ['%strony%', '%www%', '%web%']
  • :limit - Maximum results (default 50)

Scoring Strategy:

  1. FTS Score: ts_rank() measures how well document matches query (0.0-1.0)
  2. Fuzzy Score: similarity() from pg_trgm measures string similarity (0.0-1.0)
  3. History Score: Fixed 0.5 bonus if founders/owners match (important for people search)
  4. Final Score: GREATEST() of all three scores, normalized to 0-100 scale

Match Types:

  • 'fts' - Full-text search match (highest ts_rank)
  • 'fuzzy' - Fuzzy string similarity match (highest similarity)
  • 'history' - Founding history match (owner/founder keywords)

Fallback Behavior:

  • If pg_trgm extension not available → Uses FTS only (no fuzzy matching)
  • If FTS returns 0 results → Falls back to SQLite keyword scoring
  • If FTS query fails (exception) → Rollback transaction, use SQLite fallback

4. SQLite Keyword Scoring Fallback

4.1 Fallback Sequence

sequenceDiagram
    participant SearchSvc as SearchService
    participant DB as Database
    participant Scorer as Keyword Scorer<br/>(in-memory)

    SearchSvc->>SearchSvc: _expand_keywords(query)
    Note over SearchSvc: Keywords: [strony, www, web, ...]

    SearchSvc->>DB: SELECT * FROM companies<br/>WHERE status = 'active'
    DB->>SearchSvc: All active companies (in-memory)

    loop For each company
        SearchSvc->>Scorer: Calculate score

        Note over Scorer: Name match: +10<br/>(+5 bonus for exact match)
        Note over Scorer: Description short: +5
        Note over Scorer: Services: +8
        Note over Scorer: Competencies: +7
        Note over Scorer: City: +3
        Note over Scorer: Founding history: +12<br/>(owners/founders)
        Note over Scorer: Description full: +4

        Scorer->>SearchSvc: Total score (0+)
    end

    SearchSvc->>SearchSvc: Filter companies (score > 0)
    SearchSvc->>SearchSvc: Sort by score DESC
    SearchSvc->>SearchSvc: Limit results

    SearchSvc->>SearchSvc: Build SearchResult[]<br/>with scores and match_types

4.2 Keyword Scoring Algorithm

File: search_service.py (lines 162-249)

Scoring Weights:

{
    'name_match': 10,           # Company name contains keyword
    'exact_name_match': +5,     # Exact query appears in name (bonus)
    'description_short': 5,     # Short description contains keyword
    'services': 8,              # Service tag matches
    'competencies': 7,          # Competency tag matches
    'city': 3,                  # City/location matches
    'founding_history': 12,     # Owners/founders match (highest weight)
    'description_full': 4,      # Full description contains keyword
}

Algorithm:

  1. Fetch all active companies from database

  2. For each company, calculate score:

    score = 0
    match_type = 'keyword'
    
    # Name match (highest weight)
    if any(keyword in company.name.lower() for keyword in keywords):
        score += 10
        if original_query.lower() in company.name.lower():
            score += 5  # Exact match bonus
            match_type = 'exact'
    
    # Description match
    if any(keyword in company.description_short.lower() for keyword in keywords):
        score += 5
    
    # Services match
    if any(keyword in service.name.lower() for service in company.services for keyword in keywords):
        score += 8
    
    # Competencies match
    if any(keyword in competency.name.lower() for competency in company.competencies for keyword in keywords):
        score += 7
    
    # City match
    if any(keyword in company.city.lower() for keyword in keywords):
        score += 3
    
    # Founding history match (owners, founders)
    if any(keyword in company.founding_history.lower() for keyword in keywords):
        score += 12
    
    # Full description match
    if any(keyword in company.description_full.lower() for keyword in keywords):
        score += 4
    
  3. Filter companies with score > 0

  4. Sort by score descending

  5. Limit to requested result count

  6. Return as SearchResult[] with scores and match types

Match Types:

  • 'exact' - Original query appears exactly in company name
  • 'keyword' - One or more expanded keywords matched

5. Direct Identifier Lookup

5.1 NIP Lookup Flow

sequenceDiagram
    actor User
    participant Route as /search route
    participant SearchSvc as SearchService
    participant DB as PostgreSQL

    User->>Route: GET /search?q=5882436505
    Route->>SearchSvc: search("5882436505")

    SearchSvc->>SearchSvc: _is_nip("5882436505")
    Note over SearchSvc: Regex: ^\d{10}$
    SearchSvc->>SearchSvc: Clean: remove spaces/hyphens

    SearchSvc->>DB: SELECT * FROM companies<br/>WHERE nip = '5882436505'<br/>AND status = 'active'

    alt Company found
        DB->>SearchSvc: Company object
        SearchSvc->>Route: [SearchResult(company, score=100, match_type='nip')]
        Route->>User: Display single company
    else Not found
        DB->>SearchSvc: NULL
        SearchSvc->>Route: []
        Route->>User: "Brak wyników"
    end

Implementation:

  • File: search_service.py (lines 112-131)
  • Input cleaning: Strip spaces and hyphens (e.g., "588-243-65-05" → "5882436505")
  • Validation: Must be exactly 10 digits
  • Score: Always 100.0 (perfect match)
  • Match type: 'nip'

5.2 REGON Lookup Flow

sequenceDiagram
    actor User
    participant Route as /search route
    participant SearchSvc as SearchService
    participant DB as PostgreSQL

    User->>Route: GET /search?q=220825533
    Route->>SearchSvc: search("220825533")

    SearchSvc->>SearchSvc: _is_regon("220825533")
    Note over SearchSvc: Regex: ^\d{9}$ OR ^\d{14}$
    SearchSvc->>SearchSvc: Clean: remove spaces/hyphens

    SearchSvc->>DB: SELECT * FROM companies<br/>WHERE regon = '220825533'<br/>AND status = 'active'

    alt Company found
        DB->>SearchSvc: Company object
        SearchSvc->>Route: [SearchResult(company, score=100, match_type='regon')]
        Route->>User: Display single company
    else Not found
        DB->>SearchSvc: NULL
        SearchSvc->>Route: []
        Route->>User: "Brak wyników"
    end

Implementation:

  • File: search_service.py (lines 117-142)
  • Input cleaning: Strip spaces and hyphens
  • Validation: Must be exactly 9 or 14 digits
  • Score: Always 100.0 (perfect match)
  • Match type: 'regon'

6. User Search Interface

6.1 Search Route Flow

sequenceDiagram
    actor User
    participant Browser
    participant Flask as Flask App<br/>(app.py /search)
    participant SearchSvc as SearchService
    participant DB as PostgreSQL
    participant Template as search_results.html

    User->>Browser: Navigate to /search
    Browser->>Flask: GET /search?q=strony+www&category=1

    Note over Flask: @login_required<br/>User must be authenticated

    Flask->>Flask: Parse query params<br/>q = "strony www"<br/>category = 1

    Flask->>SearchSvc: search_companies(db, "strony www", category_id=1, limit=50)
    SearchSvc->>SearchSvc: Execute search strategy<br/>(NIP/REGON/FTS/Fallback)
    SearchSvc->>DB: Query companies
    DB->>SearchSvc: Results
    SearchSvc->>Flask: List[SearchResult]

    Flask->>Flask: Extract companies from results<br/>companies = [r.company for r in results]

    Flask->>Flask: Log search analytics<br/>logger.info(f"Search '{query}': {len} results, types: {match_types}")

    Flask->>Template: render_template('search_results.html',<br/>companies=companies,<br/>query=query,<br/>category_id=category_id,<br/>result_count=len)

    Template->>Browser: HTML response
    Browser->>User: Display search results

Route Details:

  • Path: /search
  • Method: GET
  • Authentication: Required (@login_required)
  • File: app.py (lines 718-748)

Query Parameters:

  • q (string, optional) - Search query
  • category (integer, optional) - Category filter (category_id)

Response:

  • Template: search_results.html
  • Context Variables:
    • companies - List of Company objects
    • query - Original search query
    • category_id - Selected category filter
    • result_count - Number of results

Analytics Logging:

if query:
    match_types = {}
    for r in results:
        match_types[r.match_type] = match_types.get(r.match_type, 0) + 1
    logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")

Example log output:

Search 'strony www': 12 results, types: {'fts': 8, 'fuzzy': 3, 'exact': 1}

7. AI Chat Integration

7.1 AI Chat Search Flow

sequenceDiagram
    actor User
    participant Chat as AI Chat Interface<br/>/chat
    participant ChatSvc as NordaBizChatService<br/>nordabiz_chat.py
    participant SearchSvc as SearchService
    participant DB as PostgreSQL
    participant Gemini as Google Gemini API

    User->>Chat: POST /chat/send<br/>"Szukam firm do stron www"
    Chat->>ChatSvc: send_message(user_message, conversation_id)

    ChatSvc->>ChatSvc: _find_relevant_companies(db, message)
    Note over ChatSvc: Extract search keywords from message

    ChatSvc->>SearchSvc: search_companies(db, message, limit=10)
    Note over SearchSvc: Use same search strategies<br/>(NIP/REGON/FTS/Fallback)

    SearchSvc->>DB: Query companies
    DB->>SearchSvc: Results
    SearchSvc->>ChatSvc: List[SearchResult] (max 10)

    ChatSvc->>ChatSvc: Extract companies from results<br/>companies = [r.company for r in results]

    ChatSvc->>ChatSvc: _build_conversation_context(db, user, conversation, companies)
    Note over ChatSvc: Limit to 8 companies (prevent context overflow)<br/>Include last 10 messages for history

    ChatSvc->>ChatSvc: _company_to_compact_dict(company)
    Note over ChatSvc: Compress company data<br/>(name, desc, services, competencies, etc)

    ChatSvc->>Gemini: POST /generateContent<br/>System prompt + context + user message
    Note over Gemini: Model: gemini-3-flash-preview<br/>Max tokens: 2048

    Gemini->>ChatSvc: AI response text

    ChatSvc->>DB: Save conversation messages<br/>(user message + AI response)
    ChatSvc->>DB: Track API costs<br/>(gemini_cost_tracking)

    ChatSvc->>Chat: AI response with company recommendations
    Chat->>User: Display chat response

Key Differences from User Search:

  1. Result Limit: 10 companies (vs 50 for user search)
  2. Company Limit to AI: 8 companies max (prevents context overflow)
  3. Context Building: Companies converted to compact JSON format
  4. Integration: Seamless - AI doesn't know about search internals
  5. Message History: Last 10 messages included in context

Implementation:

  • File: nordabiz_chat.py (lines 383-405)
  • Search Call:
    results = search_companies(db, message, limit=10)
    companies = [result.company for result in results]
    return companies
    

Company Data Compression:

compact = {
    'name': company.name,
    'cat': company.category.name,
    'desc': company.description_short,
    'history': company.founding_history,  # Owners, founders
    'svc': [service.name for service in company.services],
    'comp': [competency.name for competency in company.competencies],
    'web': company.website,
    'tel': company.phone,
    'mail': company.email,
    'city': company.address_city,
    'year': company.year_established,
    'cert': [cert.name for cert in company.certifications[:3]]
}

AI System Prompt (includes search context):

Jesteś asystentem bazy firm Norda Biznes z Wejherowa.
Odpowiadaj zwięźle, konkretnie, po polsku.

Oto firmy które mogą być istotne dla pytania użytkownika:
{companies_json}

Historia rozmowy:
{recent_messages}

Odpowiedz na pytanie użytkownika bazując na powyższych danych.

8. Performance Considerations

8.1 Database Indexing

Required Indexes:

-- Full-text search index (PostgreSQL)
CREATE INDEX idx_companies_search_vector ON companies USING gin(search_vector);

-- NIP lookup index
CREATE UNIQUE INDEX idx_companies_nip ON companies(nip) WHERE status = 'active';

-- REGON lookup index
CREATE INDEX idx_companies_regon ON companies(regon) WHERE status = 'active';

-- Status filter index
CREATE INDEX idx_companies_status ON companies(status);

-- Category filter index
CREATE INDEX idx_companies_category ON companies(category_id) WHERE status = 'active';

-- pg_trgm index for fuzzy matching (optional)
CREATE INDEX idx_companies_name_trgm ON companies USING gin(name gin_trgm_ops);

8.2 Search Vector Maintenance

Automatic Updates:

-- Trigger to update search_vector on INSERT/UPDATE
CREATE TRIGGER companies_search_vector_update
BEFORE INSERT OR UPDATE ON companies
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(
    search_vector, 'pg_catalog.simple',
    name, description_short, description_full, founding_history
);

Manual Rebuild:

-- Rebuild all search vectors
UPDATE companies SET search_vector =
    setweight(to_tsvector('simple', COALESCE(name, '')), 'A') ||
    setweight(to_tsvector('simple', COALESCE(description_short, '')), 'B') ||
    setweight(to_tsvector('simple', COALESCE(description_full, '')), 'C') ||
    setweight(to_tsvector('simple', COALESCE(founding_history, '')), 'B');

8.3 Query Performance

Performance Targets:

  • NIP/REGON lookup: < 10ms (indexed)
  • PostgreSQL FTS: < 100ms (typical)
  • SQLite fallback: < 500ms (in-memory scoring)
  • AI Chat search: < 200ms (limit 10 results)

Optimization Strategies:

  1. Early Exit: NIP/REGON lookup bypasses full search
  2. Result Limiting: Default 50 results (10 for AI chat)
  3. Category Filtering: Reduces search space
  4. Synonym Pre-expansion: Computed once, reused in all clauses
  5. Score-based Ordering: Database-level sorting (not in-memory)

8.4 Fallback Performance

PostgreSQL → SQLite Fallback Triggers:

  1. FTS query returns 0 results
  2. FTS query throws exception (syntax error, missing extension)
  3. pg_trgm extension not available (degrades to FTS-only, not full fallback)

SQLite Fallback Cost:

  • Fetches ALL active companies into memory
  • Scores each company in Python (slower than SQL)
  • Suitable for development/testing, not recommended for production with 100+ companies

Monitoring:

# Logged in app.py when search executes
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")

# Example outputs:
# Search 'strony www': 12 results, types: {'fts': 8, 'fuzzy': 4}
# Search '5882436505': 1 results, types: {'nip': 1}
# Search 'PIXLAB': 1 results, types: {'exact': 1}

9. Search Result Structure

9.1 SearchResult Dataclass

File: search_service.py (lines 20-25)

@dataclass
class SearchResult:
    """Search result with score and match info"""
    company: Company          # Full Company SQLAlchemy object
    score: float              # Relevance score (0.0-100.0)
    match_type: str           # Match type identifier

Match Types:

Match Type Description Score Range
'nip' Direct NIP match 100.0 (fixed)
'regon' Direct REGON match 100.0 (fixed)
'exact' Exact name match (SQLite) Variable (usually high)
'fts' PostgreSQL full-text search 0.0-100.0 (normalized ts_rank)
'fuzzy' PostgreSQL fuzzy similarity 0.0-100.0 (normalized similarity)
'history' Founding history match 50.0 (fixed bonus)
'keyword' SQLite keyword scoring Variable (weighted sum)
'all' All companies (no filter) 0.0 (no relevance)

9.2 Score Normalization

PostgreSQL FTS Scores:

# ts_rank returns 0.0-1.0, normalize to 0-100
fts_score = ts_rank(...) * 100

# similarity returns 0.0-1.0, normalize to 0-100
fuzzy_score = similarity(...) * 100

# history match is fixed bonus
history_score = 0.5 * 100 = 50.0

SQLite Keyword Scores:

# Sum of all matching field weights
score = (
    10  # name match
    + 5   # exact match bonus
    + 5   # description_short
    + 8   # services
    + 7   # competencies
    + 3   # city
    + 12  # founding_history
    + 4   # description_full
)
# Maximum possible: 54 points
# Typical: 10-30 points

10. Error Handling & Edge Cases

10.1 PostgreSQL FTS Error Handling

Error Scenarios:

  1. Invalid tsquery syntax - Fallback to SQLite
  2. pg_trgm extension missing - Degrade to FTS-only (no fuzzy)
  3. search_vector column missing - Exception, fallback to SQLite
  4. Database connection error - Propagate exception to route

Implementation:

try:
    result = self.db.execute(sql, params)
    rows = result.fetchall()
    # ... process results
except Exception as e:
    print(f"PostgreSQL FTS error: {e}, falling back to keyword search")
    self.db.rollback()  # CRITICAL: prevent InFailedSqlTransaction
    return self._search_sqlite_fallback(query, category_id, limit)

Critical: db.rollback() is essential before fallback to prevent transaction state errors.

10.2 Empty Results Handling

No Results Scenarios:

  1. NIP/REGON not found - Return empty list []
  2. FTS returns 0 matches - Automatic fallback to SQLite scoring
  3. SQLite scoring returns 0 matches - Return empty list []
  4. Empty query - Return all active companies (ordered by name)

User Interface:

{% if result_count == 0 %}
    <div class="alert alert-info">
        Brak wyników dla zapytania "{{ query }}".
        Spróbuj innych słów kluczowych lub usuń filtry.
    </div>
{% endif %}

10.3 Special Characters & Sanitization

Query Cleaning:

query = query.strip()  # Remove leading/trailing whitespace
clean_nip = re.sub(r'[\s\-]', '', query)  # Remove spaces and hyphens from NIP/REGON

SQL Injection Prevention:

  • All queries use SQLAlchemy parameter binding (:param syntax)
  • No raw string concatenation in SQL
  • ILIKE patterns are passed as array parameters

XSS Prevention:

  • All user input sanitized before display (handled by Jinja2 auto-escaping)
  • Query string displayed in template: {{ query }} (auto-escaped)

11. Testing & Verification

11.1 Test Queries

NIP Lookup:

Query: "5882436505"
Expected: PIXLAB Sp. z o.o. (single result, score=100, match_type='nip')

REGON Lookup:

Query: "220825533"
Expected: Single company with matching REGON (score=100, match_type='regon')

Keyword Search (PostgreSQL FTS):

Query: "strony internetowe"
Expected: Multiple results (IT/Web companies, match_type='fts' or 'fuzzy')
Keywords expanded to: [strony, internetowe, www, web, internet, witryny, seo, ...]

Exact Name Match:

Query: "PIXLAB"
Expected: PIXLAB at top (high score, match_type='exact' or 'fts')

Owner/Founder Search:

Query: "Jan Kowalski"  (example founder name)
Expected: Companies where Jan Kowalski appears in founding_history
Match type: 'history' or high score from founding_history match

Category Filter:

Query: "strony" + category=1 (IT)
Expected: Only IT category companies matching "strony"

Empty Query:

Query: ""
Expected: All active companies, alphabetically sorted

11.2 Performance Testing

Load Testing Scenarios:

# Test 1: Direct lookup performance
for nip in all_nips:
    results = search_companies(db, nip)
    assert len(results) == 1
    assert results[0].match_type == 'nip'

# Test 2: Full-text search performance
queries = ["strony", "budowa", "księgowość", "metal", "transport"]
for query in queries:
    start = time.time()
    results = search_companies(db, query)
    elapsed = time.time() - start
    assert elapsed < 0.1  # < 100ms
    print(f"{query}: {len(results)} results in {elapsed*1000:.1f}ms")

# Test 3: Fallback trigger test (simulate FTS failure)
# Force SQLite fallback by using invalid tsquery syntax
results = search_companies(db, "test:query|with:invalid&syntax")
# Should not crash, should return results via fallback

11.3 Search Quality Metrics

Relevance Testing:

test_cases = [
    {
        'query': 'strony www',
        'expected_top_3': ['PIXLAB', 'Web Agency', 'IT Solutions'],
        'min_results': 5
    },
    {
        'query': 'budownictwo',
        'expected_categories': ['Construction'],
        'min_results': 3
    },
    # ... more test cases
]

for test in test_cases:
    results = search_companies(db, test['query'])
    assert len(results) >= test['min_results']
    # Check if expected companies appear in top results
    top_names = [r.company.name for r in results[:3]]
    for expected in test['expected_top_3']:
        assert expected in top_names

12. Maintenance & Monitoring

12.1 Database Maintenance

Weekly Tasks:

-- Rebuild search vectors (if data quality issues)
UPDATE companies SET search_vector =
    setweight(to_tsvector('simple', COALESCE(name, '')), 'A') ||
    setweight(to_tsvector('simple', COALESCE(description_short, '')), 'B') ||
    setweight(to_tsvector('simple', COALESCE(description_full, '')), 'C') ||
    setweight(to_tsvector('simple', COALESCE(founding_history, '')), 'B')
WHERE updated_at > NOW() - INTERVAL '7 days';

-- Verify index health
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'companies'
ORDER BY idx_scan DESC;

-- Check for missing indexes
SELECT indexname, indexdef FROM pg_indexes
WHERE tablename = 'companies';

Monthly Tasks:

-- Vacuum and analyze for performance
VACUUM ANALYZE companies;

-- Check for slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%companies%search_vector%'
ORDER BY mean_exec_time DESC
LIMIT 10;

12.2 Search Analytics

Logging Search Patterns:

# Already implemented in app.py /search route
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")

Analytics Queries:

-- Top search queries (requires search_logs table - not yet implemented)
SELECT query, COUNT(*) as frequency
FROM search_logs
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY query
ORDER BY frequency DESC
LIMIT 20;

-- Zero-result searches (requires logging)
SELECT query, COUNT(*) as frequency
FROM search_logs
WHERE result_count = 0
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY query
ORDER BY frequency DESC
LIMIT 10;

12.3 Synonym Expansion Tuning

Adding New Synonyms:

# Edit search_service.py KEYWORD_SYNONYMS dictionary
KEYWORD_SYNONYMS = {
    # Add new industry-specific terms
    'cyberbezpieczeństwo': ['security', 'ochrona', 'firewall', 'antywirus'],
    # ... more synonyms
}

Synonym Effectiveness Testing:

# Test query with and without synonym expansion
query = "cyberbezpieczeństwo"

# With expansion
results_with = search_companies(db, query)
print(f"With synonyms: {len(results_with)} results")

# Without expansion (mock)
# ... compare recall/precision

13. Future Enhancements

13.1 Planned Improvements

  1. Search Result Ranking ML Model

    • Learn from user click-through rates
    • Personalized ranking based on user preferences
    • A/B testing of ranking algorithms
  2. Search Autocomplete

    • Suggest company names as user types
    • Suggest common search queries
    • Category-based suggestions
  3. Advanced Filters

    • Location-based search (radius from city)
    • Certification filters (ISO, other)
    • Founding year range
    • Employee count range (if available)
  4. Search Analytics Dashboard

    • Top queries (daily/weekly/monthly)
    • Zero-result queries (opportunities for content)
    • Average result count per query
    • Match type distribution
    • Click-through rates by position
  5. Semantic Search

    • Integrate sentence embeddings (sentence-transformers)
    • Vector similarity search for related companies
    • "More like this" company recommendations
  6. Multi-language Support

    • English query translation
    • German query support (for border region)
    • Auto-detect query language

13.2 Performance Optimization Ideas

  1. Query Result Caching

    • Redis cache for common queries (TTL 5 minutes)
    • Cache key: search:{query}:{category_id}
    • Invalidate on company data updates
  2. Partial Index Optimization

    -- Index only active companies
    CREATE INDEX idx_companies_active_search
    ON companies USING gin(search_vector)
    WHERE status = 'active';
    
  3. Materialized View for Search

    -- Pre-compute search data
    CREATE MATERIALIZED VIEW search_companies_mv AS
    SELECT id, name, search_vector, category_id, status, ...
    FROM companies
    WHERE status = 'active';
    
    -- Refresh daily
    REFRESH MATERIALIZED VIEW search_companies_mv;
    
  4. Connection Pooling

    • Already implemented via SQLAlchemy
    • Monitor pool size and overflow
    • Adjust pool_size/max_overflow if needed


15. Glossary

Term Description
FTS Full-Text Search - PostgreSQL text search engine using tsvector
tsvector PostgreSQL data type for full-text search, stores preprocessed text
tsquery PostgreSQL query syntax for full-text search (e.g., "word1 | word2")
ts_rank PostgreSQL function to score FTS relevance (0.0-1.0)
pg_trgm PostgreSQL extension for trigram-based fuzzy string matching
similarity() pg_trgm function to measure string similarity (0.0-1.0)
Synonym Expansion Expanding user query with related keywords (e.g., "strony" → "www, web, internet")
SearchResult Dataclass containing Company, score, and match_type
Match Type Identifier for how company was matched (nip, regon, fts, fuzzy, keyword, etc.)
NIP Polish tax identification number (10 digits)
REGON Polish business registry number (9 or 14 digits)
Fallback Alternative search method when primary method fails (PostgreSQL FTS → SQLite keyword scoring)
SearchService Unified search service class (search_service.py)
Keyword Scoring In-memory scoring algorithm for SQLite fallback

Document Metadata

Created: 2026-01-10 Author: Architecture Documentation (auto-claude) Related Files:

  • search_service.py (main implementation)
  • app.py (lines 718-748, /search route)
  • nordabiz_chat.py (lines 383-405, AI integration)
  • database.py (Company model)

Version History:

  • v1.0 (2026-01-10) - Initial documentation

End of Document