nordabiz/docs/architecture/flows/02-search-flow.md
Maciej Pienczyn cebe52f303 refactor: Rebranding i aktualizacja modelu AI
- Zmiana nazwy: "Norda Biznes Hub" → "Norda Biznes Partner"
- Aktualizacja modelu AI: Gemini 2.0 Flash → Gemini 3 Flash
- Zachowano historyczne odniesienia w timeline i dokumentacji

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 14:08:39 +01:00

1161 lines
36 KiB
Markdown

# Company Search Flow
**Document Version:** 1.0
**Last Updated:** 2026-01-10
**Status:** Production LIVE
**Flow Type:** Company Search & Discovery
---
## Overview
This document describes the **complete company search flow** for the Norda Biznes Partner application, covering:
- **User Search Interface** (`/search` route)
- **Search Service Architecture** (unified search with multiple strategies)
- **AI Chat Integration** (context-aware company discovery)
- **Search Strategies:**
- NIP/REGON direct lookup
- Synonym expansion
- PostgreSQL Full-Text Search (FTS)
- Fuzzy matching (pg_trgm)
- SQLite keyword scoring fallback
**Key Technology:**
- **Search Engine:** Custom unified SearchService
- **Database:** PostgreSQL FTS with tsvector indexing
- **Fuzzy Matching:** pg_trgm extension for typo tolerance
- **Synonym Expansion:** Domain-specific keyword mappings
- **AI Integration:** Used by NordaBiz Chat for context building
**Performance Features:**
- Direct identifier lookup (NIP/REGON) bypasses full search
- Database-level full-text search indexing
- Synonym expansion increases recall
- Configurable result limits (default 50)
- Fallback mechanisms for SQLite compatibility
---
## 1. Search Flow Overview
### 1.1 High-Level Architecture
```mermaid
flowchart TD
User[User] -->|1. Search query| UI[Search UI<br/>/search route]
AIUser[AI Chat User] -->|1. Natural language| Chat[AI Chat<br/>/chat route]
UI -->|2. Call| SearchSvc[Search Service<br/>search_service.py]
Chat -->|2. Find companies| SearchSvc
SearchSvc -->|3. Detect query type| QueryType{Query Type?}
QueryType -->|NIP: 10 digits| NIPLookup[NIP Direct Lookup]
QueryType -->|REGON: 9/14 digits| REGONLookup[REGON Direct Lookup]
QueryType -->|Text query| DBCheck{Database<br/>Type?}
DBCheck -->|PostgreSQL| PGFTS[PostgreSQL FTS<br/>+ Fuzzy Match]
DBCheck -->|SQLite| SQLiteFallback[SQLite Keyword<br/>Scoring]
NIPLookup -->|4. Query DB| DB[(PostgreSQL<br/>companies)]
REGONLookup -->|4. Query DB| DB
PGFTS -->|4. FTS query| DB
SQLiteFallback -->|4. LIKE query| DB
DB -->|5. Results| SearchSvc
SearchSvc -->|6. SearchResult[]| UI
SearchSvc -->|6. Company[]| Chat
UI -->|7. Render| SearchResults[search_results.html]
Chat -->|7. Build context| AIContext[AI Context Builder]
SearchResults -->|8. Display| User
AIContext -->|8. Generate response| AIUser
style SearchSvc fill:#4CAF50
style PGFTS fill:#2196F3
style DB fill:#FF9800
style NIPLookup fill:#9C27B0
style REGONLookup fill:#9C27B0
```
---
## 2. Search Strategies
### 2.1 Strategy Selection Algorithm
```mermaid
flowchart TD
Start([User Query]) --> Clean[Strip whitespace]
Clean --> Empty{Empty<br/>query?}
Empty -->|Yes| AllCompanies[Return all companies<br/>ORDER BY name]
Empty -->|No| NIPCheck{Is NIP?<br/>10 digits}
NIPCheck -->|Yes| NIPSearch[Direct NIP lookup<br/>WHERE nip = ?]
NIPCheck -->|No| REGONCheck{Is REGON?<br/>9 or 14 digits}
REGONCheck -->|Yes| REGONSearch[Direct REGON lookup<br/>WHERE regon = ?]
REGONCheck -->|No| DBType{Database<br/>Type?}
DBType -->|PostgreSQL| PGFlow[PostgreSQL FTS Flow]
DBType -->|SQLite| SQLiteFlow[SQLite Keyword Flow]
NIPSearch --> Found{Found?}
REGONSearch --> Found
Found -->|Yes| ReturnSingle[Return single result<br/>score=100, match_type='nip/regon']
Found -->|No| ReturnEmpty[Return empty list]
PGFlow --> PGSynonym[Expand synonyms]
PGSynonym --> PGExtCheck{pg_trgm<br/>available?}
PGExtCheck -->|Yes| FTS_Fuzzy[FTS + Fuzzy search<br/>ts_rank + similarity]
PGExtCheck -->|No| FTS_Only[FTS only<br/>ts_rank]
FTS_Fuzzy --> PGResults{Results?}
FTS_Only --> PGResults
PGResults -->|Yes| ReturnScored[Return scored results<br/>ORDER BY score DESC]
PGResults -->|No| Fallback[Execute SQLite fallback]
SQLiteFlow --> SQLiteSynonym[Expand synonyms]
SQLiteSynonym --> Fallback
Fallback --> InMemory[In-memory keyword scoring]
InMemory --> ReturnScored
ReturnSingle --> End([SearchResult[]])
ReturnEmpty --> End
ReturnScored --> End
AllCompanies --> End
style NIPSearch fill:#9C27B0
style REGONSearch fill:#9C27B0
style FTS_Fuzzy fill:#2196F3
style FTS_Only fill:#2196F3
style InMemory fill:#FF9800
```
### 2.2 Synonym Expansion
**Purpose:** Increase search recall by expanding user queries with domain-specific synonyms
**Examples:**
```python
KEYWORD_SYNONYMS = {
# IT / Web
'strony': ['www', 'web', 'internet', 'witryny', 'seo', 'e-commerce', 'sklep', 'portal'],
'aplikacje': ['software', 'programowanie', 'systemy', 'crm', 'erp', 'app'],
'it': ['informatyka', 'komputery', 'software', 'systemy', 'serwis'],
# Construction
'budowa': ['budownictwo', 'konstrukcje', 'remonty', 'wykończenia', 'dach', 'elewacja'],
'remont': ['wykończenie', 'naprawa', 'renowacja', 'modernizacja'],
# Services
'księgowość': ['rachunkowość', 'finanse', 'podatki', 'biuro rachunkowe', 'kadry'],
'prawo': ['prawnik', 'adwokat', 'radca', 'kancelaria', 'notariusz'],
# Production
'metal': ['stal', 'obróbka', 'spawanie', 'cnc', 'ślusarstwo'],
'drewno': ['stolarka', 'meble', 'tartak', 'carpentry'],
}
```
**Algorithm:**
1. Tokenize user query (split on whitespace, strip punctuation)
2. For each word:
- Direct lookup in KEYWORD_SYNONYMS keys
- Check if word appears in any synonym list
- Add matching synonyms to expanded query
3. Return unique set of keywords
**Example Expansion:**
```
Input: "strony internetowe"
Output: ['strony', 'internetowe', 'www', 'web', 'internet', 'witryny',
'seo', 'e-commerce', 'ecommerce', 'sklep', 'portal', 'online',
'cyfrowe', 'marketing']
```
---
## 3. PostgreSQL Full-Text Search (FTS)
### 3.1 FTS Search Sequence
```mermaid
sequenceDiagram
actor User
participant Route as Flask Route<br/>/search
participant SearchSvc as SearchService
participant PG as PostgreSQL
participant FTS as Full-Text Engine<br/>(tsvector)
participant Trgm as pg_trgm Extension<br/>(fuzzy matching)
User->>Route: GET /search?q=strony www
Route->>SearchSvc: search("strony www", limit=50)
Note over SearchSvc: Detect PostgreSQL database
SearchSvc->>SearchSvc: _expand_keywords("strony www")
Note over SearchSvc: Expanded: [strony, www, web, internet,<br/>witryny, seo, e-commerce, ...]
SearchSvc->>SearchSvc: Build tsquery: "strony:* | www:* | web:* | ..."
SearchSvc->>SearchSvc: Build ILIKE patterns: [%strony%, %www%, %web%, ...]
SearchSvc->>PG: Check pg_trgm extension available
PG->>SearchSvc: Extension exists
SearchSvc->>PG: Execute FTS + Fuzzy query
Note over PG: SELECT c.id,<br/>ts_rank(search_vector, tsquery) as fts_score,<br/>similarity(name, query) as fuzzy_score,<br/>CASE WHEN founding_history ILIKE ...<br/>FROM companies c<br/>WHERE search_vector @@ tsquery<br/>OR similarity(name, query) > 0.2<br/>OR name/description ILIKE patterns
PG->>FTS: Match against search_vector
FTS->>PG: FTS matches with ts_rank scores
PG->>Trgm: Calculate similarity(name, query)
Trgm->>PG: Fuzzy match scores (0.0-1.0)
PG->>SearchSvc: Result rows: [(id, fts_score, fuzzy_score, history_score), ...]
SearchSvc->>PG: Fetch full Company objects<br/>WHERE id IN (...)
PG->>SearchSvc: Company objects
SearchSvc->>SearchSvc: Determine match_type (fts/fuzzy/history)
SearchSvc->>SearchSvc: Normalize scores (0-100)
SearchSvc->>Route: SearchResult[] with companies, scores, match_types
Route->>User: Render search_results.html
```
### 3.2 PostgreSQL FTS Implementation
**File:** `search_service.py` (lines 251-378)
**Database Requirements:**
- **Extension:** `pg_trgm` (optional, enables fuzzy matching)
- **Column:** `companies.search_vector` (tsvector, indexed)
- **Index:** GIN index on `search_vector` for fast full-text search
**SQL Query Structure (with pg_trgm):**
```sql
SELECT c.id,
COALESCE(ts_rank(c.search_vector, to_tsquery('simple', :tsquery)), 0) as fts_score,
COALESCE(similarity(c.name, :query), 0) as fuzzy_score,
CASE WHEN c.founding_history ILIKE ANY(:like_patterns) THEN 0.5 ELSE 0 END as history_score
FROM companies c
WHERE c.status = 'active'
AND (
c.search_vector @@ to_tsquery('simple', :tsquery) -- FTS match
OR similarity(c.name, :query) > 0.2 -- Fuzzy name match
OR c.name ILIKE ANY(:like_patterns) -- Keyword in name
OR c.description_short ILIKE ANY(:like_patterns) -- Keyword in description
OR c.founding_history ILIKE ANY(:like_patterns) -- Keyword in owners/founders
OR c.description_full ILIKE ANY(:like_patterns) -- Keyword in full text
)
ORDER BY GREATEST(
COALESCE(ts_rank(c.search_vector, to_tsquery('simple', :tsquery)), 0),
COALESCE(similarity(c.name, :query), 0),
CASE WHEN c.founding_history ILIKE ANY(:like_patterns) THEN 0.5 ELSE 0 END
) DESC
LIMIT :limit
```
**Parameters:**
- `:tsquery` - Expanded keywords joined with `|` (OR), each with `:*` prefix matching
- Example: `"strony:* | www:* | web:* | internet:*"`
- `:query` - Original user query for fuzzy matching
- `:like_patterns` - Array of ILIKE patterns for direct keyword matches
- Example: `['%strony%', '%www%', '%web%']`
- `:limit` - Maximum results (default 50)
**Scoring Strategy:**
1. **FTS Score:** `ts_rank()` measures how well document matches query (0.0-1.0)
2. **Fuzzy Score:** `similarity()` from pg_trgm measures string similarity (0.0-1.0)
3. **History Score:** Fixed 0.5 bonus if founders/owners match (important for people search)
4. **Final Score:** `GREATEST()` of all three scores, normalized to 0-100 scale
**Match Types:**
- `'fts'` - Full-text search match (highest ts_rank)
- `'fuzzy'` - Fuzzy string similarity match (highest similarity)
- `'history'` - Founding history match (owner/founder keywords)
**Fallback Behavior:**
- If `pg_trgm` extension not available → Uses FTS only (no fuzzy matching)
- If FTS returns 0 results → Falls back to SQLite keyword scoring
- If FTS query fails (exception) → Rollback transaction, use SQLite fallback
---
## 4. SQLite Keyword Scoring Fallback
### 4.1 Fallback Sequence
```mermaid
sequenceDiagram
participant SearchSvc as SearchService
participant DB as Database
participant Scorer as Keyword Scorer<br/>(in-memory)
SearchSvc->>SearchSvc: _expand_keywords(query)
Note over SearchSvc: Keywords: [strony, www, web, ...]
SearchSvc->>DB: SELECT * FROM companies<br/>WHERE status = 'active'
DB->>SearchSvc: All active companies (in-memory)
loop For each company
SearchSvc->>Scorer: Calculate score
Note over Scorer: Name match: +10<br/>(+5 bonus for exact match)
Note over Scorer: Description short: +5
Note over Scorer: Services: +8
Note over Scorer: Competencies: +7
Note over Scorer: City: +3
Note over Scorer: Founding history: +12<br/>(owners/founders)
Note over Scorer: Description full: +4
Scorer->>SearchSvc: Total score (0+)
end
SearchSvc->>SearchSvc: Filter companies (score > 0)
SearchSvc->>SearchSvc: Sort by score DESC
SearchSvc->>SearchSvc: Limit results
SearchSvc->>SearchSvc: Build SearchResult[]<br/>with scores and match_types
```
### 4.2 Keyword Scoring Algorithm
**File:** `search_service.py` (lines 162-249)
**Scoring Weights:**
```python
{
'name_match': 10, # Company name contains keyword
'exact_name_match': +5, # Exact query appears in name (bonus)
'description_short': 5, # Short description contains keyword
'services': 8, # Service tag matches
'competencies': 7, # Competency tag matches
'city': 3, # City/location matches
'founding_history': 12, # Owners/founders match (highest weight)
'description_full': 4, # Full description contains keyword
}
```
**Algorithm:**
1. Fetch all active companies from database
2. For each company, calculate score:
```python
score = 0
match_type = 'keyword'
# Name match (highest weight)
if any(keyword in company.name.lower() for keyword in keywords):
score += 10
if original_query.lower() in company.name.lower():
score += 5 # Exact match bonus
match_type = 'exact'
# Description match
if any(keyword in company.description_short.lower() for keyword in keywords):
score += 5
# Services match
if any(keyword in service.name.lower() for service in company.services for keyword in keywords):
score += 8
# Competencies match
if any(keyword in competency.name.lower() for competency in company.competencies for keyword in keywords):
score += 7
# City match
if any(keyword in company.city.lower() for keyword in keywords):
score += 3
# Founding history match (owners, founders)
if any(keyword in company.founding_history.lower() for keyword in keywords):
score += 12
# Full description match
if any(keyword in company.description_full.lower() for keyword in keywords):
score += 4
```
3. Filter companies with score > 0
4. Sort by score descending
5. Limit to requested result count
6. Return as `SearchResult[]` with scores and match types
**Match Types:**
- `'exact'` - Original query appears exactly in company name
- `'keyword'` - One or more expanded keywords matched
---
## 5. Direct Identifier Lookup
### 5.1 NIP Lookup Flow
```mermaid
sequenceDiagram
actor User
participant Route as /search route
participant SearchSvc as SearchService
participant DB as PostgreSQL
User->>Route: GET /search?q=5882436505
Route->>SearchSvc: search("5882436505")
SearchSvc->>SearchSvc: _is_nip("5882436505")
Note over SearchSvc: Regex: ^\d{10}$
SearchSvc->>SearchSvc: Clean: remove spaces/hyphens
SearchSvc->>DB: SELECT * FROM companies<br/>WHERE nip = '5882436505'<br/>AND status = 'active'
alt Company found
DB->>SearchSvc: Company object
SearchSvc->>Route: [SearchResult(company, score=100, match_type='nip')]
Route->>User: Display single company
else Not found
DB->>SearchSvc: NULL
SearchSvc->>Route: []
Route->>User: "Brak wyników"
end
```
**Implementation:**
- **File:** `search_service.py` (lines 112-131)
- **Input cleaning:** Strip spaces and hyphens (e.g., "588-243-65-05" → "5882436505")
- **Validation:** Must be exactly 10 digits
- **Score:** Always 100.0 (perfect match)
- **Match type:** `'nip'`
### 5.2 REGON Lookup Flow
```mermaid
sequenceDiagram
actor User
participant Route as /search route
participant SearchSvc as SearchService
participant DB as PostgreSQL
User->>Route: GET /search?q=220825533
Route->>SearchSvc: search("220825533")
SearchSvc->>SearchSvc: _is_regon("220825533")
Note over SearchSvc: Regex: ^\d{9}$ OR ^\d{14}$
SearchSvc->>SearchSvc: Clean: remove spaces/hyphens
SearchSvc->>DB: SELECT * FROM companies<br/>WHERE regon = '220825533'<br/>AND status = 'active'
alt Company found
DB->>SearchSvc: Company object
SearchSvc->>Route: [SearchResult(company, score=100, match_type='regon')]
Route->>User: Display single company
else Not found
DB->>SearchSvc: NULL
SearchSvc->>Route: []
Route->>User: "Brak wyników"
end
```
**Implementation:**
- **File:** `search_service.py` (lines 117-142)
- **Input cleaning:** Strip spaces and hyphens
- **Validation:** Must be exactly 9 or 14 digits
- **Score:** Always 100.0 (perfect match)
- **Match type:** `'regon'`
---
## 6. User Search Interface
### 6.1 Search Route Flow
```mermaid
sequenceDiagram
actor User
participant Browser
participant Flask as Flask App<br/>(app.py /search)
participant SearchSvc as SearchService
participant DB as PostgreSQL
participant Template as search_results.html
User->>Browser: Navigate to /search
Browser->>Flask: GET /search?q=strony+www&category=1
Note over Flask: @login_required<br/>User must be authenticated
Flask->>Flask: Parse query params<br/>q = "strony www"<br/>category = 1
Flask->>SearchSvc: search_companies(db, "strony www", category_id=1, limit=50)
SearchSvc->>SearchSvc: Execute search strategy<br/>(NIP/REGON/FTS/Fallback)
SearchSvc->>DB: Query companies
DB->>SearchSvc: Results
SearchSvc->>Flask: List[SearchResult]
Flask->>Flask: Extract companies from results<br/>companies = [r.company for r in results]
Flask->>Flask: Log search analytics<br/>logger.info(f"Search '{query}': {len} results, types: {match_types}")
Flask->>Template: render_template('search_results.html',<br/>companies=companies,<br/>query=query,<br/>category_id=category_id,<br/>result_count=len)
Template->>Browser: HTML response
Browser->>User: Display search results
```
**Route Details:**
- **Path:** `/search`
- **Method:** GET
- **Authentication:** Required (`@login_required`)
- **File:** `app.py` (lines 718-748)
**Query Parameters:**
- `q` (string, optional) - Search query
- `category` (integer, optional) - Category filter (category_id)
**Response:**
- **Template:** `search_results.html`
- **Context Variables:**
- `companies` - List of Company objects
- `query` - Original search query
- `category_id` - Selected category filter
- `result_count` - Number of results
**Analytics Logging:**
```python
if query:
match_types = {}
for r in results:
match_types[r.match_type] = match_types.get(r.match_type, 0) + 1
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")
```
Example log output:
```
Search 'strony www': 12 results, types: {'fts': 8, 'fuzzy': 3, 'exact': 1}
```
---
## 7. AI Chat Integration
### 7.1 AI Chat Search Flow
```mermaid
sequenceDiagram
actor User
participant Chat as AI Chat Interface<br/>/chat
participant ChatSvc as NordaBizChatService<br/>nordabiz_chat.py
participant SearchSvc as SearchService
participant DB as PostgreSQL
participant Gemini as Google Gemini API
User->>Chat: POST /chat/send<br/>"Szukam firm do stron www"
Chat->>ChatSvc: send_message(user_message, conversation_id)
ChatSvc->>ChatSvc: _find_relevant_companies(db, message)
Note over ChatSvc: Extract search keywords from message
ChatSvc->>SearchSvc: search_companies(db, message, limit=10)
Note over SearchSvc: Use same search strategies<br/>(NIP/REGON/FTS/Fallback)
SearchSvc->>DB: Query companies
DB->>SearchSvc: Results
SearchSvc->>ChatSvc: List[SearchResult] (max 10)
ChatSvc->>ChatSvc: Extract companies from results<br/>companies = [r.company for r in results]
ChatSvc->>ChatSvc: _build_conversation_context(db, user, conversation, companies)
Note over ChatSvc: Limit to 8 companies (prevent context overflow)<br/>Include last 10 messages for history
ChatSvc->>ChatSvc: _company_to_compact_dict(company)
Note over ChatSvc: Compress company data<br/>(name, desc, services, competencies, etc)
ChatSvc->>Gemini: POST /generateContent<br/>System prompt + context + user message
Note over Gemini: Model: gemini-2.5-flash<br/>Max tokens: 2048
Gemini->>ChatSvc: AI response text
ChatSvc->>DB: Save conversation messages<br/>(user message + AI response)
ChatSvc->>DB: Track API costs<br/>(gemini_cost_tracking)
ChatSvc->>Chat: AI response with company recommendations
Chat->>User: Display chat response
```
**Key Differences from User Search:**
1. **Result Limit:** 10 companies (vs 50 for user search)
2. **Company Limit to AI:** 8 companies max (prevents context overflow)
3. **Context Building:** Companies converted to compact JSON format
4. **Integration:** Seamless - AI doesn't know about search internals
5. **Message History:** Last 10 messages included in context
**Implementation:**
- **File:** `nordabiz_chat.py` (lines 383-405)
- **Search Call:**
```python
results = search_companies(db, message, limit=10)
companies = [result.company for result in results]
return companies
```
**Company Data Compression:**
```python
compact = {
'name': company.name,
'cat': company.category.name,
'desc': company.description_short,
'history': company.founding_history, # Owners, founders
'svc': [service.name for service in company.services],
'comp': [competency.name for competency in company.competencies],
'web': company.website,
'tel': company.phone,
'mail': company.email,
'city': company.address_city,
'year': company.year_established,
'cert': [cert.name for cert in company.certifications[:3]]
}
```
**AI System Prompt (includes search context):**
```
Jesteś asystentem bazy firm Norda Biznes z Wejherowa.
Odpowiadaj zwięźle, konkretnie, po polsku.
Oto firmy które mogą być istotne dla pytania użytkownika:
{companies_json}
Historia rozmowy:
{recent_messages}
Odpowiedz na pytanie użytkownika bazując na powyższych danych.
```
---
## 8. Performance Considerations
### 8.1 Database Indexing
**Required Indexes:**
```sql
-- Full-text search index (PostgreSQL)
CREATE INDEX idx_companies_search_vector ON companies USING gin(search_vector);
-- NIP lookup index
CREATE UNIQUE INDEX idx_companies_nip ON companies(nip) WHERE status = 'active';
-- REGON lookup index
CREATE INDEX idx_companies_regon ON companies(regon) WHERE status = 'active';
-- Status filter index
CREATE INDEX idx_companies_status ON companies(status);
-- Category filter index
CREATE INDEX idx_companies_category ON companies(category_id) WHERE status = 'active';
-- pg_trgm index for fuzzy matching (optional)
CREATE INDEX idx_companies_name_trgm ON companies USING gin(name gin_trgm_ops);
```
### 8.2 Search Vector Maintenance
**Automatic Updates:**
```sql
-- Trigger to update search_vector on INSERT/UPDATE
CREATE TRIGGER companies_search_vector_update
BEFORE INSERT OR UPDATE ON companies
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(
search_vector, 'pg_catalog.simple',
name, description_short, description_full, founding_history
);
```
**Manual Rebuild:**
```sql
-- Rebuild all search vectors
UPDATE companies SET search_vector =
setweight(to_tsvector('simple', COALESCE(name, '')), 'A') ||
setweight(to_tsvector('simple', COALESCE(description_short, '')), 'B') ||
setweight(to_tsvector('simple', COALESCE(description_full, '')), 'C') ||
setweight(to_tsvector('simple', COALESCE(founding_history, '')), 'B');
```
### 8.3 Query Performance
**Performance Targets:**
- **NIP/REGON lookup:** < 10ms (indexed)
- **PostgreSQL FTS:** < 100ms (typical)
- **SQLite fallback:** < 500ms (in-memory scoring)
- **AI Chat search:** < 200ms (limit 10 results)
**Optimization Strategies:**
1. **Early Exit:** NIP/REGON lookup bypasses full search
2. **Result Limiting:** Default 50 results (10 for AI chat)
3. **Category Filtering:** Reduces search space
4. **Synonym Pre-expansion:** Computed once, reused in all clauses
5. **Score-based Ordering:** Database-level sorting (not in-memory)
### 8.4 Fallback Performance
**PostgreSQL → SQLite Fallback Triggers:**
1. FTS query returns 0 results
2. FTS query throws exception (syntax error, missing extension)
3. `pg_trgm` extension not available (degrades to FTS-only, not full fallback)
**SQLite Fallback Cost:**
- Fetches ALL active companies into memory
- Scores each company in Python (slower than SQL)
- Suitable for development/testing, not recommended for production with 100+ companies
**Monitoring:**
```python
# Logged in app.py when search executes
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")
# Example outputs:
# Search 'strony www': 12 results, types: {'fts': 8, 'fuzzy': 4}
# Search '5882436505': 1 results, types: {'nip': 1}
# Search 'PIXLAB': 1 results, types: {'exact': 1}
```
---
## 9. Search Result Structure
### 9.1 SearchResult Dataclass
**File:** `search_service.py` (lines 20-25)
```python
@dataclass
class SearchResult:
"""Search result with score and match info"""
company: Company # Full Company SQLAlchemy object
score: float # Relevance score (0.0-100.0)
match_type: str # Match type identifier
```
**Match Types:**
| Match Type | Description | Score Range |
|------------|-------------|-------------|
| `'nip'` | Direct NIP match | 100.0 (fixed) |
| `'regon'` | Direct REGON match | 100.0 (fixed) |
| `'exact'` | Exact name match (SQLite) | Variable (usually high) |
| `'fts'` | PostgreSQL full-text search | 0.0-100.0 (normalized ts_rank) |
| `'fuzzy'` | PostgreSQL fuzzy similarity | 0.0-100.0 (normalized similarity) |
| `'history'` | Founding history match | 50.0 (fixed bonus) |
| `'keyword'` | SQLite keyword scoring | Variable (weighted sum) |
| `'all'` | All companies (no filter) | 0.0 (no relevance) |
### 9.2 Score Normalization
**PostgreSQL FTS Scores:**
```python
# ts_rank returns 0.0-1.0, normalize to 0-100
fts_score = ts_rank(...) * 100
# similarity returns 0.0-1.0, normalize to 0-100
fuzzy_score = similarity(...) * 100
# history match is fixed bonus
history_score = 0.5 * 100 = 50.0
```
**SQLite Keyword Scores:**
```python
# Sum of all matching field weights
score = (
10 # name match
+ 5 # exact match bonus
+ 5 # description_short
+ 8 # services
+ 7 # competencies
+ 3 # city
+ 12 # founding_history
+ 4 # description_full
)
# Maximum possible: 54 points
# Typical: 10-30 points
```
---
## 10. Error Handling & Edge Cases
### 10.1 PostgreSQL FTS Error Handling
**Error Scenarios:**
1. **Invalid tsquery syntax** - Fallback to SQLite
2. **pg_trgm extension missing** - Degrade to FTS-only (no fuzzy)
3. **search_vector column missing** - Exception, fallback to SQLite
4. **Database connection error** - Propagate exception to route
**Implementation:**
```python
try:
result = self.db.execute(sql, params)
rows = result.fetchall()
# ... process results
except Exception as e:
print(f"PostgreSQL FTS error: {e}, falling back to keyword search")
self.db.rollback() # CRITICAL: prevent InFailedSqlTransaction
return self._search_sqlite_fallback(query, category_id, limit)
```
**Critical:** `db.rollback()` is essential before fallback to prevent transaction state errors.
### 10.2 Empty Results Handling
**No Results Scenarios:**
1. **NIP/REGON not found** - Return empty list `[]`
2. **FTS returns 0 matches** - Automatic fallback to SQLite scoring
3. **SQLite scoring returns 0 matches** - Return empty list `[]`
4. **Empty query** - Return all active companies (ordered by name)
**User Interface:**
```html
{% if result_count == 0 %}
<div class="alert alert-info">
Brak wyników dla zapytania "{{ query }}".
Spróbuj innych słów kluczowych lub usuń filtry.
</div>
{% endif %}
```
### 10.3 Special Characters & Sanitization
**Query Cleaning:**
```python
query = query.strip() # Remove leading/trailing whitespace
clean_nip = re.sub(r'[\s\-]', '', query) # Remove spaces and hyphens from NIP/REGON
```
**SQL Injection Prevention:**
- All queries use SQLAlchemy parameter binding (`:param` syntax)
- No raw string concatenation in SQL
- ILIKE patterns are passed as array parameters
**XSS Prevention:**
- All user input sanitized before display (handled by Jinja2 auto-escaping)
- Query string displayed in template: `{{ query }}` (auto-escaped)
---
## 11. Testing & Verification
### 11.1 Test Queries
**NIP Lookup:**
```
Query: "5882436505"
Expected: PIXLAB Sp. z o.o. (single result, score=100, match_type='nip')
```
**REGON Lookup:**
```
Query: "220825533"
Expected: Single company with matching REGON (score=100, match_type='regon')
```
**Keyword Search (PostgreSQL FTS):**
```
Query: "strony internetowe"
Expected: Multiple results (IT/Web companies, match_type='fts' or 'fuzzy')
Keywords expanded to: [strony, internetowe, www, web, internet, witryny, seo, ...]
```
**Exact Name Match:**
```
Query: "PIXLAB"
Expected: PIXLAB at top (high score, match_type='exact' or 'fts')
```
**Owner/Founder Search:**
```
Query: "Jan Kowalski" (example founder name)
Expected: Companies where Jan Kowalski appears in founding_history
Match type: 'history' or high score from founding_history match
```
**Category Filter:**
```
Query: "strony" + category=1 (IT)
Expected: Only IT category companies matching "strony"
```
**Empty Query:**
```
Query: ""
Expected: All active companies, alphabetically sorted
```
### 11.2 Performance Testing
**Load Testing Scenarios:**
```python
# Test 1: Direct lookup performance
for nip in all_nips:
results = search_companies(db, nip)
assert len(results) == 1
assert results[0].match_type == 'nip'
# Test 2: Full-text search performance
queries = ["strony", "budowa", "księgowość", "metal", "transport"]
for query in queries:
start = time.time()
results = search_companies(db, query)
elapsed = time.time() - start
assert elapsed < 0.1 # < 100ms
print(f"{query}: {len(results)} results in {elapsed*1000:.1f}ms")
# Test 3: Fallback trigger test (simulate FTS failure)
# Force SQLite fallback by using invalid tsquery syntax
results = search_companies(db, "test:query|with:invalid&syntax")
# Should not crash, should return results via fallback
```
### 11.3 Search Quality Metrics
**Relevance Testing:**
```python
test_cases = [
{
'query': 'strony www',
'expected_top_3': ['PIXLAB', 'Web Agency', 'IT Solutions'],
'min_results': 5
},
{
'query': 'budownictwo',
'expected_categories': ['Construction'],
'min_results': 3
},
# ... more test cases
]
for test in test_cases:
results = search_companies(db, test['query'])
assert len(results) >= test['min_results']
# Check if expected companies appear in top results
top_names = [r.company.name for r in results[:3]]
for expected in test['expected_top_3']:
assert expected in top_names
```
---
## 12. Maintenance & Monitoring
### 12.1 Database Maintenance
**Weekly Tasks:**
```sql
-- Rebuild search vectors (if data quality issues)
UPDATE companies SET search_vector =
setweight(to_tsvector('simple', COALESCE(name, '')), 'A') ||
setweight(to_tsvector('simple', COALESCE(description_short, '')), 'B') ||
setweight(to_tsvector('simple', COALESCE(description_full, '')), 'C') ||
setweight(to_tsvector('simple', COALESCE(founding_history, '')), 'B')
WHERE updated_at > NOW() - INTERVAL '7 days';
-- Verify index health
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE tablename = 'companies'
ORDER BY idx_scan DESC;
-- Check for missing indexes
SELECT indexname, indexdef FROM pg_indexes
WHERE tablename = 'companies';
```
**Monthly Tasks:**
```sql
-- Vacuum and analyze for performance
VACUUM ANALYZE companies;
-- Check for slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%companies%search_vector%'
ORDER BY mean_exec_time DESC
LIMIT 10;
```
### 12.2 Search Analytics
**Logging Search Patterns:**
```python
# Already implemented in app.py /search route
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")
```
**Analytics Queries:**
```sql
-- Top search queries (requires search_logs table - not yet implemented)
SELECT query, COUNT(*) as frequency
FROM search_logs
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY query
ORDER BY frequency DESC
LIMIT 20;
-- Zero-result searches (requires logging)
SELECT query, COUNT(*) as frequency
FROM search_logs
WHERE result_count = 0
AND created_at > NOW() - INTERVAL '30 days'
GROUP BY query
ORDER BY frequency DESC
LIMIT 10;
```
### 12.3 Synonym Expansion Tuning
**Adding New Synonyms:**
```python
# Edit search_service.py KEYWORD_SYNONYMS dictionary
KEYWORD_SYNONYMS = {
# Add new industry-specific terms
'cyberbezpieczeństwo': ['security', 'ochrona', 'firewall', 'antywirus'],
# ... more synonyms
}
```
**Synonym Effectiveness Testing:**
```python
# Test query with and without synonym expansion
query = "cyberbezpieczeństwo"
# With expansion
results_with = search_companies(db, query)
print(f"With synonyms: {len(results_with)} results")
# Without expansion (mock)
# ... compare recall/precision
```
---
## 13. Future Enhancements
### 13.1 Planned Improvements
1. **Search Result Ranking ML Model**
- Learn from user click-through rates
- Personalized ranking based on user preferences
- A/B testing of ranking algorithms
2. **Search Autocomplete**
- Suggest company names as user types
- Suggest common search queries
- Category-based suggestions
3. **Advanced Filters**
- Location-based search (radius from city)
- Certification filters (ISO, other)
- Founding year range
- Employee count range (if available)
4. **Search Analytics Dashboard**
- Top queries (daily/weekly/monthly)
- Zero-result queries (opportunities for content)
- Average result count per query
- Match type distribution
- Click-through rates by position
5. **Semantic Search**
- Integrate sentence embeddings (sentence-transformers)
- Vector similarity search for related companies
- "More like this" company recommendations
6. **Multi-language Support**
- English query translation
- German query support (for border region)
- Auto-detect query language
### 13.2 Performance Optimization Ideas
1. **Query Result Caching**
- Redis cache for common queries (TTL 5 minutes)
- Cache key: `search:{query}:{category_id}`
- Invalidate on company data updates
2. **Partial Index Optimization**
```sql
-- Index only active companies
CREATE INDEX idx_companies_active_search
ON companies USING gin(search_vector)
WHERE status = 'active';
```
3. **Materialized View for Search**
```sql
-- Pre-compute search data
CREATE MATERIALIZED VIEW search_companies_mv AS
SELECT id, name, search_vector, category_id, status, ...
FROM companies
WHERE status = 'active';
-- Refresh daily
REFRESH MATERIALIZED VIEW search_companies_mv;
```
4. **Connection Pooling**
- Already implemented via SQLAlchemy
- Monitor pool size and overflow
- Adjust pool_size/max_overflow if needed
---
## 14. Related Documentation
- **[Flask Application Structure](../analysis/flask-application-structure.md)** - Complete route reference
- **[Database Schema](./05-database-schema.md)** - Company model and indexes
- **[External Integrations](./06-external-integrations.md)** - AI Chat integration details
- **[AI Chat Flow](./03-ai-chat-flow.md)** - How AI uses search service (to be created)
---
## 15. Glossary
| Term | Description |
|------|-------------|
| **FTS** | Full-Text Search - PostgreSQL text search engine using tsvector |
| **tsvector** | PostgreSQL data type for full-text search, stores preprocessed text |
| **tsquery** | PostgreSQL query syntax for full-text search (e.g., "word1 \| word2") |
| **ts_rank** | PostgreSQL function to score FTS relevance (0.0-1.0) |
| **pg_trgm** | PostgreSQL extension for trigram-based fuzzy string matching |
| **similarity()** | pg_trgm function to measure string similarity (0.0-1.0) |
| **Synonym Expansion** | Expanding user query with related keywords (e.g., "strony" "www, web, internet") |
| **SearchResult** | Dataclass containing Company, score, and match_type |
| **Match Type** | Identifier for how company was matched (nip, regon, fts, fuzzy, keyword, etc.) |
| **NIP** | Polish tax identification number (10 digits) |
| **REGON** | Polish business registry number (9 or 14 digits) |
| **Fallback** | Alternative search method when primary method fails (PostgreSQL FTS SQLite keyword scoring) |
| **SearchService** | Unified search service class (search_service.py) |
| **Keyword Scoring** | In-memory scoring algorithm for SQLite fallback |
---
## Document Metadata
**Created:** 2026-01-10
**Author:** Architecture Documentation (auto-claude)
**Related Files:**
- `search_service.py` (main implementation)
- `app.py` (lines 718-748, /search route)
- `nordabiz_chat.py` (lines 383-405, AI integration)
- `database.py` (Company model)
**Version History:**
- v1.0 (2026-01-10) - Initial documentation
---
**End of Document**