- Zmiana nazwy: "Norda Biznes Hub" → "Norda Biznes Partner" - Aktualizacja modelu AI: Gemini 2.0 Flash → Gemini 3 Flash - Zachowano historyczne odniesienia w timeline i dokumentacji Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1161 lines
36 KiB
Markdown
1161 lines
36 KiB
Markdown
# Company Search Flow
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2026-01-10
|
|
**Status:** Production LIVE
|
|
**Flow Type:** Company Search & Discovery
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
This document describes the **complete company search flow** for the Norda Biznes Partner application, covering:
|
|
|
|
- **User Search Interface** (`/search` route)
|
|
- **Search Service Architecture** (unified search with multiple strategies)
|
|
- **AI Chat Integration** (context-aware company discovery)
|
|
- **Search Strategies:**
|
|
- NIP/REGON direct lookup
|
|
- Synonym expansion
|
|
- PostgreSQL Full-Text Search (FTS)
|
|
- Fuzzy matching (pg_trgm)
|
|
- SQLite keyword scoring fallback
|
|
|
|
**Key Technology:**
|
|
- **Search Engine:** Custom unified SearchService
|
|
- **Database:** PostgreSQL FTS with tsvector indexing
|
|
- **Fuzzy Matching:** pg_trgm extension for typo tolerance
|
|
- **Synonym Expansion:** Domain-specific keyword mappings
|
|
- **AI Integration:** Used by NordaBiz Chat for context building
|
|
|
|
**Performance Features:**
|
|
- Direct identifier lookup (NIP/REGON) bypasses full search
|
|
- Database-level full-text search indexing
|
|
- Synonym expansion increases recall
|
|
- Configurable result limits (default 50)
|
|
- Fallback mechanisms for SQLite compatibility
|
|
|
|
---
|
|
|
|
## 1. Search Flow Overview
|
|
|
|
### 1.1 High-Level Architecture
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
User[User] -->|1. Search query| UI[Search UI<br/>/search route]
|
|
AIUser[AI Chat User] -->|1. Natural language| Chat[AI Chat<br/>/chat route]
|
|
|
|
UI -->|2. Call| SearchSvc[Search Service<br/>search_service.py]
|
|
Chat -->|2. Find companies| SearchSvc
|
|
|
|
SearchSvc -->|3. Detect query type| QueryType{Query Type?}
|
|
|
|
QueryType -->|NIP: 10 digits| NIPLookup[NIP Direct Lookup]
|
|
QueryType -->|REGON: 9/14 digits| REGONLookup[REGON Direct Lookup]
|
|
QueryType -->|Text query| DBCheck{Database<br/>Type?}
|
|
|
|
DBCheck -->|PostgreSQL| PGFTS[PostgreSQL FTS<br/>+ Fuzzy Match]
|
|
DBCheck -->|SQLite| SQLiteFallback[SQLite Keyword<br/>Scoring]
|
|
|
|
NIPLookup -->|4. Query DB| DB[(PostgreSQL<br/>companies)]
|
|
REGONLookup -->|4. Query DB| DB
|
|
PGFTS -->|4. FTS query| DB
|
|
SQLiteFallback -->|4. LIKE query| DB
|
|
|
|
DB -->|5. Results| SearchSvc
|
|
SearchSvc -->|6. SearchResult[]| UI
|
|
SearchSvc -->|6. Company[]| Chat
|
|
|
|
UI -->|7. Render| SearchResults[search_results.html]
|
|
Chat -->|7. Build context| AIContext[AI Context Builder]
|
|
|
|
SearchResults -->|8. Display| User
|
|
AIContext -->|8. Generate response| AIUser
|
|
|
|
style SearchSvc fill:#4CAF50
|
|
style PGFTS fill:#2196F3
|
|
style DB fill:#FF9800
|
|
style NIPLookup fill:#9C27B0
|
|
style REGONLookup fill:#9C27B0
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Search Strategies
|
|
|
|
### 2.1 Strategy Selection Algorithm
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
Start([User Query]) --> Clean[Strip whitespace]
|
|
Clean --> Empty{Empty<br/>query?}
|
|
|
|
Empty -->|Yes| AllCompanies[Return all companies<br/>ORDER BY name]
|
|
Empty -->|No| NIPCheck{Is NIP?<br/>10 digits}
|
|
|
|
NIPCheck -->|Yes| NIPSearch[Direct NIP lookup<br/>WHERE nip = ?]
|
|
NIPCheck -->|No| REGONCheck{Is REGON?<br/>9 or 14 digits}
|
|
|
|
REGONCheck -->|Yes| REGONSearch[Direct REGON lookup<br/>WHERE regon = ?]
|
|
REGONCheck -->|No| DBType{Database<br/>Type?}
|
|
|
|
DBType -->|PostgreSQL| PGFlow[PostgreSQL FTS Flow]
|
|
DBType -->|SQLite| SQLiteFlow[SQLite Keyword Flow]
|
|
|
|
NIPSearch --> Found{Found?}
|
|
REGONSearch --> Found
|
|
|
|
Found -->|Yes| ReturnSingle[Return single result<br/>score=100, match_type='nip/regon']
|
|
Found -->|No| ReturnEmpty[Return empty list]
|
|
|
|
PGFlow --> PGSynonym[Expand synonyms]
|
|
PGSynonym --> PGExtCheck{pg_trgm<br/>available?}
|
|
|
|
PGExtCheck -->|Yes| FTS_Fuzzy[FTS + Fuzzy search<br/>ts_rank + similarity]
|
|
PGExtCheck -->|No| FTS_Only[FTS only<br/>ts_rank]
|
|
|
|
FTS_Fuzzy --> PGResults{Results?}
|
|
FTS_Only --> PGResults
|
|
|
|
PGResults -->|Yes| ReturnScored[Return scored results<br/>ORDER BY score DESC]
|
|
PGResults -->|No| Fallback[Execute SQLite fallback]
|
|
|
|
SQLiteFlow --> SQLiteSynonym[Expand synonyms]
|
|
SQLiteSynonym --> Fallback
|
|
|
|
Fallback --> InMemory[In-memory keyword scoring]
|
|
InMemory --> ReturnScored
|
|
|
|
ReturnSingle --> End([SearchResult[]])
|
|
ReturnEmpty --> End
|
|
ReturnScored --> End
|
|
AllCompanies --> End
|
|
|
|
style NIPSearch fill:#9C27B0
|
|
style REGONSearch fill:#9C27B0
|
|
style FTS_Fuzzy fill:#2196F3
|
|
style FTS_Only fill:#2196F3
|
|
style InMemory fill:#FF9800
|
|
```
|
|
|
|
### 2.2 Synonym Expansion
|
|
|
|
**Purpose:** Increase search recall by expanding user queries with domain-specific synonyms
|
|
|
|
**Examples:**
|
|
```python
|
|
KEYWORD_SYNONYMS = {
|
|
# IT / Web
|
|
'strony': ['www', 'web', 'internet', 'witryny', 'seo', 'e-commerce', 'sklep', 'portal'],
|
|
'aplikacje': ['software', 'programowanie', 'systemy', 'crm', 'erp', 'app'],
|
|
'it': ['informatyka', 'komputery', 'software', 'systemy', 'serwis'],
|
|
|
|
# Construction
|
|
'budowa': ['budownictwo', 'konstrukcje', 'remonty', 'wykończenia', 'dach', 'elewacja'],
|
|
'remont': ['wykończenie', 'naprawa', 'renowacja', 'modernizacja'],
|
|
|
|
# Services
|
|
'księgowość': ['rachunkowość', 'finanse', 'podatki', 'biuro rachunkowe', 'kadry'],
|
|
'prawo': ['prawnik', 'adwokat', 'radca', 'kancelaria', 'notariusz'],
|
|
|
|
# Production
|
|
'metal': ['stal', 'obróbka', 'spawanie', 'cnc', 'ślusarstwo'],
|
|
'drewno': ['stolarka', 'meble', 'tartak', 'carpentry'],
|
|
}
|
|
```
|
|
|
|
**Algorithm:**
|
|
1. Tokenize user query (split on whitespace, strip punctuation)
|
|
2. For each word:
|
|
- Direct lookup in KEYWORD_SYNONYMS keys
|
|
- Check if word appears in any synonym list
|
|
- Add matching synonyms to expanded query
|
|
3. Return unique set of keywords
|
|
|
|
**Example Expansion:**
|
|
```
|
|
Input: "strony internetowe"
|
|
Output: ['strony', 'internetowe', 'www', 'web', 'internet', 'witryny',
|
|
'seo', 'e-commerce', 'ecommerce', 'sklep', 'portal', 'online',
|
|
'cyfrowe', 'marketing']
|
|
```
|
|
|
|
---
|
|
|
|
## 3. PostgreSQL Full-Text Search (FTS)
|
|
|
|
### 3.1 FTS Search Sequence
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
actor User
|
|
participant Route as Flask Route<br/>/search
|
|
participant SearchSvc as SearchService
|
|
participant PG as PostgreSQL
|
|
participant FTS as Full-Text Engine<br/>(tsvector)
|
|
participant Trgm as pg_trgm Extension<br/>(fuzzy matching)
|
|
|
|
User->>Route: GET /search?q=strony www
|
|
Route->>SearchSvc: search("strony www", limit=50)
|
|
|
|
Note over SearchSvc: Detect PostgreSQL database
|
|
SearchSvc->>SearchSvc: _expand_keywords("strony www")
|
|
Note over SearchSvc: Expanded: [strony, www, web, internet,<br/>witryny, seo, e-commerce, ...]
|
|
|
|
SearchSvc->>SearchSvc: Build tsquery: "strony:* | www:* | web:* | ..."
|
|
SearchSvc->>SearchSvc: Build ILIKE patterns: [%strony%, %www%, %web%, ...]
|
|
|
|
SearchSvc->>PG: Check pg_trgm extension available
|
|
PG->>SearchSvc: Extension exists
|
|
|
|
SearchSvc->>PG: Execute FTS + Fuzzy query
|
|
Note over PG: SELECT c.id,<br/>ts_rank(search_vector, tsquery) as fts_score,<br/>similarity(name, query) as fuzzy_score,<br/>CASE WHEN founding_history ILIKE ...<br/>FROM companies c<br/>WHERE search_vector @@ tsquery<br/>OR similarity(name, query) > 0.2<br/>OR name/description ILIKE patterns
|
|
|
|
PG->>FTS: Match against search_vector
|
|
FTS->>PG: FTS matches with ts_rank scores
|
|
|
|
PG->>Trgm: Calculate similarity(name, query)
|
|
Trgm->>PG: Fuzzy match scores (0.0-1.0)
|
|
|
|
PG->>SearchSvc: Result rows: [(id, fts_score, fuzzy_score, history_score), ...]
|
|
|
|
SearchSvc->>PG: Fetch full Company objects<br/>WHERE id IN (...)
|
|
PG->>SearchSvc: Company objects
|
|
|
|
SearchSvc->>SearchSvc: Determine match_type (fts/fuzzy/history)
|
|
SearchSvc->>SearchSvc: Normalize scores (0-100)
|
|
|
|
SearchSvc->>Route: SearchResult[] with companies, scores, match_types
|
|
Route->>User: Render search_results.html
|
|
```
|
|
|
|
### 3.2 PostgreSQL FTS Implementation
|
|
|
|
**File:** `search_service.py` (lines 251-378)
|
|
|
|
**Database Requirements:**
|
|
- **Extension:** `pg_trgm` (optional, enables fuzzy matching)
|
|
- **Column:** `companies.search_vector` (tsvector, indexed)
|
|
- **Index:** GIN index on `search_vector` for fast full-text search
|
|
|
|
**SQL Query Structure (with pg_trgm):**
|
|
```sql
|
|
SELECT c.id,
|
|
COALESCE(ts_rank(c.search_vector, to_tsquery('simple', :tsquery)), 0) as fts_score,
|
|
COALESCE(similarity(c.name, :query), 0) as fuzzy_score,
|
|
CASE WHEN c.founding_history ILIKE ANY(:like_patterns) THEN 0.5 ELSE 0 END as history_score
|
|
FROM companies c
|
|
WHERE c.status = 'active'
|
|
AND (
|
|
c.search_vector @@ to_tsquery('simple', :tsquery) -- FTS match
|
|
OR similarity(c.name, :query) > 0.2 -- Fuzzy name match
|
|
OR c.name ILIKE ANY(:like_patterns) -- Keyword in name
|
|
OR c.description_short ILIKE ANY(:like_patterns) -- Keyword in description
|
|
OR c.founding_history ILIKE ANY(:like_patterns) -- Keyword in owners/founders
|
|
OR c.description_full ILIKE ANY(:like_patterns) -- Keyword in full text
|
|
)
|
|
ORDER BY GREATEST(
|
|
COALESCE(ts_rank(c.search_vector, to_tsquery('simple', :tsquery)), 0),
|
|
COALESCE(similarity(c.name, :query), 0),
|
|
CASE WHEN c.founding_history ILIKE ANY(:like_patterns) THEN 0.5 ELSE 0 END
|
|
) DESC
|
|
LIMIT :limit
|
|
```
|
|
|
|
**Parameters:**
|
|
- `:tsquery` - Expanded keywords joined with `|` (OR), each with `:*` prefix matching
|
|
- Example: `"strony:* | www:* | web:* | internet:*"`
|
|
- `:query` - Original user query for fuzzy matching
|
|
- `:like_patterns` - Array of ILIKE patterns for direct keyword matches
|
|
- Example: `['%strony%', '%www%', '%web%']`
|
|
- `:limit` - Maximum results (default 50)
|
|
|
|
**Scoring Strategy:**
|
|
1. **FTS Score:** `ts_rank()` measures how well document matches query (0.0-1.0)
|
|
2. **Fuzzy Score:** `similarity()` from pg_trgm measures string similarity (0.0-1.0)
|
|
3. **History Score:** Fixed 0.5 bonus if founders/owners match (important for people search)
|
|
4. **Final Score:** `GREATEST()` of all three scores, normalized to 0-100 scale
|
|
|
|
**Match Types:**
|
|
- `'fts'` - Full-text search match (highest ts_rank)
|
|
- `'fuzzy'` - Fuzzy string similarity match (highest similarity)
|
|
- `'history'` - Founding history match (owner/founder keywords)
|
|
|
|
**Fallback Behavior:**
|
|
- If `pg_trgm` extension not available → Uses FTS only (no fuzzy matching)
|
|
- If FTS returns 0 results → Falls back to SQLite keyword scoring
|
|
- If FTS query fails (exception) → Rollback transaction, use SQLite fallback
|
|
|
|
---
|
|
|
|
## 4. SQLite Keyword Scoring Fallback
|
|
|
|
### 4.1 Fallback Sequence
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant SearchSvc as SearchService
|
|
participant DB as Database
|
|
participant Scorer as Keyword Scorer<br/>(in-memory)
|
|
|
|
SearchSvc->>SearchSvc: _expand_keywords(query)
|
|
Note over SearchSvc: Keywords: [strony, www, web, ...]
|
|
|
|
SearchSvc->>DB: SELECT * FROM companies<br/>WHERE status = 'active'
|
|
DB->>SearchSvc: All active companies (in-memory)
|
|
|
|
loop For each company
|
|
SearchSvc->>Scorer: Calculate score
|
|
|
|
Note over Scorer: Name match: +10<br/>(+5 bonus for exact match)
|
|
Note over Scorer: Description short: +5
|
|
Note over Scorer: Services: +8
|
|
Note over Scorer: Competencies: +7
|
|
Note over Scorer: City: +3
|
|
Note over Scorer: Founding history: +12<br/>(owners/founders)
|
|
Note over Scorer: Description full: +4
|
|
|
|
Scorer->>SearchSvc: Total score (0+)
|
|
end
|
|
|
|
SearchSvc->>SearchSvc: Filter companies (score > 0)
|
|
SearchSvc->>SearchSvc: Sort by score DESC
|
|
SearchSvc->>SearchSvc: Limit results
|
|
|
|
SearchSvc->>SearchSvc: Build SearchResult[]<br/>with scores and match_types
|
|
```
|
|
|
|
### 4.2 Keyword Scoring Algorithm
|
|
|
|
**File:** `search_service.py` (lines 162-249)
|
|
|
|
**Scoring Weights:**
|
|
```python
|
|
{
|
|
'name_match': 10, # Company name contains keyword
|
|
'exact_name_match': +5, # Exact query appears in name (bonus)
|
|
'description_short': 5, # Short description contains keyword
|
|
'services': 8, # Service tag matches
|
|
'competencies': 7, # Competency tag matches
|
|
'city': 3, # City/location matches
|
|
'founding_history': 12, # Owners/founders match (highest weight)
|
|
'description_full': 4, # Full description contains keyword
|
|
}
|
|
```
|
|
|
|
**Algorithm:**
|
|
1. Fetch all active companies from database
|
|
2. For each company, calculate score:
|
|
```python
|
|
score = 0
|
|
match_type = 'keyword'
|
|
|
|
# Name match (highest weight)
|
|
if any(keyword in company.name.lower() for keyword in keywords):
|
|
score += 10
|
|
if original_query.lower() in company.name.lower():
|
|
score += 5 # Exact match bonus
|
|
match_type = 'exact'
|
|
|
|
# Description match
|
|
if any(keyword in company.description_short.lower() for keyword in keywords):
|
|
score += 5
|
|
|
|
# Services match
|
|
if any(keyword in service.name.lower() for service in company.services for keyword in keywords):
|
|
score += 8
|
|
|
|
# Competencies match
|
|
if any(keyword in competency.name.lower() for competency in company.competencies for keyword in keywords):
|
|
score += 7
|
|
|
|
# City match
|
|
if any(keyword in company.city.lower() for keyword in keywords):
|
|
score += 3
|
|
|
|
# Founding history match (owners, founders)
|
|
if any(keyword in company.founding_history.lower() for keyword in keywords):
|
|
score += 12
|
|
|
|
# Full description match
|
|
if any(keyword in company.description_full.lower() for keyword in keywords):
|
|
score += 4
|
|
```
|
|
|
|
3. Filter companies with score > 0
|
|
4. Sort by score descending
|
|
5. Limit to requested result count
|
|
6. Return as `SearchResult[]` with scores and match types
|
|
|
|
**Match Types:**
|
|
- `'exact'` - Original query appears exactly in company name
|
|
- `'keyword'` - One or more expanded keywords matched
|
|
|
|
---
|
|
|
|
## 5. Direct Identifier Lookup
|
|
|
|
### 5.1 NIP Lookup Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
actor User
|
|
participant Route as /search route
|
|
participant SearchSvc as SearchService
|
|
participant DB as PostgreSQL
|
|
|
|
User->>Route: GET /search?q=5882436505
|
|
Route->>SearchSvc: search("5882436505")
|
|
|
|
SearchSvc->>SearchSvc: _is_nip("5882436505")
|
|
Note over SearchSvc: Regex: ^\d{10}$
|
|
SearchSvc->>SearchSvc: Clean: remove spaces/hyphens
|
|
|
|
SearchSvc->>DB: SELECT * FROM companies<br/>WHERE nip = '5882436505'<br/>AND status = 'active'
|
|
|
|
alt Company found
|
|
DB->>SearchSvc: Company object
|
|
SearchSvc->>Route: [SearchResult(company, score=100, match_type='nip')]
|
|
Route->>User: Display single company
|
|
else Not found
|
|
DB->>SearchSvc: NULL
|
|
SearchSvc->>Route: []
|
|
Route->>User: "Brak wyników"
|
|
end
|
|
```
|
|
|
|
**Implementation:**
|
|
- **File:** `search_service.py` (lines 112-131)
|
|
- **Input cleaning:** Strip spaces and hyphens (e.g., "588-243-65-05" → "5882436505")
|
|
- **Validation:** Must be exactly 10 digits
|
|
- **Score:** Always 100.0 (perfect match)
|
|
- **Match type:** `'nip'`
|
|
|
|
### 5.2 REGON Lookup Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
actor User
|
|
participant Route as /search route
|
|
participant SearchSvc as SearchService
|
|
participant DB as PostgreSQL
|
|
|
|
User->>Route: GET /search?q=220825533
|
|
Route->>SearchSvc: search("220825533")
|
|
|
|
SearchSvc->>SearchSvc: _is_regon("220825533")
|
|
Note over SearchSvc: Regex: ^\d{9}$ OR ^\d{14}$
|
|
SearchSvc->>SearchSvc: Clean: remove spaces/hyphens
|
|
|
|
SearchSvc->>DB: SELECT * FROM companies<br/>WHERE regon = '220825533'<br/>AND status = 'active'
|
|
|
|
alt Company found
|
|
DB->>SearchSvc: Company object
|
|
SearchSvc->>Route: [SearchResult(company, score=100, match_type='regon')]
|
|
Route->>User: Display single company
|
|
else Not found
|
|
DB->>SearchSvc: NULL
|
|
SearchSvc->>Route: []
|
|
Route->>User: "Brak wyników"
|
|
end
|
|
```
|
|
|
|
**Implementation:**
|
|
- **File:** `search_service.py` (lines 117-142)
|
|
- **Input cleaning:** Strip spaces and hyphens
|
|
- **Validation:** Must be exactly 9 or 14 digits
|
|
- **Score:** Always 100.0 (perfect match)
|
|
- **Match type:** `'regon'`
|
|
|
|
---
|
|
|
|
## 6. User Search Interface
|
|
|
|
### 6.1 Search Route Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
actor User
|
|
participant Browser
|
|
participant Flask as Flask App<br/>(app.py /search)
|
|
participant SearchSvc as SearchService
|
|
participant DB as PostgreSQL
|
|
participant Template as search_results.html
|
|
|
|
User->>Browser: Navigate to /search
|
|
Browser->>Flask: GET /search?q=strony+www&category=1
|
|
|
|
Note over Flask: @login_required<br/>User must be authenticated
|
|
|
|
Flask->>Flask: Parse query params<br/>q = "strony www"<br/>category = 1
|
|
|
|
Flask->>SearchSvc: search_companies(db, "strony www", category_id=1, limit=50)
|
|
SearchSvc->>SearchSvc: Execute search strategy<br/>(NIP/REGON/FTS/Fallback)
|
|
SearchSvc->>DB: Query companies
|
|
DB->>SearchSvc: Results
|
|
SearchSvc->>Flask: List[SearchResult]
|
|
|
|
Flask->>Flask: Extract companies from results<br/>companies = [r.company for r in results]
|
|
|
|
Flask->>Flask: Log search analytics<br/>logger.info(f"Search '{query}': {len} results, types: {match_types}")
|
|
|
|
Flask->>Template: render_template('search_results.html',<br/>companies=companies,<br/>query=query,<br/>category_id=category_id,<br/>result_count=len)
|
|
|
|
Template->>Browser: HTML response
|
|
Browser->>User: Display search results
|
|
```
|
|
|
|
**Route Details:**
|
|
- **Path:** `/search`
|
|
- **Method:** GET
|
|
- **Authentication:** Required (`@login_required`)
|
|
- **File:** `app.py` (lines 718-748)
|
|
|
|
**Query Parameters:**
|
|
- `q` (string, optional) - Search query
|
|
- `category` (integer, optional) - Category filter (category_id)
|
|
|
|
**Response:**
|
|
- **Template:** `search_results.html`
|
|
- **Context Variables:**
|
|
- `companies` - List of Company objects
|
|
- `query` - Original search query
|
|
- `category_id` - Selected category filter
|
|
- `result_count` - Number of results
|
|
|
|
**Analytics Logging:**
|
|
```python
|
|
if query:
|
|
match_types = {}
|
|
for r in results:
|
|
match_types[r.match_type] = match_types.get(r.match_type, 0) + 1
|
|
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")
|
|
```
|
|
|
|
Example log output:
|
|
```
|
|
Search 'strony www': 12 results, types: {'fts': 8, 'fuzzy': 3, 'exact': 1}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. AI Chat Integration
|
|
|
|
### 7.1 AI Chat Search Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
actor User
|
|
participant Chat as AI Chat Interface<br/>/chat
|
|
participant ChatSvc as NordaBizChatService<br/>nordabiz_chat.py
|
|
participant SearchSvc as SearchService
|
|
participant DB as PostgreSQL
|
|
participant Gemini as Google Gemini API
|
|
|
|
User->>Chat: POST /chat/send<br/>"Szukam firm do stron www"
|
|
Chat->>ChatSvc: send_message(user_message, conversation_id)
|
|
|
|
ChatSvc->>ChatSvc: _find_relevant_companies(db, message)
|
|
Note over ChatSvc: Extract search keywords from message
|
|
|
|
ChatSvc->>SearchSvc: search_companies(db, message, limit=10)
|
|
Note over SearchSvc: Use same search strategies<br/>(NIP/REGON/FTS/Fallback)
|
|
|
|
SearchSvc->>DB: Query companies
|
|
DB->>SearchSvc: Results
|
|
SearchSvc->>ChatSvc: List[SearchResult] (max 10)
|
|
|
|
ChatSvc->>ChatSvc: Extract companies from results<br/>companies = [r.company for r in results]
|
|
|
|
ChatSvc->>ChatSvc: _build_conversation_context(db, user, conversation, companies)
|
|
Note over ChatSvc: Limit to 8 companies (prevent context overflow)<br/>Include last 10 messages for history
|
|
|
|
ChatSvc->>ChatSvc: _company_to_compact_dict(company)
|
|
Note over ChatSvc: Compress company data<br/>(name, desc, services, competencies, etc)
|
|
|
|
ChatSvc->>Gemini: POST /generateContent<br/>System prompt + context + user message
|
|
Note over Gemini: Model: gemini-2.5-flash<br/>Max tokens: 2048
|
|
|
|
Gemini->>ChatSvc: AI response text
|
|
|
|
ChatSvc->>DB: Save conversation messages<br/>(user message + AI response)
|
|
ChatSvc->>DB: Track API costs<br/>(gemini_cost_tracking)
|
|
|
|
ChatSvc->>Chat: AI response with company recommendations
|
|
Chat->>User: Display chat response
|
|
```
|
|
|
|
**Key Differences from User Search:**
|
|
1. **Result Limit:** 10 companies (vs 50 for user search)
|
|
2. **Company Limit to AI:** 8 companies max (prevents context overflow)
|
|
3. **Context Building:** Companies converted to compact JSON format
|
|
4. **Integration:** Seamless - AI doesn't know about search internals
|
|
5. **Message History:** Last 10 messages included in context
|
|
|
|
**Implementation:**
|
|
- **File:** `nordabiz_chat.py` (lines 383-405)
|
|
- **Search Call:**
|
|
```python
|
|
results = search_companies(db, message, limit=10)
|
|
companies = [result.company for result in results]
|
|
return companies
|
|
```
|
|
|
|
**Company Data Compression:**
|
|
```python
|
|
compact = {
|
|
'name': company.name,
|
|
'cat': company.category.name,
|
|
'desc': company.description_short,
|
|
'history': company.founding_history, # Owners, founders
|
|
'svc': [service.name for service in company.services],
|
|
'comp': [competency.name for competency in company.competencies],
|
|
'web': company.website,
|
|
'tel': company.phone,
|
|
'mail': company.email,
|
|
'city': company.address_city,
|
|
'year': company.year_established,
|
|
'cert': [cert.name for cert in company.certifications[:3]]
|
|
}
|
|
```
|
|
|
|
**AI System Prompt (includes search context):**
|
|
```
|
|
Jesteś asystentem bazy firm Norda Biznes z Wejherowa.
|
|
Odpowiadaj zwięźle, konkretnie, po polsku.
|
|
|
|
Oto firmy które mogą być istotne dla pytania użytkownika:
|
|
{companies_json}
|
|
|
|
Historia rozmowy:
|
|
{recent_messages}
|
|
|
|
Odpowiedz na pytanie użytkownika bazując na powyższych danych.
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Performance Considerations
|
|
|
|
### 8.1 Database Indexing
|
|
|
|
**Required Indexes:**
|
|
```sql
|
|
-- Full-text search index (PostgreSQL)
|
|
CREATE INDEX idx_companies_search_vector ON companies USING gin(search_vector);
|
|
|
|
-- NIP lookup index
|
|
CREATE UNIQUE INDEX idx_companies_nip ON companies(nip) WHERE status = 'active';
|
|
|
|
-- REGON lookup index
|
|
CREATE INDEX idx_companies_regon ON companies(regon) WHERE status = 'active';
|
|
|
|
-- Status filter index
|
|
CREATE INDEX idx_companies_status ON companies(status);
|
|
|
|
-- Category filter index
|
|
CREATE INDEX idx_companies_category ON companies(category_id) WHERE status = 'active';
|
|
|
|
-- pg_trgm index for fuzzy matching (optional)
|
|
CREATE INDEX idx_companies_name_trgm ON companies USING gin(name gin_trgm_ops);
|
|
```
|
|
|
|
### 8.2 Search Vector Maintenance
|
|
|
|
**Automatic Updates:**
|
|
```sql
|
|
-- Trigger to update search_vector on INSERT/UPDATE
|
|
CREATE TRIGGER companies_search_vector_update
|
|
BEFORE INSERT OR UPDATE ON companies
|
|
FOR EACH ROW EXECUTE FUNCTION
|
|
tsvector_update_trigger(
|
|
search_vector, 'pg_catalog.simple',
|
|
name, description_short, description_full, founding_history
|
|
);
|
|
```
|
|
|
|
**Manual Rebuild:**
|
|
```sql
|
|
-- Rebuild all search vectors
|
|
UPDATE companies SET search_vector =
|
|
setweight(to_tsvector('simple', COALESCE(name, '')), 'A') ||
|
|
setweight(to_tsvector('simple', COALESCE(description_short, '')), 'B') ||
|
|
setweight(to_tsvector('simple', COALESCE(description_full, '')), 'C') ||
|
|
setweight(to_tsvector('simple', COALESCE(founding_history, '')), 'B');
|
|
```
|
|
|
|
### 8.3 Query Performance
|
|
|
|
**Performance Targets:**
|
|
- **NIP/REGON lookup:** < 10ms (indexed)
|
|
- **PostgreSQL FTS:** < 100ms (typical)
|
|
- **SQLite fallback:** < 500ms (in-memory scoring)
|
|
- **AI Chat search:** < 200ms (limit 10 results)
|
|
|
|
**Optimization Strategies:**
|
|
1. **Early Exit:** NIP/REGON lookup bypasses full search
|
|
2. **Result Limiting:** Default 50 results (10 for AI chat)
|
|
3. **Category Filtering:** Reduces search space
|
|
4. **Synonym Pre-expansion:** Computed once, reused in all clauses
|
|
5. **Score-based Ordering:** Database-level sorting (not in-memory)
|
|
|
|
### 8.4 Fallback Performance
|
|
|
|
**PostgreSQL → SQLite Fallback Triggers:**
|
|
1. FTS query returns 0 results
|
|
2. FTS query throws exception (syntax error, missing extension)
|
|
3. `pg_trgm` extension not available (degrades to FTS-only, not full fallback)
|
|
|
|
**SQLite Fallback Cost:**
|
|
- Fetches ALL active companies into memory
|
|
- Scores each company in Python (slower than SQL)
|
|
- Suitable for development/testing, not recommended for production with 100+ companies
|
|
|
|
**Monitoring:**
|
|
```python
|
|
# Logged in app.py when search executes
|
|
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")
|
|
|
|
# Example outputs:
|
|
# Search 'strony www': 12 results, types: {'fts': 8, 'fuzzy': 4}
|
|
# Search '5882436505': 1 results, types: {'nip': 1}
|
|
# Search 'PIXLAB': 1 results, types: {'exact': 1}
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Search Result Structure
|
|
|
|
### 9.1 SearchResult Dataclass
|
|
|
|
**File:** `search_service.py` (lines 20-25)
|
|
|
|
```python
|
|
@dataclass
|
|
class SearchResult:
|
|
"""Search result with score and match info"""
|
|
company: Company # Full Company SQLAlchemy object
|
|
score: float # Relevance score (0.0-100.0)
|
|
match_type: str # Match type identifier
|
|
```
|
|
|
|
**Match Types:**
|
|
| Match Type | Description | Score Range |
|
|
|------------|-------------|-------------|
|
|
| `'nip'` | Direct NIP match | 100.0 (fixed) |
|
|
| `'regon'` | Direct REGON match | 100.0 (fixed) |
|
|
| `'exact'` | Exact name match (SQLite) | Variable (usually high) |
|
|
| `'fts'` | PostgreSQL full-text search | 0.0-100.0 (normalized ts_rank) |
|
|
| `'fuzzy'` | PostgreSQL fuzzy similarity | 0.0-100.0 (normalized similarity) |
|
|
| `'history'` | Founding history match | 50.0 (fixed bonus) |
|
|
| `'keyword'` | SQLite keyword scoring | Variable (weighted sum) |
|
|
| `'all'` | All companies (no filter) | 0.0 (no relevance) |
|
|
|
|
### 9.2 Score Normalization
|
|
|
|
**PostgreSQL FTS Scores:**
|
|
```python
|
|
# ts_rank returns 0.0-1.0, normalize to 0-100
|
|
fts_score = ts_rank(...) * 100
|
|
|
|
# similarity returns 0.0-1.0, normalize to 0-100
|
|
fuzzy_score = similarity(...) * 100
|
|
|
|
# history match is fixed bonus
|
|
history_score = 0.5 * 100 = 50.0
|
|
```
|
|
|
|
**SQLite Keyword Scores:**
|
|
```python
|
|
# Sum of all matching field weights
|
|
score = (
|
|
10 # name match
|
|
+ 5 # exact match bonus
|
|
+ 5 # description_short
|
|
+ 8 # services
|
|
+ 7 # competencies
|
|
+ 3 # city
|
|
+ 12 # founding_history
|
|
+ 4 # description_full
|
|
)
|
|
# Maximum possible: 54 points
|
|
# Typical: 10-30 points
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Error Handling & Edge Cases
|
|
|
|
### 10.1 PostgreSQL FTS Error Handling
|
|
|
|
**Error Scenarios:**
|
|
1. **Invalid tsquery syntax** - Fallback to SQLite
|
|
2. **pg_trgm extension missing** - Degrade to FTS-only (no fuzzy)
|
|
3. **search_vector column missing** - Exception, fallback to SQLite
|
|
4. **Database connection error** - Propagate exception to route
|
|
|
|
**Implementation:**
|
|
```python
|
|
try:
|
|
result = self.db.execute(sql, params)
|
|
rows = result.fetchall()
|
|
# ... process results
|
|
except Exception as e:
|
|
print(f"PostgreSQL FTS error: {e}, falling back to keyword search")
|
|
self.db.rollback() # CRITICAL: prevent InFailedSqlTransaction
|
|
return self._search_sqlite_fallback(query, category_id, limit)
|
|
```
|
|
|
|
**Critical:** `db.rollback()` is essential before fallback to prevent transaction state errors.
|
|
|
|
### 10.2 Empty Results Handling
|
|
|
|
**No Results Scenarios:**
|
|
1. **NIP/REGON not found** - Return empty list `[]`
|
|
2. **FTS returns 0 matches** - Automatic fallback to SQLite scoring
|
|
3. **SQLite scoring returns 0 matches** - Return empty list `[]`
|
|
4. **Empty query** - Return all active companies (ordered by name)
|
|
|
|
**User Interface:**
|
|
```html
|
|
{% if result_count == 0 %}
|
|
<div class="alert alert-info">
|
|
Brak wyników dla zapytania "{{ query }}".
|
|
Spróbuj innych słów kluczowych lub usuń filtry.
|
|
</div>
|
|
{% endif %}
|
|
```
|
|
|
|
### 10.3 Special Characters & Sanitization
|
|
|
|
**Query Cleaning:**
|
|
```python
|
|
query = query.strip() # Remove leading/trailing whitespace
|
|
clean_nip = re.sub(r'[\s\-]', '', query) # Remove spaces and hyphens from NIP/REGON
|
|
```
|
|
|
|
**SQL Injection Prevention:**
|
|
- All queries use SQLAlchemy parameter binding (`:param` syntax)
|
|
- No raw string concatenation in SQL
|
|
- ILIKE patterns are passed as array parameters
|
|
|
|
**XSS Prevention:**
|
|
- All user input sanitized before display (handled by Jinja2 auto-escaping)
|
|
- Query string displayed in template: `{{ query }}` (auto-escaped)
|
|
|
|
---
|
|
|
|
## 11. Testing & Verification
|
|
|
|
### 11.1 Test Queries
|
|
|
|
**NIP Lookup:**
|
|
```
|
|
Query: "5882436505"
|
|
Expected: PIXLAB Sp. z o.o. (single result, score=100, match_type='nip')
|
|
```
|
|
|
|
**REGON Lookup:**
|
|
```
|
|
Query: "220825533"
|
|
Expected: Single company with matching REGON (score=100, match_type='regon')
|
|
```
|
|
|
|
**Keyword Search (PostgreSQL FTS):**
|
|
```
|
|
Query: "strony internetowe"
|
|
Expected: Multiple results (IT/Web companies, match_type='fts' or 'fuzzy')
|
|
Keywords expanded to: [strony, internetowe, www, web, internet, witryny, seo, ...]
|
|
```
|
|
|
|
**Exact Name Match:**
|
|
```
|
|
Query: "PIXLAB"
|
|
Expected: PIXLAB at top (high score, match_type='exact' or 'fts')
|
|
```
|
|
|
|
**Owner/Founder Search:**
|
|
```
|
|
Query: "Jan Kowalski" (example founder name)
|
|
Expected: Companies where Jan Kowalski appears in founding_history
|
|
Match type: 'history' or high score from founding_history match
|
|
```
|
|
|
|
**Category Filter:**
|
|
```
|
|
Query: "strony" + category=1 (IT)
|
|
Expected: Only IT category companies matching "strony"
|
|
```
|
|
|
|
**Empty Query:**
|
|
```
|
|
Query: ""
|
|
Expected: All active companies, alphabetically sorted
|
|
```
|
|
|
|
### 11.2 Performance Testing
|
|
|
|
**Load Testing Scenarios:**
|
|
```python
|
|
# Test 1: Direct lookup performance
|
|
for nip in all_nips:
|
|
results = search_companies(db, nip)
|
|
assert len(results) == 1
|
|
assert results[0].match_type == 'nip'
|
|
|
|
# Test 2: Full-text search performance
|
|
queries = ["strony", "budowa", "księgowość", "metal", "transport"]
|
|
for query in queries:
|
|
start = time.time()
|
|
results = search_companies(db, query)
|
|
elapsed = time.time() - start
|
|
assert elapsed < 0.1 # < 100ms
|
|
print(f"{query}: {len(results)} results in {elapsed*1000:.1f}ms")
|
|
|
|
# Test 3: Fallback trigger test (simulate FTS failure)
|
|
# Force SQLite fallback by using invalid tsquery syntax
|
|
results = search_companies(db, "test:query|with:invalid&syntax")
|
|
# Should not crash, should return results via fallback
|
|
```
|
|
|
|
### 11.3 Search Quality Metrics
|
|
|
|
**Relevance Testing:**
|
|
```python
|
|
test_cases = [
|
|
{
|
|
'query': 'strony www',
|
|
'expected_top_3': ['PIXLAB', 'Web Agency', 'IT Solutions'],
|
|
'min_results': 5
|
|
},
|
|
{
|
|
'query': 'budownictwo',
|
|
'expected_categories': ['Construction'],
|
|
'min_results': 3
|
|
},
|
|
# ... more test cases
|
|
]
|
|
|
|
for test in test_cases:
|
|
results = search_companies(db, test['query'])
|
|
assert len(results) >= test['min_results']
|
|
# Check if expected companies appear in top results
|
|
top_names = [r.company.name for r in results[:3]]
|
|
for expected in test['expected_top_3']:
|
|
assert expected in top_names
|
|
```
|
|
|
|
---
|
|
|
|
## 12. Maintenance & Monitoring
|
|
|
|
### 12.1 Database Maintenance
|
|
|
|
**Weekly Tasks:**
|
|
```sql
|
|
-- Rebuild search vectors (if data quality issues)
|
|
UPDATE companies SET search_vector =
|
|
setweight(to_tsvector('simple', COALESCE(name, '')), 'A') ||
|
|
setweight(to_tsvector('simple', COALESCE(description_short, '')), 'B') ||
|
|
setweight(to_tsvector('simple', COALESCE(description_full, '')), 'C') ||
|
|
setweight(to_tsvector('simple', COALESCE(founding_history, '')), 'B')
|
|
WHERE updated_at > NOW() - INTERVAL '7 days';
|
|
|
|
-- Verify index health
|
|
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
|
|
FROM pg_stat_user_indexes
|
|
WHERE tablename = 'companies'
|
|
ORDER BY idx_scan DESC;
|
|
|
|
-- Check for missing indexes
|
|
SELECT indexname, indexdef FROM pg_indexes
|
|
WHERE tablename = 'companies';
|
|
```
|
|
|
|
**Monthly Tasks:**
|
|
```sql
|
|
-- Vacuum and analyze for performance
|
|
VACUUM ANALYZE companies;
|
|
|
|
-- Check for slow queries
|
|
SELECT query, mean_exec_time, calls
|
|
FROM pg_stat_statements
|
|
WHERE query LIKE '%companies%search_vector%'
|
|
ORDER BY mean_exec_time DESC
|
|
LIMIT 10;
|
|
```
|
|
|
|
### 12.2 Search Analytics
|
|
|
|
**Logging Search Patterns:**
|
|
```python
|
|
# Already implemented in app.py /search route
|
|
logger.info(f"Search '{query}': {len(companies)} results, types: {match_types}")
|
|
```
|
|
|
|
**Analytics Queries:**
|
|
```sql
|
|
-- Top search queries (requires search_logs table - not yet implemented)
|
|
SELECT query, COUNT(*) as frequency
|
|
FROM search_logs
|
|
WHERE created_at > NOW() - INTERVAL '30 days'
|
|
GROUP BY query
|
|
ORDER BY frequency DESC
|
|
LIMIT 20;
|
|
|
|
-- Zero-result searches (requires logging)
|
|
SELECT query, COUNT(*) as frequency
|
|
FROM search_logs
|
|
WHERE result_count = 0
|
|
AND created_at > NOW() - INTERVAL '30 days'
|
|
GROUP BY query
|
|
ORDER BY frequency DESC
|
|
LIMIT 10;
|
|
```
|
|
|
|
### 12.3 Synonym Expansion Tuning
|
|
|
|
**Adding New Synonyms:**
|
|
```python
|
|
# Edit search_service.py KEYWORD_SYNONYMS dictionary
|
|
KEYWORD_SYNONYMS = {
|
|
# Add new industry-specific terms
|
|
'cyberbezpieczeństwo': ['security', 'ochrona', 'firewall', 'antywirus'],
|
|
# ... more synonyms
|
|
}
|
|
```
|
|
|
|
**Synonym Effectiveness Testing:**
|
|
```python
|
|
# Test query with and without synonym expansion
|
|
query = "cyberbezpieczeństwo"
|
|
|
|
# With expansion
|
|
results_with = search_companies(db, query)
|
|
print(f"With synonyms: {len(results_with)} results")
|
|
|
|
# Without expansion (mock)
|
|
# ... compare recall/precision
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Future Enhancements
|
|
|
|
### 13.1 Planned Improvements
|
|
|
|
1. **Search Result Ranking ML Model**
|
|
- Learn from user click-through rates
|
|
- Personalized ranking based on user preferences
|
|
- A/B testing of ranking algorithms
|
|
|
|
2. **Search Autocomplete**
|
|
- Suggest company names as user types
|
|
- Suggest common search queries
|
|
- Category-based suggestions
|
|
|
|
3. **Advanced Filters**
|
|
- Location-based search (radius from city)
|
|
- Certification filters (ISO, other)
|
|
- Founding year range
|
|
- Employee count range (if available)
|
|
|
|
4. **Search Analytics Dashboard**
|
|
- Top queries (daily/weekly/monthly)
|
|
- Zero-result queries (opportunities for content)
|
|
- Average result count per query
|
|
- Match type distribution
|
|
- Click-through rates by position
|
|
|
|
5. **Semantic Search**
|
|
- Integrate sentence embeddings (sentence-transformers)
|
|
- Vector similarity search for related companies
|
|
- "More like this" company recommendations
|
|
|
|
6. **Multi-language Support**
|
|
- English query translation
|
|
- German query support (for border region)
|
|
- Auto-detect query language
|
|
|
|
### 13.2 Performance Optimization Ideas
|
|
|
|
1. **Query Result Caching**
|
|
- Redis cache for common queries (TTL 5 minutes)
|
|
- Cache key: `search:{query}:{category_id}`
|
|
- Invalidate on company data updates
|
|
|
|
2. **Partial Index Optimization**
|
|
```sql
|
|
-- Index only active companies
|
|
CREATE INDEX idx_companies_active_search
|
|
ON companies USING gin(search_vector)
|
|
WHERE status = 'active';
|
|
```
|
|
|
|
3. **Materialized View for Search**
|
|
```sql
|
|
-- Pre-compute search data
|
|
CREATE MATERIALIZED VIEW search_companies_mv AS
|
|
SELECT id, name, search_vector, category_id, status, ...
|
|
FROM companies
|
|
WHERE status = 'active';
|
|
|
|
-- Refresh daily
|
|
REFRESH MATERIALIZED VIEW search_companies_mv;
|
|
```
|
|
|
|
4. **Connection Pooling**
|
|
- Already implemented via SQLAlchemy
|
|
- Monitor pool size and overflow
|
|
- Adjust pool_size/max_overflow if needed
|
|
|
|
---
|
|
|
|
## 14. Related Documentation
|
|
|
|
- **[Flask Application Structure](../analysis/flask-application-structure.md)** - Complete route reference
|
|
- **[Database Schema](./05-database-schema.md)** - Company model and indexes
|
|
- **[External Integrations](./06-external-integrations.md)** - AI Chat integration details
|
|
- **[AI Chat Flow](./03-ai-chat-flow.md)** - How AI uses search service (to be created)
|
|
|
|
---
|
|
|
|
## 15. Glossary
|
|
|
|
| Term | Description |
|
|
|------|-------------|
|
|
| **FTS** | Full-Text Search - PostgreSQL text search engine using tsvector |
|
|
| **tsvector** | PostgreSQL data type for full-text search, stores preprocessed text |
|
|
| **tsquery** | PostgreSQL query syntax for full-text search (e.g., "word1 \| word2") |
|
|
| **ts_rank** | PostgreSQL function to score FTS relevance (0.0-1.0) |
|
|
| **pg_trgm** | PostgreSQL extension for trigram-based fuzzy string matching |
|
|
| **similarity()** | pg_trgm function to measure string similarity (0.0-1.0) |
|
|
| **Synonym Expansion** | Expanding user query with related keywords (e.g., "strony" → "www, web, internet") |
|
|
| **SearchResult** | Dataclass containing Company, score, and match_type |
|
|
| **Match Type** | Identifier for how company was matched (nip, regon, fts, fuzzy, keyword, etc.) |
|
|
| **NIP** | Polish tax identification number (10 digits) |
|
|
| **REGON** | Polish business registry number (9 or 14 digits) |
|
|
| **Fallback** | Alternative search method when primary method fails (PostgreSQL FTS → SQLite keyword scoring) |
|
|
| **SearchService** | Unified search service class (search_service.py) |
|
|
| **Keyword Scoring** | In-memory scoring algorithm for SQLite fallback |
|
|
|
|
---
|
|
|
|
## Document Metadata
|
|
|
|
**Created:** 2026-01-10
|
|
**Author:** Architecture Documentation (auto-claude)
|
|
**Related Files:**
|
|
- `search_service.py` (main implementation)
|
|
- `app.py` (lines 718-748, /search route)
|
|
- `nordabiz_chat.py` (lines 383-405, AI integration)
|
|
- `database.py` (Company model)
|
|
|
|
**Version History:**
|
|
- v1.0 (2026-01-10) - Initial documentation
|
|
|
|
---
|
|
|
|
**End of Document**
|