Spaces:

Hopsakee
/

fabric_to_espanso

Sleeping

App Files Files Community

Hopsakee commited on Mar 23

Commit

5fe3652

verified ·

1 Parent(s): 5b40ec9

Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

docs/qdrant_lessons_learned.md +299 -0
logs/fabric_to_espanso.log.1 +0 -0
logs/fabric_to_espanso.log.2 +0 -0
logs/fabric_to_espanso.log.3 +0 -0
logs/fabric_to_espanso.log.4 +0 -0
main.py +17 -2
parameters.py +1 -1
src/fabrics_processor/database.py +36 -0
src/fabrics_processor/database_updater.py +17 -5
src/fabrics_processor/deduplicator.py +167 -0
src/fabrics_processor/logger.py +2 -2
src/fabrics_processor/output_files_generator.py +2 -0
src/search_qdrant/LOG_FILE +464 -0
src/search_qdrant/database_query.py +4 -0
src/search_qdrant/run_streamlit_terminal_visible.sh +3 -3
src/search_qdrant/streamlit_app.py +28 -8

docs/qdrant_lessons_learned.md ADDED Viewed

	@@ -0,0 +1,299 @@

+# Qdrant Integration: Lessons Learned
+## Introduction
+This document summarizes our experience integrating Qdrant vector database with FastEmbed for embedding generation. We encountered several challenges related to vector naming conventions, search query formats, and other aspects of working with Qdrant. This document outlines the issues we faced and the solutions we implemented to create a robust vector search system.
+## Problem Statement
+We were experiencing issues with vector name mismatches in our Qdrant integration. Specifically:
+1. Points were being skipped during processing with the error message "Skipping point as it has no valid vector"
+2. The vector names we specified in our configuration did not match the actual vector names used in the Qdrant collection
+3. We had implemented unnecessary sanitization of model names
+## Understanding Vector Names in Qdrant
+### How Qdrant Handles Vector Names
+According to the [Qdrant documentation](https://qdrant.tech/documentation/concepts/collections/), when creating a collection with vectors, you specify vector names and their configurations. These names are used as keys when inserting and querying vectors.
+However, when using FastEmbed with Qdrant, we discovered that the model names specified in the configuration are transformed before being used as vector names in the collection:
+- Original model name: `"intfloat/multilingual-e5-large"`
+- Actual vector name in Qdrant: `"fast-multilingual-e5-large"`
+Similarly for sparse vectors:
+- Original model name: `"prithivida/Splade_PP_en_v1"`
+- Actual vector name in Qdrant: `"fast-sparse-splade_pp_en_v1"`
+### Initial Approach (Problematic)
+Our initial approach was to manually transform the model names using a `format_vector_name` function:
+```python
+def format_vector_name(name: str) -> str:
+    """Format a model name into a valid vector name for Qdrant."""
+    return name.replace('/', '_')
+```
+This led to inconsistencies because:
+1. We were using one transformation in our code (`replace('/', '_')`)
+2. FastEmbed was using a different transformation (prefixing with "fast-" and removing slashes)
+## Solution: Dynamic Vector Name Discovery
+Instead of trying to predict how FastEmbed transforms model names, we implemented a solution that dynamically discovers the actual vector names from the Qdrant collection configuration.
+### Helper Functions
+We added two helper functions to retrieve the actual vector names:
+```python
+def get_dense_vector_name(client: QdrantClient, collection_name: str) -> str:
+    """
+    Get the name of the dense vector from the collection configuration.
+    Args:
+        client: Initialized Qdrant client
+        collection_name: Name of the collection
+    Returns:
+        Name of the dense vector as used in the collection
+    """
+    try:
+        return list(client.get_collection(collection_name).config.params.vectors.keys())[0]
+    except (IndexError, AttributeError) as e:
+        logger.warning(f"Could not get dense vector name: {e}")
+        # Fallback to a default name
+        return "fast-multilingual-e5-large"
+def get_sparse_vector_name(client: QdrantClient, collection_name: str) -> str:
+    """
+    Get the name of the sparse vector from the collection configuration.
+    Args:
+        client: Initialized Qdrant client
+        collection_name: Name of the collection
+    Returns:
+        Name of the sparse vector as used in the collection
+    """
+    try:
+        return list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0]
+    except (IndexError, AttributeError) as e:
+        logger.warning(f"Could not get sparse vector name: {e}")
+        # Fallback to a default name
+        return "fast-sparse-splade_pp_en_v1"
+```
+### Implementation in Vector Creation
+When creating new points or updating existing ones, we now use these helper functions to get the correct vector names:
+```python
+# Get vector names from the collection configuration
+dense_vector_name = get_dense_vector_name(client, collection_name)
+sparse_vector_name = get_sparse_vector_name(client, collection_name)
+# Create point with the correct vector names
+point = PointStruct(
+    id=str(uuid.uuid4()),
+    vector={
+        dense_vector_name: get_embedding(payload_new['purpose'])[0],
+        sparse_vector_name: get_embedding(payload_new['purpose'])[1]
+    },
+    payload={
+        # payload fields...
+    }
+)
+```
+### Implementation in Vector Querying
+Similarly, when querying vectors, we use the same helper functions:
+```python
+# Get the actual vector names from the collection configuration
+dense_vector_name = get_dense_vector_name(client, collection_name)
+# Skip points without vector or without the required vector type
+if not point.vector or dense_vector_name not in point.vector:
+    logger.debug(f"Skipping point {point_id} as it has no valid vector")
+    continue
+# Find semantically similar points using Qdrant's search
+similar_points = client.search(
+    collection_name=collection_name,
+    query_vector={
+        dense_vector_name: point.vector.get(dense_vector_name)
+    },
+    limit=100,
+    score_threshold=SIMILARITY_THRESHOLD
+)
+```
+## Key Insights
+1. **Model Names vs. Vector Names**: There's a distinction between the model names you specify in your configuration and the actual vector names used in the Qdrant collection. FastEmbed transforms these names.
+2. **Dynamic Discovery**: Instead of hardcoding vector names or trying to predict the transformation, it's better to dynamically discover the actual vector names from the collection configuration.
+3. **Fallback Mechanism**: Always include fallback mechanisms in case the collection information can't be retrieved, making your code more robust.
+4. **Consistency**: Use the same vector names throughout your system to ensure consistency between vector creation, storage, and retrieval.
+5. **Correct Search Query Format**: When using named vectors in Qdrant search queries, you must use the correct format. Instead of passing a dictionary with vector names as keys, use the `query_vector` parameter for the actual vector and the `using` parameter to specify which named vector to use.
+## Accessing Collection Configuration
+The key to our solution was discovering how to access the collection configuration to get the actual vector names:
+```python
+# Get dense vector name
+dense_vector_name = list(client.get_collection(collection_name).config.params.vectors.keys())[0]
+# Get sparse vector name
+sparse_vector_name = list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0]
+```
+This approach allows our code to adapt to however FastEmbed decides to name the vectors in the collection, rather than assuming a specific naming convention.
+## Correct Search Query Format for Named Vectors
+When using named vectors in Qdrant, it's important to use the correct format for search queries. The format depends on the version of the Qdrant client you're using:
+### Incorrect Format (Causes Validation Error)
+```python
+# This format causes a validation error
+similar_points = client.search(
+    collection_name=collection_name,
+    query_vector={
+        dense_vector_name: point.vector.get(dense_vector_name)
+    },
+    limit=100
+)
+```
+### Correct Format for Qdrant Client Version 1.12.2
+```python
+# This is the correct format for Qdrant client version 1.12.2
+similar_points = client.search(
+    collection_name=collection_name,
+    query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),  # Tuple of (vector_name, vector_values)
+    limit=100,
+    score_threshold=0.8  # Optional similarity threshold
+)
+```
+In Qdrant client version 1.12.2, the correct way to specify which named vector to use is by providing a tuple to the `query_vector` parameter. The tuple should contain the vector name as the first element and the actual vector values as the second element.
+Using the incorrect format will result in a Pydantic validation error with messages like:
+```
+validation errors for SearchRequest
+vector.list[float]
+  Input should be a valid list [type=list_type, input_value={'fast-multilingual-e5-la...}, input_type=dict]
+vector.NamedVector.name
+  Field required [type=missing, input_value={'fast-multilingual-e5-la...}, input_type=dict]
+```
+## Optimizing Search Parameters for Deduplication
+When using Qdrant for deduplication of similar content, the search parameters play a crucial role in determining the effectiveness of the process. We've found the following parameters to be particularly important:
+### Similarity Threshold
+The `score_threshold` parameter determines the minimum similarity score required for points to be considered similar:
+```python
+similar_points = client.search(
+    collection_name=collection_name,
+    query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),
+    limit=100,
+    score_threshold=0.9  # Only consider points with similarity > 90%
+)
+```
+For deduplication purposes, we found that a higher threshold (0.9) works better than a lower one (0.7) to avoid false positives. This means that only very similar items will be considered duplicates.
+### Text Difference Threshold
+In addition to vector similarity, we also check the actual text difference between potential duplicates:
+```python
+# Constants for duplicate detection
+SIMILARITY_THRESHOLD = 0.9  # Minimum semantic similarity to consider as potential duplicate
+DIFFERENCE_THRESHOLD = 0.05  # Maximum text difference (5%) to consider as duplicate
+```
+The `DIFFERENCE_THRESHOLD` of 0.05 means that texts with less than 5% difference will be considered duplicates. This two-step verification (vector similarity + text difference) helps to ensure that only true duplicates are removed.
+## Logging Considerations
+When working with Qdrant, especially during development and debugging, it's helpful to adjust the logging level:
+```python
+# Set log level and prevent propagation
+logger.setLevel(logging.DEBUG)  # For development/debugging
+logger.setLevel(logging.INFO)   # For production
+```
+Using `DEBUG` level during development provides detailed information about vector operations, including:
+- Which points are being processed
+- Why points are being skipped (e.g., missing vectors)
+- Similarity scores between points
+- Deduplication decisions
+However, in production, it's better to use `INFO` level to reduce log volume, especially when processing large collections.
+## Performance Considerations
+### Batch Operations
+When working with large numbers of points, it's more efficient to use batch operations:
+```python
+# Batch upsert example
+client.upsert(
+    collection_name=collection_name,
+    points=batch_of_points  # List of PointStruct objects
+)
+```
+This reduces network overhead compared to upserting points individually.
+### Search Limit
+The `limit` parameter in search operations should be set carefully:
+```python
+similar_points = client.search(
+    collection_name=collection_name,
+    query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),
+    limit=100,  # Maximum number of similar points to return
+    score_threshold=0.9
+)
+```
+A higher limit increases the chance of finding all duplicates but also increases search time. For deduplication purposes, we found that a limit of 100 provides a good balance between thoroughness and performance.
+## Conclusion
+Our experience with Qdrant has taught us several important lessons:
+1. **Dynamic Vector Name Discovery**: By retrieving the actual vector names from the Qdrant collection configuration, we've created a robust solution that adapts to the naming conventions used by FastEmbed and Qdrant.
+2. **Correct Query Format**: Using the proper format for search queries with named vectors is essential - specifically using a tuple of (vector_name, vector_values) for the query_vector parameter.
+3. **Optimized Search Parameters**: Fine-tuning similarity thresholds and text difference thresholds is crucial for effective deduplication, with higher thresholds (0.9 for similarity, 0.05 for text difference) providing better results.
+4. **Appropriate Logging Levels**: Using DEBUG level during development and INFO in production helps balance between having enough information for troubleshooting and maintaining performance.
+5. **Batch Operations**: Using batch operations for inserting and updating points significantly improves performance when working with large collections.
+By implementing these lessons, we've created a more efficient and reliable vector search system that properly handles named vectors, effectively identifies duplicates, and maintains good performance even with large collections.
+This solution should work regardless of changes to the naming conventions in future versions of Qdrant or FastEmbed, as it reads the actual names directly from the collection configuration.

logs/fabric_to_espanso.log.1 CHANGED Viewed

The diff for this file is too large to render. See raw diff

logs/fabric_to_espanso.log.2 ADDED Viewed

The diff for this file is too large to render. See raw diff

logs/fabric_to_espanso.log.3 ADDED Viewed

The diff for this file is too large to render. See raw diff

logs/fabric_to_espanso.log.4 ADDED Viewed

The diff for this file is too large to render. See raw diff

main.py CHANGED Viewed

@@ -8,9 +8,10 @@ from contextlib import contextmanager
 from src.fabrics_processor.database import initialize_qdrant_database
 from src.fabrics_processor.file_change_detector import detect_file_changes
 from src.fabrics_processor.database_updater import update_qdrant_database
-from src.fabrics_processor.yaml_file_generator import generate_yaml_file
 from src.fabrics_processor.logger import setup_logger
 from src.fabrics_processor.config import config
 from src.fabrics_processor.exceptions import (
     DatabaseConnectionError,
     DatabaseInitializationError
@@ -62,13 +63,27 @@ def process_changes(client) -> bool:
         if deleted_files:
             logger.info(f"Deleted files: {deleted_files}")
         # Update database if there are changes
         if any([new_files, modified_files, deleted_files]):
             logger.info("Changes detected. Updating database...")
-            update_qdrant_database(client, new_files, modified_files, deleted_files)
         # Always generate output files to ensure consistency
         generate_yaml_file(client, config.yaml_output_folder)
         return True

 from src.fabrics_processor.database import initialize_qdrant_database
 from src.fabrics_processor.file_change_detector import detect_file_changes
 from src.fabrics_processor.database_updater import update_qdrant_database
+from src.fabrics_processor.output_files_generator import generate_yaml_file
 from src.fabrics_processor.logger import setup_logger
 from src.fabrics_processor.config import config
+from src.fabrics_processor.deduplicator import remove_duplicates
 from src.fabrics_processor.exceptions import (
     DatabaseConnectionError,
     DatabaseInitializationError
         if deleted_files:
             logger.info(f"Deleted files: {deleted_files}")
+        # Track changes for summary
+        duplicates_removed = 0
         # Update database if there are changes
         if any([new_files, modified_files, deleted_files]):
             logger.info("Changes detected. Updating database...")
+            update_qdrant_database(client, config.embedding.collection_name, new_files, modified_files, deleted_files)
+            # Deduplicate entries after updating the database
+            logger.info("Checking for and removing duplicate entries...")
+            duplicates_removed = remove_duplicates(client, config.embedding.collection_name)
+            if duplicates_removed > 0:
+                logger.info(f"Removed {duplicates_removed} duplicate entries from the database")
         # Always generate output files to ensure consistency
         generate_yaml_file(client, config.yaml_output_folder)
+        # Generate summary message
+        total_entries = len(client.scroll(collection_name=config.embedding.collection_name, limit=10000)[0])
+        summary_message = f"Database update summary: {len(new_files)} added, {len(modified_files)} modified, {len(deleted_files)} deleted, {duplicates_removed} duplicates removed. Total entries: {total_entries}"
+        logger.info(summary_message)
         return True

parameters.py CHANGED Viewed

@@ -61,5 +61,5 @@ REQUIRED_FIELDS_DEFAULTS = {
 # Embedding Model parameters voor Qdrant
 USE_FASTEMBED = True
-EMBED_MODEL_DENSE = 'BAAI/bge-base-en' # "fast-bge-small-en"
 EMBED_MODEL_SPARSE = "prithivida/Splade_PP_en_v1"

 # Embedding Model parameters voor Qdrant
 USE_FASTEMBED = True
+EMBED_MODEL_DENSE = "intfloat/multilingual-e5-large" # 'BAAI/bge-base-en' # "fast-bge-small-en"
 EMBED_MODEL_SPARSE = "prithivida/Splade_PP_en_v1"

src/fabrics_processor/database.py CHANGED Viewed

@@ -12,6 +12,42 @@ from .exceptions import DatabaseConnectionError, CollectionError, DatabaseInitia
 logger = logging.getLogger('fabric_to_espanso')
 def create_database_connection(url: Optional[str] = None, api_key: Optional[str] = None) -> QdrantClient:
     """Create a database connection.

 logger = logging.getLogger('fabric_to_espanso')
+def get_dense_vector_name(client: QdrantClient, collection_name: str) -> str:
+    """
+    Get the name of the dense vector from the collection configuration.
+    Args:
+        client: Initialized Qdrant client
+        collection_name: Name of the collection
+    Returns:
+        Name of the dense vector as used in the collection
+    """
+    try:
+        return list(client.get_collection(collection_name).config.params.vectors.keys())[0]
+    except (IndexError, AttributeError) as e:
+        logger.warning(f"Could not get dense vector name: {e}")
+        # Fallback to a default name
+        return "fast-multilingual-e5-large"
+def get_sparse_vector_name(client: QdrantClient, collection_name: str) -> str:
+    """
+    Get the name of the sparse vector from the collection configuration.
+    Args:
+        client: Initialized Qdrant client
+        collection_name: Name of the collection
+    Returns:
+        Name of the sparse vector as used in the collection
+    """
+    try:
+        return list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0]
+    except (IndexError, AttributeError) as e:
+        logger.warning(f"Could not get sparse vector name: {e}")
+        # Fallback to a default name
+        return "fast-sparse-splade_pp_en_v1"
 def create_database_connection(url: Optional[str] = None, api_key: Optional[str] = None) -> QdrantClient:
     """Create a database connection.

src/fabrics_processor/database_updater.py CHANGED Viewed

@@ -7,10 +7,12 @@ import uuid
 from .output_files_generator import generate_yaml_file, generate_markdown_files
 from .config import config
 from .exceptions import ConfigurationError
-from .database import validate_point_payload
 logger = logging.getLogger('fabric_to_espanso')
 def get_embedding(text: str) -> list:
     """
     Generate embedding vector for the given text using FastEmbed.
@@ -59,11 +61,16 @@ def update_qdrant_database(client: QdrantClient, collection_name: str, new_files
         for file in new_files:
             try:
                 payload_new = validate_point_payload(file)
                 point = PointStruct(
                     id=str(uuid.uuid4()),  # Generate a new UUID for each point
                     vector={
-                        'fast-bge-base-en': get_embedding(payload_new['purpose'])[0],
-                        'fast-sparse-splade_pp_en_v1': get_embedding(payload_new['purpose'])[1]
                     },
                     payload={
                         "filename": payload_new['filename'],
@@ -95,11 +102,16 @@ def update_qdrant_database(client: QdrantClient, collection_name: str, new_files
                     point_id = scroll_result[0].id
                     payload_current = validate_point_payload(file, point_id)
                     # Update the existing point with the new file data
                     point = PointStruct(
                         id=point_id,
                         vector={
-                            'fast-bge-base-en': get_embedding(payload_current['purpose'])[0],
-                            'fast-sparse-splade_pp_en_v1': get_embedding(payload_current['purpose'])[1]
                         },
                         payload={
                         "filename": payload_current['filename'],

 from .output_files_generator import generate_yaml_file, generate_markdown_files
 from .config import config
 from .exceptions import ConfigurationError
+from .database import validate_point_payload, get_dense_vector_name, get_sparse_vector_name
 logger = logging.getLogger('fabric_to_espanso')
+# TODO: Make a summary of the prompts using a call to an LLM for every prompt and store that in the purpose field
+# of the database instead of the extracted purpose from the markdown files and use that summary to create the embeddings
 def get_embedding(text: str) -> list:
     """
     Generate embedding vector for the given text using FastEmbed.
         for file in new_files:
             try:
                 payload_new = validate_point_payload(file)
+                # Get vector names from the collection configuration
+                dense_vector_name = get_dense_vector_name(client, collection_name)
+                sparse_vector_name = get_sparse_vector_name(client, collection_name)
+                # Create point with the correct vector names
                 point = PointStruct(
                     id=str(uuid.uuid4()),  # Generate a new UUID for each point
                     vector={
+                        dense_vector_name: get_embedding(payload_new['purpose'])[0],
+                        sparse_vector_name: get_embedding(payload_new['purpose'])[1]
                     },
                     payload={
                         "filename": payload_new['filename'],
                     point_id = scroll_result[0].id
                     payload_current = validate_point_payload(file, point_id)
                     # Update the existing point with the new file data
+                    # Get vector names from the collection configuration
+                    dense_vector_name = get_dense_vector_name(client, collection_name)
+                    sparse_vector_name = get_sparse_vector_name(client, collection_name)
+                    # Create point with the correct vector names
                     point = PointStruct(
                         id=point_id,
                         vector={
+                            dense_vector_name: get_embedding(payload_current['purpose'])[0],
+                            sparse_vector_name: get_embedding(payload_current['purpose'])[1]
                         },
                         payload={
                         "filename": payload_current['filename'],

src/fabrics_processor/deduplicator.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""Deduplication module for fabric-to-espanso."""
+import logging
+from typing import List, Dict, Any, Tuple, Set
+import difflib
+from qdrant_client import QdrantClient
+from qdrant_client.http.models import Filter, PointIdsList
+from .config import config
+from .database import get_dense_vector_name, get_sparse_vector_name
+logger = logging.getLogger('fabric_to_espanso')
+def calculate_text_difference_percentage(text1: str, text2: str) -> float:
+    """
+    Calculate the percentage difference between two text strings.
+    Args:
+        text1: First text string
+        text2: Second text string
+    Returns:
+        Percentage difference as a float between 0.0 (identical) and 1.0 (completely different)
+    """
+    # Use difflib's SequenceMatcher to calculate similarity ratio
+    similarity = difflib.SequenceMatcher(None, text1, text2).ratio()
+    # Convert similarity to difference percentage
+    difference_percentage = 1.0 - similarity
+    return difference_percentage
+# TODO: Consider moving the vector similarity search functionality to database_query.py and import it here
+# This would create a more structured codebase with search functionality centralized in one place
+def find_duplicates(client: QdrantClient, collection_name: str = config.embedding.collection_name) -> List[Tuple[str, List[str]]]:
+    """
+    Find duplicate entries in the database based on semantic similarity and text difference.
+    Args:
+        client: Initialized Qdrant client
+        collection_name: Name of the collection to query
+    Returns:
+        List of tuples containing (kept_point_id, [duplicate_point_ids])
+    """
+    # Constants for duplicate detection
+    SIMILARITY_THRESHOLD = 0.85  # Minimum semantic similarity to consider as potential duplicate
+    DIFFERENCE_THRESHOLD = 0.1  # Maximum text difference (10%) to consider as duplicate
+    # Get all points from the database
+    all_points = client.scroll(
+        collection_name=collection_name,
+        with_vectors=True, # Include vector data, else no vector will be available
+        limit=10000  # Adjust based on expected file count
+    )[0]
+    logger.info(f"Checking {len(all_points)} entries for duplicates")
+    # Track processed points to avoid redundant comparisons
+    processed_points = set()
+    # Store duplicates as (kept_id, [duplicate_ids])
+    duplicates = []
+    # For each point, find semantically similar points
+    for i, point in enumerate(all_points):
+        if point.id in processed_points:
+            continue
+        point_id = point.id
+        point_content = point.payload.get('content', '')
+        logger.debug(f"Checking point {point_id} for duplicates")
+        logger.debug(f"Content: {point_content}")
+        # Skip if no content
+        if not point_content:
+            logger.debug(f"Skipping point {point_id} as it has no content")
+            continue
+        # Get the actual vector names from the collection configuration
+        dense_vector_name = get_dense_vector_name(client, collection_name)
+        # Skip points without vector or without the required vector type
+        if not point.vector or dense_vector_name not in point.vector:
+            logger.debug(f"Skipping point {point_id} as it has no valid vector")
+            continue
+        # Find semantically similar points using Qdrant's search
+        similar_points = client.search(
+            collection_name=collection_name,
+            query_vector=(dense_vector_name, point.vector.get(dense_vector_name)),
+            limit=100,
+            score_threshold=SIMILARITY_THRESHOLD  # Only consider points with similarity > threshold
+        )
+        # Skip the first result (which is the point itself)
+        similar_points = [p for p in similar_points if p.id != point_id]
+        if not similar_points:
+            continue
+        logger.debug(f"Found {len(similar_points)} semantically similar points for {point.payload.get('filename', 'unknown')}")
+        # Check text difference for each similar point
+        duplicate_ids = []
+        for similar_point in similar_points:
+            similar_id = similar_point.id
+            # Skip if already processed
+            if similar_id in processed_points:
+                continue
+            # Get content of similar point
+            similar_content = None
+            for p in all_points:
+                if p.id == similar_id:
+                    similar_content = p.payload.get('content', '')
+                    break
+            if not similar_content:
+                continue
+            # Calculate text difference percentage
+            diff_percentage = calculate_text_difference_percentage(point_content, similar_content)
+            # If difference is less than threshold, consider it a duplicate
+            if diff_percentage <= DIFFERENCE_THRESHOLD:
+                duplicate_ids.append(similar_id)
+                processed_points.add(similar_id)
+                logger.debug(f"Found duplicate: {similar_id} (diff: {diff_percentage:.2%})")
+        if duplicate_ids:
+            duplicates.append((point_id, duplicate_ids))
+            processed_points.add(point_id)
+    logger.info(f"Found {sum(len(dups) for _, dups in duplicates)} duplicate entries in {len(duplicates)} groups")
+    return duplicates
+def remove_duplicates(client: QdrantClient, collection_name: str = config.embedding.collection_name) -> int:
+    """
+    Remove duplicate entries from the database based on semantic similarity and text difference.
+    Uses a two-step verification process:
+    1. Find entries with semantic similarity > 0.9 (using vector search)
+    2. For those entries, keep only those with text difference <= 5%
+    Args:
+        client: Initialized Qdrant client
+        collection_name: Name of the collection to query
+    Returns:
+        Number of removed duplicate entries
+    """
+    # Find duplicates
+    duplicate_groups = find_duplicates(client, collection_name)
+    if not duplicate_groups:
+        logger.info("No duplicates found")
+        return 0
+    # Count total duplicates
+    total_duplicates = sum(len(dups) for _, dups in duplicate_groups)
+    # Remove duplicates
+    for _, duplicate_ids in duplicate_groups:
+        if duplicate_ids:
+            client.delete(
+                collection_name=collection_name,
+                points_selector=PointIdsList(points=duplicate_ids)
+            )
+    logger.info(f"Removed {total_duplicates} duplicate entries from the database")
+    return total_duplicates

src/fabrics_processor/logger.py CHANGED Viewed

@@ -42,8 +42,8 @@ def setup_logger(log_file='fabric_to_espanso.log'):
     console_handler = logging.StreamHandler()
     # Set log levels
-    file_handler.setLevel(logging.INFO)
-    console_handler.setLevel(logging.INFO)
     # Create formatters and add it to handlers
     file_format = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

     console_handler = logging.StreamHandler()
     # Set log levels
+    file_handler.setLevel(logging.DEBUG)
+    console_handler.setLevel(logging.DEBUG)
     # Create formatters and add it to handlers
     file_format = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')

src/fabrics_processor/output_files_generator.py CHANGED Viewed

@@ -24,6 +24,7 @@ def repr_block_string(dumper: yaml.Dumper, data: BlockString) -> yaml.ScalarNode
 yaml.add_representer(BlockString, repr_block_string)
 def generate_yaml_file(client: QdrantClient, collection_name: str, yaml_output_folder: str) -> None:
     """Generate a complete YAML file from the Qdrant database.
@@ -78,6 +79,7 @@ def generate_yaml_file(client: QdrantClient, collection_name: str, yaml_output_f
             raise
         raise RuntimeError(f"Unexpected error generating YAML: {str(e)}") from e
 def generate_markdown_files(client: QdrantClient, collection_name: str, markdown_output_folder: str) -> None:
     """Generate markdown files from the Qdrant database.

 yaml.add_representer(BlockString, repr_block_string)
+# TODO: Remove duplicates before exporting the contents of the database to YAML files
 def generate_yaml_file(client: QdrantClient, collection_name: str, yaml_output_folder: str) -> None:
     """Generate a complete YAML file from the Qdrant database.
             raise
         raise RuntimeError(f"Unexpected error generating YAML: {str(e)}") from e
+# TODO: Remove duplicates before exporting the contents of the database to markdown files
 def generate_markdown_files(client: QdrantClient, collection_name: str, markdown_output_folder: str) -> None:
     """Generate markdown files from the Qdrant database.

src/search_qdrant/LOG_FILE ADDED Viewed

	@@ -0,0 +1,464 @@

+  You can now view your Streamlit app in your browser.
+  Local URL: http://localhost:8501
+  Network URL: http://172.18.189.135:8501
+'\\wsl.localhost\Ubuntu\home\jelle\code\fabric_to_espanso\src\search_qdrant'
+CMD.EXE was started with the above path as the current directory.
+UNC paths are not supported.  Defaulting to Windows directory.
+fabric_to_espanso - INFO - Collection fabric_patterns_hybrid ready with 198 points
+fabric_to_espanso - INFO - Processed: write_essay
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/coding_master/system.md
+fabric_to_espanso - INFO - Processed: coding_master
+fabric_to_espanso - INFO - Processed: judge_output
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/ask_uncle_duke/system.md
+fabric_to_espanso - INFO - Processed: ask_uncle_duke
+fabric_to_espanso - INFO - Processed: create_investigation_visualization
+fabric_to_espanso - INFO - Processed: humanize
+fabric_to_espanso - INFO - Processed: analyze_prose_json
+fabric_to_espanso - INFO - Processed: create_design_document
+fabric_to_espanso - INFO - Processed: identify_dsrp_systems
+fabric_to_espanso - INFO - Processed: official_pattern_template
+fabric_to_espanso - INFO - Processed: extract_extraordinary_claims
+fabric_to_espanso - INFO - Processed: create_visualization
+fabric_to_espanso - INFO - Processed: create_5_sentence_summary
+fabric_to_espanso - INFO - Processed: create_micro_summary
+fabric_to_espanso - INFO - Processed: create_fabric_patterns-own
+fabric_to_espanso - INFO - Processed: extract_wisdom_agents
+fabric_to_espanso - INFO - Processed: extract_skills
+fabric_to_espanso - INFO - Processed: rate_ai_result
+fabric_to_espanso - INFO - Processed: create_cyber_summary
+fabric_to_espanso - INFO - Processed: create_upgrade_pack
+fabric_to_espanso - INFO - Processed: analyze_product_feedback
+fabric_to_espanso - INFO - Processed: summarize_prompt
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/write_latex/system.md
+fabric_to_espanso - INFO - Processed: write_latex
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/tweet/system.md
+fabric_to_espanso - INFO - Processed: tweet
+fabric_to_espanso - INFO - Processed: explain_project
+fabric_to_espanso - INFO - Processed: analyze_interviewer_techniques
+fabric_to_espanso - INFO - Processed: create_aphorisms
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/website_description-own/system.md
+fabric_to_espanso - INFO - Processed: website_description-own
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/translate_to_dutch_or_from_dutch_to_english-own/system.md
+fabric_to_espanso - INFO - Processed: translate_to_dutch_or_from_dutch_to_english-own
+fabric_to_espanso - INFO - Processed: extract_business_ideas
+fabric_to_espanso - INFO - Processed: create_git_diff_commit
+fabric_to_espanso - INFO - Processed: create_fabric_prompt_pattern_v2-own
+fabric_to_espanso - INFO - Processed: get_wow_per_minute
+fabric_to_espanso - INFO - Processed: create_coding_project
+fabric_to_espanso - INFO - Processed: identify_job_stories
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/create_costar_prompt-own/system.md
+fabric_to_espanso - INFO - Processed: create_costar_prompt-own
+fabric_to_espanso - INFO - Processed: create_keynote
+fabric_to_espanso - INFO - Processed: extract_sponsors
+fabric_to_espanso - INFO - Processed: summarize_rpg_session
+fabric_to_espanso - INFO - Processed: translate
+fabric_to_espanso - INFO - Processed: summarize_legislation
+fabric_to_espanso - INFO - Processed: analyze_claims
+fabric_to_espanso - INFO - Processed: extract_book_recommendations
+fabric_to_espanso - INFO - Processed: extract_article_wisdom
+fabric_to_espanso - INFO - Processed: create_newsletter_entry
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/create_sigma_rules/system.md
+fabric_to_espanso - INFO - Processed: create_sigma_rules
+fabric_to_espanso - INFO - Processed: analyze_answers
+fabric_to_espanso - INFO - Processed: create_markmap_visualization
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/label_and_rate/system.md
+fabric_to_espanso - INFO - Processed: label_and_rate
+fabric_to_espanso - INFO - Processed: extract_insights
+fabric_to_espanso - INFO - Processed: rate_value
+fabric_to_espanso - INFO - Processed: create_command
+fabric_to_espanso - INFO - Processed: explain_terms
+fabric_to_espanso - INFO - Processed: explain_code
+fabric_to_espanso - INFO - Processed: recommend_pipeline_upgrades
+fabric_to_espanso - INFO - Processed: get_youtube_rss
+fabric_to_espanso - INFO - Processed: suggest_pattern
+fabric_to_espanso - INFO - Processed: analyze_malware
+fabric_to_espanso - INFO - Processed: improve_academic_writing
+fabric_to_espanso - INFO - Processed: create_story_explanation
+fabric_to_espanso - INFO - Processed: enrich_blog_post
+fabric_to_espanso - INFO - Processed: analyze_mistakes
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/write_python_code_with_explanations-own/system.md
+fabric_to_espanso - INFO - Processed: write_python_code_with_explanations-own
+fabric_to_espanso - INFO - Processed: extract_patterns
+fabric_to_espanso - INFO - Processed: analyze_paper
+fabric_to_espanso - INFO - Processed: create_better_frame
+fabric_to_espanso - INFO - Processed: analyze_candidates
+fabric_to_espanso - INFO - Processed: extract_song_meaning
+fabric_to_espanso - INFO - Processed: analyze_threat_report_trends
+fabric_to_espanso - INFO - Processed: create_summary
+fabric_to_espanso - INFO - Processed: raw_query
+fabric_to_espanso - INFO - Processed: solve_with_cot
+fabric_to_espanso - INFO - Processed: analyze_sales_call
+fabric_to_espanso - INFO - Processed: extract_references
+fabric_to_espanso - INFO - Processed: explain_docs
+fabric_to_espanso - INFO - Processed: ask_secure_by_design_questions
+fabric_to_espanso - INFO - Processed: extract_questions
+fabric_to_espanso - INFO - Processed: extract_algorithm_update_recommendations
+fabric_to_espanso - INFO - Processed: create_report_finding
+fabric_to_espanso - INFO - Processed: extract_product_features
+fabric_to_espanso - INFO - Processed: agility_story
+fabric_to_espanso - INFO - Processed: extract_book_ideas
+fabric_to_espanso - INFO - Processed: create_ttrc_graph
+fabric_to_espanso - INFO - Processed: extract_wisdom_dm
+fabric_to_espanso - INFO - Processed: create_ttrc_narrative
+fabric_to_espanso - INFO - Processed: prepare_7s_strategy
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/convert_to_markdown/system.md
+fabric_to_espanso - INFO - Processed: convert_to_markdown
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/analyze_incident/system.md
+fabric_to_espanso - INFO - Processed: analyze_incident
+fabric_to_espanso - INFO - Processed: summarize_meeting
+fabric_to_espanso - INFO - Processed: create_formal_email
+fabric_to_espanso - INFO - Processed: refine_design_document
+fabric_to_espanso - INFO - Processed: improve_prompt
+fabric_to_espanso - INFO - Processed: create_logo
+fabric_to_espanso - INFO - Processed: create_network_threat_landscape
+fabric_to_espanso - INFO - Processed: extract_keywords_and_subjects_from_text-own
+fabric_to_espanso - INFO - Processed: extract_most_redeeming_thing
+fabric_to_espanso - INFO - Processed: create_rpg_summary
+fabric_to_espanso - INFO - Processed: analyze_proposition
+fabric_to_espanso - INFO - Processed: write_nuclei_template_rule
+fabric_to_espanso - INFO - Processed: analyze_email_headers
+fabric_to_espanso - INFO - Processed: analyze_presentation
+fabric_to_espanso - INFO - Processed: improve_writing
+fabric_to_espanso - INFO - Processed: create_user_story
+fabric_to_espanso - INFO - Processed: create_stride_threat_model
+fabric_to_espanso - INFO - Processed: analyze_debate
+fabric_to_espanso - INFO - Processed: analyze_spiritual_text
+fabric_to_espanso - INFO - Processed: extract_insights_dm
+fabric_to_espanso - INFO - Processed: analyze_military_strategy
+fabric_to_espanso - INFO - Processed: analyze_personality
+fabric_to_espanso - INFO - Processed: transcribe_minutes
+fabric_to_espanso - INFO - Processed: extract_recipe
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/summarize_paper/system.md
+fabric_to_espanso - INFO - Processed: summarize_paper
+fabric_to_espanso - INFO - Processed: check_agreement
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/rewrite_python_code_with_explanations-own/system.md
+fabric_to_espanso - INFO - Processed: rewrite_python_code_with_explanations-own
+fabric_to_espanso - INFO - Processed: find_logical_fallacies
+fabric_to_espanso - INFO - Processed: extract_wisdom
+fabric_to_espanso - INFO - Processed: extract_wisdom_nometa
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/solveitwithcode_review_repl_driven_process_detailed_extreme-own/system.md
+fabric_to_espanso - INFO - Processed: solveitwithcode_review_repl_driven_process_detailed_extreme-own
+fabric_to_espanso - INFO - Processed: identify_dsrp_distinctions
+fabric_to_espanso - INFO - Processed: compare_two_documents-own
+fabric_to_espanso - INFO - Processed: extract_controversial_ideas
+fabric_to_espanso - INFO - Processed: create_tags
+fabric_to_espanso - INFO - Processed: review_design
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/start_in_depth_discussion_with_functions-own/system.md
+fabric_to_espanso - INFO - Processed: start_in_depth_discussion_with_functions-own
+fabric_to_espanso - INFO - Processed: create_art_prompt
+fabric_to_espanso - INFO - Processed: analyze_patent
+fabric_to_espanso - INFO - Processed: identify_dsrp_relationships
+fabric_to_espanso - INFO - Processed: analyze_cfp_submission
+fabric_to_espanso - INFO - Processed: create_mermaid_visualization_for_github
+fabric_to_espanso - INFO - Processed: create_graph_from_input
+fabric_to_espanso - INFO - Processed: extract_main_idea
+fabric_to_espanso - INFO - Processed: extract_latest_video
+fabric_to_espanso - INFO - Processed: extract_core_message
+fabric_to_espanso - INFO - Processed: extract_jokes
+fabric_to_espanso - INFO - Processed: create_academic_paper
+fabric_to_espanso - INFO - Processed: create_reading_plan
+fabric_to_espanso - INFO - Processed: analyze_risk
+fabric_to_espanso - INFO - Processed: improve_report_finding
+fabric_to_espanso - INFO - Processed: explain_math
+fabric_to_espanso - INFO - Processed: summarize_git_changes
+fabric_to_espanso - INFO - Processed: recommend_talkpanel_topics
+fabric_to_espanso - INFO - Processed: extract_predictions
+fabric_to_espanso - INFO - Processed: extract_primary_solution
+fabric_to_espanso - INFO - Processed: extract_videoid
+fabric_to_espanso - INFO - Processed: create_show_intro
+fabric_to_espanso - INFO - Processed: summarize_git_diff
+fabric_to_espanso - INFO - Processed: create_quiz
+fabric_to_espanso - INFO - Processed: write_semgrep_rule
+fabric_to_espanso - INFO - Processed: write_hackerone_report
+fabric_to_espanso - INFO - Processed: summarize_micro
+fabric_to_espanso - INFO - Processed: create_ai_jobs_analysis
+fabric_to_espanso - INFO - Processed: create_pattern
+fabric_to_espanso - INFO - Processed: capture_thinkers_work
+fabric_to_espanso - INFO - Processed: analyze_prose_pinker
+fabric_to_espanso - INFO - Processed: create_threat_scenarios
+fabric_to_espanso - INFO - Processed: extract_ctf_writeup
+fabric_to_espanso - INFO - Processed: ai
+fabric_to_espanso - INFO - Processed: rate_ai_response
+fabric_to_espanso - INFO - Processed: create_prd
+fabric_to_espanso - INFO - Processed: clean_text
+fabric_to_espanso - INFO - Processed: create_video_chapters
+fabric_to_espanso - INFO - Processed: summarize_lecture
+fabric_to_espanso - INFO - Processed: identify_dsrp_perspectives
+fabric_to_espanso - INFO - Processed: recommend_artists
+fabric_to_espanso - INFO - Processed: extract_ideas
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/solveitwithcode_review_repl_driven_process-own/system.md
+fabric_to_espanso - INFO - Processed: solveitwithcode_review_repl_driven_process-own
+fabric_to_espanso - INFO - Processed: to_flashcards
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/extract_instructions/system.md
+fabric_to_espanso - INFO - Processed: extract_instructions
+fabric_to_espanso - INFO - Processed: write_micro_essay
+fabric_to_espanso - INFO - Processed: extract_primary_problem
+fabric_to_espanso - INFO - Processed: create_hormozi_offer
+fabric_to_espanso - INFO - Processed: analyze_prose
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/solveitwithcode_review_repl_driven_process_detailed-own/system.md
+fabric_to_espanso - INFO - Processed: solveitwithcode_review_repl_driven_process_detailed-own
+fabric_to_espanso - INFO - Processed: analyze_logs
+fabric_to_espanso - INFO - Processed: create_recursive_outline
+fabric_to_espanso - INFO - Processed: create_image_prompt_from_book_extract-own
+fabric_to_espanso - INFO - Processed: analyze_tech_impact
+fabric_to_espanso - INFO - Processed: find_hidden_message
+fabric_to_espanso - INFO - Processed: create_npc
+fabric_to_espanso - INFO - Processed: provide_guidance
+fabric_to_espanso - INFO - Processed: export_data_as_csv
+fabric_to_espanso - INFO - Processed: show_fabric_options_markmap
+fabric_to_espanso - INFO - Processed: summarize_debate
+fabric_to_espanso - INFO - Processed: answer_interview_question
+fabric_to_espanso - INFO - Processed: extract_poc
+fabric_to_espanso - INFO - Processed: rate_content
+fabric_to_espanso - INFO - Processed: create_diy
+fabric_to_espanso - INFO - Processed: create_idea_compass
+fabric_to_espanso - INFO - Processed: create_security_update
+fabric_to_espanso - INFO - Processed: extract_recommendations
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/md_callout/system.md
+fabric_to_espanso - INFO - Processed: md_callout
+fabric_to_espanso - INFO - Processed: analyze_threat_report
+fabric_to_espanso - INFO - Processed: dialog_with_socrates
+fabric_to_espanso - INFO - Processed: summarize_newsletter
+fabric_to_espanso - INFO - Processed: create_mermaid_visualization
+fabric_to_espanso - INFO - Processed: analyze_comments
+fabric_to_espanso - INFO - Processed: summarize
+fabric_to_espanso - INFO - Processed: compare_and_contrast
+fabric_to_espanso - INFO - Successfully processed 198 files in fabric patterns folder
+fabric_to_espanso - INFO - Changes detected: 0 new, 0 modified, 0 deleted
+fabric_to_espanso - INFO - Database update completed successfully
+fabric_to_espanso - INFO - YAML file generated successfully at /mnt/c/Users/barle/AppData/Roaming/espanso/match/fabric_patterns.yml
+fabric_to_espanso - INFO - Generated 198 Markdown files generated successfully at /mnt/c/Obsidian/BrainCave/Extra/textgenerator/templates/fabric
+fabric_to_espanso - INFO - Processed: write_essay
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/coding_master/system.md
+fabric_to_espanso - INFO - Processed: coding_master
+fabric_to_espanso - INFO - Processed: judge_output
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/ask_uncle_duke/system.md
+fabric_to_espanso - INFO - Processed: ask_uncle_duke
+fabric_to_espanso - INFO - Processed: create_investigation_visualization
+fabric_to_espanso - INFO - Processed: humanize
+fabric_to_espanso - INFO - Processed: analyze_prose_json
+fabric_to_espanso - INFO - Processed: create_design_document
+fabric_to_espanso - INFO - Processed: identify_dsrp_systems
+fabric_to_espanso - INFO - Processed: official_pattern_template
+fabric_to_espanso - INFO - Processed: extract_extraordinary_claims
+fabric_to_espanso - INFO - Processed: create_visualization
+fabric_to_espanso - INFO - Processed: create_5_sentence_summary
+fabric_to_espanso - INFO - Processed: create_micro_summary
+fabric_to_espanso - INFO - Processed: create_fabric_patterns-own
+fabric_to_espanso - INFO - Processed: extract_wisdom_agents
+fabric_to_espanso - INFO - Processed: extract_skills
+fabric_to_espanso - INFO - Processed: rate_ai_result
+fabric_to_espanso - INFO - Processed: create_cyber_summary
+fabric_to_espanso - INFO - Processed: create_upgrade_pack
+fabric_to_espanso - INFO - Processed: analyze_product_feedback
+fabric_to_espanso - INFO - Processed: summarize_prompt
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/write_latex/system.md
+fabric_to_espanso - INFO - Processed: write_latex
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/tweet/system.md
+fabric_to_espanso - INFO - Processed: tweet
+fabric_to_espanso - INFO - Processed: explain_project
+fabric_to_espanso - INFO - Processed: analyze_interviewer_techniques
+fabric_to_espanso - INFO - Processed: create_aphorisms
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/website_description-own/system.md
+fabric_to_espanso - INFO - Processed: website_description-own
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/translate_to_dutch_or_from_dutch_to_english-own/system.md
+fabric_to_espanso - INFO - Processed: translate_to_dutch_or_from_dutch_to_english-own
+fabric_to_espanso - INFO - Processed: extract_business_ideas
+fabric_to_espanso - INFO - Processed: create_git_diff_commit
+fabric_to_espanso - INFO - Processed: create_fabric_prompt_pattern_v2-own
+fabric_to_espanso - INFO - Processed: get_wow_per_minute
+fabric_to_espanso - INFO - Processed: create_coding_project
+fabric_to_espanso - INFO - Processed: identify_job_stories
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/create_costar_prompt-own/system.md
+fabric_to_espanso - INFO - Processed: create_costar_prompt-own
+fabric_to_espanso - INFO - Processed: create_keynote
+fabric_to_espanso - INFO - Processed: extract_sponsors
+fabric_to_espanso - INFO - Processed: summarize_rpg_session
+fabric_to_espanso - INFO - Processed: translate
+fabric_to_espanso - INFO - Processed: summarize_legislation
+fabric_to_espanso - INFO - Processed: analyze_claims
+fabric_to_espanso - INFO - Processed: extract_book_recommendations
+fabric_to_espanso - INFO - Processed: extract_article_wisdom
+fabric_to_espanso - INFO - Processed: create_newsletter_entry
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/create_sigma_rules/system.md
+fabric_to_espanso - INFO - Processed: create_sigma_rules
+fabric_to_espanso - INFO - Processed: analyze_answers
+fabric_to_espanso - INFO - Processed: create_markmap_visualization
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/label_and_rate/system.md
+fabric_to_espanso - INFO - Processed: label_and_rate
+fabric_to_espanso - INFO - Processed: extract_insights
+fabric_to_espanso - INFO - Processed: rate_value
+fabric_to_espanso - INFO - Processed: create_command
+fabric_to_espanso - INFO - Processed: explain_terms
+fabric_to_espanso - INFO - Processed: explain_code
+fabric_to_espanso - INFO - Processed: recommend_pipeline_upgrades
+fabric_to_espanso - INFO - Processed: get_youtube_rss
+fabric_to_espanso - INFO - Processed: suggest_pattern
+fabric_to_espanso - INFO - Processed: analyze_malware
+fabric_to_espanso - INFO - Processed: improve_academic_writing
+fabric_to_espanso - INFO - Processed: create_story_explanation
+fabric_to_espanso - INFO - Processed: enrich_blog_post
+fabric_to_espanso - INFO - Processed: analyze_mistakes
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/write_python_code_with_explanations-own/system.md
+fabric_to_espanso - INFO - Processed: write_python_code_with_explanations-own
+fabric_to_espanso - INFO - Processed: extract_patterns
+fabric_to_espanso - INFO - Processed: analyze_paper
+fabric_to_espanso - INFO - Processed: create_better_frame
+fabric_to_espanso - INFO - Processed: analyze_candidates
+fabric_to_espanso - INFO - Processed: extract_song_meaning
+fabric_to_espanso - INFO - Processed: analyze_threat_report_trends
+fabric_to_espanso - INFO - Processed: create_summary
+fabric_to_espanso - INFO - Processed: raw_query
+fabric_to_espanso - INFO - Processed: solve_with_cot
+fabric_to_espanso - INFO - Processed: analyze_sales_call
+fabric_to_espanso - INFO - Processed: extract_references
+fabric_to_espanso - INFO - Processed: explain_docs
+fabric_to_espanso - INFO - Processed: ask_secure_by_design_questions
+fabric_to_espanso - INFO - Processed: extract_questions
+fabric_to_espanso - INFO - Processed: extract_algorithm_update_recommendations
+fabric_to_espanso - INFO - Processed: create_report_finding
+fabric_to_espanso - INFO - Processed: extract_product_features
+fabric_to_espanso - INFO - Processed: agility_story
+fabric_to_espanso - INFO - Processed: extract_book_ideas
+fabric_to_espanso - INFO - Processed: create_ttrc_graph
+fabric_to_espanso - INFO - Processed: extract_wisdom_dm
+fabric_to_espanso - INFO - Processed: create_ttrc_narrative
+fabric_to_espanso - INFO - Processed: prepare_7s_strategy
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/convert_to_markdown/system.md
+fabric_to_espanso - INFO - Processed: convert_to_markdown
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/analyze_incident/system.md
+fabric_to_espanso - INFO - Processed: analyze_incident
+fabric_to_espanso - INFO - Processed: summarize_meeting
+fabric_to_espanso - INFO - Processed: create_formal_email
+fabric_to_espanso - INFO - Processed: refine_design_document
+fabric_to_espanso - INFO - Processed: improve_prompt
+fabric_to_espanso - INFO - Processed: create_logo
+fabric_to_espanso - INFO - Processed: create_network_threat_landscape
+fabric_to_espanso - INFO - Processed: extract_keywords_and_subjects_from_text-own
+fabric_to_espanso - INFO - Processed: extract_most_redeeming_thing
+fabric_to_espanso - INFO - Processed: create_rpg_summary
+fabric_to_espanso - INFO - Processed: analyze_proposition
+fabric_to_espanso - INFO - Processed: write_nuclei_template_rule
+fabric_to_espanso - INFO - Processed: analyze_email_headers
+fabric_to_espanso - INFO - Processed: analyze_presentation
+fabric_to_espanso - INFO - Processed: improve_writing
+fabric_to_espanso - INFO - Processed: create_user_story
+fabric_to_espanso - INFO - Processed: create_stride_threat_model
+fabric_to_espanso - INFO - Processed: analyze_debate
+fabric_to_espanso - INFO - Processed: analyze_spiritual_text
+fabric_to_espanso - INFO - Processed: extract_insights_dm
+fabric_to_espanso - INFO - Processed: analyze_military_strategy
+fabric_to_espanso - INFO - Processed: analyze_personality
+fabric_to_espanso - INFO - Processed: transcribe_minutes
+fabric_to_espanso - INFO - Processed: extract_recipe
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/summarize_paper/system.md
+fabric_to_espanso - INFO - Processed: summarize_paper
+fabric_to_espanso - INFO - Processed: check_agreement
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/rewrite_python_code_with_explanations-own/system.md
+fabric_to_espanso - INFO - Processed: rewrite_python_code_with_explanations-own
+fabric_to_espanso - INFO - Processed: find_logical_fallacies
+fabric_to_espanso - INFO - Processed: extract_wisdom
+fabric_to_espanso - INFO - Processed: extract_wisdom_nometa
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/solveitwithcode_review_repl_driven_process_detailed_extreme-own/system.md
+fabric_to_espanso - INFO - Processed: solveitwithcode_review_repl_driven_process_detailed_extreme-own
+fabric_to_espanso - INFO - Processed: identify_dsrp_distinctions
+fabric_to_espanso - INFO - Processed: compare_two_documents-own
+fabric_to_espanso - INFO - Processed: extract_controversial_ideas
+fabric_to_espanso - INFO - Processed: create_tags
+fabric_to_espanso - INFO - Processed: review_design
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/start_in_depth_discussion_with_functions-own/system.md
+fabric_to_espanso - INFO - Processed: start_in_depth_discussion_with_functions-own
+fabric_to_espanso - INFO - Processed: create_art_prompt
+fabric_to_espanso - INFO - Processed: analyze_patent
+fabric_to_espanso - INFO - Processed: identify_dsrp_relationships
+fabric_to_espanso - INFO - Processed: analyze_cfp_submission
+fabric_to_espanso - INFO - Processed: create_mermaid_visualization_for_github
+fabric_to_espanso - INFO - Processed: create_graph_from_input
+fabric_to_espanso - INFO - Processed: extract_main_idea
+fabric_to_espanso - INFO - Processed: extract_latest_video
+fabric_to_espanso - INFO - Processed: extract_core_message
+fabric_to_espanso - INFO - Processed: extract_jokes
+fabric_to_espanso - INFO - Processed: create_academic_paper
+fabric_to_espanso - INFO - Processed: create_reading_plan
+fabric_to_espanso - INFO - Processed: analyze_risk
+fabric_to_espanso - INFO - Processed: improve_report_finding
+fabric_to_espanso - INFO - Processed: explain_math
+fabric_to_espanso - INFO - Processed: summarize_git_changes
+fabric_to_espanso - INFO - Processed: recommend_talkpanel_topics
+fabric_to_espanso - INFO - Processed: extract_predictions
+fabric_to_espanso - INFO - Processed: extract_primary_solution
+fabric_to_espanso - INFO - Processed: extract_videoid
+fabric_to_espanso - INFO - Processed: create_show_intro
+fabric_to_espanso - INFO - Processed: summarize_git_diff
+fabric_to_espanso - INFO - Processed: create_quiz
+fabric_to_espanso - INFO - Processed: write_semgrep_rule
+fabric_to_espanso - INFO - Processed: write_hackerone_report
+fabric_to_espanso - INFO - Processed: summarize_micro
+fabric_to_espanso - INFO - Processed: create_ai_jobs_analysis
+fabric_to_espanso - INFO - Processed: create_pattern
+fabric_to_espanso - INFO - Processed: capture_thinkers_work
+fabric_to_espanso - INFO - Processed: analyze_prose_pinker
+fabric_to_espanso - INFO - Processed: create_threat_scenarios
+fabric_to_espanso - INFO - Processed: extract_ctf_writeup
+fabric_to_espanso - INFO - Processed: ai
+fabric_to_espanso - INFO - Processed: rate_ai_response
+fabric_to_espanso - INFO - Processed: create_prd
+fabric_to_espanso - INFO - Processed: clean_text
+fabric_to_espanso - INFO - Processed: create_video_chapters
+fabric_to_espanso - INFO - Processed: summarize_lecture
+fabric_to_espanso - INFO - Processed: identify_dsrp_perspectives
+fabric_to_espanso - INFO - Processed: recommend_artists
+fabric_to_espanso - INFO - Processed: extract_ideas
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/solveitwithcode_review_repl_driven_process-own/system.md
+fabric_to_espanso - INFO - Processed: solveitwithcode_review_repl_driven_process-own
+fabric_to_espanso - INFO - Processed: to_flashcards
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/extract_instructions/system.md
+fabric_to_espanso - INFO - Processed: extract_instructions
+fabric_to_espanso - INFO - Processed: write_micro_essay
+fabric_to_espanso - INFO - Processed: extract_primary_problem
+fabric_to_espanso - INFO - Processed: create_hormozi_offer
+fabric_to_espanso - INFO - Processed: analyze_prose
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/solveitwithcode_review_repl_driven_process_detailed-own/system.md
+fabric_to_espanso - INFO - Processed: solveitwithcode_review_repl_driven_process_detailed-own
+fabric_to_espanso - INFO - Processed: analyze_logs
+fabric_to_espanso - INFO - Processed: create_recursive_outline
+fabric_to_espanso - INFO - Processed: create_image_prompt_from_book_extract-own
+fabric_to_espanso - INFO - Processed: analyze_tech_impact
+fabric_to_espanso - INFO - Processed: find_hidden_message
+fabric_to_espanso - INFO - Processed: create_npc
+fabric_to_espanso - INFO - Processed: provide_guidance
+fabric_to_espanso - INFO - Processed: export_data_as_csv
+fabric_to_espanso - INFO - Processed: show_fabric_options_markmap
+fabric_to_espanso - INFO - Processed: summarize_debate
+fabric_to_espanso - INFO - Processed: answer_interview_question
+fabric_to_espanso - INFO - Processed: extract_poc
+fabric_to_espanso - INFO - Processed: rate_content
+fabric_to_espanso - INFO - Processed: create_diy
+fabric_to_espanso - INFO - Processed: create_idea_compass
+fabric_to_espanso - INFO - Processed: create_security_update
+fabric_to_espanso - INFO - Processed: extract_recommendations
+fabric_to_espanso - WARNING - No sections extracted from /home/jelle/.config/fabric/patterns/md_callout/system.md
+fabric_to_espanso - INFO - Processed: md_callout
+fabric_to_espanso - INFO - Processed: analyze_threat_report
+fabric_to_espanso - INFO - Processed: dialog_with_socrates
+fabric_to_espanso - INFO - Processed: summarize_newsletter
+fabric_to_espanso - INFO - Processed: create_mermaid_visualization
+fabric_to_espanso - INFO - Processed: analyze_comments
+fabric_to_espanso - INFO - Processed: summarize
+fabric_to_espanso - INFO - Processed: compare_and_contrast
+fabric_to_espanso - INFO - Successfully processed 198 files in fabric patterns folder
+fabric_to_espanso - INFO - Changes detected: 0 new, 0 modified, 0 deleted
+fabric_to_espanso - INFO - Database update completed successfully
+fabric_to_espanso - INFO - YAML file generated successfully at /mnt/c/Users/barle/AppData/Roaming/espanso/match/fabric_patterns.yml
+fabric_to_espanso - INFO - Generated 198 Markdown files generated successfully at /mnt/c/Obsidian/BrainCave/Extra/textgenerator/templates/fabric
+fabric_to_espanso - INFO - Collection fabric_patterns_hybrid ready with 198 points
+Generating YAML file...
+Generating markdown files...
+Generating YAML file...
+Generating markdown files...
+  Stopping...

src/search_qdrant/database_query.py CHANGED Viewed

@@ -5,6 +5,10 @@ from qdrant_client.models import QueryResponse
 import argparse
 from src.fabrics_processor.config import config
 def query_qdrant_database(
       query: str,
       client: QdrantClient,

 import argparse
 from src.fabrics_processor.config import config
+# TODO: Use reranking to get even better search results
+# TODO: Add an option to monitor the quality of the search responses with thumbs up/down feedback
+# Store evaluations in an SQLite database with the query, the returned prompt, and the evaluation (up/down)
+# This will create a database of good and bad examples to improve the search model
 def query_qdrant_database(
       query: str,
       client: QdrantClient,

src/search_qdrant/run_streamlit_terminal_visible.sh CHANGED Viewed

@@ -1,13 +1,13 @@
 #!/bin/bash
 # Add the project root to PYTHONPATH
-export PYTHONPATH="/home/jelle/Tools/pythagora-core/workspace/fabric_to_espanso:$PYTHONPATH"
 # Create a log directory if it doesn't exist
-LOG_DIR="/home/jelle/Tools/pythagora-core/workspace/fabric_to_espanso/logs"
 mkdir -p "$LOG_DIR"
 LOG_FILE="$LOG_DIR/streamlit.log"
 # Run the streamlit app
 echo "Starting Streamlit app..."
-/home/jelle/Tools/pythagora-core/workspace/fabric_to_espanso/.venv/bin/streamlit run ~/Tools/pythagora-core/workspace/fabric_to_espanso/src/search_qdrant/streamlit_app.py

 #!/bin/bash
 # Add the project root to PYTHONPATH
+export PYTHONPATH="/home/jelle/code/fabric_to_espanso:$PYTHONPATH"
 # Create a log directory if it doesn't exist
+LOG_DIR="/home/jelle/code/fabric_to_espanso/logs"
 mkdir -p "$LOG_DIR"
 LOG_FILE="$LOG_DIR/streamlit.log"
 # Run the streamlit app
 echo "Starting Streamlit app..."
+/home/jelle/code/fabric_to_espanso/.venv/bin/streamlit run ~/code/fabric_to_espanso/src/search_qdrant/streamlit_app.py

src/search_qdrant/streamlit_app.py CHANGED Viewed

@@ -11,6 +11,7 @@ from src.fabrics_processor.logger import setup_logger
 import logging
 import atexit
 from src.fabrics_processor.config import config
 # Configure logging
 logger = setup_logger()
@@ -156,14 +157,32 @@ def update_database():
                 fabric_patterns_folder=config.fabric_patterns_folder
             )
-            # Update the database
-            update_qdrant_database(
-                client=st.session_state.client,
-                collection_name=config.embedding.collection_name,
-                new_files=new_files,
-                modified_files=modified_files,
-                deleted_files=deleted_files
-            )
             # Get updated collection info
             collection_info = st.session_state.client.get_collection(config.embedding.collection_name)
@@ -177,6 +196,7 @@ def update_database():
             - {len(new_files)} new files
             - {len(modified_files)} modified files
             - {len(deleted_files)} deleted files
             Database entries:
             - Initial: {initial_points}

 import logging
 import atexit
 from src.fabrics_processor.config import config
+from src.fabrics_processor.deduplicator import remove_duplicates
 # Configure logging
 logger = setup_logger()
                 fabric_patterns_folder=config.fabric_patterns_folder
             )
+            # Update the database if there are changes
+            if any([new_files, modified_files, deleted_files]):
+                st.info("Changes detected. Updating database...")
+                update_qdrant_database(
+                    client=st.session_state.client,
+                    collection_name=config.embedding.collection_name,
+                    new_files=new_files,
+                    modified_files=modified_files,
+                    deleted_files=deleted_files
+                )
+            else:
+                st.info("No changes detected in input folders.")
+            # Create a separate section for deduplication - ALWAYS run this regardless of file changes
+            st.subheader("Deduplication Process")
+            with st.spinner("Checking for and removing duplicate entries..."):
+                # Run the deduplication process
+                duplicates_removed = remove_duplicates(
+                    client=st.session_state.client,
+                    collection_name=config.embedding.collection_name
+                )
+                if duplicates_removed > 0:
+                    st.success(f"Successfully removed {duplicates_removed} duplicate entries from the database")
+                else:
+                    st.info("No duplicate entries found in the database")
             # Get updated collection info
             collection_info = st.session_state.client.get_collection(config.embedding.collection_name)
             - {len(new_files)} new files
             - {len(modified_files)} modified files
             - {len(deleted_files)} deleted files
+            - {duplicates_removed} duplicate entries removed
             Database entries:
             - Initial: {initial_points}