45 5 175

sometimesanotion

https://ko-fi.com/sometimesanotion

AI & ML interests

Agentic LLM services, model merging, finetunes, distillation

Recent Activity

replied to CultriX's post about 12 hours ago

Script for QA-style dataset generation from custom data: Transform Your Personal Data into High-Quality Training Datasets with help from a LLM. Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning. What it does: 1. Split the input data into chunks (note: this is important, more below!) 2. QA generation: Creates contextually relevant question-answer pairs from each chunk. 3. Quality assurance: Validates outputs using both rule-based filters and LLM judges 4. Exports datasets in both CSV and JSON formats Key features: - Separate model configurations for generation and evaluation - Configurable chunk sizes and question length - Multi-language support (English and Dutch, but easy to add your own!) - Local and cloud API compatibility Quick start: Place your documents (.txt for now) in an input folder and run: ``` python generate-rag-qav4.py \ --input-dir ./rag-input/ \ --output-dir ./rag-output/ \ --output-filename finetuning_qa_dataset \ --gen-model google/gemma-3-4b \ --gen-api-base http://127.0.0.1:1234/v1 \ --judge-model google/gemma-3-4b \ --judge-api-base http://127.0.0.1:1234/v1 \ --min-chunk-len 200 \ --question-chars 20 \ --answer-chars 5 \ --lang en ``` Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type! Use cases: - Personal knowledge base fine-tuning - Domain-specific QA dataset creation - RAG system training data preparation Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted. Note2: Original Reddit post that gave me the idea: https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn The script can be found here: https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00

replied to CultriX's post about 17 hours ago

# Announcing the RAG-Ready Conteaant Scraper! 🚀 Supercharge your Retrieval Augmented Generation (RAG) pipelines with ease! I just finished working on the **RAG-Ready Content Scraper**, a mix between two very useful tools (RAG-Scraper and RepoMix); now available as a Hugging Face Space! ## What can it do? This intuitive application helps you effortlessly gather and process content from various sources: * 🌐 **Webpages**: Scrape content from any URL (with RAG-Scraper). You can even control the scraping depth to fetch linked pages! * 📂 **GitHub Repositories**: Process entire GitHub repos (using the power of Repomix) by simply providing a URL or `username/repo` ID. ## Various Output Formats Convert the scraped content into a variety of RAG-friendly formats: * **Markdown** (.md) * **JSON** (.json) * **CSV** (.csv) * **Plain Text** (.txt) * **PDF** (.pdf) Perfect for building datasets, knowledge bases, and feeding your LLMs with high-quality, structured information. ## Hope you enjoY! Ready to streamline your RAG data preparation? 👉 **Visit the RAG-Ready Content Scraper on Hugging Face Spaces:** [https://huggingface.co/spaces/CultriX/RAG-Scraper] --- Feedback and feature requests are welcome! Let's build better RAG together.

liked a model 6 days ago

MaatAI/Seshat-Qwen3-8B

View all activity

Organizations

sometimesanotion's activity

replied to CultriX's post about 12 hours ago

Now imagine this as a hashtag generator and so a RAG search can find great context. :)

replied to CultriX's post about 17 hours ago

Neat! I've transitioned from wanting more from a model's one-shot answers to breaking things down and walking through the problem with cached context. This effectively means simulating most of the thinking block, but by tool usage and RAG.

I'm happily using our models from months ago to do it. If anything - even Lamarck 0.7's use of thinking blocks are a bit much. I'm using Lamarck 0.7 Fusion (my best GPQA model, though it didn't break your record and is best used where modest IFEVAL isn't a blocker) and /nothink with ValiantLab's Qwen3 models in concert.

I suspect I'll try some merges soon to give this toolchain better models, leaderboard or no leaderboard!

liked a model 6 days ago

MaatAI/Seshat-Qwen3-8B

Text Generation • Updated 6 days ago • 65 • 3

replied to sequelbox's post 6 days ago

I've been using Esper3 8B and 14B for first-pass code review. I am quite pleased.

Have you considered fine-tuning a 1.7B or smaller model for autocomplete?

replied to CultriX's post 6 days ago

I've been thinking a lot about using small caches of embeddings for local RAG lately. Have you considered an HTTP caching proxy like Squid as a low-impact source? It would retrieve what a user is reading anyway, and what's in their field of interest. A browser extension to signal some limited ingestion when a page is bookmarked might fit a lot of use cases.

For many reasons, smart management of context windows is my top priority with AI now!

liked a model 7 days ago