Harpyx: Open-Source Platform for RAG-Based Document Intelligence A practical solution to search, analyze, and query large document collections without losing control over data, infrastructure, and LLM providers

Harpyx: Open-Source Platform for RAG-Based Document Intelligence

Over the last few months, I've worked with several companies that were looking to use AI for something more substantial than text generation: making sense of their documents. Specifications, technical notes, legal material, project documentation, exported tickets, meeting notes, and PDFs collected over the years: the kind of material every team has somewhere, usually scattered across folders, repositories, inboxes, shared drives, and half-forgotten archives.

Tools like NotebookLM made that workflow feel immediately natural: upload a few sources, ask questions, generate summaries, compare documents, and extract useful context. It is one of those ideas that makes sense the first time you try it, because it matches a real need: using an AI model not only to invent answers from its general training data, yet also to reason over the material we actually care about. This works well enough with a handful of clean files. The problems start when the collection begins to resemble a real archive: larger, less curated, full of mixed formats, imperfect documents, sensitive material, old exports, duplicates, and all the other things teams actually deal with every day. At that point, the assumptions of a closed, notebook-style product begin to show their limits. The issue is not only technical: in professional environments, documents often come with constraints: personal data, confidential material, retention policies, access rules, audit requirements, data governance processes, and, in many cases, GDPR obligations that cannot be treated as an afterthought.

After running into those limits more than once, I felt I needed a different approach: more control over ingestion, storage, indexing, retrieval, providers, tenants, quotas, and data lifecycle. I wanted a system that could be customized and adapted to the kind of workloads I see in professional environments, where document intelligence is not just a clever interface over a few uploaded files, but part of a broader architecture of security, compliance, governance, and operational control.

That is why I started building Harpyx: an open-source, MIT-licensed platform for document intelligence, retrieval-augmented generation, and AI-assisted knowledge work. It can be used as a self-hosted solution or through a (currently free) SaaS offering, and it is built for teams that need to work with large, varied, and sometimes sensitive document collections without losing control over how data is stored, processed, and connected to LLM providers.

The idea is not to clone a consumer-facing NotebookLM experience feature by feature: Harpyx is meant to provide a solid, transparent foundation for real-world document intelligence workflows, without giving up control over their data and infrastructure.

What is Harpyx

The main idea behind Harpyx is simple: document-grounded AI becomes much more useful when it can deal with the messy perimeter of real documents.

In many companies, knowledge is not stored in one clean repository. It is scattered across PDFs, Word files, Excel spreadsheets, PowerPoint decks, Markdown notes, HTML exports, JSON files, email messages, compressed archives, and scanned material. Some of it is current, some of it is obsolete, some of it is duplicated, and a surprising amount of it is still valuable if you can retrieve it at the right moment.

I did not want Harpyx to be limited by the assumption that a project contains a small number of curated sources. That assumption works for lightweight research, but it breaks down quickly when you start thinking about technical departments, legal teams, compliance workflows, customer support knowledge bases, internal documentation, or long-running software projects.

Harpyx is built around that wider perimeter. It treats ingestion, extraction, chunking, storage, and retrieval as first-class parts of the system, not as details hidden behind an upload button.

Built for Large Document Collections

Harpyx is organized around workspaces, projects, background ingestion, document chunking, vector indexing, and retrieval. The architecture is meant to support collections that can grow over time, instead of assuming that users will keep their sources small and tidy.

In practical terms, the effective scale of a Harpyx installation depends on the infrastructure behind it: SQL Server for relational data, MinIO-compatible object storage for files, OpenSearch for search workloads, Redis and RabbitMQ for coordination and background processing, plus the worker capacity assigned to ingestion and indexing.

This makes the system scalable from an application-design perspective. Instead of being constrained only by a fixed SaaS quota, the operator can scale the services that matter for the installation: storage, search, queues, workers, and model usage.

That also makes Harpyx more honest as a platform. Large-scale document intelligence is never magic: it is the result of storage decisions, indexing strategies, queue management, model costs, chunking quality, and operational monitoring. Harpyx tries to expose those moving parts in a way that developers and administrators can understand and improve.

More Than PDFs

A serious document intelligence platform cannot stop at PDFs. PDF support is important, of course, but most real repositories contain much more than that. Harpyx is designed to handle a broad set of business and technical formats, including:

  • PDF documents, including material intended for OCR-oriented workflows;
  • plain text, Markdown and HTML files;
  • Word, Excel and PowerPoint documents;
  • OpenDocument formats;
  • CSV, JSON, XML and YAML files;
  • email files and exported communication records;
  • compressed archives containing multiple documents;
  • image formats that can be processed through OCR pipelines when needed.

Uploaded documents are validated, stored, processed asynchronously, extracted into text, split into searchable chunks, and prepared for retrieval. Long files can be broken down into smaller semantic units, so that answers are based on the most relevant passages rather than on an oversized and unfocused context.

This part matters more than it may seem at first. In document AI projects, the quality of the final answer depends heavily on everything that happens before the prompt reaches the model. Bad extraction, poor chunking, or weak retrieval will produce weak answers, even with an excellent LLM behind the scenes.

RAG by Design

Harpyx uses retrieval-augmented generation as a core design principle.

When a user asks a question, Harpyx does not simply forward the prompt to a model: it searches the selected project documents, retrieves the most relevant chunks, and sends that context to the configured chat model. The answer is therefore grounded in the uploaded material, instead of relying only on the model's general knowledge.

The RAG pipeline is designed to support the full flow:

  • document upload and background ingestion;
  • text extraction from different file formats;
  • chunking strategies suitable for long and heterogeneous documents;
  • embedding generation, to make extracted content searchable through semantic similarity;
  • vector and keyword retrieval, with support for both semantic search and exact or term-based matches;
  • project-level chat over selected document collections;
  • source-aware context selection, so users can understand where an answer comes from.

My goal with Harpyx is to keep this pipeline understandable: RAG systems can become opaque very quickly, especially when they are packaged as a feature rather than designed as infrastructure, while Harpyx takes the opposite route: ingestion, indexing, and retrieval are explicit parts of the application, because those are the parts developers usually need to tune when the first real dataset arrives.

Flexible LLM Provider Support

Harpyx is not tied to a single AI provider: the platform follows a bring-your-own-key model and can be configured to work with different providers and model capabilities, including chat, embeddings and OCR-oriented workflows. This allows teams to choose the provider that best fits their privacy requirements, cost profile, latency expectations, and compliance constraints.

For organisations with stricter operational needs, hosted models can also be configured centrally by administrators. That distinction is useful in practice: a small team may want maximum flexibility, while a company may prefer a controlled list of approved providers and models.

This is one of the areas where self-hosting makes a real difference. The AI provider is only one part of the system; document storage, metadata, user access, retention policies, and auditability matter just as much when the platform is used with sensitive or business-critical material.

Self-Hosted and Operator-Controlled

One of the main reasons Harpyx exists is control. With a self-hosted installation, the operator decides where the data lives, how it is backed up, which users can access it, which LLM providers are available, and how the surrounding infrastructure is monitored. These capabilities are of utmost importance for teams working with legal documents, internal procedures, customer records, technical repositories, or regulated data.

More specifically, Harpyx is designed so that operators can control:

  • where documents and extracted content are stored;
  • which infrastructure components are used for storage, search, and background processing;
  • which LLM providers and models are enabled;
  • how users, tenants, workspaces, and projects are organized;
  • how document retention, encryption, backups, and disaster recovery are handled;
  • how logs, observability, and audit trails are managed.

In many document AI scenarios, this level of control is the difference between a useful prototype and something that can be evaluated seriously inside an organisation.

Open Source, MIT-Licensed

Harpyx is completely open source and released under the MIT License. That choice is intentional: I wanted developers and organisations to be able to inspect the code, deploy the platform independently, adapt it to their workflows, integrate it with internal systems, and contribute improvements back to the project.

Document intelligence platforms should not be mysterious about how documents are stored, processed, indexed, and retrieved. When a system is used to reason over important material, transparency matters: developers need to understand the pipeline, security teams need to evaluate the architecture, and operators need to know what happens when something fails.

The MIT license keeps the project easy to adopt, including in professional and enterprise contexts where licensing uncertainty can stop an otherwise useful tool before it even reaches a technical evaluation.

Who Harpyx Is For

Harpyx is a good fit for teams that like the idea of document-grounded AI, but need more control than a closed SaaS workflow can offer.

Typical use cases include:

  • AI knowledge bases built over large document repositories;
  • technical teams that need to query specifications, manuals, tickets, release notes and architectural documents;
  • legal, compliance and research workflows where source grounding and data location matter;
  • developers building custom RAG applications on top of existing infrastructure;
  • organizations that want to analyze large document collections without sending everything to a black-box platform;
  • open-source communities looking for a transparent foundation for document search and AI-assisted analysis.

Harpyx is especially useful when the document set is too large, too varied, too sensitive, or too operationally important for a lightweight notebook workflow.

Current Status

Harpyx is still a young project, and that is part of the reason I wanted to release it as open source early.

There is a lot of work to do: improving ingestion quality, refining chunking strategies, expanding provider support, strengthening administrative features, polishing the user experience and testing the platform against increasingly complex real-world datasets. At the same time, the foundation is already clear: self-hosted document intelligence, RAG-first architecture, broad file support, and operator control.

Releasing Harpyx now means making that direction visible and inviting feedback from people who have similar problems. I am particularly interested in use cases that stress the system in ways that polished demos usually do not: large archives, mixed formats, imperfect files, long-running projects, strict data boundaries and teams that need to understand why an answer was produced.

Conclusion

Notebook-style AI tools have shown how useful it is to interact with documents through natural language. Harpyx takes that idea into a more open and operator-controlled direction, with an architecture designed for larger collections, broader file support and real deployment needs.

For me, the project grew out of a simple need: I wanted a document intelligence platform that I could run, inspect, extend and adapt without giving up control over the documents themselves. Harpyx is my attempt to build that platform in the open.

If you are working with large document collections, experimenting with RAG, or looking for a self-hosted alternative to closed document AI tools, Harpyx is worth following, testing and, hopefully, improving together.

References

Fork me on GitHub

About Ryan

IT Project Manager, Web Interface Architect and Lead Developer for many high-traffic web sites & services hosted in Italy and Europe. Since 2010 it's also a lead designer for many App and games for Android, iOS and Windows Phone mobile devices for a number of italian companies. Microsoft MVP for Development Technologies since 2018.

View all posts by Ryan

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.