15 min read

Building an AI Agent System with Hermes: A Remote Architecture Guide

How I set up Hermes AI agents to run on a VPS while connecting to a home LLM backend via Tailscale. Complete setup guide with architecture diagrams.

  • AI Agents
  • Hermes
  • LLM
  • Remote Setup
  • Tailscale
  • Telegram

Note: This article was written using the Hermes agent itself via Telegram — the same system described here.

Introduction

Autonomous AI agents have transformed how I approach development tasks. After evaluating several frameworks, I chose Hermes — an open-source agent framework that combines powerful LLM reasoning with practical tool integrations.

The core challenge: I wanted 24/7 availability and reliable network access from a VPS, while keeping my expensive GPU hardware at home for cost efficiency. This guide documents my complete setup, architecture decisions, and provides step-by-step instructions for replicating it.

System Architecture

This setup combines a reliable VPS for 24/7 availability with cost-effective home GPU inference. Here’s how the components interconnect:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              HOME NETWORK (LAN)                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        GPU Server (RTX 3090 24GB)                       │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐    │  │
│  │  │   llama.cpp     │  │   Quantized     │  │   Local Storage &   │    │  │
│  │  │   Server        │◄─┤   LLM Models    │◄─┤   Model Weights     │    │  │
│  │  │  (port 8000)    │  │   (GGUF)        │  │                     │    │  │
│  │  └────────┬────────┘  └─────────────────┘  └─────────────────────┘    │  │
│  │           │              API Keys                               │    │  │
│  │           │                                                            │  │
│  └───────────┼────────────────────────────────────────────────────────────┘  │
│              │                                                                │
└──────────────┼────────────────────────────────────────────────────────────────┘

               │ Tailscale Encrypted Tunnel
               │ (100.64.x.x network)

┌──────────────┼────────────────────────────────────────────────────────────────┐
│              ▼                          VPS (Remote Server)                    │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                         Hermes Agent Gateway                                 ││
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐  ││
│  │  │   Telegram   │  │    Tool      │  │    Cron      │  │   Session      │  ││
│  │  │   Bot API    │  │   Executor   │  │   Scheduler  │  │   Manager      │  ││
│  │  │              │  │              │  │              │  │                │  ││
│  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └───────┬────────┘  ││
│  │         │                 │                  │                  │           ││
│  │         └─────────────────┬┴──────────────────┴──────────────────┘           ││
│  │                           │                                                  ││
│  │                    ┌──────▼──────┐                                         ││
│  │                    │   LLM Client  │─────────────────┐                       ││
│  │                    │  (OpenAI API) │                  │                       ││
│  │                    └──────────────┘                  │                       ││
│  └──────────────────────────────────────────────────────┼───────────────────────┘│
│                                                         │                        │
│  ┌─────────────────┐  ┌─────────────────────────────────┼───────────────────────┐│
│  │  Tools Available│  │                                 │                       ││
│  │  ───────────────│  │        External Services        │                       ││
│  │  • Terminal     │  │  ┌──────────────┐  ┌─────────────▼─────────────┐        ││
│  │  • File I/O     │  │  │   GitHub     │  │         Internet         │        ││
│  │  • Web Search   │  │  │     API      │  │   (Web Scraping/APIs)    │        ││
│  │  • Browser     │  │  └──────────────┘  └───────────────────────────┘        ││
│  │  • Delegate    │  │                                                  ││
│  │  • GitHub      │◄─┤                                                  ││
│  │  • Code Exec   │  │                                                  ││
│  │  • Cron/Sched  │  │                                                  ││
│  └─────────────────┘  └──────────────────────────────────────────────────────────┘│
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

Data Flow:

  1. Message arrives at the Telegram bot
  2. Hermes Gateway receives it via the Telegram Bot API
  3. Gateway constructs a prompt with context and available tools
  4. Request routes through Tailscale tunnel to the home GPU
  5. llama.cpp generates the response
  6. Hermes executes any requested tool calls
  7. Final response returns to Telegram

Key Components

1. Home GPU Server

  • Hardware: RTX 3090 (24GB VRAM) for running quantized LLMs
  • Software: llama.cpp serving GGUF models via OpenAI-compatible API
  • Benefits: Cost-effective inference with full control over model selection and quantization

2. llama.cpp Server

  • Purpose: Serve local LLMs with an OpenAI-compatible HTTP API
  • Configuration:
    llama-server --model models/qwen-2.5-7b-instruct-q4_k_m.gguf \
      --host 0.0.0.0 --port 8000 \
      --ctx-size 8192 --threads 8
  • Why llama.cpp: Excellent GGUF support, minimal memory overhead, straightforward HTTP API

3. Tailscale VPN

  • Purpose: Secure, encrypted tunnel between VPS and home network
  • Configuration:
    • Home server receives a Tailscale IP (e.g., 100.64.1.10)
    • VPS joins the same Tailscale network
    • LLM backend listens on 100.64.1.10:8000

4. API Gateway (api.lucasnicolas.dev)

  • Purpose: Single, stable OpenAI-compatible endpoint for Hermes agents and Honcho memory services
  • Implementation: Caddy reverse proxy on the VPS, terminating TLS and routing by path
  • Current routing:
    • POST /v1/chat/completions → llama.cpp backend (home GPU, port 8080)
    • GET /v1/models → llama.cpp backend
    • POST /v1/embeddings → embedding backend (intfloat/multilingual-e5-large-instruct, port 8081)
  • Active model: qwen3.5-35b (served via llama.cpp with GGUF quantization)
  • Benefit: Hermes uses one HTTPS URL regardless of backend port or service changes; clean separation between chat and embedding backends

5. Hermes Agent Gateway

  • Location: VPS (Ubuntu/Debian)
  • Role: Central orchestrator managing:
    • Message routing from Telegram
    • Tool execution and result aggregation
    • Conversation state management
    • Scheduled task coordination

6. Telegram Bot

  • Purpose: Primary interface for agent interaction
  • Benefits: Mobile access, push notifications, persistent chat history
  • Setup: BotFather creates the bot token; stored securely in ~/.hermes/.env

Why This Architecture?

Advantages

Cost Efficiency: Home GPU costs far less than cloud GPU instances
Privacy: Sensitive data never leaves your private network
Reliability: VPS ensures 24/7 uptime with a public IP
Flexibility: Easy to swap models, add tools, or customize behavior
Security: Tailscale provides encrypted, private networking
Low Latency: llama.cpp has minimal overhead compared to heavier frameworks

Trade-offs

⚠️ Latency: Network hop adds ~10-50ms per API call
⚠️ Complexity: More components to configure and maintain
⚠️ Bandwidth: Model weights transfer once (then cached locally)
⚠️ Home Internet: Upload speed affects request/response times

Complete Setup Guide

Step 1: Prepare Your Home GPU Server

Install llama.cpp

Option A: Build from source (Recommended)

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support (for NVIDIA GPUs)
make LLAMA_CUDA=1

# Or use CMake for more configuration options
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
make -j$(nproc)

Option B: Use pre-built release

# Download latest release from GitHub
wget https://github.com/ggerganov/llama.cpp/releases/download/b3062/llama-b3062-bin-ubuntu-x64.zip
unzip llama-b3062-bin-ubuntu-x64.zip

Download and Serve a Model

# Download a quantized model (example: Qwen 2.5 7B)
# Get GGUF models from Hugging Face: https://huggingface.co/models?search=gguf

# Example: Qwen 2.5 7B Instruct (Q4_K_M quantization)
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

# Start the server
./llama-server \
  --model qwen2.5-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --threads 8 \
  --gpu-layers 35

Configure Tailscale

# Install Tailscale on home server
curl -fsSL https://tailscale.com/install.sh | sh

# Start Tailscale and authenticate
sudo tailscale up

# Note your Tailscale IP (e.g., 100.64.1.10)
hostname -I

Important: Enable “Subnet Router” if you plan to access other devices:

sudo tailscale advertise-routes 100.64.1.10/32

Expose the LLM Backend

If behind a firewall/NAT:

# Option 1: Port forward on router (less secure)
# Forward external port 8000 → internal IP:8000

# Option 2: Use Tailscale as reverse proxy (recommended)
# Access via: http://100.64.1.10:8000

Step 2: Set Up the VPS

Prerequisites

  • Ubuntu 22.04 LTS or Debian 12+
  • Python 3.10+
  • Node.js 18+ (for some tools)
  • Git
  • Tailscale client

Install Dependencies

# Update system
sudo apt update && sudo apt upgrade -y

# Install Python and pip
sudo apt install -y python3 python3-pip python3-venv

# Install Node.js (if needed for certain tools)
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt install -y nodejs

# Install Git
sudo apt install -y git

# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

Configure Tailscale on VPS

# Start Tailscale and authenticate with same account as home server
sudo tailscale up

# Verify connection to home server
ping 100.64.1.10

# Test LLM backend connectivity
curl http://100.64.1.10:8000/v1/models

Configure api.lucasnicolas.dev (Caddy Reverse Proxy)

Rather than pointing Hermes directly to a Tailscale IP, I expose a single stable HTTPS endpoint on the VPS that proxies traffic to the appropriate backend. This setup provides:

  • TLS termination with automatic Let’s Encrypt certificates
  • Path-based routing to separate chat and embedding backends
  • Port abstraction so backend changes don’t require client updates

Why llama.cpp over vLLM?

I evaluated switching to vLLM for better throughput, but chose to stay with llama.cpp because:

  1. Unsloth GGUF workflow: I’m using Unsloth-quantized GGUF models, which are native to llama.cpp. Converting to vLLM-compatible formats (AWQ/GPTQ) would require re-training or complex conversion that loses Unsloth optimizations.
  2. Single-agent use case: For my primary workflow (single-agent interactive work), the performance difference is negligible.
  3. Simplicity: llama.cpp works directly with GGUF files without additional tooling.

When to consider vLLM:

  • Running multiple concurrent agents (better batching)
  • Production serving with high throughput requirements
  • You have native PyTorch checkpoints (not GGUF)

See the Performance Considerations section for tuning llama.cpp on your RTX 3090.

Create /etc/caddy/Caddyfile:

api.lucasnicolas.dev {
  encode gzip

  # Embeddings endpoint (CPU-friendly embedding server)
  handle_path /v1/embeddings {
    reverse_proxy 100.64.1.10:8081
  }

  # All other OpenAI-compatible routes (llama.cpp on home GPU)
  handle {
    reverse_proxy 100.64.1.10:8080
  }
}

Apply and validate:

sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy

# Verify routes
curl https://api.lucasnicolas.dev/v1/models
curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"hello"}'

This gives Hermes, Honcho, and any other OpenAI-compatible client a single base URL to use:

https://api.lucasnicolas.dev/v1

Clone and Install Hermes

# Create working directory
mkdir -p ~/.hermes && cd ~/.hermes

# Clone Hermes repository (if using source)
git clone https://github.com/hermes-agent/hermes-agent.git
cd hermes-agent

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -e .

Step 3: Configure Hermes

Create Configuration File

# Create config directory
mkdir -p ~/.hermes

# Generate default config
hermes setup

Edit ~/.hermes/config.yaml:

# ~/.hermes/config.yaml
display:
  theme: default
  tool_progress_command: kawaii

model:
  provider: custom
  # Point to your HTTPS OpenAI-compatible gateway
  model: "https://api.lucasnicolas.dev/v1"
  api_key: "${LLM_API_KEY}"  # Set in ~/.hermes/.env
  
tools:
  enabled_toolsets:
    - terminal
    - file
    - web
    - delegate
    - github
    
# Telegram configuration (set via env var)
telegram:
  bot_token: "${HERMES_TELEGRAM_BOT_TOKEN}"
  allowed_users:
    - "your_telegram_username"

Set Environment Variables

Create ~/.hermes/.env:

# ~/.hermes/.env
HERMES_TELEGRAM_BOT_TOKEN=***
GITHUB_TOKEN=***
LLM_API_KEY=***

Get your Telegram Bot Token:

  1. Open Telegram and message @BotFather
  2. Send /newbot and follow prompts
  3. Copy the API token it provides

Get your GitHub Token:

  1. Go to https://github.com/settings/tokens
  2. Create a new token with repo scope
  3. Copy the token

Set your LLM API key:

  1. Use the key expected by your OpenAI-compatible endpoint (api.lucasnicolas.dev)
  2. Store it as LLM_API_KEY in ~/.hermes/.env
  3. Keep file permissions strict:
    chmod 600 ~/.hermes/.env

Test the Connection

# Activate virtual environment
cd ~/.hermes/hermes-agent
source venv/bin/activate

# Test LLM backend connectivity
python3 -c "
import requests
response = requests.get('https://api.lucasnicolas.dev/v1/models')
print('LLM Backend Status:', response.status_code)
print('Models:', response.json())
"

# Optional: test embeddings route
curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"Hermes health check"}'

Step 4: Run the Agent

Start the Gateway (Telegram Mode)

# In virtual environment
cd ~/.hermes/hermes-agent

# Start Telegram gateway
python3 -m hermes.gateway --platform telegram

The gateway will:

  • Connect to your Telegram bot
  • Listen for messages
  • Forward requests to your home LLM backend
  • Execute tools and return responses

Make it Persistent with systemd

Create ~/.hermes/hermes-gateway.service:

[Unit]
Description=Hermes AI Agent Gateway
After=network.target tailscaled.service

[Service]
Type=simple
User=lucas
WorkingDirectory=/home/lucas/.hermes/hermes-agent
Environment="PATH=/home/lucas/.hermes/hermes-agent/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/home/lucas/.hermes/hermes-agent/venv/bin/python3 -m hermes.gateway --platform telegram
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

# Copy to systemd directory
sudo cp ~/.hermes/hermes-gateway.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable hermes-gateway
sudo systemctl start hermes-gateway

# Check status
sudo systemctl status hermes-gateway

# View logs
sudo journalctl -u hermes-gateway -f

Step 5: Verify Everything Works

Test from Telegram

  1. Open Telegram and find your bot
  2. Send a simple greeting: Hello
  3. You should receive a response from the agent

Test Tool Access

# In Telegram, try these commands:
/git status              # Check git tool access
/web_search "Hermes AI"  # Test web search
/terminal "echo hello"   # Test terminal access

Monitor Logs

# View real-time logs
sudo journalctl -u hermes-gateway -f

# Or check gateway output manually
tail -f ~/.hermes/hermes-agent/logs/gateway.log

Troubleshooting

Common Issues

”Connection refused” when accessing LLM backend

Symptoms: Agent can’t connect to home GPU server

Solutions:

  1. Verify Tailscale is running on both machines:
    tailscale status
  2. Check if you can ping the home IP:
    ping 100.64.1.10
  3. Ensure llama.cpp is listening on 0.0.0.0 not just 127.0.0.1:
    # Correct - listens on all interfaces
    ./llama-server --host 0.0.0.0 --port 8000
    
    # Wrong - only listens locally
    ./llama-server --host 127.0.0.1 --port 8000

“401 Unauthorized” from Telegram

Symptoms: Bot token rejected

Solutions:

  1. Verify bot token in ~/.hermes/.env
  2. Ensure bot is active (not deleted)
  3. Check allowed_users list includes your Telegram username

Model loading fails or times out

Symptoms: llama.cpp returns errors or OOM

Solutions:

  1. Check GPU memory usage on home server:
    nvidia-smi
  2. Use a smaller context size:
    ./llama-server --ctx-size 4096  # instead of 8192
  3. Try a smaller quantization (Q4_K_M instead of Q8_0):
    # Q4_K_M uses ~4.5GB for 7B model
    # Q8_0 uses ~7GB for 7B model

Out of GPU memory

Symptoms: CUDA out of memory errors

Solutions:

  1. Reduce --gpu-layers parameter:
    # Fewer layers on GPU = less VRAM
    ./llama-server --gpu-layers 20
  2. Use CPU offload for some layers:
    ./llama-server --gpu-layers 10 --threads 8
  3. Try a smaller model (7B instead of 14B)

“model not found” on /v1/embeddings

Symptoms: API returns an error like:

The provided model=BAAI/bge-m3 has not been found ... use model=intfloat/multilingual-e5-large-instruct instead.

Root cause: Your embedding backend is serving intfloat/multilingual-e5-large-instruct (1024-d vectors), but configuration files reference a different model.

Solutions:

  1. Update embedding model everywhere to match what your endpoint serves:

    • intfloat/multilingual-e5-large-instruct
    • Vector dimensions: 1024
  2. Check all config locations:

    # Honcho config
    grep -r "BAAI/bge-m3" ~/honcho/
    
    # Hermes config
    grep -r "embedding" ~/.hermes/config.yaml
    
    # Environment files
    grep -r "EMBEDDING_MODEL" ~/.hermes/ ~/honcho/
  3. Restart services that cache config:

    cd ~/honcho && docker compose restart api deriver
    sudo systemctl restart hermes-gateway
  4. Verify the fix:

    curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
      -H "Content-Type: application/json" \
      -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"test"}'

Important: The embedding model name must exactly match what your backend serves. If unsure, check your embedding server logs or test with different model names.

Debugging Tips

# Test gateway endpoint directly from VPS
curl -X POST https://api.lucasnicolas.dev/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7
  }'

# Test embeddings route and model name
curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"health check"}'

# Check Tailscale network (upstream path from VPS proxy to home server)
tailscale ping 100.64.1.10
tailscale status

# Verify file permissions
ls -la ~/.hermes/.env
chmod 600 ~/.hermes/.env

# Monitor llama.cpp directly (on home server)
# Check for errors in the terminal output

Advanced Configuration

Adding More Tools

Edit ~/.hermes/config.yaml:

tools:
  enabled_toolsets:
    - terminal          # Shell commands
    - file              # File I/O operations
    - web               # Web search and extraction
    - delegate          # Subagent delegation
    - github            # GitHub repository access
    - browser           # Browser automation (requires API key)
    - cron              # Scheduled task management
    
  disabled_toolsets: []

Honcho Memory Integration

Hermes integrates with Honcho, a self-hosted memory system for long-term conversation context and semantic search.

Architecture:

  • Storage: PostgreSQL + Redis on the VPS (~/honcho directory)
  • Embeddings: Routed through api.lucasnicolas.dev/v1/embeddings to your embedding server at port 8081
  • Vector dimensions: 1024 (matches intfloat/multilingual-e5-large-instruct)

Verification:

# Check Hermes Honcho connection
hermes honcho status

# Expected output:
# ✓ Connection OK
#   - API: http://localhost:8000
#   - Embeddings: https://api.lucasnicolas.dev/v1/embeddings
#   - Dimensions: 1024

If Honcho shows connection issues:

  1. Verify Honcho services are running:
    cd ~/honcho && docker compose ps
    # Expected: api, database, redis, deriver all healthy
  2. Check embedding configuration:
    # Ensure EMBEDDING_MODEL matches your backend
    grep "EMBEDDING_MODEL" ~/honcho/.env
    # Should be: intfloat/multilingual-e5-large-instruct
  3. Restart Honcho if needed:
    cd ~/honcho && docker compose restart api deriver

Configuring Scheduled Tasks

Hermes supports cron-like scheduling for automated tasks:

# In your Telegram chat, send:
/cron create "Daily backup" "0 3 * * *" "tar -czf /backup/home.tar.gz /home"
/cron list            # View all scheduled jobs
/cron pause "Daily backup"
/cron resume "Daily backup"
/cron remove "Daily backup"

Custom Skill Sets

Create custom skills in ~/.hermes/skills/:

mkdir -p ~/.hermes/skills/my-skills

Example skill file (~/.hermes/skills/my-skills/deploy.md):

---
name: deploy-to-vps
description: Deploy code to production server
version: 1.0.0
---

## Deployment Workflow

1. Build the project
2. Test locally
3. Push to production branch
4. Restart services

Model Selection for llama.cpp

ModelGGUF FileVRAM (Q4_K_M)Use Case
Qwen 2.5 7Bqwen2.5-7b-instruct-q4_k_m.gguf~6GBFast, general purpose
Llama 3.1 8Bllama-3.1-8b-instruct-q4_k_m.gguf~6GBReasoning, code
Mistral 7Bmistral-7b-instruct-v0.3-q4_k_m.gguf~6GBCode, reasoning
Qwen 2.5 14Bqwen2.5-14b-instruct-q4_k_m.gguf~10GBComplex reasoning
Qwen 2.5 32Bqwen2.5-32b-instruct-q4_k_m.gguf~20GBResearch, deep analysis

Recommendation: Start with Qwen 2.5 7B or Llama 3.1 8B for most use cases.

Performance Tuning

# Optimal llama.cpp settings for RTX 3090 (24GB)
./llama-server \
  --model qwen2.5-14b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --threads 8 \
  --gpu-layers 40 \
  --batch-size 512 \
  --ubatch-size 128 \
  --n-gpu-layers 40

Performance Considerations

Latency Breakdown

ComponentTypical Latency
Telegram API50-150ms
Tailscale tunnel10-50ms
llama.cpp inference (7B, RTX 3090)20-50 tokens/s
VPS internal processing<10ms
Total first token latency~200-500ms

Cost Analysis

ComponentMonthly CostNotes
Home GPU electricity~$15-30RTX 3090 idle + active usage
VPS (basic)$5-20DigitalOcean, Linode, etc.
TailscaleFreeUp to 3 users, unlimited nodes
Total~$20-50/monthvs. $100-300 for cloud GPU

Optimization Tips

  1. Use quantized models: Q4_K_M offers best quality/size ratio
  2. Batch requests: Combine multiple queries when possible
  3. Cache responses: Hermes supports prompt caching for repeated queries
  4. Keep context small: Smaller contexts = faster inference
  5. Use GPU layers wisely: Balance GPU/CPU offload for your hardware

Conclusion

This remote Hermes setup gives you the best of both worlds:

  • Reliability: VPS provides 24/7 availability and public IP
  • Privacy: Your data stays on your home network
  • Cost: Home GPU is far cheaper than cloud alternatives
  • Flexibility: Easy to modify, extend, and customize
  • Performance: llama.cpp is lean and fast with minimal overhead
  • Memory: Honcho provides semantic search and long-term context

The key components are Tailscale for secure networking, api.lucasnicolas.dev as a unified gateway, and understanding the trade-offs between latency and cost.

Why This Setup Works

  1. llama.cpp + GGUF: Leverages Unsloth-quantized models without format conversion
  2. Single HTTPS endpoint: api.lucasnicolas.dev/v1 abstracts away backend complexity
  3. Separate embedding route: Keeps CPU-intensive embeddings off the main inference pipeline
  4. Self-hosted memory: Honcho runs locally with 1024-d vectors matching your embedding model

Next Steps

  1. Start small: Begin with a 7B Q4_K_M model and basic tools
  2. Monitor performance: Use the built-in logging to track latency
  3. Iterate: Add features gradually as you become comfortable
  4. Share: Contribute back to the Hermes community with your improvements

Written by: Lucas Nicolas
Last Updated: April 12, 2026
License: MIT
Source Code: github.com/lucas-nicolas-viseo/portfolio

This article was written using the Hermes agent via Telegram, demonstrating exactly the kind of workflow this architecture enables.