Building an AI Agent System with Hermes: A Remote Architecture Guide

Note: This article was written using the Hermes agent itself via Telegram — the same system described here.

Introduction

Autonomous AI agents have transformed how I approach development tasks. After evaluating several frameworks, I chose Hermes — an open-source agent framework that combines powerful LLM reasoning with practical tool integrations.

The core challenge: I wanted 24/7 availability and reliable network access from a VPS, while keeping my expensive GPU hardware at home for cost efficiency. This guide documents my complete setup, architecture decisions, and provides step-by-step instructions for replicating it.

System Architecture

This setup combines a reliable VPS for 24/7 availability with cost-effective home GPU inference. Here’s how the components interconnect:

┌─────────────────────────────────────────────────────────────────────────────┐
│                              HOME NETWORK (LAN)                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        GPU Server (RTX 3090 24GB)                       │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐    │  │
│  │  │   llama.cpp     │  │   Quantized     │  │   Local Storage &   │    │  │
│  │  │   Server        │◄─┤   LLM Models    │◄─┤   Model Weights     │    │  │
│  │  │  (port 8000)    │  │   (GGUF)        │  │                     │    │  │
│  │  └────────┬────────┘  └─────────────────┘  └─────────────────────┘    │  │
│  │           │              API Keys                               │    │  │
│  │           │                                                            │  │
│  └───────────┼────────────────────────────────────────────────────────────┘  │
│              │                                                                │
└──────────────┼────────────────────────────────────────────────────────────────┘
               │
               │ Tailscale Encrypted Tunnel
               │ (100.64.x.x network)
               │
┌──────────────┼────────────────────────────────────────────────────────────────┐
│              ▼                          VPS (Remote Server)                    │
│  ┌─────────────────────────────────────────────────────────────────────────────┐│
│  │                         Hermes Agent Gateway                                 ││
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐  ││
│  │  │   Telegram   │  │    Tool      │  │    Cron      │  │   Session      │  ││
│  │  │   Bot API    │  │   Executor   │  │   Scheduler  │  │   Manager      │  ││
│  │  │              │  │              │  │              │  │                │  ││
│  │  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  └───────┬────────┘  ││
│  │         │                 │                  │                  │           ││
│  │         └─────────────────┬┴──────────────────┴──────────────────┘           ││
│  │                           │                                                  ││
│  │                    ┌──────▼──────┐                                         ││
│  │                    │   LLM Client  │─────────────────┐                       ││
│  │                    │  (OpenAI API) │                  │                       ││
│  │                    └──────────────┘                  │                       ││
│  └──────────────────────────────────────────────────────┼───────────────────────┘│
│                                                         │                        │
│  ┌─────────────────┐  ┌─────────────────────────────────┼───────────────────────┐│
│  │  Tools Available│  │                                 │                       ││
│  │  ───────────────│  │        External Services        │                       ││
│  │  • Terminal     │  │  ┌──────────────┐  ┌─────────────▼─────────────┐        ││
│  │  • File I/O     │  │  │   GitHub     │  │         Internet         │        ││
│  │  • Web Search   │  │  │     API      │  │   (Web Scraping/APIs)    │        ││
│  │  • Browser     │  │  └──────────────┘  └───────────────────────────┘        ││
│  │  • Delegate    │  │                                                  ││
│  │  • GitHub      │◄─┤                                                  ││
│  │  • Code Exec   │  │                                                  ││
│  │  • Cron/Sched  │  │                                                  ││
│  └─────────────────┘  └──────────────────────────────────────────────────────────┘│
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

Data Flow:

Message arrives at the Telegram bot
Hermes Gateway receives it via the Telegram Bot API
Gateway constructs a prompt with context and available tools
Request routes through Tailscale tunnel to the home GPU
llama.cpp generates the response
Hermes executes any requested tool calls
Final response returns to Telegram

Key Components

1. Home GPU Server

Hardware: RTX 3090 (24GB VRAM) for running quantized LLMs
Software: llama.cpp serving GGUF models via OpenAI-compatible API
Benefits: Cost-effective inference with full control over model selection and quantization

2. llama.cpp Server

Purpose: Serve local LLMs with an OpenAI-compatible HTTP API

Configuration:

llama-server --model models/qwen-2.5-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 --port 8000 \
  --ctx-size 8192 --threads 8

Why llama.cpp: Excellent GGUF support, minimal memory overhead, straightforward HTTP API

3. Tailscale VPN

Purpose: Secure, encrypted tunnel between VPS and home network
Configuration:
- Home server receives a Tailscale IP (e.g., 100.64.1.10)
- VPS joins the same Tailscale network
- LLM backend listens on 100.64.1.10:8000

4. API Gateway (`api.lucasnicolas.dev`)

Purpose: Single, stable OpenAI-compatible endpoint for Hermes agents and Honcho memory services
Implementation: Caddy reverse proxy on the VPS, terminating TLS and routing by path
Current routing:
- POST /v1/chat/completions → llama.cpp backend (home GPU, port 8080)
- GET /v1/models → llama.cpp backend
- POST /v1/embeddings → embedding backend (intfloat/multilingual-e5-large-instruct, port 8081)
Active model: qwen3.5-35b (served via llama.cpp with GGUF quantization)
Benefit: Hermes uses one HTTPS URL regardless of backend port or service changes; clean separation between chat and embedding backends

5. Hermes Agent Gateway

Location: VPS (Ubuntu/Debian)
Role: Central orchestrator managing:
- Message routing from Telegram
- Tool execution and result aggregation
- Conversation state management
- Scheduled task coordination

6. Telegram Bot

Purpose: Primary interface for agent interaction
Benefits: Mobile access, push notifications, persistent chat history
Setup: BotFather creates the bot token; stored securely in ~/.hermes/.env

Why This Architecture?

Advantages

✅ Cost Efficiency: Home GPU costs far less than cloud GPU instances
✅ Privacy: Sensitive data never leaves your private network
✅ Reliability: VPS ensures 24/7 uptime with a public IP
✅ Flexibility: Easy to swap models, add tools, or customize behavior
✅ Security: Tailscale provides encrypted, private networking
✅ Low Latency: llama.cpp has minimal overhead compared to heavier frameworks

Trade-offs

⚠️ Latency: Network hop adds ~10-50ms per API call
⚠️ Complexity: More components to configure and maintain
⚠️ Bandwidth: Model weights transfer once (then cached locally)
⚠️ Home Internet: Upload speed affects request/response times

Complete Setup Guide

Step 1: Prepare Your Home GPU Server

Install llama.cpp

Option A: Build from source (Recommended)

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support (for NVIDIA GPUs)
make LLAMA_CUDA=1

# Or use CMake for more configuration options
mkdir build && cd build
cmake .. -DLLAMA_CUDA=ON
make -j$(nproc)

Option B: Use pre-built release

# Download latest release from GitHub
wget https://github.com/ggerganov/llama.cpp/releases/download/b3062/llama-b3062-bin-ubuntu-x64.zip
unzip llama-b3062-bin-ubuntu-x64.zip

Download and Serve a Model

# Download a quantized model (example: Qwen 2.5 7B)
# Get GGUF models from Hugging Face: https://huggingface.co/models?search=gguf

# Example: Qwen 2.5 7B Instruct (Q4_K_M quantization)
wget https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf

# Start the server
./llama-server \
  --model qwen2.5-7b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --threads 8 \
  --gpu-layers 35

Configure Tailscale

# Install Tailscale on home server
curl -fsSL https://tailscale.com/install.sh | sh

# Start Tailscale and authenticate
sudo tailscale up

# Note your Tailscale IP (e.g., 100.64.1.10)
hostname -I

Important: Enable “Subnet Router” if you plan to access other devices:

sudo tailscale advertise-routes 100.64.1.10/32

Expose the LLM Backend

If behind a firewall/NAT:

# Option 1: Port forward on router (less secure)
# Forward external port 8000 → internal IP:8000

# Option 2: Use Tailscale as reverse proxy (recommended)
# Access via: http://100.64.1.10:8000

Step 2: Set Up the VPS

Prerequisites

Ubuntu 22.04 LTS or Debian 12+
Python 3.10+
Node.js 18+ (for some tools)
Git
Tailscale client

Install Dependencies

# Update system
sudo apt update && sudo apt upgrade -y

# Install Python and pip
sudo apt install -y python3 python3-pip python3-venv

# Install Node.js (if needed for certain tools)
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt install -y nodejs

# Install Git
sudo apt install -y git

# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

Configure Tailscale on VPS

# Start Tailscale and authenticate with same account as home server
sudo tailscale up

# Verify connection to home server
ping 100.64.1.10

# Test LLM backend connectivity
curl http://100.64.1.10:8000/v1/models

Configure `api.lucasnicolas.dev` (Caddy Reverse Proxy)

Rather than pointing Hermes directly to a Tailscale IP, I expose a single stable HTTPS endpoint on the VPS that proxies traffic to the appropriate backend. This setup provides:

TLS termination with automatic Let’s Encrypt certificates
Path-based routing to separate chat and embedding backends
Port abstraction so backend changes don’t require client updates

Why llama.cpp over vLLM?

I evaluated switching to vLLM for better throughput, but chose to stay with llama.cpp because:

Unsloth GGUF workflow: I’m using Unsloth-quantized GGUF models, which are native to llama.cpp. Converting to vLLM-compatible formats (AWQ/GPTQ) would require re-training or complex conversion that loses Unsloth optimizations.
Single-agent use case: For my primary workflow (single-agent interactive work), the performance difference is negligible.
Simplicity: llama.cpp works directly with GGUF files without additional tooling.

When to consider vLLM:

Running multiple concurrent agents (better batching)
Production serving with high throughput requirements
You have native PyTorch checkpoints (not GGUF)

See the Performance Considerations section for tuning llama.cpp on your RTX 3090.

Create /etc/caddy/Caddyfile:

api.lucasnicolas.dev {
  encode gzip

  # Embeddings endpoint (CPU-friendly embedding server)
  handle_path /v1/embeddings {
    reverse_proxy 100.64.1.10:8081
  }

  # All other OpenAI-compatible routes (llama.cpp on home GPU)
  handle {
    reverse_proxy 100.64.1.10:8080
  }
}

Apply and validate:

sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl reload caddy

# Verify routes
curl https://api.lucasnicolas.dev/v1/models
curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"hello"}'

This gives Hermes, Honcho, and any other OpenAI-compatible client a single base URL to use:

https://api.lucasnicolas.dev/v1

Clone and Install Hermes

# Create working directory
mkdir -p ~/.hermes && cd ~/.hermes

# Clone Hermes repository (if using source)
git clone https://github.com/hermes-agent/hermes-agent.git
cd hermes-agent

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -e .

Step 3: Configure Hermes

Create Configuration File

# Create config directory
mkdir -p ~/.hermes

# Generate default config
hermes setup

Edit ~/.hermes/config.yaml:

# ~/.hermes/config.yaml
display:
  theme: default
  tool_progress_command: kawaii

model:
  provider: custom
  # Point to your HTTPS OpenAI-compatible gateway
  model: "https://api.lucasnicolas.dev/v1"
  api_key: "${LLM_API_KEY}"  # Set in ~/.hermes/.env
  
tools:
  enabled_toolsets:
    - terminal
    - file
    - web
    - delegate
    - github
    
# Telegram configuration (set via env var)
telegram:
  bot_token: "${HERMES_TELEGRAM_BOT_TOKEN}"
  allowed_users:
    - "your_telegram_username"

Set Environment Variables

Create ~/.hermes/.env:

# ~/.hermes/.env
HERMES_TELEGRAM_BOT_TOKEN=***
GITHUB_TOKEN=***
LLM_API_KEY=***

Get your Telegram Bot Token:

Open Telegram and message @BotFather
Send /newbot and follow prompts
Copy the API token it provides

Get your GitHub Token:

Go to https://github.com/settings/tokens
Create a new token with repo scope
Copy the token

Set your LLM API key:

Use the key expected by your OpenAI-compatible endpoint (api.lucasnicolas.dev)
Store it as LLM_API_KEY in ~/.hermes/.env
Keep file permissions strict:
```
chmod 600 ~/.hermes/.env
```

Test the Connection

# Activate virtual environment
cd ~/.hermes/hermes-agent
source venv/bin/activate

# Test LLM backend connectivity
python3 -c "
import requests
response = requests.get('https://api.lucasnicolas.dev/v1/models')
print('LLM Backend Status:', response.status_code)
print('Models:', response.json())
"

# Optional: test embeddings route
curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"Hermes health check"}'

Step 4: Run the Agent

Start the Gateway (Telegram Mode)

# In virtual environment
cd ~/.hermes/hermes-agent

# Start Telegram gateway
python3 -m hermes.gateway --platform telegram

The gateway will:

Connect to your Telegram bot
Listen for messages
Forward requests to your home LLM backend
Execute tools and return responses

Make it Persistent with systemd

Create ~/.hermes/hermes-gateway.service:

[Unit]
Description=Hermes AI Agent Gateway
After=network.target tailscaled.service

[Service]
Type=simple
User=lucas
WorkingDirectory=/home/lucas/.hermes/hermes-agent
Environment="PATH=/home/lucas/.hermes/hermes-agent/venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/home/lucas/.hermes/hermes-agent/venv/bin/python3 -m hermes.gateway --platform telegram
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

# Copy to systemd directory
sudo cp ~/.hermes/hermes-gateway.service /etc/systemd/system/

# Reload systemd
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable hermes-gateway
sudo systemctl start hermes-gateway

# Check status
sudo systemctl status hermes-gateway

# View logs
sudo journalctl -u hermes-gateway -f

Step 5: Verify Everything Works

Test from Telegram

Open Telegram and find your bot
Send a simple greeting: Hello
You should receive a response from the agent

Test Tool Access

# In Telegram, try these commands:
/git status              # Check git tool access
/web_search "Hermes AI"  # Test web search
/terminal "echo hello"   # Test terminal access

Monitor Logs

# View real-time logs
sudo journalctl -u hermes-gateway -f

# Or check gateway output manually
tail -f ~/.hermes/hermes-agent/logs/gateway.log

Troubleshooting

Common Issues

”Connection refused” when accessing LLM backend

Symptoms: Agent can’t connect to home GPU server

Solutions:

Verify Tailscale is running on both machines:
```
tailscale status
```
Check if you can ping the home IP:
```
ping 100.64.1.10
```

Ensure llama.cpp is listening on 0.0.0.0 not just 127.0.0.1:

# Correct - listens on all interfaces
./llama-server --host 0.0.0.0 --port 8000

# Wrong - only listens locally
./llama-server --host 127.0.0.1 --port 8000

“401 Unauthorized” from Telegram

Symptoms: Bot token rejected

Solutions:

Verify bot token in ~/.hermes/.env
Ensure bot is active (not deleted)
Check allowed_users list includes your Telegram username

Model loading fails or times out

Symptoms: llama.cpp returns errors or OOM

Solutions:

Check GPU memory usage on home server:
```
nvidia-smi
```

Use a smaller context size:

./llama-server --ctx-size 4096  # instead of 8192

Try a smaller quantization (Q4_K_M instead of Q8_0):

# Q4_K_M uses ~4.5GB for 7B model
# Q8_0 uses ~7GB for 7B model

Out of GPU memory

Symptoms: CUDA out of memory errors

Solutions:

Reduce --gpu-layers parameter:

# Fewer layers on GPU = less VRAM
./llama-server --gpu-layers 20

Use CPU offload for some layers:

./llama-server --gpu-layers 10 --threads 8

Try a smaller model (7B instead of 14B)

“model not found” on `/v1/embeddings`

Symptoms: API returns an error like:

The provided model=BAAI/bge-m3 has not been found ... use model=intfloat/multilingual-e5-large-instruct instead.

Root cause: Your embedding backend is serving intfloat/multilingual-e5-large-instruct (1024-d vectors), but configuration files reference a different model.

Solutions:

Update embedding model everywhere to match what your endpoint serves:
- intfloat/multilingual-e5-large-instruct
- Vector dimensions: 1024

Check all config locations:

# Honcho config
grep -r "BAAI/bge-m3" ~/honcho/

# Hermes config
grep -r "embedding" ~/.hermes/config.yaml

# Environment files
grep -r "EMBEDDING_MODEL" ~/.hermes/ ~/honcho/

Restart services that cache config:

cd ~/honcho && docker compose restart api deriver
sudo systemctl restart hermes-gateway

Verify the fix:

curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"test"}'

Important: The embedding model name must exactly match what your backend serves. If unsure, check your embedding server logs or test with different model names.

Debugging Tips

# Test gateway endpoint directly from VPS
curl -X POST https://api.lucasnicolas.dev/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.7
  }'

# Test embeddings route and model name
curl -X POST https://api.lucasnicolas.dev/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"intfloat/multilingual-e5-large-instruct","input":"health check"}'

# Check Tailscale network (upstream path from VPS proxy to home server)
tailscale ping 100.64.1.10
tailscale status

# Verify file permissions
ls -la ~/.hermes/.env
chmod 600 ~/.hermes/.env

# Monitor llama.cpp directly (on home server)
# Check for errors in the terminal output

Advanced Configuration

Adding More Tools

Edit ~/.hermes/config.yaml:

tools:
  enabled_toolsets:
    - terminal          # Shell commands
    - file              # File I/O operations
    - web               # Web search and extraction
    - delegate          # Subagent delegation
    - github            # GitHub repository access
    - browser           # Browser automation (requires API key)
    - cron              # Scheduled task management
    
  disabled_toolsets: []

Honcho Memory Integration

Hermes integrates with Honcho, a self-hosted memory system for long-term conversation context and semantic search.

Architecture:

Storage: PostgreSQL + Redis on the VPS (~/honcho directory)
Embeddings: Routed through api.lucasnicolas.dev/v1/embeddings to your embedding server at port 8081
Vector dimensions: 1024 (matches intfloat/multilingual-e5-large-instruct)

Verification:

# Check Hermes Honcho connection
hermes honcho status

# Expected output:
# ✓ Connection OK
#   - API: http://localhost:8000
#   - Embeddings: https://api.lucasnicolas.dev/v1/embeddings
#   - Dimensions: 1024

If Honcho shows connection issues:

Verify Honcho services are running:

cd ~/honcho && docker compose ps
# Expected: api, database, redis, deriver all healthy

Check embedding configuration:

# Ensure EMBEDDING_MODEL matches your backend
grep "EMBEDDING_MODEL" ~/honcho/.env
# Should be: intfloat/multilingual-e5-large-instruct

Restart Honcho if needed:

cd ~/honcho && docker compose restart api deriver

Configuring Scheduled Tasks

Hermes supports cron-like scheduling for automated tasks:

# In your Telegram chat, send:
/cron create "Daily backup" "0 3 * * *" "tar -czf /backup/home.tar.gz /home"
/cron list            # View all scheduled jobs
/cron pause "Daily backup"
/cron resume "Daily backup"
/cron remove "Daily backup"

Custom Skill Sets

Create custom skills in ~/.hermes/skills/:

mkdir -p ~/.hermes/skills/my-skills

Example skill file (~/.hermes/skills/my-skills/deploy.md):

---
name: deploy-to-vps
description: Deploy code to production server
version: 1.0.0
---

## Deployment Workflow

1. Build the project
2. Test locally
3. Push to production branch
4. Restart services

Model Selection for llama.cpp

Model	GGUF File	VRAM (Q4_K_M)	Use Case
Qwen 2.5 7B	qwen2.5-7b-instruct-q4_k_m.gguf	~6GB	Fast, general purpose
Llama 3.1 8B	llama-3.1-8b-instruct-q4_k_m.gguf	~6GB	Reasoning, code
Mistral 7B	mistral-7b-instruct-v0.3-q4_k_m.gguf	~6GB	Code, reasoning
Qwen 2.5 14B	qwen2.5-14b-instruct-q4_k_m.gguf	~10GB	Complex reasoning
Qwen 2.5 32B	qwen2.5-32b-instruct-q4_k_m.gguf	~20GB	Research, deep analysis

Recommendation: Start with Qwen 2.5 7B or Llama 3.1 8B for most use cases.

Performance Tuning

# Optimal llama.cpp settings for RTX 3090 (24GB)
./llama-server \
  --model qwen2.5-14b-instruct-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --threads 8 \
  --gpu-layers 40 \
  --batch-size 512 \
  --ubatch-size 128 \
  --n-gpu-layers 40

Performance Considerations

Latency Breakdown

Component	Typical Latency
Telegram API	50-150ms
Tailscale tunnel	10-50ms
llama.cpp inference (7B, RTX 3090)	20-50 tokens/s
VPS internal processing	<10ms
Total first token latency	~200-500ms

Cost Analysis

Component	Monthly Cost	Notes
Home GPU electricity	~$15-30	RTX 3090 idle + active usage
VPS (basic)	$5-20	DigitalOcean, Linode, etc.
Tailscale	Free	Up to 3 users, unlimited nodes
Total	~$20-50/month	vs. $100-300 for cloud GPU

Optimization Tips

Use quantized models: Q4_K_M offers best quality/size ratio
Batch requests: Combine multiple queries when possible
Cache responses: Hermes supports prompt caching for repeated queries
Keep context small: Smaller contexts = faster inference
Use GPU layers wisely: Balance GPU/CPU offload for your hardware

Conclusion

This remote Hermes setup gives you the best of both worlds:

Reliability: VPS provides 24/7 availability and public IP
Privacy: Your data stays on your home network
Cost: Home GPU is far cheaper than cloud alternatives
Flexibility: Easy to modify, extend, and customize
Performance: llama.cpp is lean and fast with minimal overhead
Memory: Honcho provides semantic search and long-term context

The key components are Tailscale for secure networking, api.lucasnicolas.dev as a unified gateway, and understanding the trade-offs between latency and cost.

Why This Setup Works

llama.cpp + GGUF: Leverages Unsloth-quantized models without format conversion
Single HTTPS endpoint: api.lucasnicolas.dev/v1 abstracts away backend complexity
Separate embedding route: Keeps CPU-intensive embeddings off the main inference pipeline
Self-hosted memory: Honcho runs locally with 1024-d vectors matching your embedding model

Next Steps

Start small: Begin with a 7B Q4_K_M model and basic tools
Monitor performance: Use the built-in logging to track latency
Iterate: Add features gradually as you become comfortable
Share: Contribute back to the Hermes community with your improvements

Written by: Lucas Nicolas
Last Updated: April 12, 2026
License: MIT
Source Code: github.com/lucas-nicolas-viseo/portfolio

This article was written using the Hermes agent via Telegram, demonstrating exactly the kind of workflow this architecture enables.

Introduction

System Architecture

Key Components

1. Home GPU Server

2. llama.cpp Server

3. Tailscale VPN

4. API Gateway (api.lucasnicolas.dev)

5. Hermes Agent Gateway

6. Telegram Bot

Why This Architecture?

Advantages

Trade-offs

Complete Setup Guide

Step 1: Prepare Your Home GPU Server

Install llama.cpp

Download and Serve a Model

Configure Tailscale

Expose the LLM Backend

Step 2: Set Up the VPS

Prerequisites

Install Dependencies

Configure Tailscale on VPS

Configure api.lucasnicolas.dev (Caddy Reverse Proxy)

Clone and Install Hermes

Step 3: Configure Hermes

Create Configuration File

Set Environment Variables

Test the Connection

Step 4: Run the Agent

Start the Gateway (Telegram Mode)

Make it Persistent with systemd

Step 5: Verify Everything Works

Test from Telegram

Test Tool Access

Monitor Logs

Troubleshooting

Common Issues

”Connection refused” when accessing LLM backend

“401 Unauthorized” from Telegram

Model loading fails or times out

Out of GPU memory

“model not found” on /v1/embeddings

Debugging Tips

Advanced Configuration

Adding More Tools

Honcho Memory Integration

Configuring Scheduled Tasks

Custom Skill Sets

Model Selection for llama.cpp

Performance Tuning

Performance Considerations

Latency Breakdown

Cost Analysis

Optimization Tips

Conclusion

Why This Setup Works

Next Steps

4. API Gateway (`api.lucasnicolas.dev`)

Configure `api.lucasnicolas.dev` (Caddy Reverse Proxy)

“model not found” on `/v1/embeddings`