Over the past few years, the cost of training large language models (LLMs) has skyrocketed. Models like GPT-4 are estimated to cost $20M–$100M+ just to train once, with projections of $1B per run by 2027. Even “smaller” foundation models like GPT-3 required roughly $4.6M in compute.
That’s out of reach for nearly every company. But the good news? You don’t need to train a new LLM from scratch to harness AI in your business. Instead, you can run existing models locally and pair them with a vector database to bring in your company’s knowledge.
This approach — Retrieval Augmented Generation (RAG) — is how many startups and internal tools are building practical, affordable AI systems today.
Training vs. Using LLMs
- Training from scratch
- Requires thousands of GPUs, months of compute, and millions of dollars.
- Only feasible for major labs (OpenAI, Anthropic, DeepMind, etc.).
- Running + fine-tuning existing models
- Can be done on commodity cloud servers — or even a laptop.
- Cost can drop from millions to just hundreds or thousands of dollars.
The trick: instead of teaching a model everything, let it “look things up” in your own database of knowledge.
Ollama: Running LLMs Locally
Ollama makes it easy to run open-source LLMs on your own hardware.
- It supports models like LLaMA, Mistral, and Gemma.
- You can run it on a laptop (Mac/Windows/Linux) and/or in a Docker container. I like to run it in docker on my machine, it’s the easiest way to control costs while building and testing
- Developers can expose endpoints to applications with a simple API.
Instead of paying per token to OpenAI or Anthropic, you run the models yourself, with predictable costs.
# Example: pull and run LLaMA 3.2 with Ollama
ollama pull llama3.2
ollama run llama3.2
Supabase: Your Vector Database
When you add RAG into the mix, you need somewhere to store embeddings of your documents. That’s where Supabase comes in:
- Supabase is a Postgres-based platform with built-in pgvector extension.
- You can store text embeddings (numerical representations of text meaning).
- With SQL or RPC calls, you can run similarity searches (<->) to fetch the most relevant chunks of data.
For example, embedding your FAQs:
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
embedding vector(1536)
);
-- Search for relevant documents
SELECT content
FROM documents
ORDER BY embedding <-> (SELECT embedding FROM query_embedding)
LIMIT 5;
This gives your LLM the ability to retrieve your data before generating answers.
RAG in Action: The Flow
- User asks a question → “What’s our refund policy?”
- System embeds the query using nomic-embed-text or OpenAI embeddings.
- Supabase vector search finds the closest matching policy docs.
- Ollama LLM uses both the question + retrieved context to generate a grounded answer.
Result: Instead of the model hallucinating, it answers confidently with your company’s real data.
Cost Reality Check
- Training GPT-4: $50M+
- Running Ollama with a 7B–13B parameter model: a few hundred dollars per month in compute (or free if local).
- Using Supabase for vector search: low monthly costs, scales with usage.
For most businesses, this approach is 95% cheaper and far faster to implement.
Final Thoughts
Building your own GPT-4 is impossible for most organizations. But by combining:
- Ollama (local LLM runtime)
- Supabase + pgvector (semantic search layer)
- RAG pipelines
…you can get the power of custom AI at a fraction of the cost.
The future isn’t about every company training billion-dollar models — it’s about smart teams leveraging open-source LLMs and vector databases to make AI truly useful inside their workflows.
Interested in this for your company? Feel free to reach out on LinkedIn and I’ll use my experience doing this for Modere and one of my freelance clients to build one for you.