Distributed approach similar to EXO/LocalAI #1092

axelquack · 2025-04-01T14:55:40Z

axelquack
Apr 1, 2025

Hi - I run Ramalama as well as Ollama on my local machines. GPU/CPU detection works well with Ramalama.
As you can imagine - even if I have a RTX on one of my local machines, usually models struggle above 70b or even 32b.

I wonder if Ramalama could support a distributed approach like Exo(Labs) or LocalAI can do. I could imagine distributing the pods across local machines, but also push them towards a public cloud. All monitored by e.g. Podman Desktop. Side note for instance I run Talos Linux at home on 3 machines, and that would be very interesting to run the pods locally, but also push them to the cloud to run the models there.

Afaik - LocalAI could offer model acces to an LLM runner by e.g. local-ai run ollama://gemma:2b which is very basic. On the other hand side LocalAI backends are internally implemented using gRPC services, allowing connection to external gRPC services via the --external-grpc-backends parameter. This suggests LocalAI can extend functionalities via third-party gRPC binaries, with examples like vllm. However, Ollama's API is REST-based, not gRPC. This mismatch means Ollama piggybacking on LocalAI cannot be directly used as an external backend in LocalAI without additional development, such as creating a gRPC wrapper for Ollama's REST API.

ericcurtin · 2025-04-01T17:08:21Z

ericcurtin
Apr 1, 2025
Maintainer

An upcoming feature in ramalama is that when we run:

ramalama serve

that a generic proxy server is kicked off akin to:

https://github.com/ericcurtin/anythingproxy

which essentially forks a server (like llama-sever) with a requested model and returns the result. The goal here is to be more ollama-like.

I see this as an extension of that proxy server once implemented.

#598

1 reply

axelquack Apr 2, 2025
Author

Thank you for providing context. If I understand correctly the proxy would act as controller/master? (side note, and since it was also mentioned in the thread - I use LiteLLM to provide commercial/hosted LLMs additionally to Open WebUI).

Exo

Shards/Model Splitting

In regards of Exo. They distribute the LLMs by splitting them into smaller parts ("shards") across recognized machines. Each machine processes a portion of the model, typically based on its memory capacity, using a ring topology where data flows between devices. This means the entire model isn't downloaded locally on each machine, but rather shared across the cluster.

Models must be pre-downloaded, can be local or via Hugging Face

Autodiscovery

Exo achieves afaik autodiscovery through a P2P network, allowing devices to connect and be discovered automatically. This doesn't require manual configuration. (so no master/worker approach). I don't fully get how they do that, but it seems to be in the main.py script.

LocalAI

Distributed inference

I also looked briefly on how LocalAI handles a distributed approach (document here: Distributed Inference

In its P2P mode (which I didn't know existed), LocalAI uses a shared token for secure communication, with nodes automatically discovered and connected. This is initiated by starting the server with --p2p, generating a token, and allowing workers to join using this token. For example, the command local-ai run --p2p starts the server, and workers can be added with TOKEN=XXX ./local-ai worker p2p-llama-cpp-rpc --llama-cpp-args="-m ", as seen in the documentation.

However, LocalAI also supports a non-P2P mode, where users manually specify worker addresses using the LLAMACPP_GRPC_SERVERS environment variable, such as LLAMACPP_GRPC_SERVERS="address1:port,address2:port" local-ai run.

Although, unlike Exo they do not do splits into shards, insteady they splight weights across workers (which I think is a more common approach using a sort of load balancer that routes requests to a single worker) during using the P2P mode.

Comparison of Potential Distributed Architectures

Aspect	Centralized (Proxy-Based)	Decentralized (P2P-Based)
Model distribution)	Proxy routes requests to containerized services	Models split across nodes, e.g. ring topology
Autodiscovery	Likely manual, via proxy configuration	Automatic, e.g. shared token like LocalAI P2P
Scalability	Potential bottleneck at proxy	Scales with more nodes, but latency may increase

Model Switching

Since you talk about it in the thread I wanted to know how both projects do that in a decentralized manner.

Model Switching in Exo

Exo allows switching models in its distributed or P2P mode by specifying the desired model in the API request, such as using curl commands for models like "llama-3.2-3b" or "deepseek-r1". The cluster dynamically loads the requested model's shards across available devices, using dynamic model partitioning based on network topology and device resources. This means you can switch models without restarting the service, as long as the model files are pre-downloaded to each device's cache (default location: ~/.cache/exo/downloads).

Found some indications here: /exo-explore/exo/blob/main/exo/inference/mlx/models/llama.py and /exo-explore/exo/blob/main/exo/inference/tinygrad/models/llama.py

The Medium article, "Building an AI Cluster at Home: The EXO Labs Approach" (Building an AI Cluster at Home: The EXO Labs Approach), provides insight into setting up a cluster, with API examples showing model specification in requests, such as:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "llama-3-8b", "messages": [{"role": "user", "content": "Hello, how are you?"}], "temperature": 0.7 }'

Further, browsing the repository for "how to run multiple models or switch models in distributed mode" shows examples of running different models via the API, e.g., "llama-3.2-3b", "llama-3.1-405b", "deepseek-r1", "llava-1.5-7b-hf", targeting the endpoint http://localhost:52415/v1/chat/completions. This suggests that users can switch models by modifying the "model" field in the API request.

Model Switching in LocalAI

In LocalAI's distributed mode, currently, only a single model is supported, meaning the cluster is set up for one specific model at a time. To switch to a different model, you need to reconfigure the cluster by stopping the current setup and starting a new one with the desired model, using commands like local-ai run --p2p for P2P mode or manual worker configuration as described before.

rhatdan · 2025-04-01T18:22:18Z

rhatdan
Apr 1, 2025
Maintainer

Yes we want to get to the point where we are running separate services containerized or not.

Basically have a chatbot in a container, talk to a llama-server in a container or on the host, talk to a rag service in a container
And eventually talking to MCP services containerized or not.

Keeping the CLI as simple as possible to run multiple services simultaneously or take advantage of services that pre-exist. How we do this in the CLI is welcome for comments.

1 reply

axelquack Apr 2, 2025
Author

Thank you Dan. Hopefully it is fair to share some user experience.

MCP integration would truly differentiate currently from Exo/LocalAI; in particular if the MCP server runs jailed. I looked at Manus, and they seem to have containerized "Ubuntu" to not only run the MCP server, but also VS Code to orchestrate from there (without having the security risk as so many other solutions, running locally non-containerized).

In terms of the RAG - using Open WebUI and by default using Chroma as VectorDB is painful, whenever I also use VSCode + Cline + Ollama I basically add to another/different "collection", and also VectorDB. So far - I now have to "isolated" RAGs.

In ericcurtins comment I wrote something as a comparison between a more centralized (master/worker) approach and decentralized/P2P approach. And also splitting roles in regards of the model distribution.
A centralized approach probably benefits from high bandwidth master tasks in a local Thunderbolt 5 environment due to simpler communication in daisy-chains?

afazekas · 2025-04-15T19:59:40Z

afazekas
Apr 15, 2025

I have 3 desktop machines with "high" bandwidth dual port network cards,
One of the machines has Radeon 6900 xt, the other has RTX 3090 planning to acquire a 3th GPU.
llama.cpp rpc working across the the 2 machines well.
https://github.com/ggml-org/llama.cpp/blob/master/examples/rpc/README.md
Using multiple GPU across machines expected to benefit even lower network bandwidth clusters.

I am looking for solution which also can manage speculative decoding as well.
https://www.reddit.com/r/LocalLLaMA/comments/1gzm93o/speculative_decoding_just_landed_in_llamacpps/

I do not really need a p2p solution, specifically I want to avoid to accidentally use the low bandwidth network interface.

In case this kind of inference is within a project scope I can try to look into what is needed for make it working.

1 reply

axelquack Apr 16, 2025
Author

Not exactly a solution - but might be interesting to dig deeper.

https://www.datacamp.com/tutorial/speculative-decoding w/ Gemma 2 models
and https://docs.vllm.ai/en/latest/features/spec_decode.html
and also: https://arxiv.org/abs/2402.15678 with https://ollama.com/blog/minions; which is different with local & cloud model combinations

afazekas · 2025-04-22T10:52:33Z

afazekas
Apr 22, 2025

This is interesting paper too: https://arxiv.org/abs/2401.10774 ,
but for now for speculative decoding we need a draft model, which looks like will be similar with vllm.
Since I have on old amd gpu which is not fully working with vLLM, probably I would focus on llama.cpp for now.
Looks like drafting does not needs to do much with distributed mode ATM, but we need to handle at least +1 model in this case.

Basic PoC for llama.cpp rpc : #1238

We might want to consider static and dynamic cluster main types.
In case of dynamic cluster likely we want to run some kind of ramalama agent on each node.
For cluster forming we should have cluster UUD4 (128 bit random, no secret), and cluster shared key (128 bit secret).
Cluster agent should trace each other and should serve the member list for the user requests.
The hashes of cluster shared key can be used for key generation for other services.
The agent supposed to be able to start workloads on the nodes where it is allowed..

For static/dynamic cluster we might have json files <ramalama_config_dir>/cluster/<cluster_name>.json.
'None' name is reserved for disabling clustering.
' default' name might be symbolic link...

We might need a per engine directory or section.
The <cluster_name>.json should include a driver name, in case of static cluster it should include the member nodes,
in case of agent based cluster it should have other options too.

There are many special cases and needs.
This is some of my random toughs for starting ..

0 replies

Distributed approach similar to EXO/LocalAI #1092

Uh oh!

axelquack Apr 1, 2025

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

ericcurtin Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

axelquack Apr 2, 2025 Author

Exo

Shards/Model Splitting

Autodiscovery

LocalAI

Distributed inference

Comparison of Potential Distributed Architectures

Model Switching

Model Switching in Exo

Model Switching in LocalAI

Uh oh!

rhatdan Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

axelquack Apr 2, 2025 Author

Uh oh!

afazekas Apr 15, 2025

Uh oh!

Uh oh!

axelquack Apr 16, 2025 Author

Uh oh!

Uh oh!

afazekas Apr 22, 2025

axelquack
Apr 1, 2025

Replies: 4 comments 3 replies

ericcurtin
Apr 1, 2025
Maintainer

axelquack Apr 2, 2025
Author

rhatdan
Apr 1, 2025
Maintainer

axelquack Apr 2, 2025
Author

afazekas
Apr 15, 2025

axelquack Apr 16, 2025
Author

afazekas
Apr 22, 2025