Replies: 4 comments 3 replies
-
An upcoming feature in ramalama is that when we run:
that a generic proxy server is kicked off akin to: https://github.com/ericcurtin/anythingproxy which essentially forks a server (like llama-sever) with a requested model and returns the result. The goal here is to be more ollama-like. I see this as an extension of that proxy server once implemented. |
Beta Was this translation helpful? Give feedback.
-
Yes we want to get to the point where we are running separate services containerized or not. Basically have a chatbot in a container, talk to a llama-server in a container or on the host, talk to a rag service in a container Keeping the CLI as simple as possible to run multiple services simultaneously or take advantage of services that pre-exist. How we do this in the CLI is welcome for comments. |
Beta Was this translation helpful? Give feedback.
-
I have 3 desktop machines with "high" bandwidth dual port network cards, I am looking for solution which also can manage speculative decoding as well. I do not really need a p2p solution, specifically I want to avoid to accidentally use the low bandwidth network interface. In case this kind of inference is within a project scope I can try to look into what is needed for make it working. |
Beta Was this translation helpful? Give feedback.
-
This is interesting paper too: https://arxiv.org/abs/2401.10774 , Basic PoC for llama.cpp rpc : #1238 We might want to consider static and dynamic cluster main types. For static/dynamic cluster we might have json files <ramalama_config_dir>/cluster/<cluster_name>.json. We might need a per engine directory or section. There are many special cases and needs. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi - I run Ramalama as well as Ollama on my local machines. GPU/CPU detection works well with Ramalama.
As you can imagine - even if I have a RTX on one of my local machines, usually models struggle above 70b or even 32b.
I wonder if Ramalama could support a distributed approach like Exo(Labs) or LocalAI can do. I could imagine distributing the pods across local machines, but also push them towards a public cloud. All monitored by e.g. Podman Desktop. Side note for instance I run Talos Linux at home on 3 machines, and that would be very interesting to run the pods locally, but also push them to the cloud to run the models there.
Afaik - LocalAI could offer model acces to an LLM runner by e.g.
local-ai run ollama://gemma:2b
which is very basic. On the other hand side LocalAI backends are internally implemented using gRPC services, allowing connection to external gRPC services via the--external-grpc-backends
parameter. This suggests LocalAI can extend functionalities via third-party gRPC binaries, with examples like vllm. However, Ollama's API is REST-based, not gRPC. This mismatch means Ollama piggybacking on LocalAI cannot be directly used as an external backend in LocalAI without additional development, such as creating a gRPC wrapper for Ollama's REST API.Beta Was this translation helpful? Give feedback.
All reactions