Skip to content

Replace cortex.llamacpp with minimalist fork of llama.cpp #1728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8 tasks
dan-menlo opened this issue Nov 26, 2024 · 7 comments
Closed
8 tasks

Replace cortex.llamacpp with minimalist fork of llama.cpp #1728

dan-menlo opened this issue Nov 26, 2024 · 7 comments
Assignees
Labels

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Nov 26, 2024

Goal

  • Goal: Can we have a minimalist fork of llama.cpp as llamacpp-engine
    • cortex.cpp's desktop focus means Drogon's features are unused
    • We should contribute our vision and multimodal work upstream as a form of llama.cpp server
    • Very clear Engines abstraction (i.e. support OpenVino etc in the future)
  • Goal: Contribute upwards to llama.cpp
    • Vision, multimodal
    • May not be possible if the vision, audio encoders are Python-runtime based

Can we consider refactoring llamacpp-engine to use the server implementation, and maintain a fork with our improvements to speech, vision etc? This is especially if we do a C++ implementation of whisperVQ in the future.

Potential issues

  • cortex engines llama.cpp update -> updates llama.cpp
    • We still need to build avx-512 variants for janhq/llama.cpp (i.e. build scripts)
    • We should align the janhq/llama.cpp release names with ggml-org/llama.cpp
    • Trigger automatic CI/CD to build
    • We can also ask GG if we can donate compute towards builds
  • Deprecating llava support
  • Handling existing API endpoints for logit_bias, n etc by either upstreaming or in Cortex Server
  • Update Documentation
  • DevRel @ramonpzg
    • Cortex builds on llamacpp-server (and we will contribute in the future)
    • Why do we need to build so many different types of llama.cpp (AVX512, AVX2)
    • GG -> can we contribute Menlo Cloud to llama.cpp project (built up Intel CPUs)

Key Changes

  • Use llama-server instead of Drogon that we use in cortex.llamacpp
  • Use a spawned llama.cpp process instead of dylib (better stablity, parallelism)
    • However, we will effectively need to build a process manager
@dan-menlo dan-menlo added the type: epic A major feature or initiative label Nov 26, 2024
@dan-menlo dan-menlo added this to Menlo Nov 26, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Nov 26, 2024
@vansangpfiev
Copy link
Contributor

I agree that we should align with the llama.cpp upstream, but I have several concerns:

  • Drogon is part of cortex.cpp, we have already removed it from llama-cpp engine. If we remove Drogon from cortex.cpp, we need to find a replacement, which will be costly.
  • Repository Structure: Forking the server implementation will necessitate changes to our repository structure, since we currently use llama.cpp as a submodule.
  • Our current version differs significantly from the upstream version, which will require considerable time for refactoring.

@gabrielle-ong gabrielle-ong added this to the v1.0.5 milestone Nov 28, 2024
@gabrielle-ong gabrielle-ong removed this from the v1.0.5 milestone Nov 28, 2024
@github-project-automation github-project-automation bot moved this from Investigating to QA in Menlo Dec 15, 2024
@dan-menlo dan-menlo reopened this Dec 15, 2024
@github-project-automation github-project-automation bot moved this from QA to In Progress in Menlo Dec 15, 2024
@dan-menlo dan-menlo changed the title epic: llamacpp-engine to align with llama.cpp upstream roadmap: llamacpp-engine to align with llama.cpp upstream Dec 15, 2024
@vansangpfiev
Copy link
Contributor

vansangpfiev commented Dec 23, 2024

Tasklist:

  • Fork llama.cpp and try to use llama.cpp server
  • Split vision model flow with chat model flow
  • llama.cpp server should be a new process
  • Test with OpenAI API compatible features that Alex and James added
  • CI
  • Docs

Related tickets need to be tested and verified:

Approach 1: cortex.llamacpp spawns llama.cpp server as a new process

  • pros: can directly use llama.cpp server binary
  • cons: spawn a new process is expensive and hard to control the process lifetime

image

Approach 2: Build llama.cpp server as a library and load it into cortex.llamacpp process

  • pros: llama.cpp server can be embedded into cortex.llamacpp
  • cons: will need to apply a patch to build llama.cpp server as a library

image

image

@vansangpfiev vansangpfiev moved this from In Progress to Eng Review in Menlo Jan 2, 2025
@dan-menlo dan-menlo moved this from Eng Review to QA in Menlo Jan 10, 2025
@TC117 TC117 added this to the v1.0.9 milestone Jan 13, 2025
@TC117
Copy link

TC117 commented Jan 16, 2025

Not show variant on Window like Linux
Image
Working on Linux
Image

Cant load model with cuda variant

PS C:\WINDOWS\system32> cortex-nightly.exe run tinyllama
Starting server ...
Set log level to INFO
Host: 127.0.0.1 Port: 39281
Server started
API Documentation available at: http://127.0.0.1:39281
Model failed to start: Failed to load model
Error: Failed to start model
PS C:\WINDOWS\system32>

@TC117 TC117 moved this from QA to Completed in Menlo Feb 6, 2025
@vansangpfiev vansangpfiev modified the milestones: v1.0.9, v1.0.12 Mar 11, 2025
@vansangpfiev vansangpfiev moved this from Completed to In Progress in Menlo Mar 11, 2025
@vansangpfiev
Copy link
Contributor

vansangpfiev commented Mar 11, 2025

Specs changes:

  • Move all OpenAI API compatibility from engine to cortex
  • Spawn llama-server (maybe llava also) directly from cortex

Diagram:

Image

Task list:

  • Fork of llama.cpp
  • Engine management (Linux, Windows, macOS)
  • OpenAI API compatibility
  • CIs
  • Verify with QA checklists
  • Docs

Engine variants:

  • ubuntu-arm64
  • linux-avx-cuda-cu11.7-x64
  • linux-avx-cuda-cu12.0-x64
  • linux-avx-x64
  • linux-avx2-cuda-cu11.7-x64
  • linux-avx2-cuda-cu12.0-x64
  • ubuntu-x64
  • linux-avx512-cuda-cu11.7-x64
  • linux-avx512-cuda-cu12.0-x64
  • linux-avx512-x64
  • linux-noavx-cuda-cu11.7-x64
  • linux-noavx-cuda-cu12.0-x64
  • linux-noavx-x64
  • ubuntu-vulkan-x64
  • macos-arm64
  • macos-x64
  • win-avx-cuda-cu11.7-x64
  • win-avx-cuda-cu12.0-x64
  • win-avx-x64
  • win-avx2-cuda-cu11.7-x64
  • win-avx2-cuda-cu12.0-x64
  • win-avx2-x64
  • win-avx512-cuda-cu11.7-x64
  • win-avx512-cuda-cu12.0-x64
  • win-avx512-x64
  • win-noavx-cuda-cu11.7-x64
  • win-noavx-cuda-cu12.0-x64
  • win-noavx-x64
  • win-vulkan-x64

In case of macOS, we'd like to support macos-12, then we implemented a filter for llama.cpp server in cortex

@ramonpzg ramonpzg added this to Jan Mar 13, 2025
@dan-menlo dan-menlo changed the title roadmap: llamacpp-engine to align with llama.cpp upstream epic: Replace cortex.llamacpp with minimalist fork of llama.cpp Mar 13, 2025
@vansangpfiev vansangpfiev moved this to In Progress in Jan Mar 17, 2025
@ramonpzg ramonpzg changed the title epic: Replace cortex.llamacpp with minimalist fork of llama.cpp Replace cortex.llamacpp with minimalist fork of llama.cpp Mar 18, 2025
@ramonpzg ramonpzg removed this from Menlo Mar 18, 2025
@ramonpzg ramonpzg modified the milestones: v1.0.12, Caffeinated Sloth Mar 18, 2025
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Mar 18, 2025
@ramonpzg ramonpzg added epic and removed type: epic A major feature or initiative labels Mar 18, 2025
@dan-menlo
Copy link
Contributor Author

  • Can ship early next week

@vansangpfiev vansangpfiev moved this from Investigating to In Progress in Menlo Mar 24, 2025
@vansangpfiev
Copy link
Contributor

vansangpfiev commented Mar 25, 2025

QA-checklist

OS

  • Windows 11
  • Ubuntu 24, 22
  • Mac Silicon OS 14/15
  • Mac Intel

Engine variant:

  • ubuntu-arm64
  • linux-avx-cuda-cu11.7-x64
  • linux-avx-cuda-cu12.0-x64
  • linux-avx-x64
  • linux-avx2-cuda-cu11.7-x64
  • linux-avx2-cuda-cu12.0-x64
  • ubuntu-x64
  • linux-avx512-cuda-cu11.7-x64
  • linux-avx512-cuda-cu12.0-x64
  • linux-avx512-x64
  • linux-noavx-cuda-cu11.7-x64
  • linux-noavx-cuda-cu12.0-x64
  • linux-noavx-x64
  • ubuntu-vulkan-x64
  • macos-arm64
  • macos-x64
  • win-avx-cuda-cu11.7-x64
  • win-avx-cuda-cu12.0-x64
  • win-avx-x64
  • win-avx2-cuda-cu11.7-x64
  • win-avx2-cuda-cu12.0-x64
  • win-avx2-x64
  • win-avx512-cuda-cu11.7-x64
  • win-avx512-cuda-cu12.0-x64
  • win-avx512-x64
  • win-noavx-cuda-cu11.7-x64
  • win-noavx-cuda-cu12.0-x64
  • win-noavx-x64
  • win-vulkan-x64

Scope:

CLI

Installation

  • it should install with local installer (default; no internet required during installation, all dependencies bundled)
  • it should install with network installer
  • it should install 2 binaries (cortex and cortex-server) [mac: binaries in /usr/local/bin]
  • it should install with correct folder permissions
  • it should install with folders: /engines /logs (no /models folder until model pull)
  • It should install with Docker image https://cortex.so/docs/installation/docker/

Engine management:

  • llama.cpp should be installed by default
  • it should run gguf models on llamacpp
  • it should list engines
  • it should get engines
  • it should install engines (latest version if not specified)
  • it should install engines (with specified variant and version)
  • it should get default engine
  • it should set default engine (with specified variant/version)
  • it should load engine
  • it should unload engine
  • it should update engine (to latest version)
  • it should update engine (to specified version)
  • it should uninstall engines
  • it should gracefully continue engine installation if interrupted halfway (partial download)
  • it should gracefully handle when users try to CRUD incompatible engines (No variant found for xxx)

Model Running

  • cortex run <cortexso model> - if no local models detected, shows pull model menu
  • cortex run - if local model detected, runs the local model
  • cortex run - if multiple local models detected, shows list of local models (from multiple model sources eg cortexso, HF authors) for users to select (via regex search)
  • cortex run <invalid model id> should return gracefully Model not found!
  • run should autostart server
  • cortex run <model> starts interactive chat (by default)
  • cortex run <model> -d runs in detached mode
  • cortex models start <model>
  • terminate StdIn or exit() should exit interactive chat

API

Engine management

  • List engines: GET /v1/engines
  • Get engine: GET /v1/engines/{name}
  • Install engine: POST /v1/engines/install/{name}
  • Get default engine variant/version: GET v1/engines/{name}/default
  • Set default engine variant/version: POST v1/engines/{name}/default
  • Load engine: POST v1/engines/{name}/load
  • Unload engine: DELETE v1/engines/{name}/load
  • Update engine: POST v1/engines/{name}/update
  • uninstall engine: DELETE /v1/engines/install/{name}

Running Models

  • List models: GET v1/models
  • Start model: POST /v1/models/start
  • Stop model: POST /v1/models/stop
  • Get model: GET /v1/models/{id}
  • Delete model: DELETE /v1/models/{id}
  • Update model: PATCH /v1/models/{model} updates model.yaml params

Additional requirements

  • Cortex spawns new child process when start a model
  • Cortex terminates child process when stop a model
  • Cortex terminates all child processes when it stops

@vansangpfiev vansangpfiev moved this from In Progress to Eng Review in Menlo Mar 25, 2025
@vansangpfiev vansangpfiev moved this from Eng Review to QA in Menlo Mar 25, 2025
@david-menloai david-menloai self-assigned this Mar 25, 2025
@david-menloai
Copy link

testing completed, should be closed

@github-project-automation github-project-automation bot moved this from In Progress to Done in Jan Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Archived in project
Development

No branches or pull requests

6 participants