Skip to content

Different mul_mat results between METAL and CPU backends #1230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
emre-synnada opened this issue May 19, 2025 · 5 comments
Open

Different mul_mat results between METAL and CPU backends #1230

emre-synnada opened this issue May 19, 2025 · 5 comments

Comments

@emre-synnada
Copy link

emre-synnada commented May 19, 2025

Hi,

Thank you for the amazing work!
I’m seeing inconsistent results from the METAL and CPU backends when performing a matrix multiplication. I took the simple-backend.cpp example (code below), adapted it to run on both backends, and had it perform a single forward pass—each backend computes the same multiplication, and then the outputs are compared.

After debugging, I confirmed that both the CPU and METAL kernels are using 32-bit floats. While some small floating-point discrepancies are expected, the difference I’m observing after just one pass is far larger than any reasonable tolerance and it grows even larger as array sizes increase or when using small random values. I’d appreciate your help identifying why the METAL backend is producing such a large error.

Thank you!

Result

First discrepancy at index 0:
CPU value: 22.046450
Metal value: 22.047918
Difference: 0.001469
Max difference: 0.001469
Assertion failed: (!found_diff && "CPU and Metal outputs differ!"), function main, file simple-backend.cpp, line 215.

simple-backend.cpp


#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"


#include <cassert>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <map>
#include <string>
#include <vector>

enum class DeviceType {
    CPU,
    METAL,
    CUDA
};

static void ggml_log_callback_default(ggml_log_level level, const char * text, void * user_data) {
    (void) level;
    (void) user_data;
    fputs(text, stderr);
    fflush(stderr);
}

// This is a simple model with two tensors a and b
struct simple_model {
    struct ggml_tensor * a;
    struct ggml_tensor * b;

    // the backend to perform the computation (CPU, CUDA, METAL)
    ggml_backend_t backend = NULL;

    // the backend buffer to storage the tensors data of a and b
    ggml_backend_buffer_t buffer;

    // the context to define the tensor information (dimensions, size, memory address)
    struct ggml_context * ctx;

    DeviceType device = DeviceType::CPU;

};

// initialize the tensors of the model in this case two matrices 2x2
void load_model(simple_model & model, float * a, float * b, int rows_A, int cols_A, int rows_B, int cols_B) {
    ggml_log_set(ggml_log_callback_default, nullptr);

    switch(model.device) {
        case DeviceType::METAL:
            fprintf(stderr, "%s: using Metal backend\n", __func__);
            model.backend = ggml_backend_metal_init();
            if (!model.backend) {
                fprintf(stderr, "%s: ggml_backend_metal_init() failed\n", __func__);
            }
            break;

        case DeviceType::CPU:
        default:
            if (!model.backend) {
                model.backend = ggml_backend_cpu_init();
            }
            break;
    }

    int num_tensors = 2;

    struct ggml_init_params params {
            /*.mem_size   =*/ ggml_tensor_overhead() * num_tensors,
            /*.mem_buffer =*/ NULL,
            /*.no_alloc   =*/ true,
    };

    // create context
    model.ctx = ggml_init(params);

    // create tensors
    model.a = ggml_new_tensor_2d(model.ctx, GGML_TYPE_F32, cols_A, rows_A);
    model.b = ggml_new_tensor_2d(model.ctx, GGML_TYPE_F32, cols_B, rows_B);

    // create a backend buffer (backend memory) and alloc the tensors from the context
    model.buffer = ggml_backend_alloc_ctx_tensors(model.ctx, model.backend);

    // load data from cpu memory to backend buffer
    ggml_backend_tensor_set(model.a, a, 0, ggml_nbytes(model.a));
    ggml_backend_tensor_set(model.b, b, 0, ggml_nbytes(model.b));
}

// build the compute graph to perform a matrix multiplication
struct ggml_cgraph * build_graph(const simple_model& model) {
    static size_t buf_size = ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
    static std::vector<uint8_t> buf(buf_size);

    struct ggml_init_params params0 = {
        /*.mem_size   =*/ buf_size,
        /*.mem_buffer =*/ buf.data(),
        /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_allocr_alloc_graph()
    };

    // create a temporally context to build the graph
    struct ggml_context * ctx0 = ggml_init(params0);

    struct ggml_cgraph  * gf = ggml_new_graph(ctx0);

    // result = a*b^T
    struct ggml_tensor * result = ggml_mul_mat(ctx0, model.a, model.b);

    // build operations nodes
    ggml_build_forward_expand(gf, result);

    // delete the temporally context used to build the graph
    ggml_free(ctx0);
    return gf;
}

// compute with backend
struct ggml_tensor * compute(const simple_model & model, ggml_gallocr_t allocr) {
    // reset the allocator to free all the memory allocated during the previous inference

    struct ggml_cgraph * gf = build_graph(model);

    // allocate tensors
    ggml_gallocr_alloc_graph(allocr, gf);

    int n_threads = 1; // number of threads to perform some operations with multi-threading

    if (ggml_backend_is_cpu(model.backend)) {
        ggml_backend_cpu_set_n_threads(model.backend, n_threads);
    }

    ggml_backend_graph_compute(model.backend, gf);

    // in this case, the output tensor is the last one in the graph
    return ggml_graph_node(gf, -1);
}

std::vector<float> compute_graph(const simple_model & model){

    // calculate the temporaly memory required to compute
    ggml_gallocr_t allocr = NULL;

    {
        allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(model.backend));

        // create the worst case graph for memory usage estimation
        struct ggml_cgraph * gf = build_graph(model);
        ggml_gallocr_reserve(allocr, gf);
        size_t mem_size = ggml_gallocr_get_buffer_size(allocr, 0);

        fprintf(stderr, "%s: compute buffer size: %.4f KB\n", __func__, mem_size/1024.0);
    }

    // perform computation
    struct ggml_tensor * result = compute(model, allocr);

    // create a array to print result
    std::vector<float> out_data(ggml_nelements(result));

    // bring the data from the backend memory
    ggml_backend_tensor_get(result, out_data.data(), 0, ggml_nbytes(result));

    // release backend memory used for computation
    ggml_gallocr_free(allocr);

    // free memory
    ggml_free(model.ctx);

    // release backend memory and free backend
    ggml_backend_buffer_free(model.buffer);
    ggml_backend_free(model.backend);

    return out_data;
    
}

int main(void) {
    ggml_time_init();

    const int rows_A = 512, cols_A = 512;
    const int rows_B = 512, cols_B = 512;

    // Allocate matrices dynamically due to their large size
    std::vector<float> matrix_A(rows_A * cols_A,  0.12432f); 
    std::vector<float> matrix_B(rows_B * cols_B, 0.34636f); 

    simple_model model;
    model.device = DeviceType::CPU;
    load_model(model, matrix_A.data(), matrix_B.data(), rows_A, cols_A, rows_B, cols_B);
    std::vector<float> cpu_out = compute_graph(model);

    simple_model model2;
    model2.device = DeviceType::METAL;
    load_model(model2, matrix_A.data(), matrix_B.data(), rows_A, cols_A, rows_B, cols_B);
    std::vector<float> metal_out = compute_graph(model2);

    const float epsilon = 1e-5f;
    assert(cpu_out.size() == metal_out.size());
    bool found_diff = false;
    size_t first_diff_idx = 0;
    float max_diff = 0.0f;
    
    for (size_t i = 0; i < cpu_out.size(); i++) {
        float diff = std::abs(cpu_out[i] - metal_out[i]);
        max_diff = std::max(max_diff, diff);
        
        if (!found_diff && diff >= epsilon) {
            found_diff = true;
            first_diff_idx = i;
            fprintf(stderr, "First discrepancy at index %zu:\n", i);
            fprintf(stderr, "CPU value: %f\n", cpu_out[i]);
            fprintf(stderr, "Metal value: %f\n", metal_out[i]);
            fprintf(stderr, "Difference: %f\n", diff);
        }
    }
    
    fprintf(stderr, "Max difference: %f\n", max_diff);
    assert(!found_diff && "CPU and Metal outputs differ!");
    return 0;
}

@ggerganov
Copy link
Member

        float diff = std::abs(cpu_out[i] - metal_out[i]);
        max_diff = std::max(max_diff, diff);
        
        if (!found_diff && diff >= epsilon) {

You should be looking at the relative difference, not the absolute. The absolute diff grows unbounded with the matrix sizes, so this check does not give you any information.

@emre-synnada
Copy link
Author

Thank you for pointing out the need to use a relative‐difference check. I’ve updated my validation code accordingly (see below), but I still observe a discrepancy when running with ggml. To rule out a hardware issue, I wrote a minimal PyTorch script that performs the same 512×512 matrix multiplication on both CPU and MPS backends using identical inputs and compares both absolute and relative errors. In that PyTorch test, the CPU and MPS results match exactly, with no measurable difference. I couldn't understand why ggml shows a relative error in this same operation while PyTorch does not.

GGML output

Max absolute diff:  0.001469
Max relative diff:  0.000067

PyTorch output

Max absolute difference: 0.000000e+00
Max relative difference: 0.000000e+00

All values are within the specified tolerances.

Modified tolerance calculation code

    const float abs_tol     = 1e-5f;
    const float rel_tol     = 1e-4f;
    float       max_abs_diff = 0.0f;
    float       max_rel_diff = 0.0f;

    for (size_t i = 0; i < cpu_out.size(); i++) {
        const float a    = cpu_out[i];
        const float b    = metal_out[i];
        const float diff = std::abs(a - b);
        // printf("cpu_out[%zu] = %f\n", i, a);
        // printf("metal_out[%zu] = %f\n", i, b);

        // track max absolute diff
        max_abs_diff = std::max(max_abs_diff, diff);

        // denominator: use the larger magnitude of a and b to normalize
        const float denom = std::max(std::abs(a), std::abs(b));
        // if both are zero, denom == 0, so just treat relative error == 0
        const float rel_diff = denom > 0.0f
            ? diff / denom
            : 0.0f;

        max_rel_diff = std::max(max_rel_diff, rel_diff);

        // check against relative and absolute tolerance 
        if (rel_diff > rel_tol && diff > abs_tol) {
            fprintf(stderr, "First relative tolerance violation at index %zu:\n", i);
            fprintf(stderr, " Abs difference: % .6f\n", diff);
            fprintf(stderr, " Rel difference: % .6f (tolerance = % .6f)\n",
                    rel_diff, rel_tol);
            break;
        }
    }

    fprintf(stderr, "Max absolute diff: % .6f\n", max_abs_diff);
    fprintf(stderr, "Max relative diff: % .6f\n", max_rel_diff);
    assert(max_rel_diff <= rel_tol && "Outputs differ beyond relative tolerance!");

Pytorch Matrix Multiplication (MPS and CPU)

import torch

torch.manual_seed(0)

dtype   = torch.float32
abs_tol = 1e-5
rel_tol = 1e-4

A_cpu = torch.full((512, 512), 0.12432, dtype=dtype)
B_cpu = torch.full((512, 512), 0.34636, dtype=dtype)

# Compute on CPU
C_cpu = A_cpu @ B_cpu

# Move to MPS and compute
A_mps = A_cpu.to('mps')
B_mps = B_cpu.to('mps')
C_mps = A_mps @ B_mps

# Bring MPS result back to CPU
C_mps_cpu = C_mps.to('cpu')

# Compute absolute difference
abs_diff = (C_cpu - C_mps_cpu).abs()

# Compute relative difference: diff / max(|C_cpu|, |C_mps_cpu|)
denom    = torch.max(C_cpu.abs(), C_mps_cpu.abs())
rel_diff = torch.where(denom > 0, abs_diff / denom, torch.zeros_like(abs_diff))

# Track maxima
max_abs_diff = abs_diff.max().item()
max_rel_diff = rel_diff.max().item()

print(f"Max absolute difference: {max_abs_diff:.6e}")
print(f"Max relative difference: {max_rel_diff:.6e}")

# Find first violation
mask = (abs_diff > abs_tol) & (rel_diff > rel_tol)
if mask.any():
    i, j = mask.nonzero(as_tuple=False)[0].tolist()
    a = C_cpu[i, j].item()
    b = C_mps_cpu[i, j].item()
    d = abs_diff[i, j].item()
    rd = rel_diff[i, j].item()
    print(f"\nFirst tolerance violation at index ({i}, {j}):")
    print(f" CPU value:      {a:.6f}")
    print(f" MPS value:      {b:.6f}")
    print(f" Abs difference: {d:.6f} (tol = {abs_tol})")
    print(f" Rel difference: {rd:.6f} (tol = {rel_tol})")
else:
    print("\nAll values are within the specified tolerances.")

@BoruiXu
Copy link

BoruiXu commented May 29, 2025

I also encountered inconsistent results between GPU and CPU backends when performing matrix multiplication. The first matrix has dimensions (2048, 7168) with all values set to 1, and the second matrix has dimensions (10, 7168) with all values set to 0.1. All inputs are FP32. The maximum relative error in the results is about 2.35e-04. Is this expected?
Thanks.

@ngxson
Copy link
Contributor

ngxson commented Jun 1, 2025

I ran the same test on ggml-easy, but there is no difference between CPU / Metal: https://github.com/ngxson/ggml-easy/blob/0d49b461053035205ca1ab5bc783f849baa2fa0b/demo/random.cpp#L103-L140

Also looking at your code, there is not ggml_set_input/output. There functions was not very important in the past, but in newer version of GGML it can make a big different. Can you add these ggml_set_input/output can try again?

@emre-synnada
Copy link
Author

I added a call to ggml_set_input/output when creating the graph but no luck there. Then I noticed that, in the example you provide, before the matrix‐multiply operation you insert a ggml_scale node to convert the input data into random floats and initialize. I copied that pattern in my own build_graph function (code below), so now I’m scaling a vector of ones via ggml_scale just before the matrix multiply. With this change, I see no difference between CPU and Metal backends. So, I have two questions here:

  1. Why does it matter whether I pass random‐float values directly into the matrix multiply versus scaling a tensor of ones inside the graph? Intuitively, they should be equivalent but empirically, they produce different behavior.
  2. Why ggml_set_input/output functions can make a big difference in newer version of GGML? As far as I can see, they mark the tensors as input/output.

Modified build_graph code:

// build the compute graph to perform a matrix multiplication
struct ggml_cgraph * build_graph(simple_model& model) {
    static size_t buf_size = ggml_tensor_overhead()*GGML_DEFAULT_GRAPH_SIZE + ggml_graph_overhead();
    static std::vector<uint8_t> buf(buf_size);

    struct ggml_init_params params0 = {
        /*.mem_size   =*/ buf_size,
        /*.mem_buffer =*/ buf.data(),
        /*.no_alloc   =*/ true, // the tensors will be allocated later by ggml_allocr_alloc_graph()
    };

    // create a temporally context to build the graph
    struct ggml_context * ctx0 = ggml_init(params0);

    struct ggml_cgraph  * gf = ggml_new_graph(ctx0);

    // result = a*b^T
    model.a = ggml_scale(ctx0, model.a, 0.124374f);
    model.b = ggml_scale(ctx0, model.a, 0.34636f);
    struct ggml_tensor * result = ggml_mul_mat(ctx0, model.a, model.b);
    ggml_set_input(model.a);
    ggml_set_input(model.b);
    ggml_set_output(result);
    // build operations nodes
    ggml_build_forward_expand(gf, result);

    // delete the temporally context used to build the graph
    ggml_free(ctx0);
    return gf;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants