Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
a8f15fb
Add interface is_model_splitted() to check the c-graph is splited or not
zhaixuejun1993 Mar 6, 2026
c8c3bd4
Infer and propagate dynamic-dimension indices for all tensors in the …
zhaixuejun1993 Mar 17, 2026
6c855e7
Only do this for fallback sub graph
zhaixuejun1993 Mar 19, 2026
c7af12b
Move dynamic dims compute in graph missmatch
zhaixuejun1993 Mar 23, 2026
2a118eb
ggml-openvino: fix tensor data handling for PERMUTE/VIEW ops in split…
zhaixuejun1993 Mar 19, 2026
54fe67e
ggml-openvino:add comments
zhaixuejun1993 Mar 19, 2026
74ba8fd
ggml-openvino: override VIEW op_case to 0 for split model inputs
zhaixuejun1993 Mar 19, 2026
5ec12bd
openvino backend: Handle unsupported VIEW shape-mismatch in OpenVINO …
zhaixuejun1993 Mar 19, 2026
6f3e20f
Enable additional mul_mat tests and add tensor data saving function (…
zhaixuejun1993 Mar 23, 2026
713bcb0
ggml-openvino: fix CONT/TRANSPOSE mapping and improve dynamic-dimensi…
zhaixuejun1993 Mar 26, 2026
4fbc557
OpenVINO: add NORM/TANH support and rework SOFT_MAX translation
zhaixuejun1993 Mar 28, 2026
015b607
ggml-openvino: extend VIEW handling
zhaixuejun1993 Mar 30, 2026
9e0f352
Enable -fa off (#118)
wine99 Apr 2, 2026
8f05691
Enable --context-shift
wine99 Apr 10, 2026
4c9b609
Fix llm param compute error for normal softmax not the softmax in att…
zhaixuejun1993 Apr 13, 2026
1ba5fd8
OpenVINO backend: fix error for attention size compute in llm param
zhaixuejun1993 Apr 13, 2026
644dbea
use tensor->extra in infer_request i/o
wine99 Apr 27, 2026
a979e24
OpenVINO backend: refacter the compute_llm_params() func add get_atte…
zhaixuejun1993 Apr 29, 2026
3f433c5
OpenVINO backend: clean unused code
zhaixuejun1993 Apr 29, 2026
3bc7e76
1to1 match op update (#146)
cavusmustafa May 6, 2026
19c79fd
initial gemma4 support
May 5, 2026
7897870
removed hardcoded names for kv cache slicing
cavusmustafa May 5, 2026
329c4b5
OpenVINO backend: Add new attention pattern for llm parameters compute
zhaixuejun1993 May 6, 2026
f1e32c5
flash attn Q shape static conversion
cavusmustafa May 4, 2026
33a2160
Remove slice in permute translation when n_seq is 1
cavusmustafa May 4, 2026
05c0385
return optional in extract_layer_from_name
wine99 May 7, 2026
bdc858d
OpenVINO backend: refactor VIEW related operation (#148)
zhaixuejun1993 May 7, 2026
51114e5
OpenVINO backend: Add ops l2_norm & pad
zhaixuejun1993 May 6, 2026
05ff7d0
OpenVINO backend does not support CPY with non-contiguous data or mis…
zhaixuejun1993 May 7, 2026
322bb87
add op SSM_CONV GATED_DELTA_NET
wine99 May 7, 2026
8cae14e
OpenVINO backend: fix error for bf16 in OV gpu plugin
zhaixuejun1993 May 7, 2026
f80474c
reverted static Q input shape for attention layer
cavusmustafa May 7, 2026
b61ffd4
OpenVINO backend: remove hardcode name inp_tokens, which ignore some …
zhaixuejun1993 May 8, 2026
8ba38ca
Disable remote tensor due to bug in ov gpu
wine99 May 12, 2026
edc0630
Disable n_token > 1 GATED_DELTA_NET on gpu
wine99 May 12, 2026
9331bb3
OpenVINO backend: fix the view op dynamic handling issue in gemma4 & …
zhaixuejun1993 May 13, 2026
fd0ac6d
OpenVINO backend: clean code
zhaixuejun1993 May 13, 2026
f9c343c
OpenVINO backend: enable view + norm/rms_norm
zhaixuejun1993 May 9, 2026
ebccf37
OpenVINO backend: concat op
zhaixuejun1993 May 9, 2026
0a08624
OpenVINO backend: argsort op
zhaixuejun1993 May 9, 2026
42241a2
OpenVINO backend: enable unary + view & GGML_UNARY_OP_SOFTPLUS
zhaixuejun1993 May 11, 2026
6ed8f78
Fix issue for test-backend-ops in TOPK_MOE, which compare VIEW ops re…
zhaixuejun1993 May 11, 2026
b75e927
OpenVINO backend: enable sum_rows
zhaixuejun1993 May 11, 2026
2f32361
OpenVINO backend: enable clamp
zhaixuejun1993 May 11, 2026
ba3754a
OpenVINO backend: enable DIV
zhaixuejun1993 May 11, 2026
f27b978
OpenVINO backend: enable GGML_OP_MUL_MAT_ID
zhaixuejun1993 May 11, 2026
13b71f0
OpenVINO backend: disable MUL_MAT_ID_FUSION case with large mem needed
zhaixuejun1993 May 11, 2026
9384961
OpenVINO backend: Disable GGML_OP_ARGSORT, cause test_backend-ops failed
zhaixuejun1993 May 13, 2026
833111b
OpenVINO backend: fix issue in mul_mat_id
zhaixuejun1993 May 14, 2026
7f48bc7
OpenVINO backend: Disable DIV with broadcast on GPU
zhaixuejun1993 May 14, 2026
24f2bde
OpenVINO backend: update DIV
zhaixuejun1993 May 15, 2026
952d10a
use ov internal op GatedDeltaNet
wine99 May 19, 2026
5c7fc91
OpenVINO backend: enable llama erch test qwen3next
zhaixuejun1993 May 19, 2026
af9d8c5
OpenVINO backend: enable RMS_NORM + VIEW & remove op_case 2 for rope
zhaixuejun1993 May 7, 2026
4b86839
OpenVINO backend: fix error
zhaixuejun1993 May 7, 2026
2443297
suggested changes, need review
wine99 May 7, 2026
f825020
suggested changes, need review
wine99 May 7, 2026
b86472a
OpenVINO backend: clean unused code & fix build warning
zhaixuejun1993 May 20, 2026
4f247b1
OpenVINO backend: enable minicpm3 for arch test
zhaixuejun1993 May 20, 2026
d81ede3
Disable GDN op (#177)
wine99 May 21, 2026
40e0d19
disable gated_delta_net
wine99 May 22, 2026
9e589ee
update stateful_kv_size correctly in mismatch case
wine99 May 19, 2026
71731f1
add concat ssm_conv in compute_dynamic_dim
wine99 May 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
811 changes: 727 additions & 84 deletions ggml/src/ggml-openvino/ggml-decoder.cpp

Large diffs are not rendered by default.

74 changes: 55 additions & 19 deletions ggml/src/ggml-openvino/ggml-decoder.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once

#include "ggml-quants.h"
#include "ggml-backend-impl.h"
#include "ggml-backend.h"
#include "ggml.h"
#include "openvino/decoder.h"

Expand All @@ -14,21 +15,21 @@

struct ModelParams {
int ctx = -1;
int ctx_swa = -1;
int ctx_per_seq = -1;
int ctx_per_seq_swa = -1;
int n_seq = 1;
int n_heads = -1;
int n_heads_kv = -1;
int head_size = -1;
int32_t rope_params[15];
bool mixed_rope_params = false;
std::vector<int> swa_layers;

std::vector<std::string> kv_names;
size_t kv_buffer_ctx_id = 0;

bool same_rope_params(const ModelParams & other) const {
return memcmp(rope_params, other.rope_params, sizeof(int32_t) * 15) == 0;
return mixed_rope_params == other.mixed_rope_params &&
memcmp(rope_params, other.rope_params, sizeof(int32_t) * 15) == 0;
}

bool can_reuse_dynamically(const ModelParams & other) const { return same_rope_params(other); }
Expand Down Expand Up @@ -56,6 +57,7 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
std::string node_name;
std::string node_op_type;
std::map<std::string, ggml_tensor *> node_inputs;
std::map<std::string, std::vector<std::pair<std::string, ggml_tensor *>>> node_inputs_views;
std::vector<std::string> node_inputs_names;
ggml_tensor * node_output;
std::string node_output_name;
Expand All @@ -69,6 +71,7 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
std::map<std::string, std::shared_ptr<ov::Node>> & model_weights,
bool is_static,
bool is_stateful = false,
bool model_is_splitted = false,
bool is_prefill = false,
int prefill_chunk_size = 256);

Expand All @@ -84,6 +87,28 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {

virtual std::vector<size_t> get_input_stride(int node_idx, const std::string & name) const override;

virtual size_t get_view_input_size(int node_idx, const std::string & name) const override;

virtual size_t get_view_input_offset(int node_idx, const std::string & name, size_t view_index) const override;

virtual size_t get_view_input_src_offset(int node_idx, const std::string & name, size_t view_index) const override;

virtual std::vector<size_t> get_view_input_stride(int node_idx, const std::string & name, size_t view_index) const override;

virtual std::vector<size_t> get_view_input_src_stride(int node_idx, const std::string & name, size_t view_index) const override;

virtual ov::Shape get_view_input_ggml_shape(int node_idx, const std::string & name, size_t view_index) const override;

virtual ov::Shape get_view_input_src_ggml_shape(int node_idx, const std::string & name, size_t view_index) const override;

virtual ov::PartialShape get_view_input_ov_shape(int node_idx, const std::string & name, size_t view_index) const override;

virtual ov::PartialShape get_view_input_src_ov_shape(int node_idx, const std::string & name, size_t view_index) const override;

virtual std::string get_view_input_name(int node_idx, const std::string & name, size_t view_index) const override;

virtual std::string get_view_input_src_name(int node_idx, const std::string & name, size_t view_index) const override;

virtual ov::element::Type get_input_type(int node_idx, const std::string & name) const override;

virtual size_t get_input_size() const override;
Expand All @@ -106,10 +131,14 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {

virtual ov::element::Type get_output_type(int node_idx) const override;

virtual std::vector<size_t> get_output_stride(int node_idx) const override;

virtual int32_t * get_input_op_params(int node_idx, const std::string & name) const override;

virtual int32_t * get_output_op_params(int node_idx) const override;

virtual size_t get_output_op_offset(int node_idx) const override;

virtual std::vector<std::string> get_output_names(int node_idx) const override;

virtual const std::string & get_op_type() const override;
Expand All @@ -120,6 +149,8 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {

virtual const std::string & get_op_name(int node_idx) const override;

virtual int32_t get_op_dynamic_dim(int node_idx) const override;

virtual void visit_subgraph(std::function<void(std::shared_ptr<GgmlDecoder>, int node_idx)> node_visitor) const override;

ggml_tensor * get_input_ggml_tensor(const std::string & name) const { return m_inputs.at(name); }
Expand Down Expand Up @@ -150,8 +181,6 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {

virtual int get_ctx_size() const { return m_model_params.ctx; }

virtual int get_ctx_swa_size() const { return m_model_params.ctx_swa; }

virtual int get_ctx_per_seq() const { return m_model_params.ctx_per_seq; }

virtual int get_ctx_per_seq_swa() const { return m_model_params.ctx_per_seq_swa; }
Expand All @@ -169,13 +198,19 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {

virtual int32_t * get_rope_params() const override { return const_cast<int32_t *>(m_model_params.rope_params); }

virtual bool has_mixed_rope_params() const override { return m_model_params.mixed_rope_params; }

virtual std::map<std::string, std::string> get_kv_param_res_names() const override;

virtual bool is_static() const override { return m_is_static; }

virtual bool is_stateful() const override { return m_is_stateful; }

ov::PartialShape get_graph_input_shape(const ggml_tensor * op, const ggml_tensor * input) const;
virtual bool is_splited_model() const override {
return m_model_is_splitted;
}

ov::PartialShape get_graph_input_shape(const ggml_tensor * op, const ggml_tensor * input, int dynamic_dim_index=-1) const;

static void dump_cgraph(const ggml_cgraph * cgraph, std::string & filename);

Expand Down Expand Up @@ -205,6 +240,7 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
bool m_is_prefill = false;
bool m_naive = false;
int m_prefill_chunk_size = 0;
bool m_model_is_splitted = false; // label the cgraph is splited or not

static ov::Shape get_shape(const ggml_tensor * tensor);
static std::vector<size_t> get_stride(const ggml_tensor * tensor);
Expand All @@ -227,39 +263,35 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
}

inline static bool is_inp_mask(const ggml_tensor * tensor, const ggml_tensor * op) {
return op->op == GGML_OP_CPY || (op->op == GGML_OP_FLASH_ATTN_EXT && tensor == op->src[3]);
return op->op == GGML_OP_CPY || (op->op == GGML_OP_FLASH_ATTN_EXT && tensor == op->src[3]) ||
(op->op == GGML_OP_SOFT_MAX && tensor == op->src[1]);
}

inline static bool is_rope_freqs_weight(const ggml_tensor * tensor, const ggml_tensor * op) {
return op->op == GGML_OP_ROPE && tensor == op->src[2];
}

inline static bool is_kvcache(const ggml_tensor * tensor, const ggml_tensor * op) {
return op->op == GGML_OP_SET_ROWS && op->src[2] == tensor;
return tensor->buffer->usage == GGML_BACKEND_BUFFER_USAGE_ANY ||
(op != nullptr && op->op == GGML_OP_SET_ROWS && op->src[2] == tensor);
}

inline static bool is_kv_idx(const ggml_tensor * tensor, const ggml_tensor * op) {
return op->op == GGML_OP_SET_ROWS && op->src[1] == tensor;
}

inline static bool is_output_idx(const ggml_tensor * tensor, const ggml_tensor * op) {
return op->op == GGML_OP_GET_ROWS && tensor == op->src[1] && op->src[0]->op != GGML_OP_NONE;
return op->op == GGML_OP_GET_ROWS && tensor == op->src[1] && op->src[0]->op != GGML_OP_NONE && op->src[1]->op == GGML_OP_NONE;
}

static std::string get_graph_input_ov_name(const ggml_tensor * tensor, const ggml_tensor * op) {
if (is_inp_tok(tensor, op)) {
return "inp_tokens";
}
std::string get_graph_input_ov_name(const ggml_tensor * tensor, const ggml_tensor * op) {
if (is_inp_pos(tensor, op)) {
return "inp_pos";
}
if (is_inp_emb(tensor, op)) {
return "embd";
}
if (is_output_idx(tensor, op)) {
return "inp_out_ids";
}
if (is_inp_mask(tensor, op)) {
if (is_stateful() && is_inp_mask(tensor, op)) {
return std::string(tensor->name).find("swa") == std::string::npos ? "self_kq_mask" : "self_kq_mask_swa";
}
return tensor->name;
Expand All @@ -272,6 +304,9 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
void compute_model_inputs();
void compute_model_outputs();

// Infer and propagate dynamic-dimension indices for all tensors in the GGML graph.
void compute_node_dynamic_dims();

void validate_cgraph() const;

ggml_cgraph * m_cgraph = nullptr;
Expand All @@ -284,11 +319,12 @@ class GgmlOvDecoder : public ov::frontend::ggml::GgmlDecoder {
std::map<std::string, ggml_tensor *> m_model_outputs;
std::vector<std::string> m_model_output_names;
std::vector<NodeInfo> m_node_info_list;
std::map<ggml_tensor *, int> m_node_dynamic_dims;

ModelParams m_model_params;
ComputeParams m_compute_params;
};

void print_tensor_address_map(const ggml_cgraph * cgraph);

int extract_layer_from_name(const std::string & name);
std::optional<int> extract_layer_from_name(const std::string & name);
3 changes: 3 additions & 0 deletions ggml/src/ggml-openvino/ggml-openvino-extra.h
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,9 @@ ggml_openvino_extracted_layout ggml_openvino_get_extracted_layout(const ggml_ten

ggml_openvino_tensor_extra * ggml_openvino_create_tensor_extra(const ggml_tensor * tensor, bool is_remote);

// Check if a tensor's buffer uses remote (device) memory (e.g. GPU USM)
bool ggml_openvino_buffer_is_remote(const ggml_tensor * tensor);

// Register an extra with the tensor's OpenVINO buffer context for proper lifetime management.
// This sets tensor->extra and tracks the extra in the buffer context for cleanup.
void ggml_openvino_buffer_register_extra(ggml_tensor * tensor, ggml_openvino_extra_base * extra);
Expand Down
Loading
Loading