[BUG] rkvdec-vdpu383-h264: wrong pixels at horizontal de-blocking edges y=4 and y=12

Piotr Oniszczuk piotr.oniszczuk at gmail.com
Tue Jun 9 03:21:01 PDT 2026


Simon,

You done fantastic work with nailing this issue!

In fact I suspect this issue is exact long time blocker for all on mine rk3576 users wanting to use 3576 in media player use-case.

I have 3 diff 3576 devices (nanopi-m5, nanopi-r76s and rock4d) and really want to verify are mine h264 hw decoding issues on 3576 caused by issue you discovered.

Have you PoC patch for 7.1 to verify this?



> Wiadomość napisana przez Simon Wright <Simon at symple.nz> w dniu 15 maj 2026, o godz. 08:20:
> 
> Hi Detlev,
> 
> I'm seeing systematic pixel corruption on VDPU383 H.264 decodes on RK3576 (NanoPi
> R76S).  The decoded luma plane is correct for rows 0–3 and row 8, but wrong for rows
> 4 and 12 (and the corresponding rows in every subsequent macroblock row).  The error
> propagates to all following P-frames.
> 
> I confirmed the mismatch is in the raw V4L2 CAPTURE buffer two independent ways:
> 
>   1. GStreamer v4l2slh264dec output compared to avdec_h264 with no videoconvert step.
>   2. A hand-written Rust V4L2 decoder that submits only SPS+PPS+SCALING_MATRIX+
>      DECODE_PARAMS (SLICE_PARAMS returns EINVAL on VIDIOC_QUERY_EXT_CTRL on this BSP,
>      so the control set is the same as GStreamer's actual submission) — identical
>      20.3% mismatch at the identical first-diff byte.  This rules out any GStreamer
>      post-processing or control-submission effect as the cause.
> 
> Hardware:
>   Board:      NanoPi R76S (RK3576, VDPU383)
>   Kernel:     Linux 7.0.1 (mainline rkvdec-vdpu383-h264.c, unmodified)
>   GStreamer:  1.28.2 (with v4l2slh264dec from gst-plugins-bad)
>   Content:    1920×1080 Baseline H.264, SMPTE colour bars, openh264enc
> 
> 
> MINIMAL REPRODUCER
> ------------------
> 
> Generate a test file (any H.264 Annex-B with visible content works; I used openh264enc
> with SMPTE bars):
> 
>   gst-launch-1.0 videotestsrc num-buffers=60 pattern=smpte \
>     ! video/x-raw,width=1920,height=1080,framerate=30/1 \
>     ! openh264enc ! h264parse ! filesink location=test.h264
> 
> Decode via HW, capture raw NV12:
> 
>   gst-launch-1.0 filesrc location=test.h264 num-buffers=60 \
>     ! h264parse ! v4l2slh264dec ! 'video/x-raw' \
>     ! filesink location=hw.raw
> 
> Decode via SW, capture raw NV12:
> 
>   gst-launch-1.0 filesrc location=test.h264 num-buffers=60 \
>     ! h264parse ! avdec_h264 ! videoconvert ! 'video/x-raw,format=NV12' \
>     ! filesink location=sw.raw
> 
> For a 1920×1080 NV12 frame (frame 0), compare the first 3,110,400 bytes:
> 
>   cmp hw.raw sw.raw
> 
> Expected: identical.
> Observed: first mismatch at byte 7680 (Y plane, row=4, col=0).
> 
> With SMPTE bars (white region at the top), SW Y[row=3] = 0xe9 (correct white-bar luma).
> HW Y[row=4] = 0xaf instead of 0xe9; HW Y[row=3] = 0xe9 (correct).
> Overall mismatch rate: 20.3% of bytes in frame 0.
> 
> 
> QUANTIFIED EVIDENCE (frame 0, IDR)
> -----------------------------------
> 
>   SW decode:  Y bytes [7680..7695] = e9 e9 e9 e9 e9 e9 e9 e9 e9 e9 e9 e9 e9 e9 e9 e9
>   HW decode:  Y bytes [7680..7695] = af af af af af af af af af af af af af af af af
>   First diff: byte 7680 → Y plane row=4, col=0
> 
> Error propagation:
>   Frame 0 (IDR):  20.3% mismatch, first_diff = byte 7680 (Y row=4)
>   Frame 1 (P):    23.0% mismatch, first_diff = byte 253 (error propagated to row=0)
>   Frames 5–30 (P): 25–26% mismatch, stable
> 
> ANALYSIS
> --------
> 
> A diagnostic experiment implicates the filterd_rcb buffer (RCB index 6).  Redirecting
> filterd_rcb buffers 6, 7, 8 to point at the output buffer produced 98.4% corruption
> with first diff at row=1, which indicates the hardware reads p-side pixel context from
> filterd_rcb (rather than from the reconstruction buffer) when applying horizontal
> deblocking.
> 
> Based on the error pattern, our hypothesis is that filterd_rcb uses an 8-row circular
> index (slot = row mod 8).  If so, H.264's 4-row deblocking boundaries within each
> 16-row macroblock row would cause a slot collision that HEVC (with 8-row CTU boundaries)
> does not encounter:
> 
>   Edge y=4:  p0 from row 3  → slot 3  (zero-initialised on IDR → wrong)
>   Edge y=8:  p0 from row 7  → slot 7  (written before this edge is reached → correct)
>   Edge y=12: p0 from row 11 → slot 3  (still holds row-3 data from the y=4 pass → wrong)
> 
> This would explain why y=8 decodes correctly while y=4 and y=12 do not.  We don't have
> hardware documentation for VDPU383, so we can't confirm whether this is the actual
> mechanism.
> 
> We tried several register adjustments hoping to change the filterd_rcb update granularity:
> ctu_align_wr_en (reg027), buf_empty_en (reg009), ref strides (reg083–106), and
> num_views in the SPS table.  None changed the corruption.
> 
> Is there a known configuration difference for H.264's narrower deblocking edges, or a
> BSP-level fix we've missed?
> 
> 
> ATTACHED REPRODUCER
> -------------------
> 
> The C program below (builds against GStreamer on-device, ~100 lines) automates the
> comparison and produces per-frame mismatch statistics:
> 
>   gcc -O0 -g -o h264_hw_vs_sw_dump h264_hw_vs_sw_dump.c \
>       $(pkg-config --cflags --libs gstreamer-1.0 gstreamer-video-1.0 gstreamer-app-1.0)
> 
>   ./h264_hw_vs_sw_dump /path/to/test.h264
> 
> --- BEGIN h264_hw_vs_sw_dump.c ---
> /*
>  * H.264 HW vs SW byte-level comparison via GStreamer appsink.
>  *
>  * Decodes one frame of an H.264 Annex-B file via two paths:
>  *   SW:  h264parse ! avdec_h264 ! videoconvert ! NV12 appsink
>  *   HW:  h264parse ! v4l2slh264dec             ! NV12 appsink
>  *
>  * Reports first divergent byte, mismatch percentage, and unique Y values for
>  * both decoders.  If HW bytes differ from SW bytes, the bug is in the kernel
>  * rkvdec-vdpu383-h264.c driver.
>  *
>  * Build on device:
>  *   gcc -O0 -g -o h264_hw_vs_sw_dump h264_hw_vs_sw_dump.c \
>  *       $(pkg-config --cflags --libs gstreamer-1.0 gstreamer-video-1.0 gstreamer-app-1.0)
>  */
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <stdint.h>
> #include <unistd.h>
> #include <gst/gst.h>
> #include <gst/video/video.h>
> #include <gst/app/gstappsink.h>
> 
> typedef struct {
>     uint8_t *data;
>     int      width, height;
>     size_t   y_size, uv_size, total;
> } DecodedFrame;
> 
> static void free_frame(DecodedFrame *f) { if (f) { free(f->data); f->data = NULL; } }
> 
> static DecodedFrame *run_pipeline(const char *pipeline_str, const char *label)
> {
>     fprintf(stderr, "[%s] pipeline: %s\n", label, pipeline_str);
>     GError *err = NULL;
>     GstElement *pipeline = gst_parse_launch(pipeline_str, &err);
>     if (!pipeline || err) {
>         fprintf(stderr, "[%s] gst_parse_launch: %s\n", label, err ? err->message : "unknown");
>         return NULL;
>     }
>     GstElement *sink = gst_bin_get_by_name(GST_BIN(pipeline), "sink");
>     gst_app_sink_set_emit_signals(GST_APP_SINK(sink), FALSE);
>     gst_app_sink_set_drop(GST_APP_SINK(sink), FALSE);
>     gst_app_sink_set_max_buffers(GST_APP_SINK(sink), 1);
>     gst_element_set_state(pipeline, GST_STATE_PLAYING);
> 
>     GstSample *sample = gst_app_sink_pull_sample(GST_APP_SINK(sink));
>     if (!sample) {
>         fprintf(stderr, "[%s] no sample\n", label);
>         gst_element_set_state(pipeline, GST_STATE_NULL);
>         gst_object_unref(sink); gst_object_unref(pipeline);
>         return NULL;
>     }
>     GstBuffer *buf  = gst_sample_get_buffer(sample);
>     GstCaps   *caps = gst_sample_get_caps(sample);
>     GstVideoInfo vinfo;
>     gst_video_info_from_caps(&vinfo, caps);
> 
>     int w = GST_VIDEO_INFO_WIDTH(&vinfo);
>     int h = GST_VIDEO_INFO_HEIGHT(&vinfo);
>     GstVideoFrame vframe;
>     gst_video_frame_map(&vframe, &vinfo, buf, GST_MAP_READ);
> 
>     size_t y_size  = (size_t)w * h;
>     size_t uv_size = (size_t)w * (h / 2);
>     DecodedFrame *frame = calloc(1, sizeof(*frame));
>     frame->data  = malloc(y_size + uv_size);
>     frame->width = w; frame->height = h;
>     frame->y_size = y_size; frame->uv_size = uv_size;
>     frame->total = y_size + uv_size;
> 
>     uint8_t *y_src = GST_VIDEO_FRAME_PLANE_DATA(&vframe, 0);
>     int y_stride   = GST_VIDEO_FRAME_PLANE_STRIDE(&vframe, 0);
>     for (int row = 0; row < h; row++)
>         memcpy(frame->data + row * w, y_src + row * y_stride, w);
> 
>     uint8_t *uv_src = GST_VIDEO_FRAME_PLANE_DATA(&vframe, 1);
>     int uv_stride   = GST_VIDEO_FRAME_PLANE_STRIDE(&vframe, 1);
>     uint8_t *uv_dst = frame->data + y_size;
>     for (int row = 0; row < h / 2; row++)
>         memcpy(uv_dst + row * w, uv_src + row * uv_stride, w);
> 
>     gst_video_frame_unmap(&vframe);
>     gst_sample_unref(sample);
>     gst_element_set_state(pipeline, GST_STATE_NULL);
>     gst_object_unref(sink); gst_object_unref(pipeline);
>     return frame;
> }
> 
> static void compare_frames(DecodedFrame *sw, DecodedFrame *hw)
> {
>     size_t n = sw->total < hw->total ? sw->total : hw->total;
>     size_t first_diff = (size_t)-1, diffs = 0;
>     for (size_t i = 0; i < n; i++) {
>         if (sw->data[i] != hw->data[i]) {
>             if (first_diff == (size_t)-1) first_diff = i;
>             diffs++;
>         }
>     }
>     if (!diffs) {
>         fprintf(stderr, "MATCH: HW == SW (%zu bytes)\n", n);
>         return;
>     }
>     size_t y_size  = (size_t)sw->width * sw->height;
>     const char *plane = first_diff < y_size ? "Y" : "UV";
>     size_t off = first_diff < y_size ? first_diff : first_diff - y_size;
>     fprintf(stderr, "MISMATCH: %zu/%zu bytes differ (%.1f%%)\n", diffs, n, 100.0*diffs/n);
>     fprintf(stderr, "  First diff: byte %zu -> %s plane offset %zu (row=%zu col=%zu)\n",
>             first_diff, plane, off, off / sw->width, off % sw->width);
>     fprintf(stderr, "  SW[%zu..]: ", first_diff);
>     for (size_t i = first_diff; i < first_diff+16 && i < n; i++)
>         fprintf(stderr, "%02x ", sw->data[i]);
>     fprintf(stderr, "\n  HW[%zu..]: ", first_diff);
>     for (size_t i = first_diff; i < first_diff+16 && i < n; i++)
>         fprintf(stderr, "%02x ", hw->data[i]);
>     fprintf(stderr, "\n");
> }
> 
> int main(int argc, char **argv)
> {
>     if (argc < 2) { fprintf(stderr, "Usage: %s <h264_annex_b>\n", argv[0]); return 1; }
>     gst_init(NULL, NULL);
>     char sw_pipe[1024], hw_pipe[1024];
>     snprintf(sw_pipe, sizeof(sw_pipe),
>         "filesrc location=%s ! h264parse ! avdec_h264 ! videoconvert ! "
>         "video/x-raw,format=NV12 ! appsink name=sink", argv[1]);
>     snprintf(hw_pipe, sizeof(hw_pipe),
>         "filesrc location=%s ! h264parse ! v4l2slh264dec ! "
>         "video/x-raw,format=NV12 ! appsink name=sink", argv[1]);
> 
>     DecodedFrame *sw = run_pipeline(sw_pipe, "SW");
>     DecodedFrame *hw = run_pipeline(hw_pipe, "HW");
>     if (sw && hw) compare_frames(sw, hw);
>     if (sw) { free_frame(sw); free(sw); }
>     if (hw) { free_frame(hw); free(hw); }
>     return 0;
> }
> --- END h264_hw_vs_sw_dump.c ---
> 
> Thanks,
> Simon Wright
> Symple Solutions, Dunedin, New Zealand
> 
> 




More information about the Linux-rockchip mailing list