前言
该报告是介绍了基于RA8P1开发板的人脸识别模型的转换和部署。所使用的人脸识别模型是BlazeFace,通过RUHMI工具链进行模型转换,最后基于官方提供的人脸识别例程进行修改,并部署本文转换的BlazeFace模型,最后给出实验效果。
一、模型选择和模型转换
本次实验使用的模型是BlazeFace模型,是一种实时的、轻量化的人脸检测模型。该模型的输出是4个张量,张量的大小分别是[b, 512, 1]、[b, 384, 1]、[b, 512, 16]、[b, 384, 16],前两个张量对应检测框的分数,后两个张量对应检测框的坐标。
要把该模型部署到RA8P1开发板上,需要对该模型进行模型转换。该实验使用的工具是RUHMI。在LINUX环境下可通过git下载RUHMI仓库获取源码。
git clone https://github.com/renesas/ruhmi-framework-mcu --depth 1
可先通过conda创建一个新的python环境用于RUHMI的模型转换,python的版本要选择3.10
conda create -n ruhmi python=3.10
conda activate ruhmi
进入ruhmi-framework-mcu文件夹,创建一个虚拟环境
cd ruhmi-framework-mcu
python -m venv mera-env
source mera-env/bin/activate
之后安装mera工具包和依赖项
pip install decorator typing_extensions psutil attrs pybind11 cmake junitparser
pip install ./install/mera-2.5.0+pkg.3577-cp310-cp310-manylinux_2_27_x86_64.whl
新建一个文件夹用于存储模型
mkdir models_int8
做完上述工作后,可以把要转换的模型放到models_int8,然后运行脚本文件进行转换,这里使用的模型是 blazeface_front_128_int8.tflite
cp blazeface_front_128_int8.tflite models_int8
python scripts/mcu_deploy.py --ref_data ../models_int8 deploy_ethos
运行完上述指令后,会在当前路径下生成一个deploy_ethos文件夹,里面有一个和模型名称基本相同的文件夹,其中的build文件夹存放着可以部署的c源文件,如下图所示。

src文件夹中除了hal_entry.c文件,其余的文件都是要移植的文件,这样就基本完成了模型的转换。
二、模型部署
在进行部署之前,本实验先通过python导入tflite文件进行实验,并在静态图片上进行测试。
在使用tflite进行图形推理之前,需要先对图像数据进行预处理,预处理公式为
y = x / std - mean
其中,std的取值为127.5,mean的取值为1.0,x为图像的RGB原始值,y为处理后的值。
具体代码如下:
img = cv2.imread('data/1face.png')
interpreter = tf.lite.Interpreter(model_path='model/blazeface_front_128_int8.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
input_shape = input_details[0]['shape']
img_resize = cv2.resize(img, (input_shape[1], input_shape[2]))
img_resize = img_resize.reshape([1, input_shape[1], input_shape[2], input_shape[3]])
img_input = img_resize.astype(np.float32) / 127.5 - 1.0
interpreter.set_tensor(input_details[0]['index'], img_input)
interpreter.invoke()
r1 = interpreter.get_tensor(output_details[0]['index'])
c1 = interpreter.get_tensor(output_details[1]['index'])
c2 = interpreter.get_tensor(output_details[2]['index'])
r2 = interpreter.get_tensor(output_details[3]['index'])
c = np.concatenate((c1, c2), axis=1)
r = np.concatenate((r1, r2), axis=1)
anchors = np.reshape(anchors, [896, 4])
boxes = np.zeros_like(r)
x_centor = r[..., 0] / 128.0 * anchors[:, 2] + anchors[:, 0]
y_centor = r[..., 1] / 128.0 * anchors[:, 3] + anchors[:, 1]
w = r[..., 2] / 128.0 * anchors[:, 2]
h = r[..., 3] / 128.0 * anchors[:, 3]
boxes[..., 0] = y_centor - h / 2
boxes[..., 1] = x_centor - w / 2
boxes[..., 2] = y_centor + h / 2
boxes[..., 3] = x_centor + w / 2
for k in range(6):
offset = 4 + k * 2
keypoint_x = r[..., offset] / 128.0 * anchors[..., 2] + anchors[..., 0]
keypoint_y = r[..., offset + 1] / 128.0 * anchors[..., 3] + anchors[..., 1]
boxes[..., offset] = keypoint_x
boxes[..., offset + 1] = keypoint_y
thresh = 100.0
scores = np.clip(c, -thresh, thresh).squeeze(axis=-1)
scores = 1 / (1 + np.exp(-scores))
mask = scores >= 0.75
output_detections = []
for i in range(r.shape[0]):
b = boxes[i, mask[i]]
s = scores[i, mask[i]]
output_detections.append(b.squeeze(0))
for i in range(len(output_detections)):
ymin = output_detections[i][0] * img.shape[0]
xmin = output_detections[i][1] * img.shape[1]
ymax = output_detections[i][2] * img.shape[0]
xmax = output_detections[i][3] * img.shape[1]
img = cv2.rectangle(img, (int(xmin), int(ymin)), (int(xmax), int(ymax)), (0, 255, 0), 1)
for j in range(6):
offset = 4 + 2 * j
kp_x = output_detections[i][offset] * img.shape[1]
kp_y = output_detections[i][offset + 1] * img.shape[0]
img = cv2.circle(img, (int(kp_x), int(kp_y)), 2, (0, 255, 0))
cv2.imwrite('result.png', img)
最后的运行效果如下图所示。

完成上述准备工作后,我们可以将RUHMI生成的c文件移植到工程中编译和运行,并按照上述流程编写后处理代码。本实验的基础工程是RT-Thread Studio提供的模板工程 Titan_npu_ai_face_detection。根据BlazeFace的预处理和后处理方式,改写其中的rgb565_to_gray_resize_192_and_quantization 函数和主循环的处理过程,如下所示。
void rgb565_to_gray_resize_128_and_quantization(const uint16_t *src, int16_t src_w, int16_t src_h, float *f32_buf)
{
const int16_t dst_w = 128;
const int16_t dst_h = 128;
for (int16_t y = 0; y < dst_h; y++)
{
int16_t sy = (y * src_h) / dst_h;
const uint16_t *row = src + sy * src_w;
for (int16_t x = 0; x < dst_w; x++)
{
int16_t sx = (x * src_w) / dst_w;
uint16_t pix = row[sx];
uint8_t r = (pix >> 11) & 0x1F;
uint8_t g = (pix >> 5) & 0x3F;
uint8_t b = (pix) & 0x1F;
r = (r << 3) | (r >> 2);
g = (g << 2) | (g >> 4);
b = (b << 3) | (b >> 2);
float f_r = (float)r / 127.5 - 1.0;
float f_g = (float)g / 127.5 - 1.0;
float f_b = (float)b / 127.5 - 1.0;
f32_buf[y * dst_w * 3 + x * 3] = f_r;
f32_buf[y * dst_w * 3 + x * 3 + 1] = f_g;
f32_buf[y * dst_w * 3 + x * 3 + 2] = f_b;
}
}
}
主循环修改如下
void hal_entry(void)
{
...
while (1)
{
...
rgb565_to_gray_resize_128_and_quantization(g_image_rgb565_sdram_buffer, CAM_WIDTH, CAM_HEIGHT, in_f32);
memcpy(GetModelInputPtr_net1_serving_default_input_0(), in_f32, INPUT_H * INPUT_W * 3);
RunModel_net1(false);
float* out_c1 = GetModelOutputPtr_net1_StatefulPartitionedCall_0_70130();
float* out_r1 = GetModelOutputPtr_net1_StatefulPartitionedCall_2_70133();
float* out_c2 = GetModelOutputPtr_net1_StatefulPartitionedCall_1_70153();
float* out_r2 = GetModelOutputPtr_net1_StatefulPartitionedCall_3_70156();
int32_t n = 0;
for (int i = 0; i < 512; i++)
{
int offset = i * 16;
int anchor_offset = i * 4;
float x = out_r1[offset] / 128.0 * g_anchors[anchor_offset + 2] + g_anchors[anchor_offset];
float y = out_r1[offset + 1] / 128.0 * g_anchors[anchor_offset + 3] + g_anchors[anchor_offset + 1];
float w = out_r1[offset + 2] / 128.0 * g_anchors[anchor_offset + 2];
float h = out_r1[offset + 3] / 128.0 * g_anchors[anchor_offset + 3];
float y1 = (y - h / 2) * CAM_HEIGHT;
float x1 = (x - w / 2) * CAM_WIDTH;
float y2 = (y + h / 2) * CAM_HEIGHT;
float x2 = (x + w / 2) * CAM_WIDTH;
float score = out_c1[i];
if (score <= -100.0)
{
score = 100.0;
}
else if (score >= 100.0)
{
score = 100.0;
}
score = 1 / (1 + exp(-score));
if (score >= 0.2)
{
if ((y1 < 0) || (y1 >= CAM_HEIGHT)
|| (y2 < 0) || (y2 >= CAM_HEIGHT)
|| (x1 < 0) || (x1 >= CAM_WIDTH)
|| (x2 < 0) || (x2 >= CAM_WIDTH))
{
continue;
}
pool[n].y1 = (int16_t)y1;
pool[n].x1 = (int16_t)x1;
pool[n].y2 = (int16_t)y2;
pool[n].x2 = (int16_t)x2;
n++;
}
}
...
lcd_draw_jpg_with_frame(0, 0, g_image_rgb565_sdram_buffer, CAM_WIDTH, CAM_HEIGHT, 0xff00ff00, 2, pool, n);
}
}
三、结果展示
最后的运行效果如视频所示。