【Longan Pi 3H 开发板试用连载体验】给ChatGPT装上眼睛，并且还可以语音对话：8，GPT接入，功能整合完成项目

*附件：jarvis.rar

GPT接入

Github上有一个非常有名的开源项目，曾经还引来了不少争议，这个项目叫gpt4free。实际原理与poe wrapper相似，都是利用网络请求的方法从各个第三方平台的gpt接口来访问gpt。因此无需购买API KEY就可以直接使用。

但受限于国内的网络环境，大部分的第三方平台都无法访问，经过实测，这个库可能是唯一一个可以不通过特殊手段就可以直接访问到的GPT3.5 python api，但在访问的过程中由于会经理大量的失败重试过程，所以导致速度会比较慢，有时候还有一定概率访问不成功。

闲话说到这，接下来看看具体是怎么实现的。

首先使用pip安装库：

pip install -U g4f

安装好后，我们可以写一个简单的对话机器人，来进行功能测试:

from g4f.client import Client

client = Client()
messages = [{"role": "system", "content": "你是一名助手，你的任务是回答日常问题的解决方法，并提供有用的建议。"}]

while True:
    content = input(">说点什么:")
    messages.append({"role": "user", "content": content})
    response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages,
    )
    messages.append({"role": response.choices[0].message.role, "content": response.choices[0].message.content})
    gpt_reply = messages[-1]["content"]
    print(f"GPT 回答：{gpt_reply}")

如果大家在测试时实在无法联网，那么代码中client.chat.completions.create函数有一个可选变量proxy，大家可以使用这个变量自行解决。

这一步完成后，项目所需的所有功能模块就都已经实现了。接下来我们把他们放到一起去。首先先是Longan Pi上的代码。

代码中首先使用FastAPI创建了一个服务器，服务器得到请求后，先从webcam中获取图像，获取到后进行图像识别，识别完成后再和请求中的问题进行字符串拼接，形成输入GPT的提问。接下来与GPT交互，交互完成后使用edge TTS把GPT返回的文字转为语音，再将语音bytes返回给发起get请求的客户端。

完整代码如下：

from fastapi import FastAPI
from fastapi.responses import Response
from PIL import Image
import requests
import io
from ultralytics import YOLO
from g4f.client import Client
import edge_tts

app = FastAPI()
model = YOLO('yolov8n.pt')
client = Client()

messages = [{"role": "system", "content": "你是一个人类助手，并且假设你拥有视力，可以看到东西。你的任务是根据user提供的你所看到的内容，回答user的问题。请在接下来的对话中隐藏你是计算机程序这件事，并且模仿的像一个具有视力的人类助手。不要提及你是一个语言模型，不要说提及你无法看到任何东西。"}]

@app.get("/")
async def mainapp(question):
    #   GET IMAGE
    img_raw = requests.get("http://192.168.199.143:8080/photoaf.jpg").content
    img = Image.open(io.BytesIO(img_raw))
    #   RECOGNITION
    results = model.predict(source=img, stream=True)
    for result in results:
        detections = []
        for box in result.boxes:
            class_id = result.names[box.cls[0].item()]
            cords = box.xyxy[0].tolist()
            cords = [round(x) for x in cords]
            confi = round(box.conf[0].item(), 2)
            print("Object type:", class_id)
            print("Coordinates:", cords)
            print("Probability:", confi)
            print("---")
            detections.append({"name" : class_id, "cords" : cords, "conf" : confi})
    #   GET QUESTION
    if detections:
        lst = []
        for i in detections:
            lst.append(i["name"])
        obj = "，".join(lst)
    else:
        obj = "什么都没有"
    content = "假如你看到了你的面前有以下这些东西：" + obj + "。请用中文，以人类助手的身份回答这个问题：" + question
    #   GPT
    messages.append({"role": "user", "content": content})
    response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages
    )
    messages.append({"role": response.choices[0].message.role, "content": response.choices[0].message.content})
    gpt_reply = messages[-1]["content"]
    print(f"GPT 回答：{gpt_reply}")
    #   TTS
    voices = await edge_tts.VoicesManager.create()
    voices = voices.find(ShortName="zh-CN-XiaoyiNeural")
    communicate = edge_tts.Communicate(gpt_reply, voices[0]["Name"])
    out = bytes()
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            out += chunk["data"]
        elif chunk["type"] == "WordBoundary":
            pass

    return Response(out, media_type="audio/mpeg")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)

PC端的代码分为了两块，一块是连续图像识别，并显示出来，好让我们可以实时观察到GPT“看”到了什么：

from PIL import Image
import requests
import io
from ultralytics import YOLO

model = YOLO('yolov8n.pt')

while True:
    #   GET IMAGE
    img_raw = requests.get("http://192.168.199.143:8080/photoaf.jpg").content
    img = Image.open(io.BytesIO(img_raw)).resize((1280, 720))
    #   RECOGNITION
    results = model.predict(source=img, stream=True, conf=0.5, show=True)
    for result in results:
        detections = []
        for box in result.boxes:
            class_id = result.names[box.cls[0].item()]
            cords = box.xyxy[0].tolist()
            cords = [round(x) for x in cords]
            confi = round(box.conf[0].item(), 2)
            print("Object type:", class_id)
            print("Coordinates:", cords)
            print("Probability:", confi)
            print("---")
            detections.append({"name" : class_id, "cords" : cords, "conf" : confi})

另一部分则是主程序部分，负责音频采集，识别，对Longan Pi发起请求，当收到回复的音频bytes后，对其进行播放。

import speech_recognition as sr
import requests
from pydub import AudioSegment
from pydub.playback import play
import io

r = sr.Recognizer()

while True:
    with sr.Microphone() as source:
        r.dynamic_energy_threshold = False
        r.adjust_for_ambient_noise(source)
        print(">说点什么:")
        audio_in = r.listen(source)
        print("Processing...")
    try: 
        text_input = r.recognize_whisper(audio_in, model="small", language="chinese") 
        print("You said: " + text_input) 
    except sr.UnknownValueError: 
        print("Whisper could not understand audio") 
        text_input = None
    except sr.RequestError as _error: 
        print("Could not request results from Whisper") 
        print(_error)
        text_input = None
    if text_input:
        reply = requests.get('http://192.168.199.124:8080?question=' + text_input).content
        audio_out = AudioSegment.from_file(io.BytesIO(reply), format="mp3")
        play(audio_out)