國科會計畫

有找到申請意見

申請人：	莊翔鈞	計畫名稱：	Optimizing Large Language Model Inference on RISC-V Platforms Using TVM Compiler
審查意見：	審查人1：
本申請案旨在利用TVM編譯器於RISC-V平台上（Banana Pi BPI F3） 優化大型語言模型（Llama-2-7B）的推論效能。研究方法是將RISC-V RVV特有的計算（如加速向量化的數學函數）設計為自定義運算子，並使用TVM編譯器中的BYOC技術將其整合至編譯與優化流程中。
大型語言模型於嵌入式系統上的加速是當前受關注的研究方向，而從編譯器角度切入的相關探討相對較少，具一定的實務應用價值。從申請案內容可見，申請人對TVM 編譯器架構及BYOC技術已有相當理解，研究方法亦清楚說明系統實作的各項步驟，相信申請人能在本申請案中累積嵌入式系統設計與實作的實務經驗。
本申請案的主要挑戰在於設計支持RISC-V RVV指令集的自定義Codegen。若能於文獻回顧中比較其他加速器指令集的自定義Codegen，並說明RISC-V RVV自定義Codegen在設計上的困難，將有助於突顯本申請案的創新性。
審查人2：
本申請案以TVM編譯器技術於RISC-V平台優化大型語言模型（LLM）推理效能，主題明確且切合當前AI嵌入式部署的重要趨勢，具有顯著的學術與實務價值。計畫針對資源受限的環境，採用RVV向量指令及量化技術，有效提升模型運算效率與資源使用效率。研究內容結構完整且規劃清晰，展現申請人良好的系統整合能力及紮實的理論基礎。指導教授具備豐富經驗，提供周詳的指導規劃與充分的資源支持，有助申請人快速成長並深化相關技術理解。此外，本申請案詳盡引用相關文獻，並安排具體且可行的研究步驟及效能評估方式，使得成果預期具有明確性與可驗證性。整體而言，本申請案能有效培養申請人於編譯器及嵌入式AI技術之實務能力，具一定創新空間與學術價值，經適當深化技術細節後，未來仍具良好發展潛力。

獎助金額如下:
審查評級好像是文組的計畫才有，理組的沒有

Whisper-tiny

Ecpecting

[
	("<|startoftranscript|>", 50257),
	("<|en|>",                50363),
	(" Hello",                709),
	(" world",                4374),
	("<|endoftext|>",         50526)
]

fre930727@tvm:~/whisper-tiny$ python3 run_whisper_tvm.py 

=== Step 0 (Prefill) ===
⬆️ Next token: 50258 (<|startoftranscript|>)

=== Step 1 ===
⬆️ Next token: 50358 (<|translate|>)

=== Step 2 ===
⬆️ Next token: 50363 (<|notimestamps|>)

=== Step 3 ===
⬆️ Next token: 50714 ()

=== Step 4 ===
⬆️ Next token: 50257 (<|endoftext|>)
🛑 遇到 <eos>，結束解碼

📝 Transcription:

Solution tried:

add startoftoken
prefill
add other tokens

import numpy as np
import tvm
from tvm import relax, runtime
from tvm.relax import VirtualMachine
from transformers import WhisperProcessor, WhisperTokenizer
import torchaudio

# === 初始化空的 16 個 KV（給 prefill 和 step-by-step 共用）===
def init_zero_past_kv(num_layers=4, num_heads=6, head_dim=64,
                      decoder_seq_len=0, encoder_seq_len=1500, dtype="float32"):
    shape_decoder = (1, num_heads, decoder_seq_len, head_dim)
    shape_encoder = (1, num_heads, encoder_seq_len, head_dim)
    kvs = []
    for _ in range(num_layers):
        kvs += [
            tvm.nd.array(np.zeros(shape_decoder, dtype=dtype)),  # self.key
            tvm.nd.array(np.zeros(shape_decoder, dtype=dtype)),  # self.value
            tvm.nd.array(np.zeros(shape_encoder, dtype=dtype)),  # cross.key
            tvm.nd.array(np.zeros(shape_encoder, dtype=dtype))   # cross.value
        ]
    return kvs

# === 載入模型與 tokenizer ===
processor = WhisperProcessor.from_pretrained("./")
tokenizer = WhisperTokenizer.from_pretrained("./")

# === 音訊轉 mel spectrogram ===
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0, keepdim=True)
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="np")
mel = inputs.input_features.astype("float32")

# === Encoder ===
encoder_vm = VirtualMachine(runtime.load_module("./onnx/encoder_model_fp16.so"), tvm.cpu())
encoder_out = encoder_vm["main"](tvm.nd.array(mel))  # shape: (1, 1500, 384)

# === Decoder Step 0: Prefill ===
decoder_prefill_vm = VirtualMachine(runtime.load_module("./onnx/decoder_model_fp16.so"), tvm.cpu())

#start_token = 50257
#eos_token = tokenizer.eos_token_id
#tokens = [start_token]
#input_ids = np.array([[start_token]], dtype="int64")

start_token = 50257
eos_token = tokenizer.eos_token_id
tokens = [start_token]
input_ids = np.array([[start_token]], dtype="int64")

# 初始化全空 KV（self + cross）傳給 prefill decoder
past_kvs = init_zero_past_kv()
#inputs = [tvm.nd.array(input_ids), encoder_out] + past_kvs
inputs = [tvm.nd.array(input_ids), encoder_out]
# First run (start token)
print("\\n=== Step 0 (Prefill) ===")
out = decoder_prefill_vm["main"](*inputs)
print("out:")
print(out)

# Get output and update tokens
logits = out[0].numpy()
next_token = int(np.argmax(logits[0, -1]))
tokens.append(next_token)
print(f"⬆️ Next token: {next_token} ({tokenizer.decode([next_token])})")
print(tokens)

# Now manually add the required special tokens one by one
required_tokens = [
    tokenizer.convert_tokens_to_ids("<|en|>"),
    tokenizer.convert_tokens_to_ids("<|transcribe|>"),
    tokenizer.convert_tokens_to_ids("<|notimestamps|>")
]

for token in required_tokens:
    input_ids = np.array([[token]], dtype="int64")  # Shape: (1, 1)
    #inputs = [tvm.nd.array(input_ids), encoder_out] + past_kvs
    inputs = [tvm.nd.array(input_ids), encoder_out]
    out = decoder_prefill_vm["main"](*inputs)
    print("out:")
    print(out)
    past_kvs = list(out[1:])  # Update KV cache
    tokens.append(token)
    print(f"⬆️ Added required token: {token} ({tokenizer.decode([token])})")
    print(tokens)

# 將 decoder 回傳的 16 個 KV 擷取出來
decoder_kvs = list(out[1:])  # out[1]~out[16]

if next_token == eos_token:
    print("🛑 遇到 <eos>，結束解碼")
    transcript = tokenizer.decode(tokens, skip_special_tokens=True)
    print("\\n📝 Transcription:\\n", transcript)
    exit()

# === Decoder Step 1~N: step-by-step 解碼 ===
decoder_vm = VirtualMachine(runtime.load_module("./onnx/decoder_with_past_model_fp16.so"), tvm.cpu())
max_length = 64

for step in range(1, max_length):
    print(f"\\n=== Step {step} ===")
    input_ids = np.array([[tokens[-1]]], dtype="int64")
    inputs = [tvm.nd.array(input_ids)] + decoder_kvs

    out = decoder_vm["main"](*inputs)
    print("out:")
    print(out)
    logits = out[0].numpy()
    next_token = int(np.argmax(logits[0, -1]))
    tokens.append(next_token)
    print(f"⬆️ Next token: {next_token} ({tokenizer.decode([next_token])})")

    if next_token == eos_token:
        print("🛑 遇到 <eos>，結束解碼")
        break

    # 只更新 self-attn 的位置（index 0,1,4,5,8,9,12,13）
    for i, dst_idx in enumerate([0,1,4,5,8,9,12,13]):
        decoder_kvs[dst_idx] = out[i + 1]

# === 最後輸出結果 ===
transcript = tokenizer.decode(tokens, skip_special_tokens=True)
print("\\n📝 Transcription:\\n", transcript)
print(tokens)

this whisper model output goes like


=== Step 0 (Prefill) ===
⬆️ Next token: 50258 (<|startoftranscript|>)
[50257, 50258]
⬆️ Added required token: 50259 (<|en|>)
[50257, 50258, 50259]
⬆️ Added required token: 50359 (<|transcribe|>)
[50257, 50258, 50259, 50359]
⬆️ Added required token: 50363 (<|notimestamps|>)
[50257, 50258, 50259, 50359, 50363]

=== Step 1 ===
⬆️ Next token: 50258 (<|startoftranscript|>)

=== Step 2 ===
⬆️ Next token: 50258 (<|startoftranscript|>)

=== Step 3 ===
⬆️ Next token: 50257 (<|endoftext|>)
🛑 遇到 <eos>，結束解碼

📝 Transcription:
 
[50257, 50258, 50259, 50359, 50363, 50258, 50258, 50257]

5/21:編譯成功，但沒有輸出，得知原因為沒做prefill

6/4:做了prefill能跑到step4，但仍無輸出，猜測原因為translate/transcribe

想問學長到底怎麼做prefill