申請人: 莊翔鈞 計畫名稱: Optimizing Large Language Model Inference on RISC-V Platforms Using TVM Compiler
審查意見: 審查人1:
本申請案旨在利用TVM編譯器於RISC-V平台上(Banana Pi BPI F3) 優化大型語言模型(Llama-2-7B)的推論效能。研究方法是將RISC-V RVV特有的計算(如加速向量化的數學函數)設計為自定義運算子,並使用TVM編譯器中的BYOC技術將其整合至編譯與優化流程中。
大型語言模型於嵌入式系統上的加速是當前受關注的研究方向,而從編譯器角度切入的相關探討相對較少,具一定的實務應用價值。從申請案內容可見,申請人對TVM 編譯器架構及BYOC技術已有相當理解,研究方法亦清楚說明系統實作的各項步驟,相信申請人能在本申請案中累積嵌入式系統設計與實作的實務經驗。
本申請案的主要挑戰在於設計支持RISC-V RVV指令集的自定義Codegen。若能於文獻回顧中比較其他加速器指令集的自定義Codegen,並說明RISC-V RVV自定義Codegen在設計上的困難,將有助於突顯本申請案的創新性。
審查人2:
本申請案以TVM編譯器技術於RISC-V平台優化大型語言模型(LLM)推理效能,主題明確且切合當前AI嵌入式部署的重要趨勢,具有顯著的學術與實務價值。計畫針對資源受限的環境,採用RVV向量指令及量化技術,有效提升模型運算效率與資源使用效率。研究內容結構完整且規劃清晰,展現申請人良好的系統整合能力及紮實的理論基礎。指導教授具備豐富經驗,提供周詳的指導規劃與充分的資源支持,有助申請人快速成長並深化相關技術理解。此外,本申請案詳盡引用相關文獻,並安排具體且可行的研究步驟及效能評估方式,使得成果預期具有明確性與可驗證性。整體而言,本申請案能有效培養申請人於編譯器及嵌入式AI技術之實務能力,具一定創新空間與學術價值,經適當深化技術細節後,未來仍具良好發展潛力。
獎助金額如下:

審查評級好像是文組的計畫才有,理組的沒有
Ecpecting
[
("<|startoftranscript|>", 50257),
("<|en|>", 50363),
(" Hello", 709),
(" world", 4374),
("<|endoftext|>", 50526)
]
fre930727@tvm:~/whisper-tiny$ python3 run_whisper_tvm.py
=== Step 0 (Prefill) ===
⬆️ Next token: 50258 (<|startoftranscript|>)
=== Step 1 ===
⬆️ Next token: 50358 (<|translate|>)
=== Step 2 ===
⬆️ Next token: 50363 (<|notimestamps|>)
=== Step 3 ===
⬆️ Next token: 50714 ()
=== Step 4 ===
⬆️ Next token: 50257 (<|endoftext|>)
🛑 遇到 <eos>,結束解碼
📝 Transcription:
Solution tried:
import numpy as np
import tvm
from tvm import relax, runtime
from tvm.relax import VirtualMachine
from transformers import WhisperProcessor, WhisperTokenizer
import torchaudio
# === 初始化空的 16 個 KV(給 prefill 和 step-by-step 共用)===
def init_zero_past_kv(num_layers=4, num_heads=6, head_dim=64,
decoder_seq_len=0, encoder_seq_len=1500, dtype="float32"):
shape_decoder = (1, num_heads, decoder_seq_len, head_dim)
shape_encoder = (1, num_heads, encoder_seq_len, head_dim)
kvs = []
for _ in range(num_layers):
kvs += [
tvm.nd.array(np.zeros(shape_decoder, dtype=dtype)), # self.key
tvm.nd.array(np.zeros(shape_decoder, dtype=dtype)), # self.value
tvm.nd.array(np.zeros(shape_encoder, dtype=dtype)), # cross.key
tvm.nd.array(np.zeros(shape_encoder, dtype=dtype)) # cross.value
]
return kvs
# === 載入模型與 tokenizer ===
processor = WhisperProcessor.from_pretrained("./")
tokenizer = WhisperTokenizer.from_pretrained("./")
# === 音訊轉 mel spectrogram ===
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0, keepdim=True)
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="np")
mel = inputs.input_features.astype("float32")
# === Encoder ===
encoder_vm = VirtualMachine(runtime.load_module("./onnx/encoder_model_fp16.so"), tvm.cpu())
encoder_out = encoder_vm["main"](tvm.nd.array(mel)) # shape: (1, 1500, 384)
# === Decoder Step 0: Prefill ===
decoder_prefill_vm = VirtualMachine(runtime.load_module("./onnx/decoder_model_fp16.so"), tvm.cpu())
#start_token = 50257
#eos_token = tokenizer.eos_token_id
#tokens = [start_token]
#input_ids = np.array([[start_token]], dtype="int64")
start_token = 50257
eos_token = tokenizer.eos_token_id
tokens = [start_token]
input_ids = np.array([[start_token]], dtype="int64")
# 初始化全空 KV(self + cross)傳給 prefill decoder
past_kvs = init_zero_past_kv()
#inputs = [tvm.nd.array(input_ids), encoder_out] + past_kvs
inputs = [tvm.nd.array(input_ids), encoder_out]
# First run (start token)
print("\\n=== Step 0 (Prefill) ===")
out = decoder_prefill_vm["main"](*inputs)
print("out:")
print(out)
# Get output and update tokens
logits = out[0].numpy()
next_token = int(np.argmax(logits[0, -1]))
tokens.append(next_token)
print(f"⬆️ Next token: {next_token} ({tokenizer.decode([next_token])})")
print(tokens)
# Now manually add the required special tokens one by one
required_tokens = [
tokenizer.convert_tokens_to_ids("<|en|>"),
tokenizer.convert_tokens_to_ids("<|transcribe|>"),
tokenizer.convert_tokens_to_ids("<|notimestamps|>")
]
for token in required_tokens:
input_ids = np.array([[token]], dtype="int64") # Shape: (1, 1)
#inputs = [tvm.nd.array(input_ids), encoder_out] + past_kvs
inputs = [tvm.nd.array(input_ids), encoder_out]
out = decoder_prefill_vm["main"](*inputs)
print("out:")
print(out)
past_kvs = list(out[1:]) # Update KV cache
tokens.append(token)
print(f"⬆️ Added required token: {token} ({tokenizer.decode([token])})")
print(tokens)
# 將 decoder 回傳的 16 個 KV 擷取出來
decoder_kvs = list(out[1:]) # out[1]~out[16]
if next_token == eos_token:
print("🛑 遇到 <eos>,結束解碼")
transcript = tokenizer.decode(tokens, skip_special_tokens=True)
print("\\n📝 Transcription:\\n", transcript)
exit()
# === Decoder Step 1~N: step-by-step 解碼 ===
decoder_vm = VirtualMachine(runtime.load_module("./onnx/decoder_with_past_model_fp16.so"), tvm.cpu())
max_length = 64
for step in range(1, max_length):
print(f"\\n=== Step {step} ===")
input_ids = np.array([[tokens[-1]]], dtype="int64")
inputs = [tvm.nd.array(input_ids)] + decoder_kvs
out = decoder_vm["main"](*inputs)
print("out:")
print(out)
logits = out[0].numpy()
next_token = int(np.argmax(logits[0, -1]))
tokens.append(next_token)
print(f"⬆️ Next token: {next_token} ({tokenizer.decode([next_token])})")
if next_token == eos_token:
print("🛑 遇到 <eos>,結束解碼")
break
# 只更新 self-attn 的位置(index 0,1,4,5,8,9,12,13)
for i, dst_idx in enumerate([0,1,4,5,8,9,12,13]):
decoder_kvs[dst_idx] = out[i + 1]
# === 最後輸出結果 ===
transcript = tokenizer.decode(tokens, skip_special_tokens=True)
print("\\n📝 Transcription:\\n", transcript)
print(tokens)
this whisper model output goes like
=== Step 0 (Prefill) ===
⬆️ Next token: 50258 (<|startoftranscript|>)
[50257, 50258]
⬆️ Added required token: 50259 (<|en|>)
[50257, 50258, 50259]
⬆️ Added required token: 50359 (<|transcribe|>)
[50257, 50258, 50259, 50359]
⬆️ Added required token: 50363 (<|notimestamps|>)
[50257, 50258, 50259, 50359, 50363]
=== Step 1 ===
⬆️ Next token: 50258 (<|startoftranscript|>)
=== Step 2 ===
⬆️ Next token: 50258 (<|startoftranscript|>)
=== Step 3 ===
⬆️ Next token: 50257 (<|endoftext|>)
🛑 遇到 <eos>,結束解碼
📝 Transcription:
[50257, 50258, 50259, 50359, 50363, 50258, 50258, 50257]
5/21:編譯成功,但沒有輸出,得知原因為沒做prefill
6/4:做了prefill能跑到step4,但仍無輸出,猜測原因為translate/transcribe
想問學長到底怎麼做prefill