簡單介紹拆分矩陣(讓cache能夠塞得下,實作方式如下圖
for size 384, we have
for size 1024, we have