Abstract
Dynamic binary translation (DBT) serves as a pivotal technique for instruction set simulation, yet encounters critical challenges when handling explicit instruction-level parallelism (ILP) and operational latency inherent in very long instruction word (VLIW) architectures. Existing approaches demonstrate limited capability in handling nonarithmetic operations, particularly branch and memory access instructions. The complexity intensifies when translating software-pipelined loops featuring architecture-specific instructions due to two inherent characteristics: 1) instruction reordering, overlapping, and masking within loop bodies, and 2) absence of explicit conditional branches and loop counter manipulation instructions. This work presents three original strategies for VLIW code translation. First, a strategy constrains the translation block (TB) length to resolve branch instruction challenges. Second, an approach manages store-load dependencies through the strategic postponement of store instruction processing. Third, a novel software-pipelined loop translation methodology ensures correct execution semantics by serializing parallel iterations, generating state-specific TBs, and synchronizing inner and outer loops translations. We implemented these techniques in VEMU and evaluated it against dsplib and Polybench benchmarks. VEMU successfully translates all benchmark programs. Our first strategy does not degrade performances and the second strategy brings 0.2% time overhead. Comparative analysis reveals a speedup ratio between $3.25\times $ and $7\times $ over the Texas Instruments simulator for the same benchmarks, validating the efficacy of the proposed techniques.