As reported in #3500, when debug interpreter is enabled, the classic interpreter
performs a lock operation to read `exec_env->current_status->signal_flag` and
do further handling before fetching next opcode, which makes the interpreter
run slower.
This PR atomic loads the `exec_env->current_status->signal_flag` without mutex
lock when 32-bit atomic load is supported, and only adding lock for further
handling when the signal_flag is WAMR_SIG_SINGSTEP, which improves the
performance.