1. Convolve a rectangular function of a certain width against the audio
2. The local maxima's of the convolved function show where there is the highest average amplitude, so increase the amplitude of those regions a bit
3. Repeat with smaller rectangular functions to capture finer details
Does this already exist, and if it doesn't, does this idea even work