Convert Video Streams Into Podcast-Style Audio by Removing Background Music and Normalizing the Volume Levels

As someone who enjoys watching Twitch , VTubers, and livers streams, I've often found myself immersed in their captivating content. However, there are times when the visual aspect isn't essential, and I prefer to listen to their streams in audio format, akin to a podcast. This led me to explore ways to convert their streams into audio files for convenient listening.

One of the primary challenges I encountered was the varying audio quality across different streams. Some streams had background music (BGM) that overshadowed the voices, while others suffered from uneven volume levels, especially during collaborative sessions where multiple voices were present.

To address these issues and create high-quality audio files, I devised a solution using two powerful tools: Spleeter and FFmpeg.

You can find the relevant script at the end of this blog entry.

Spleeter: Removing Background Music

Spleeter is a remarkable open-source tool by Deezer for source separation, capable of isolating vocals from background music in audio tracks. By leveraging Spleeter, I could effectively remove distracting background music from Twitch streams, ensuring that the focus remained on the streamers' voices.

FFmpeg: Dynamic Loudness Equalization

While Spleeter efficiently removed background music, I still faced challenges with uneven volume levels among different speakers or during collaborative streams. To address this, I turned to FFmpeg, a versatile multimedia framework capable of handling various audio and video processing tasks.

Using FFmpeg and this really well written post on medium by Jud Dagnall, I applied their compand filter to dynamically equalize the loudness of the audio tracks. This filter allowed me to compress or expand the audio's dynamic range, ensuring consistent volume levels throughout the stream. The compand filter parameters were precisely customized to the author's preferences, and I found them to be highly effective in achieving a balanced loudness without compromising audio quality.

The script

This script looks for webm files, feel free to adapt it to your requirements. 

This Bash script is provided "as is" without any guarantees or warranties of any kind, expressed or implied. By using this script, you acknowledge that you do so at your own risk.

##!/bin/bash

# Create a temporary directory for separated audio
output_dir=$(mktemp -d /tmp/separated_audio.XXXXXX)

# Iterate over each webm file in the directory
for file in *.webm; do
    # Create a temporary directory
    temp_dir=$(mktemp -d /tmp/spleeter.XXXXXX)

    # Extract audio to WAV using ffmpeg in the temporary directory
    ffmpeg -i "$file" -f segment -segment_time 540 "$temp_dir/out%03d.wav"

    # Iterate over each WAV file in the temporary directory
    for wav_file in "$temp_dir"/*.wav; do
        # Apply spleeter on each WAV file
        spleeter separate -d 7600.0 "$wav_file" -o "$output_dir"
    done

    # Combine all vocals.wav files into one using sox
    sox $(find "$output_dir" -type f -name vocals.wav | sort) "$output_dir/combined_vocals.wav"

    # Transcode the combined vocals file to opus 64kb and put it next to the webm file
    ffmpeg -i "$output_dir/combined_vocals.wav" -b:a 64k -filter_complex "compand=attacks=0:points=-80/-900|-45/-15|-27/-9|0/-7|20/-7" "${file%.webm}_processed.opus"

    # Clean up temporary directory
    rm -r "$temp_dir"

    # Remove the directories created by spleeter in output_dir
    rm -r "$output_dir"/*/    
    rm "$output_dir/combined_vocals.wav"
done

# Clean up the separated audio directory
rm -r "$output_dir"

Conclusion

By combining the power of Spleeter and FFmpeg, this script transforms Twitch streams into podcast-like audio files, optimized for an immersive listening experience. Now, I can enjoy the captivating content of my favorite streamers and livers without the distractions of background music or uneven volume levels.