How can I do real-time voice activity detection in Python?

python speech-recognition speech-to-text speech pyaudio

13,640

Solution 1

You should try using Python bindings to webRTC VAD from Google. It's lightweight, fast and provides very reasonable results, based on GMM modelling. As the decision is provided per frame, the latency is minimal.

# Run the VAD on 10 ms of silence. The result should be False.
import webrtcvad
vad = webrtcvad.Vad(2)

sample_rate = 16000
frame_duration = 10  # ms
frame = b'\x00\x00' * int(sample_rate * frame_duration / 1000)
print('Contains speech: %s' % (vad.is_speech(frame, sample_rate))

Also, this article might be useful for you.

Solution 2

I found out that LibROSA could be one of the solutions to your problem. There's a simple tutorial on Medium on using Microphone streaming to realise real-time prediction.

Let's use Short-Time Fourier Transform (STFT) as the feature extractor, the author explains:

To calculate STFT, Fast Fourier transform window size(n_fft) is used as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency bins(n_stft) are calculated over a window size of 512. The window is moved by a hop length of 256 to have a better overlapping of the windows in calculating the STFT.

stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))

# Plot audio with zoomed in y axis
def plotAudio(output):
    fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,10))
    plt.plot(output, color='blue')
    ax.set_xlim((0, len(output)))
    ax.margins(2, -0.1)
    plt.show()

# Plot audio
def plotAudio2(output):
    fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,4))
    plt.plot(output, color='blue')
    ax.set_xlim((0, len(output)))
    plt.show()

def minMaxNormalize(arr):
    mn = np.min(arr)
    mx = np.max(arr)
    return (arr-mn)/(mx-mn)

def predictSound(X):
    clip, index = librosa.effects.trim(X, top_db=20, frame_length=512, hop_length=64) # Empherically select top_db for every sample
    stfts = np.abs(librosa.stft(clip, n_fft=512, hop_length=256, win_length=512))
    stfts = np.mean(stfts,axis=1)
    stfts = minMaxNormalize(stfts)
    result = model.predict(np.array([stfts]))
    predictions = [np.argmax(y) for y in result]
    print(lb.inverse_transform([predictions[0]])[0])
    plotAudio2(clip)

CHUNKSIZE = 22050 # fixed chunk size
RATE = 22050

p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paFloat32, channels=1, 
rate=RATE, input=True, frames_per_buffer=CHUNKSIZE)

#preprocessing the noise around
#noise window
data = stream.read(10000)
noise_sample = np.frombuffer(data, dtype=np.float32)
print("Noise Sample")
plotAudio2(noise_sample)
loud_threshold = np.mean(np.abs(noise_sample)) * 10
print("Loud threshold", loud_threshold)
audio_buffer = []
near = 0

while(True):
    # Read chunk and load it into numpy array.
    data = stream.read(CHUNKSIZE)
    current_window = np.frombuffer(data, dtype=np.float32)
    
    #Reduce noise real-time
    current_window = nr.reduce_noise(audio_clip=current_window, noise_clip=noise_sample, verbose=False)
    
    if(audio_buffer==[]):
        audio_buffer = current_window
    else:
        if(np.mean(np.abs(current_window))<loud_threshold):
            print("Inside silence reign")
            if(near<10):
                audio_buffer = np.concatenate((audio_buffer,current_window))
                near += 1
            else:
                predictSound(np.array(audio_buffer))
                audio_buffer = []
                near
        else:
            print("Inside loud reign")
            near = 0
            audio_buffer = np.concatenate((audio_buffer,current_window))

# close stream
stream.stop_stream()
stream.close()
p.terminate()

Code credit to: Chathuranga Siriwardhana

Full code can be found here.

Solution 3

I think there are two approaches here,

Threshold Approach
Small, deployable, Neural net. Approach

The first one is fast, feasible and can be implemented and tested very fast. while the second one is a bit more difficult to implement. I think you are a bit familiar with 2nd option already.

in the case of the 2nd approach, you will be needing a dataset of speeches that are labeled in a sequence of binary classification like 00000000111111110000000011110000. The neural net should be small and optimized for running on edge devices like mobile.

You can check this out from TensorFlow

This is a voice activity detector. I think it's for your purpose.

Also, check these out.

https://github.com/eesungkim/Voice_Activity_Detector

https://github.com/pyannote/pyannote-audio

of course, you should compare performance of the mentioned toolkits and models and the feasibility of the implementation of mobile devices.

Solution 4

Audio usually has a low bitrate, so I don't see any problem of writing your code completely in numpy and python. And if you need low-level array access consider numba. Also profile your code e.g. with line_profiler. ALso note there is scipy.signal for more advanced signal processing.

Usually audio processing works in samples. So you define a sample size for your process, and then run a method to decide if that sample contains speech or not.

import numpy as np

def main_loop():
    stream = <create stream with your audio library>
    while True:
        sample = stream.readframes(<define number of samples / time to read>)
        print(is_speech(sample))

def is_speech(sample):
    audio = np.array(sample)

    < do you processing >

    # e.g. simple loudness test
    return np.any(audio > 0.8):

That should get you pretty far.

View more solutions

13,640

Author by

Nickil Maveli

Knows a thing or two about Python, Machine Learning, Artificial Intelligence and everything that tends to falls in between. Good artists copy, great artists steal - Pablo Picasso?

Updated on June 05, 2022

Comments

Nickil Maveli about 2 years

I am performing a voice activity detection on the recorded audio file to detect speech vs non-speech portions in the waveform.

The output of the classifier looks like (highlighted green regions indicate speech):

The only issue I face here is making it work for a stream of audio input (for eg: from a microphone) and do real-time analysis for a stipulated time-frame.

I know PyAudio can be used to record speech from the microphone dynamically and there a couple of real-time visualization examples of a waveform, spectrum, spectrogram, etc, but could not find anything relevant to carrying out feature extraction in a near real-time manner.
Nickil Maveli about 4 years

Nice find! I completely overlooked WebRTC module. Although, I've got a feasible solution using Auditok, I'll accept yours as it is close to what I am looking for.
igrinis about 4 years

Thank you for Auditok. Wasn't aware of it's existance
matanster about 4 years

Thanks for linking to the Google paper. It's terribly old now in Deep Learning terms, but it's already RNN based and it is easy to get a grip on the available techniques knowing this is called "Voice Activity Detection" (VAD when abbreviated) and/or following up from that paper to newer ones maybe.
matanster about 4 years

I especially liked the < do your processing > bit of this answer ;-)
matanster about 4 years

My first attempt at the py-webrtcvad route, using sounddevice as the audio input source. See also stackoverflow.com/questions/51463146/…