Can a computer analyze audio quicker than real time playback?

transcription audio digital-audio

7,313

Solution 1

Yes. Absolutely.

Algorithms can process data as fast as they can read them and get them through the CPU.

If data is on disk, for example, a modern NVMe can read at 5+ GB/s which is much faster than bit-rates normally used to store voice data. Of course, the actual algorithm being applied can be more or less complex, so we cannot guarantee it will be processed at the maximum read speed but there is nothing inherent that limits such analysis to be in real-time speed.

The same principle applies to video but that requires much more throughput due to the huge amount of data in such files. That obviously depends on resolution, frame-rate and complexity of the analysis. It is actually difficult to perform sophisticated video analysis in real-time because analysis is almost always done on decompressed video, so the processor must have time to decode and analyze in a short period of time and keep data flowing so that by the time some analysis is done, the next block of video is already decoded and in memory. This is something I worked on for almost a decade.

When you playback video faster, words are unclear to you but the data is exactly the same. The speed at which audio is being processed does not affect the ability of the algorithm to understand it. Software knows exactly how much time each audio sample represents.

Solution 2

I'd go a bit further than the current answer, and would like to contradict the idea that the computer is somehow "playing back" the files at all. That would imply that processing is necessarily a strictly sequential process, starting from the beginning of the file and working your way towards the end.

In reality, most audio processing algorithms will be somewhat sequential - after all, that's how sound files are meant to be interpreted when playing them for human consumption. But other methods are conceivable: For example, say you want to write a program that determines the average loudness of a sound file. You could go through the whole file and measure the loudness of each snippet; but it would also be a valid (although maybe less accurate) strategy to just sample some snippets at random and measure those. Note that now the file isn't "played back" at all; the algorithm is simply looking at some data points that it chose by itself, it is free to do so in any order it likes.

This means that talking about "playing back" the file isn't really the right term here at all - even when the processing does happen sequentially, the computer isn't "listening" to sounds, it is simply processing a dataset (audio files aren't really anything other than a list of recorded air pressure values over time). Maybe the better analogy isn't a human listening to the audio, but analyzing it by looking at the waveform of the audio file:

In this case, you aren't at all constrained by the actual time scale of the audio, but can look at whatever part of the waveform you want for however long you want to (and if you are a fast enough, you can indeed "read" the waveform in a shorter time than playing the original audio would take). Of course, if it's a very long waveform printout, you might still have to "walk" for a bit to reach the section you are interested in (or if you are a computer, seek to the right position on the hard drive). But the speed that you're walking or reading isn't intrinsically linked to the (imaginary) time labels on the x-axis, i.e. the audio's "real-time".

Solution 3

Your core question is this:

“Can a computer analyze audio quicker than real time playback?”

Other great answers here but here is — what I consider — to be a very commonplace real-world example of computers analyzing audio faster than real-time audio playback…

Converting an audio CD to MP3 files on a modern computer system is always faster than real-time playback of the audio on that CD.

It all depends on the speed of your system and hardware, but even 20-ish years ago converting a CD to MP3 files was always faster than real-time playback of the CD audio.

So, for example, how can a 45 minute audio CD can be converted to MP3 in less than 45 minute? How could that occur if the computer was constrained by audio playback limits? It’s all data on the data side, but constrained to human levels on playback.

Think about it: A computer is reading the raw audio data from a CD at a speed faster than normal audio playback and running an algorithm against it to convert the raw audio into a compressed audio data format.

And when it comes to transcribing text from audio, it’s a similar digital analysis process but with different output. A far more complex process than just transcoding audio from one format to another, but still it’s another digital analysis process.

^{PS: To the seemingly endless stream of commenters who want to point out that pre-1995 PCs could not encode MP3s faster than real time… Yes, I know… That is why I qualify what I posted by saying “…on a modern computer system…” as well as stating “…but even 20-ish years ago…” as well.}

^{The first MP3 encoder came out on July 7, 1994 and the .mp3 extension was formally chosen on July 14, 1995. The point of this answer is to explain at a very high level that on modern PCs the act of analyzing audio quicker than real time playback already exists in a way we all use: The act of converting an audio CD to MP3 files.}

Solution 4

Computers don't experience audio the same way we do.

Recordings played at a higher speed become incomprehensible for humans because we're receiving more data than we can process. Our bodies and brains have limits and once the "data rate" is exceeded slightly we start picking up only parts of what is being said. Turn the speed up and it becomes gibberish.

A computer doesn't experience this phenomenon because its perception of time in the recording isn't based on actual time that has passed, but on the amount of data processed. A computer will never read data from disk faster than it can process it¹, so it's never overloaded. Data rate always matches the processing speed perfectly.

¹ Unless it's told to do so by a buggy program, but this applies to every single computer question.

Solution 5

In spite of the good answers here, I'm going to have to go for a solid

It depends

Some algorithms depend on brute-force processing power. The more processing power you've got, the more processing (or the more accurate processing) you can do. We're at a point now where most audio processing is no longer resource-limited. Video processing is still resource-limited though, as can be seen by the continuing state-of-the-art in gaming.

After that though, the issue you have with real-time processing is latency - in this case the delay between you saying something and the computer putting the text up. All processing algorithms have some delay, but anything based on Fourier transforms is especially limited by this. By a mathematical theorem, the lower the frequency you want to be able to recognize, the more data you need to spot it, and hence the longer the delay before the computer gives you a result. So you do hit a point where it doesn't matter how fast you can do the maths, you're always at least that far behind.

The challenge for real-time processing is to find a sweet spot where you can get reasonably effective processing and have the delay relatively imperceptible for users. This is always a trade-off between lower delays and higher quality results, and the optimal algorithm for this can be a matter of personal taste as anything else.

And in the extreme case, some algorithms simply cannot be run in real time. Some very effective filtering algorithms exist which require the data to be run backwards, for example. These can give very good results for post-processing recorded data, but of course are utterly impossible to run with real-time data.

View more solutions

7,313

Dave

Updated on September 18, 2022

Comments

Dave almost 2 years

So let’s say that your computer is transcribing audio (of someone speaking) to text. Because it’s looking at the digital values of the audio, does it “render” the transcription quicker than the time it takes to play it in real time? I would imagine that it is not “listening” like a human would, rather it processes it digitally. Am I right in this assumption?

The same question would apply to analyzing video.

My confusion is: When playing audio back at a faster rate, the words become unclear, so how does the computer compensate for that? Excuse me if I am missing something basic and fundamental here.

Edit: When I use the term “real time” in this question, I don't mean at the time of recording, and then transcribing in real time. Rather, I mean playback at 1x speed (or real time playback speed). It seems some people didn't catch what I meant.
- gronostaj over 3 years
  
  "the words become unclear" - that's because it's too fast for you. My blind friend on the other hand is complaining that he's set his screen reader to the max speech speed and he would like it a bit faster, while I already can't understand a word. The computer is even faster and has the advantage of always reading at the optimal speed.
- Dave over 3 years
  
  @gronostaj i am asking, if let's say the computer can replay at x200 speed (in which you couldn't hear the words under any circumstance....) can the computer still analyze the words digitally, or does there need to be the element of "real time" speech analysis? (i guess i need to understand how it transcribes words digitally...)
- gronostaj over 3 years
  
  I'm addressing exactly that in the comment above.
- Mast over 3 years
  
  For a computer, there's no such thing as real-time if the data doesn't have to be sound to speakers. There's only computer-time, determined by the clock (the internal clock, not the one on your wall).
- Askar Kalykov over 3 years
  
  For computer, there is when l even no such thing as time, moreover "real" time as in "human's perception of a time". For computer, there are only cycles.
- Peter - Reinstate Monica over 3 years
  
  How do you think a computer "perceives" time? Do you have a general idea how a modern digital computer works? If so, why can you not answer your own question?
- Kaithar over 3 years
  
  To simplify: If you drive down a road at twice the speed that I did, you will get to the end quicker but we will both have experienced the same section of road. Going faster doesn't skip or change parts of the road, instead it changes how long you spend on each part. The computer doesn't have to slow down for the speed bumps of the speaker and our ears.
- wizzwizz4 over 3 years
  
  There is a “too fast for the computer”, but if the computer's working through the file “as fast as possible”, it's not going to go “too fast”. If it's using a slower-than-realtime algorithm and you expect it to process live data, then it might miss things, but that's a special-case (live processing is a different kind of processing to offline processing). The computer is basically reading the audio, not listening to it.
- AaronD over 3 years
  
  DSP functions have no perception of time at all. They don't count seconds, they count samples. Given that, it can process a chunk of data in as many or few seconds as you like, with no change in outcome. (also, if you change the sample rate, all the frequency-based math needs to change with it, to keep the frequencies the same)
- eps over 3 years
  
  this really isn't any different than a computer being able to search for text in a document faster than you can read it.
- Dave over 3 years
  
  @eps searching for text is an exact match of two values... Audio transcription is (I would assume) much more complex with pattern matching etc...
- somebody over 3 years
  
  @Dave the point is, it doesn't matter if you are reading the file at some speed, or at twice the speed of that, the data is exactly the same to the computer. reading it faster does not change the data, so for the computer there is zero difference. it may be "more complex", but speed doesn't change how complex it is. if your computer is powerful enough to do it faster, it's powerful enough to do it faster, that i sall
Itai over 3 years

Fair example but most real-time processing is done sequentially because there is no knowledge of how much there is to process. A device has to listen continuously to determine when certain worlds appear 'Hey Google" for example when then activates the device, it has no luxury to get the whole audio and do analysis later. It's the difference between transcribing live news cast vs a pre-recorded show.
Dave over 3 years

very interesting... but I can't think of why you wouldn't go in order of the track when transcribing the entire thing to text, which was my initial question.
ManfP over 3 years

@Dave The point is not that sequential processing is unusual (you're right that speech recognition would probably be a mostly sequential process) but the point is that software just isn't limited to those constraints, so the audio being "played back" is a faulty model of thinking about it. It's all just a list of data points; if it wants to read them out-of-order, it can; if it wants to go faster than real-time, it can as well.
Dave over 3 years

@ManfP I really like that line "contradict the idea that the computer is somehow "playing back" the files at all"... Does the computer transcribe text in any way similar to how humans do it? Are the any analog similarities? Or is it fully digital recognition of patterns?
ManfP over 3 years

@Dave it's really just some form of pattern recognition, though with lots of clever math tricks. Probably the first step would be to chop the audio into very short (a few milliseconds long) windows and perform what's called a Fourier Transform on those. The result contains information about the frequencies, a little similar to the different hairs in our ears reacting to different tones, but it's still a purely digital and mathematical process; then those can be further analyzed for vocal patterns (though I don't know much about the specifics of how speech recognition works).
Michael over 3 years

“When you playback video faster, words are unclear to you but the data is exactly the same.” Changing playback speed does alter the pitch of audio. If you increase the playback speed the frequencies increase, which is especially noticeable for voices (this is called Munchkinization).
gronostaj over 3 years

@Michael The pitch increases for you because the recording is sped up, but you're processing it as if it was played at a normal speed. This mismatch makes the sine wave perceptibly denser for you. A computer doesn't experience this phenomenon because its perception of time in the recording isn't based on actual time that has passed, but on the amount of data processed. In other words audio processing algorithms don't look at the wall clock, but rather at the in-file clock - which always matches the speed at which file is read perfectly, so there's no mismatch and no pitch change.
sawdust over 3 years

"In other words audio processing algorithms don't look at the wall clock, but rather at the in-file clock" -- The technical term for this "in-file clock" is the sampling rate.
jamesqf over 3 years

Not only could you sample snippets, you could use massively parallel processors to process the input in chunks. Probably more applicable to video - GPUs are essentially doing the reverse operation.
JCRM over 3 years

While true for analog media, that's not always the case @Michael. of course, keeping the pitch the same while increasing playback speed leads to loss of information.
Mast over 3 years

Why stop at NVMe speed? Process from DDR4 RAM.
NPSF3000 over 3 years

The title is "Yes. Absolutely."... but the answer seems to be "It Depends". Seems a bit weird to bring up NVME speeds... when A) the data might not be on NVME and B) the bottleneck is more likely to computation required to execute the algorithm.
Itai over 3 years

@Mast NVMEe was just an example because most analysis are done on files which then goes through memory. When it does not go through files, then it usually comes in through a slower channel like the network (again, it depends because I've worked on systems with 8 fibre-channel loops to get data as fast as the CPU FSB supports, at the time, at least).
Itai over 3 years

@NPSF3000 - The point is that processing can be done faster than real-time which is what the question asked, but not any processing can be done at that speed. If the question was at which speed can processing be done, then it would probably require a 50-page research paper!
NPSF3000 over 3 years

"The point is that processing can be done faster than real-time which is what the question asked," Then provide proof of this. For example, the question asks 'does it “render” the transcription quicker than the time it takes to play it in real time?"' Yet the answer lacks any evidence that an algorithm exists that can transcribe faster than real-time on a particular system. This answer seems to rely on the notion that just because computers operate at a high frequency, they can do everything fast. This notion is flawed.
Dave over 3 years

I am not asking about real-time in respect to "how fast can the computer react to audio and quickly decode it". I am asking "if you have an audio file that is 2 minutes long, and the computer transcribes it in 15 seconds (not real-time playback, meaning not 1x speed, but much faster), how did the computer decode the audio so quickly, if the words aren't clear at that speed?"
Chris H over 3 years

You could chunk the input in a simple but less parallel way than @jamesqf suggests - split on silence, for example, so you've got a lot of fragments. The speech-to-text aspect isn't simply processing the signal - it will use probabilistic ways to generate the text, both likelihood of words occurring and of sequences - "the independent" is far more likely than "thee in dependent" despite sounding almost identical
Mark Morgan Lloyd over 3 years

@NPSF3000 By analogy, you can read a book in a week that took a lifetime to write. All you need is the sequence of symbols that make up the text. Now if you have uncompressed audio or video you've got something precisely comparable: a sequence of symbols that make up the signal, and that can be processed at any speed.
NPSF3000 over 3 years

@MarkMorganLloyd By analogy, you could spend a lifetime reading a complex mathematical book... and not understand it. Fermat's Last Theorem probably took only a few seconds to write... yet took hundreds of years to solve. This answers 'Yes. Absolutely.' is entirely unwarranted.
Mark Morgan Lloyd over 3 years

@NPSF3000 comprehension has nothing at all to do with it, since that implies intelligence. The important thing is being able to process the symbols in sequence, and that can be done irrespective of speed.
NPSF3000 over 3 years

@MarkMorganLloyd I'm sorry but that's nonsense. The question provides an example: "transcribing audio ... to text." It's absolutely important to know how quickly (if at all) the computer can perform this operation. Simply reading symbols does not perform this important transformation. Just because a computer can shuffle a few bytes around, does not mean that the computer can process that data to achieve the desired result in real-time.
Mark Morgan Lloyd over 3 years

@NPSF3000 You are talking absolute rubbish, and quite clearly have no idea what a symbol is in communications theory.
NPSF3000 over 3 years

@MarkMorganLloyd Feel free to specify what you mean by 'symbol'. What I know is that just because you have a bunch of data, and the ability to move that data around, does not mean that every algorithm can process that data in real-time or faster than real-time. To suggest otherwise, would imply a machine with unlimited computational capability.
Mark Morgan Lloyd over 3 years

I'm not getting drawn any further into this, my apologies to the community for the "sound and fury" which has already transpired. OP's question has resulted in an answer from Itai which has been accepted, one user appears to have problems with that, and that basically is the end of the story.
Dalen over 3 years

Note: voice command activated anything is usually not the same as voice recognition. It works more in a way the dogs perceive word commands. What I mean to say is, real Speech To Text is a complex process, while activation method is a simple pattern recognition, usually driven using hardware acceleration like FFT integrated circuits. That's why a probability of somebody else activating Siri on your device using hey Siri is slim unless the person is your twin or unusually good imitator. Of course system may prepare patterns before in a way that matching is speaker independent.
Dalen over 3 years

Also the command activation is not a real-time process, but it has a windowed delay in order to catch the whole pattern before pushing audio through FFT and pattern matching.
Dalen over 3 years

You usually use 20 ms delay with overlap windows of length most probable the pattern to be caught e.g. 0.7 seconds, you use short time FFT on chunks of the recorded portion, usually 20 ms, to make a frequency envelope which then can be filtered further and matched against saved patterns. You can also use e.g. LPC instead of FFT or cosine transform or ... So, using hey <digital assistant> is not a right example for the question. And, yeah, speech recognition can use same fundamentals, but it can also be just extremely well designed and trained neural network that accepts time domain input.