Audio streaming by websockets

15,241

I think web sockets are appropriate here. Just make sure that you are using binary transfer. (I use BinaryJS for this myself, allowing me to open up arbitrary streams to the server.)

Getting the data from user media capture is pretty straightforward. What you have is a good start. The tricky party is on playback. You will have to buffer the data and play it back using your own script processing node.

This isn't too hard if you use PCM everywhere... the raw samples you get from the Web Audio API. The downside of this is that there is a lot of overhead shoving 32-bit floating point PCM around. This uses a ton of bandwidth which isn't needed for speech alone.

I think the easiest thing to do in your case is to reduce the bit depth to an arbitrary bit depth that works well for your application. 8-bit samples are plenty for discernible speech and will take up quite a bit less bandwidth. By using PCM, you avoid having to implement a codec in JS and then having to deal with the buffering and framing of data for that codec.

To summarize, once you have the raw sample data in a typed array in your script processing node, write something to convert those samples from 32-bit float to 8-bit signed integers. Send these buffers to your server in the same size chunks as they come in on, over your binary web socket. The server will then send these to all the other clients on their binary web sockets. When the clients receive audio data, it will buffer it for whatever amount of time you choose to prevent dropping audio. Your client code will convert those 8-bit samples back to 32-bit float and put it in a playback buffer. Your script processing node will pick up whatever is in the buffer and start playback as data is available.

Share:
15,241
Oskar Kamiński
Author by

Oskar Kamiński

Updated on June 04, 2022

Comments

  • Oskar Kamiński
    Oskar Kamiński almost 2 years

    I'm going to create voice chat. My backend server works on Node.js and almost every connection between client and server uses socket.io.

    Is websockets appropriate for my use case? I prefer communication client -> server -> clients than P2P because I expect even 1000 clients connected to one room.

    If websocket is ok, then which method is the best to send AudioBuffer to server and playback on other clients? I do it like that:

    navigator.getUserMedia({audio: true}, initializeRecorder, errorCallback);
    function initializeRecorder(MediaStream) {
        var audioCtx = new window.AudioContext();
        var sourceNode = audioCtx.createMediaStreamSource(MediaStream);
    
        var recorder = audioCtx.createScriptProcessor(4096, 1, 1);
        recorder.onaudioprocess = recorderProcess;
    
        sourceNode.connect(recorder);
    
        recorder.connect(audioCtx.destination);
    }
    function recorderProcess(e) {
        var left = e.inputBuffer.getChannelData(0);
    
        io.socket.post('url', left);
    }
    

    But after receive data on other clients I don't know how to playback this Audio Stream from Buffer Arrays.

    EDIT

    1) Why if I don't connect ScriptProcessor (recorder variable) to destination, onaudioprocess method isn't fired?

    Documentation info - "although you don't have to provide a destination if you, say, just want to visualise some audio data" - Web Audio concepts and usage

    2) Why I don't hear anything from my speakers after connect recorder variable to destination and if I connect sourceNode variable directly to destination, I do. Even if onaudioprocess method doesn't do anything.

    Anyone can help?

    • Brad
      Brad almost 9 years
      "I expect even 1000 clients connected to one room." Voice chat with more than 3 people is difficult. Not only is there the practical concern of not being able to understand 1,000 people speaking all at once, but that would take a ton of bandwidth. Are you absolutely sure a "voice chat" application is what you're trying to build? Are you sure you're not trying to build something where only a few people are talking and most are listening?
    • Oskar Kamiński
      Oskar Kamiński almost 9 years
      Like I said 1000 clients connected to room. Only one can speak in the same time. I'm agree that even 3 people speaking at once would be mess already ;)
    • Brad
      Brad almost 9 years
      I'm still not clear on what it is you're trying to build... these details are important in picking the right solution. Only one can speak at a time? Or one throughout the lifetime of the session? And, what are your latency requirements?
    • Oskar Kamiński
      Oskar Kamiński almost 9 years
      chatonic.com - I want to create solution like this. But they use flash from client side. I would like to use just JavaScript.
    • Brad
      Brad almost 9 years
      I don't know what Chatonic is and I can't try it without creating an account. Can you just describe in more detail what it is specifically that you are trying to do?
    • Oskar Kamiński
      Oskar Kamiński almost 9 years
      There is a text and voice chat. On text chat everybody can speak at the same time. To speak on voice chat you have to wait till everybody in queue finish. Then you have for example 30 seconds to speak and everybody in room are listening you.
    • Oskar Kamiński
      Oskar Kamiński almost 9 years
  • Oskar Kamiński
    Oskar Kamiński almost 9 years
    Thanks for your answer. I will try but before I would like to resolve 2 problems that occur in this code. Could you check it? I added info in edit section.
  • kiwicomb123
    kiwicomb123 almost 6 years
    I am working on a online teaching platform that needs two-way voice. Websockets seem the way to go. Do you know of any good tutorials which might help piece together the solution Brad is describing? The system uses PHP, and NodeJS for the real-time stuff.
  • Brad
    Brad almost 6 years
    @kiwicomb123 If you're doing voice chat where latency matters, consider using WebRTC instead.
  • Dan Mills
    Dan Mills about 5 years
    Note that there is the usual network audio elephant in the room here... Sample clock syncronisation! Every one of those clients will have a slightly different idea about exactly what sample rate corresponds to whatever nominal rate you are using, so some resampling based on a rate estimator is in order. Professionally we just lock the audio interface to PTP and call it good, but I suspect that the AES67 approach will not work for you.