Compute PTS and DTS correctly to sync audio and video ffmpeg C++

17,264

Timestamps (such as dts) should be in AVStream.time_base units. You're requesting a video timebase of 1/90000 and a default audio timebase (1/9000), but you're using a timebase of 1/100000 to write dts values. I'm also not sure if it's guaranteed that requested timebases are maintained during header writing, your muxer might change the values and expect you to deal with the new values.

So code like this:

int64_t dts = av_gettime();
dts = av_rescale_q(dts, (AVRational){1, 1000000}, (AVRational){1, 90000});
int duration = AUDIO_STREAM_DURATION; // 20
if(m_prevAudioDts > 0LL) {
    duration = dts - m_prevAudioDts;
}

Won't work. Change that to something that uses the audiostream's timebase, and don't set the duration unless you know what you're doing. (Same for video.)

m_prevAudioDts = dts;
pkt.pts = AV_NOPTS_VALUE;
pkt.dts = m_currAudioDts;
m_currAudioDts += duration;
pkt.duration = duration;

This looks creepy, especially combined with the video alike code. The problem here is that the first packet for both will have a timestamp of zero, regardless of inter-packet delay between the streams. You need one parent currDts shared between all streams, otherwise your streams will be perpetually out of sync.

[edit]

So, regarding your edit, if you have audio gaps, I think you need to insert silence (zeroed audio sample data) for the duration of the gap.

Share:
17,264

Related videos on Youtube

Kaidul
Author by

Kaidul

I believe one of my answers has brought you on my profile. Since you're here, check some of my other answers which you might find interesting as well :) 2D Segment/Quad Tree Explanation with C++ DFA construction in Knuth-Morris-Pratt algorithm Minimum number of edges to be deleted in a directed graph to remove all the cycles Count number of inversions in an array using segment trees What is the best algorithm to shuffle cards? Two elements of an array whose XOR is maximum efficiently Monolith to Microservices Raising a number to a huge exponent HashMap Space Complexity Best practices for microserviced applications

Updated on June 04, 2022

Comments

  • Kaidul
    Kaidul almost 2 years

    I am trying to mux H264 encoded data and G711 PCM data into mov multimedia container. I am creating AVPacket from encoded data and initially the PTS and DTS value of video/audio frames is equivalent to AV_NOPTS_VALUE. So I calculated the DTS using current time information. My code -

    bool AudioVideoRecorder::WriteVideo(const unsigned char *pData, size_t iDataSize, bool const bIFrame) {
        .....................................
        .....................................
        .....................................
        AVPacket pkt = {0};
        av_init_packet(&pkt);
        int64_t dts = av_gettime();
        dts = av_rescale_q(dts, (AVRational){1, 1000000}, m_pVideoStream->time_base);
        int duration = 90000 / VIDEO_FRAME_RATE;
        if(m_prevVideoDts > 0LL) {
            duration = dts - m_prevVideoDts;
        }
        m_prevVideoDts = dts;
    
        pkt.pts = AV_NOPTS_VALUE;
        pkt.dts = m_currVideoDts;
        m_currVideoDts += duration;
        pkt.duration = duration;
        if(bIFrame) {
            pkt.flags |= AV_PKT_FLAG_KEY;
        }
        pkt.stream_index = m_pVideoStream->index;
        pkt.data = (uint8_t*) pData;
        pkt.size = iDataSize;
    
        int ret = av_interleaved_write_frame(m_pFormatCtx, &pkt);
    
        if(ret < 0) {
            LogErr("Writing video frame failed.");
            return false;
        }
    
        Log("Writing video frame done.");
    
        av_free_packet(&pkt);
        return true;
    }
    
    bool AudioVideoRecorder::WriteAudio(const unsigned char *pEncodedData, size_t iDataSize) {
        .................................
        .................................
        .................................
        AVPacket pkt = {0};
        av_init_packet(&pkt);
    
        int64_t dts = av_gettime();
        dts = av_rescale_q(dts, (AVRational){1, 1000000}, (AVRational){1, 90000});
        int duration = AUDIO_STREAM_DURATION; // 20
        if(m_prevAudioDts > 0LL) {
            duration = dts - m_prevAudioDts;
        }
        m_prevAudioDts = dts;
        pkt.pts = AV_NOPTS_VALUE;
        pkt.dts = m_currAudioDts;
        m_currAudioDts += duration;
        pkt.duration = duration;
    
        pkt.stream_index = m_pAudioStream->index;
        pkt.flags |= AV_PKT_FLAG_KEY;
        pkt.data = (uint8_t*) pEncodedData;
        pkt.size = iDataSize;
    
        int ret = av_interleaved_write_frame(m_pFormatCtx, &pkt);
        if(ret < 0) {
            LogErr("Writing audio frame failed: %d", ret);
            return false;
        }
    
        Log("Writing audio frame done.");
    
        av_free_packet(&pkt);
        return true;
    }
    

    And I added stream like this -

    AVStream* AudioVideoRecorder::AddMediaStream(enum AVCodecID codecID) {
        ................................
        .................................   
        pStream = avformat_new_stream(m_pFormatCtx, codec);
        if (!pStream) {
            LogErr("Could not allocate stream.");
            return NULL;
        }
        pStream->id = m_pFormatCtx->nb_streams - 1;
        pCodecCtx = pStream->codec;
        pCodecCtx->codec_id = codecID;
    
        switch(codec->type) {
        case AVMEDIA_TYPE_VIDEO:
            pCodecCtx->bit_rate = VIDEO_BIT_RATE;
            pCodecCtx->width = PICTURE_WIDTH;
            pCodecCtx->height = PICTURE_HEIGHT;
            pStream->time_base = (AVRational){1, 90000};
            pStream->avg_frame_rate = (AVRational){90000, 1};
            pStream->r_frame_rate = (AVRational){90000, 1}; // though the frame rate is variable and around 15 fps
            pCodecCtx->pix_fmt = STREAM_PIX_FMT;
            m_pVideoStream = pStream;
            break;
    
        case AVMEDIA_TYPE_AUDIO:
            pCodecCtx->sample_fmt = AV_SAMPLE_FMT_S16;
            pCodecCtx->bit_rate = AUDIO_BIT_RATE;
            pCodecCtx->sample_rate = AUDIO_SAMPLE_RATE;
            pCodecCtx->channels = 1;
            m_pAudioStream = pStream;
            break;
    
        default:
            break;
        }
    
        /* Some formats want stream headers to be separate. */
        if (m_pOutputFmt->flags & AVFMT_GLOBALHEADER)
            m_pFormatCtx->flags |= CODEC_FLAG_GLOBAL_HEADER;
    
        return pStream;
    }
    

    There are several problems with this calculation:

    1. The video is laggy and lags behind than audio increasingly with time.

    2. Suppose, an audio frame is received (WriteAudio(..)) little lately like 3 seconds, then the late frame should be started playing with 3 second delay, but it's not. The delayed frame is played consecutively with previous frame.

    3. Sometimes I recorded for ~40 seconds but the file duration is much like 2 minutes, but audio/video is played only few moments like 40 seconds and rest of the file contains nothing and seekbar jumps at en immediately after 40 seconds (tested in VLC).

    EDIT:

    According to Ronald S. Bultje's suggestion, what I've understand:

    m_pAudioStream->time_base = (AVRational){1, 9000}; // actually no need to set as 9000 is already default value for audio as you said
    m_pVideoStream->time_base = (AVRational){1, 9000};
    

    should be set as now both audio and video streams are now in same time base units.

    And for video:

    ...................
    ...................
    
    int64_t dts = av_gettime(); // get current time in microseconds
    dts *= 9000; 
    dts /= 1000000; // 1 second = 10^6 microseconds
    pkt.pts = AV_NOPTS_VALUE; // is it okay?
    pkt.dts = dts;
    // and no need to set pkt.duration, right?
    

    And for audio: (exactly same as video, right?)

    ...................
    ...................
    
    int64_t dts = av_gettime(); // get current time in microseconds
    dts *= 9000; 
    dts /= 1000000; // 1 second = 10^6 microseconds
    pkt.pts = AV_NOPTS_VALUE; // is it okay?
    pkt.dts = dts;
    // and no need to set pkt.duration, right?
    

    And I think they are now like sharing same currDts, right? Please correct me if I am wrong anywhere or missing anything.

    Also, if I want to use video stream time base as (AVRational){1, frameRate} and audio stream time base as (AVRational){1, sampleRate}, how the correct code should look like?

    EDIT 2.0:

    m_pAudioStream->time_base = (AVRational){1, VIDEO_FRAME_RATE};
    m_pVideoStream->time_base = (AVRational){1, VIDEO_FRAME_RATE};
    

    And

    bool AudioVideoRecorder::WriteAudio(const unsigned char *pEncodedData, size_t iDataSize) {
        ...........................
        ......................
        AVPacket pkt = {0};
        av_init_packet(&pkt);
    
        int64_t dts = av_gettime() / 1000; // convert into millisecond
        dts = dts * VIDEO_FRAME_RATE;
        if(m_dtsOffset < 0) {
            m_dtsOffset = dts;
        }
    
        pkt.pts = AV_NOPTS_VALUE;
        pkt.dts = (dts - m_dtsOffset);
    
        pkt.stream_index = m_pAudioStream->index;
        pkt.flags |= AV_PKT_FLAG_KEY;
        pkt.data = (uint8_t*) pEncodedData;
        pkt.size = iDataSize;
    
        int ret = av_interleaved_write_frame(m_pFormatCtx, &pkt);
        if(ret < 0) {
            LogErr("Writing audio frame failed: %d", ret);
            return false;
        }
    
        Log("Writing audio frame done.");
    
        av_free_packet(&pkt);
        return true;
    }
    
    bool AudioVideoRecorder::WriteVideo(const unsigned char *pData, size_t iDataSize, bool const bIFrame) {
        ........................................
        .................................
        AVPacket pkt = {0};
        av_init_packet(&pkt);
        int64_t dts = av_gettime() / 1000;
        dts = dts * VIDEO_FRAME_RATE;
        if(m_dtsOffset < 0) {
            m_dtsOffset = dts;
        }
        pkt.pts = AV_NOPTS_VALUE;
        pkt.dts = (dts - m_dtsOffset);
    
        if(bIFrame) {
            pkt.flags |= AV_PKT_FLAG_KEY;
        }
        pkt.stream_index = m_pVideoStream->index;
        pkt.data = (uint8_t*) pData;
        pkt.size = iDataSize;
    
        int ret = av_interleaved_write_frame(m_pFormatCtx, &pkt);
    
        if(ret < 0) {
            LogErr("Writing video frame failed.");
            return false;
        }
    
        Log("Writing video frame done.");
    
        av_free_packet(&pkt);
        return true;
    }
    

    Is the last change okay? The video and audio seems synced. Only problem is - the audio is played without the delay regardless the packet arrived in delay. Like -

    packet arrival: 1 2 3 4... (then next frame arrived after 3 sec) .. 5

    audio played: 1 2 3 4 (no delay) 5

    EDIT 3.0:

    zeroed audio sample data:

    AVFrame* pSilentData;
    pSilentData = av_frame_alloc();
    memset(&pSilentData->data[0], 0, iDataSize);
    
    pkt.data = (uint8_t*) pSilentData;
    pkt.size = iDataSize;
    
    av_freep(&pSilentData->data[0]);
    av_frame_free(&pSilentData);
    

    Is this okay? But after writing this into file container, there are dot dot noise during playing the media. Whats the problem?

    EDIT 4.0:

    Well, For µ-Law audio the zero value is represented as 0xff. So -

    memset(&pSilentData->data[0], 0xff, iDataSize);
    

    solve my problem.

    • wimh
      wimh over 8 years
      AFAIK Audio should generally not have a dts, just a pts. Video should only have a dts if the source frame has a dts too. (If it is used as a reference by a B-Frame).
    • Connor Nee
      Connor Nee over 8 years
      Your timebase for the audio and video should correspond to your sampling frequency. For instance if you are sampling your video at 25 frames per second then the rescale is from 1/25 to 1/90000. I'm not sure why you are using 100000 anywhere.
    • Kaidul
      Kaidul over 8 years
      @sipwiz Can you please check the EDIT 2.0?
  • Kaidul
    Kaidul over 8 years
    Sir, thanks for your answer! I have edited my question and wrote down at bottom what I understood from your answer so far. Please correct me. And if that doesn't make any sense, can you please give me some precise working code? I am really struggling 2-3 days with this issue, I am really so dumb to get the intuition.
  • Kaidul
    Kaidul over 8 years
    Can you please check the EDIT 2.0?
  • Kaidul
    Kaidul over 8 years
    Thanks again for suggesting audio gap solution. Can you please tell me IF MY CURRENT IMPLEMENTATION IS CORRECT OR NOT? what changes I need to make in writeAudio(..) if I want to set the audio stream time base as {1, sampleRate}?
  • Kaidul
    Kaidul over 8 years
    Also by zeroed audio sample data you mean AVPacket pkt = {0} ?
  • Ronald S. Bultje
    Ronald S. Bultje over 8 years
    no, I mean an AVFrame with memset(data[0], 0, size).
  • Kaidul
    Kaidul over 8 years
    Can you please check EDIT 3.0?
  • PR Singh
    PR Singh about 6 years
    @KaidulIslam Hi, I am also facing this issue,is it resolved?