How to merge pdfs and create bookmarks for each input file in output file? (linux)

linux pdf bookmarks merge open-source

5,459

Solution 1

UPDATE: I wasn't satisfied with the result and have written this with nice GUI:

Learned python and has written (modified) program in one hour:

#! /usr/bin/env python
# Original author Nicholas Kim, modified by Yan Pashkovsky
# New license - GPL v3
import sys
import time
from PyPDF2 import utils, PdfFileReader, PdfFileWriter

def get_cmdline_arguments():
    """Retrieve command line arguments."""
    
    from optparse import OptionParser
    
    usage_string = "%prog [-o output_name] file1, file2 [, ...]"

    parser = OptionParser(usage_string)
    parser.add_option(
        "-o", "--output",
        dest="output_filename",
        default=time.strftime("output_%Y%m%d_%H%M%S"),
        help="specify output filename (exclude .pdf extension); default is current date/time stamp"
    )
    
    options, args = parser.parse_args()
    if len(args) < 2:
        parser.print_help()
        sys.exit(1)
    return options, args
    
def main():
    options, filenames = get_cmdline_arguments()
    output_pdf_name = options.output_filename + ".pdf"
    files_to_merge = []

    # get PDF files
    for f in filenames:
        try:
            next_pdf_file = PdfFileReader(open(f, "rb"))
        except(utils.PdfReadError):
            print >>sys.stderr, "%s is not a valid PDF file." % f
            sys.exit(1)
        except(IOError):
            print >>sys.stderr, "%s could not be found." % f
            sys.exit(1)
        else:
            files_to_merge.append(next_pdf_file)

    # merge page by page
    output_pdf_stream = PdfFileWriter()
    j=0
    k=0
    for f in files_to_merge:
        for i in range(f.numPages):
            output_pdf_stream.addPage(f.getPage(i))
            if i==0:
                output_pdf_stream.addBookmark(str(filenames[k]),j)
            j = j + 1
        k += 1
        
    # create output pdf file
    try:
        output_pdf_file = open(output_pdf_name, "wb")
        output_pdf_stream.write(output_pdf_file)
    finally:
        output_pdf_file.close()

    print "%s successfully created." % output_pdf_name


if __name__ == "__main__":
    main()

This program requires PyPDF2, you can install it via sudo pip install pypdf2, before this you need to install pip :) Just open terminal and enter ./pdfmerger.py *.pdf

Solution 2

This Bash script will make each PDF in a directory contain one bookmark to its first page with the text of the PDF's filename, and then it will concatenate them all. It can handle Non-ASCII filename.

#!/usr/bin/bash

cattedPDFname="${1:?Concatenated PDF filename}"

# make each PDF contain a single bookmark to first page
tempPDF=`mktemp`
for i in *.pdf
do
    bookmarkTitle=`basename "$i" .pdf`
    bookmarkInfo="BookmarkBegin\nBookmarkTitle: $bookmarkTitle\nBookmarkLevel: 1\nBookmarkPageNumber: 1"
    pdftk "$i" update_info_utf8 <(echo -en $bookmarkInfo) output $tempPDF verbose
    mv $tempPDF "$i"
done

# concatenate the PDFs
pdftk *.pdf cat output "$cattedPDFname" verbose

Solution 3

Modifying a good answer ^[1] of tex.stackexchange.com, you can create an itemize list with the reference to the files that you will include below. (Similarly to a toc). Latex will take care to update the page numbers.

Some Latex words more

A line as this will include the PDF file MyDoc1.pdf with the reference name "doc01" present in the same directory of the latex file:
```
\modifiedincludepdf{-}{doc01}{MyDoc1.pdf}
```
A command as \pageref{doc02.3} will create a link with the number of the third page of the document that has for reference the key "doc02". Latex will take care to keep it updated.
A block \begin{itemize} \end{itemize} will create a pointed list.

The latex file
Here below the modified template that will work with pdflatex:

\documentclass{article}
\usepackage{hyperref}
\usepackage{pdfpages}
\usepackage[russian,english]{babel}

\newcounter{includepdfpage}
\newcounter{currentpagecounter}
\newcommand{\addlabelstoallincludedpages}[1]{
   \refstepcounter{includepdfpage}
   \stepcounter{currentpagecounter}
   \label{#1.\thecurrentpagecounter}}
\newcommand{\modifiedincludepdf}[3]{
    \setcounter{currentpagecounter}{0}
    \includepdf[pages=#1,pagecommand=\addlabelstoallincludedpages{#2}]{#3}}

\begin{document}

You can refer to the beginning or to a specific page: \\
see page \pageref{doc01.1} till \pageref{doc02.3}.\\

\begin{itemize}
  \item Here contribution from Grupmate 1 \pageref{doc01.1}
  \item Here contribution from Grupmate 2 \pageref{doc02.1}
\end{itemize}

\modifiedincludepdf{-}{doc01}{MyDoc1.pdf}
\modifiedincludepdf{-}{doc02}{MyDoc2.pdf}

\end{document}

Note

To simply merge and split PDF documents or pages you can use tools as pdftk and take inspiration from other questions ^[3] about it.

References

Unable to link to inserted pages with pdfpages
pdflatex(1) - Linux man page
Answer about pdftk.

5,459

yanpas

Updated on September 18, 2022

Comments

yanpas over 1 year

I'm using Linux and I would like to have software (or script, method) which merges some pdfs and creates an united output pdf, containing bookmarks. Bookmarks are named by filename of pdf files, which were used for merging and pointing to the page number, where these files begin.

Similar possibilities have Adobe Acrobat, but it is non-free and Windows-only.
- Hastur over 8 years
  
  In okular you can put bookmarks in each part of a pdf and they will be shown in a column of bookmarks, regardless if the file is open or not. Then you click and... It's not what you are searching for but it could work. To physically merge more pdf in only one you can use latex... BTW your question it will be probably closed because the software suggestion are off topic. It should be different if you were trying to do a script that finds all the pdf with their location, split basename and dirname and put all in a tex container to be compiled to have your file and you stop somewhere. ;)
- NZD over 8 years
  
  Have a look at unix.stackexchange.com/q/17065/121614
- yanpas over 8 years
  
  @Hastur well gs script would be OK for this purpose ) I do not have source files, only pdfs, so I do not understand how latex can help
- Hastur over 8 years
  
  @yanpas: I didn't understand well: do you want to create, let we say, a book with included a bunch of pdf files and with an index in the beginning (or in the ending) with hyperlinks to the page from which each article starts in the book, or do you want to create an index with link that points to the file on the HDD? I suppose the 1st. Can you confirm it?
- yanpas over 8 years
  
  @Hastur the answer is closer to the first. Me and my groupmates are preparing about 100 questions to the exam, each of us is doing his own part in editor he prefers and send me his result in pdf format. Then I merge all pdfs to output.pdf. For easier navigation I would like outer.pdf to have a bookmark list (when i clikc on this list - I am moved to the section of document which is related to the bunch of answers. Something like i.imgur.com/hQQwp6i.png
- Hastur over 8 years
  
  @yanpas Feel you free to add the packages you need and modify it for your purpose :) I tested it works on my system. Let me know.
- Xen2050 over 8 years
  
  Why not all just use the same file format, one that's better suited to editing, cut&paste? Like ODF (Libreoffice), Word, etc? Or, if each person can't be bothered to use the same program, then you open each file in it's own format, then cut & paste into your favourite one?
- yanpas over 8 years
  
  @Xen2050 I've described only one case, sometimes a have nothing but pdfs from internet and I still need strcuture in final pdf
Nathan over 3 years

Thanks for this! It's a very useful little script.
Ur Ya'ar about 3 years

My files are named "Lecture_#.pdf" where # is a number, and it does what you intend but the order is not right - instead of going 1,2,3,... it goes 10,11,12....1,20,21,.... can this be fixed?
Ur Ya'ar about 3 years

This is exactly what I'm looking for! Could you add more detailed installation instructions?
Ur Ya'ar about 3 years

This also happens when I just want te merge in pdftk, so to get the correct order I use Lecture_{1..27}.pdf instead of Lecture_*.pdf. but I know the exact name and number of files...
James Wright over 2 years

I updated the python script to be 3.X compatible and put it in the following gist.