Pandoc and foreign characters
Solution 1
Use the --pdf-engine=xelatex
option.
Solution 2
By default, Pandoc use the pdflatex
engine when converting markdown file to pdf files. pdflatex
can not handle Unicode characters very smoothly as xelatex
. You should try xelatex
instead. But, merely using xelatex
command is not enough. As is often the case, you need to choose a proper font which contains glyphs for the Unicode characters your want to typeset.
I am a Chinese user, so take Chinese for example. If you have a test.md
which contains the following content:
你好汉字
you can use the following command to compile this markdown file:
pandoc --pdf-engine=xelatex -V CJKmainfont="KaiTi" test.md -o test.pdf
In the above command, --pdf-engine=xelatex
is used to select the LaTeX engine (for the new version of Pandoc, --latex-engine
option is deprecated). -V CJKmainfont="KaiTi"
is used to select the proper font which support Chinese. For other languages, you may use the flag -C mainfont="<FONT_NAME>"
.
How to find a font which support your language
In order to find a font which supports your language, you need to know your language code. Then, if you are on Linux system or on Windows systems with TeX Live installed. You can use the following command to find a valid font for you language:
fc-list :lang=zh #find the font which support Chinese (language code is `zh`)
The output on my Linux system is shown below
If you choose to use, e.g. the font Source Han Serif CN
, then use the following command to compile your markdown file:
pandoc --pdf-engine=xelatex -V CJKmainfont="Source Han Serif CN" test.md -o test.pdf
Solution 3
UPDATE: the answer below seems to be valid for pandoc 1.x but with later versions the syntax has changed
Coming back to this post in five years time and the issue is still there. The command
pandoc -s test.md -t latex -o test.pdf
fails when test.md
contains text with non-latin characters, Greek, Cyrillic, CJK, Hebrew and Arabic included.
LaTeX was designed before Unicode and its support for different character sets is robust in some areas but far from comprehensive, so the advice to use XeLaTeX is valid yet requires one to choose the main font carefully, since there is no automatic choice.
Below is a small taxonomy of possible issues and some solutions. All tested with Pandoc 1.19.
Cyrillic
Support for Cyrillic alphabet in LaTeX is provided via T2A font encoding.
Consider a small sample:
# Header
## Subheader
Tetris (Russian: Тетрис) quoting Wikipedia is a tile-matching puzzle
video game
Running this example with pandoc would fail with:
! Package inputenc Error: Unicode char Т (U+422)
(inputenc) not set up for use with LaTeX.
See the inputenc package documentation for explanation.
A fix is available as fontenc
option is a predefined variable in default.latex
template.
Running this example with
pandoc -t latex -o tetris.pdf -V fontenc=T2A cyrillic.md
would produce correct rendering
This however would not handle other language features correctly such as hyphenation. A better way would be to use Babel and have it select the correct font encoding.
pandoc -t latex -o tetris.pdf -V lang -V babel-lang=russian cyrillic.md
Or to switch languages with Babel commands inside Markdown
# Header
## Subheader
Tetris (Russian: \foreignlanguage{russian}{Тетрис}) quoting Wikipedia
is a tile-matching puzzle video game
And run with
pandoc -t latex -o tetris.pdf -V lang -V babel-lang=english \
-V babel-otherlangs=russian cyrillic2.md
Greek
The example in the original post contains characters both from the main and extended Greek Unicode codepages.
Anyway, the widely used LGR greek font encoding is not covered by LaTeX 3 project and is classified as a local encoding, i.e. it may vary from site to site and from system to system according to the LaTeX Encoding Guide.
On TeX Live the following packages need to be installed: texlive-greek-inputenc
, texlive-greek-fontenc
and texlive-cbfonts
. Note that you need Babel 3.9 or later.
However the result of
pandoc -t latex -o anarchy.pdf -V fontenc=LGR greek.md
may appear unexpected.
In order to correct this issue one has to setup LaTeX Babel package correctly. And insert commands to switch between the languages in the original text:
# Header!
## Sub Header
themselves derived respectively from the Greek \textgreek{ἀναρχία}
i.e. 'anarchy'
Compiling this with the following command
pandoc -s greek2.md -t latex -V fontenc=T2A -V lang -V babel-lang=english \
-V babel-otherlangs=greek -o greek.pdf
would produce the output exactly as you would expect it to be:
XeLaTeX
All of this would not be needed if we were using XeLaTeX.
Just running the original example with
pandoc -s greek.md --latex-engine=xelatex -t latex -o greek.pdf
would produce
Because the font does not contain anything in the greek character positions the output contains some white space instead.
Selecting one of the popular fonts as the new mainfont
would help a bit
pandoc -s greek.md --latex-engine=xelatex \
-V mainfont="Liberation Serif" -t latex -o greek.pdf
However characters from the extended Greek codepage such as the small letter alpha with psili accent are not rendered.
The Font Setup for Greek with XeTeX/LuaTeX Guide suggests to use DejaVu, Libertine or Free font families.
Indeed with DejaVu Serif
, Linux Libertine O
as well as Tempora
and perhaps some other fonts, the result would be as expected. See below the rendering with XeLaTeX and Linux Libertine fonts.
pandoc -s greek.md --latex-engine=xelatex -V mainfont="Linux Libertine O" \
-t latex -o greek.pdf
Solution 4
Works for Cyrillic characters
pandoc myfile.md --pdf-engine=xelatex -V mainfont=Arial
Solution 5
You can use --latex-engine=xelatex
, as said before, but the best I have found is to use the lang
variable to specify the document language in the header, like this: lang: ru-RU
. A working example on my debian workstation:
---
title: Lady Macbeth de Mzensk (Chostakovitch, livret d'Alexandre Preis, 1934)
lang: ru-RU
---
# Acte I / Tableau 1
*[Народ ненадежный]*
Ха, ха, ха, ха, ха, ха, ха. *[...]* Чуыствуем
На кого ты нас покидаешь?
Без хозяина будет скучно,
скучно, тоскливо, безрадостно.
Не работа. Без тебя невеселье. Воз вращайся
Как можно скорей, скорей !
Then you can launch:
$ pandoc -o your-file-output.pdf your-source-file.md
Related videos on Youtube
Mike Thomsen
Java developer, inveterate hater of description blurbs.
Updated on May 19, 2020Comments
-
Mike Thomsen almost 4 years
I've been trying to use Pandoc to convert some Markdown into a PDF file. This is a sample that Pandoc will not convert for me:
# Header! ## Sub Header themselves derived respectively from the Greek ἀναρχία i.e. 'anarchy'
That's just something I grabbed from the top of the wikipedia database dump. Pandoc doesn't like that at all. This is the error message it gives me:
pandoc: Error producing PDF from TeX source. ! Package inputenc Error: Unicode char \u8:ἀ not set up for use with LaTeX. See the inputenc package documentation for explanation. Type H <return> for immediate help. ... l.53 ...es derived respectively from the Greek ἀ
Is there a command switch I can give it to get around this? I tried following the advice to do something like this, but it failed:
iconv -t utf-8 test.md | pandoc -o test.pdf
Update Before following John's advice below, see this.
Update 2 This is the command that ultimately got it working. Hopefully this will help someone:
pandoc test2.md -o test2.pdf --latex-engine=xelatex --template=my.latex --variable mainfont="DejaVu Serif" --variable sansfont=Arial
And this is the contents of
my.latex
:\documentclass[$if(fontsize)$$fontsize$,$endif$$if(lang)$$lang$,$endif$$if(papersize)$$papersize$,$endif$]{$documentclass$} \usepackage[T1]{fontenc} \usepackage{lmodern} \usepackage{amssymb,amsmath} \usepackage{ifxetex,ifluatex} \usepackage{fixltx2e} % provides \textsubscript % use microtype if available \IfFileExists{microtype.sty}{\usepackage{microtype}}{} % use upquote if available, for straight quotes in verbatim environments \IfFileExists{upquote.sty}{\usepackage{upquote}}{} \ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex \usepackage[utf]{inputenc} \usepackage{ucs} $if(euro)$ \usepackage{eurosym} $endif$ \else % if luatex or xelatex \usepackage{fontspec} \ifxetex \usepackage{xltxtra,xunicode} \fi \defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase} \setromanfont{TeX Gyre Pagella} \newcommand{\euro}{€} $if(mainfont)$ \setmainfont{$mainfont$} $endif$ $if(sansfont)$ \setsansfont{$sansfont$} $endif$ $if(monofont)$ \setmonofont{$monofont$} $endif$ $if(mathfont)$ \setmathfont{$mathfont$} $endif$ \fi $if(geometry)$ \usepackage[$for(geometry)$$geometry$$sep$,$endfor$]{geometry} $endif$ $if(natbib)$ \usepackage{natbib} \bibliographystyle{plainnat} $endif$ $if(biblatex)$ \usepackage{biblatex} $if(biblio-files)$ \bibliography{$biblio-files$} $endif$ $endif$ $if(listings)$ \usepackage{listings} $endif$ $if(lhs)$ \lstnewenvironment{code}{\lstset{language=Haskell,basicstyle=\small\ttfamily}}{} $endif$ $if(highlighting-macros)$ $highlighting-macros$ $endif$ $if(verbatim-in-note)$ \usepackage{fancyvrb} $endif$ $if(tables)$ \usepackage{longtable} $endif$ $if(graphics)$ \usepackage{graphicx} % We will generate all images so they have a width \maxwidth. This means % that they will get their normal width if they fit onto the page, but % are scaled down if they would overflow the margins. \makeatletter \def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth \else\Gin@nat@width\fi} \makeatother \let\Oldincludegraphics\includegraphics \renewcommand{\includegraphics}[1]{\Oldincludegraphics[width=\maxwidth]{#1}} $endif$ \ifxetex \usepackage[setpagesize=false, % page size defined by xetex unicode=false, % unicode breaks when used with xetex xetex]{hyperref} \else \usepackage[unicode=true]{hyperref} \fi \hypersetup{breaklinks=true, bookmarks=true, pdfauthor={$author-meta$}, pdftitle={$title-meta$}, colorlinks=true, urlcolor=$if(urlcolor)$$urlcolor$$else$blue$endif$, linkcolor=$if(linkcolor)$$linkcolor$$else$magenta$endif$, pdfborder={0 0 0}} \urlstyle{same} % don't use monospace font for urls $if(links-as-notes)$ % Make links footnotes instead of hotlinks: \renewcommand{\href}[2]{#2\footnote{\url{#1}}} $endif$ $if(strikeout)$ \usepackage[normalem]{ulem} % avoid problems with \sout in headers with hyperref: \pdfstringdefDisableCommands{\renewcommand{\sout}{}} $endif$ \setlength{\parindent}{0pt} \setlength{\parskip}{6pt plus 2pt minus 1pt} \setlength{\emergencystretch}{3em} % prevent overfull lines $if(numbersections)$ $else$ \setcounter{secnumdepth}{0} $endif$ $if(verbatim-in-note)$ \VerbatimFootnotes % allows verbatim text in footnotes $endif$ $if(lang)$ \ifxetex \usepackage{polyglossia} \setmainlanguage{$mainlang$} \else \usepackage[$lang$]{babel} \fi $endif$ $for(header-includes)$ $header-includes$ $endfor$ $if(title)$ \title{$title$} $endif$ \author{$for(author)$$author$$sep$ \and $endfor$} \date{$date$} \begin{document} $if(title)$ \maketitle $endif$ $for(include-before)$ $include-before$ $endfor$ $if(toc)$ { \hypersetup{linkcolor=black} \setcounter{tocdepth}{$toc-depth$} \tableofcontents } $endif$ $body$ $if(natbib)$ $if(biblio-files)$ $if(biblio-title)$ $if(book-class)$ \renewcommand\bibname{$biblio-title$} $else$ \renewcommand\refname{$biblio-title$} $endif$ $endif$ \bibliography{$biblio-files$} $endif$ $endif$ $if(biblatex)$ \printbibliography$if(biblio-title)$[title=$biblio-title$]$endif$ $endif$ $for(include-after)$ $include-after$ $endfor$ \end{document}
-
Mike Thomsen over 10 yearsThat created the document, but now I have a bunch of blank characters where the greek word is supposed to be. I think it is not recognizing the characters in question.
-
z-- over 9 yearsThe blank characters show up when you have a font selected which does not contain greek glyphs. Use the
--variable mainfont="..."
option on the command line. See johnmacfarlane.net/pandoc/demos.html Example 14 (Xe)Latex -
J3soon about 6 yearsSee Pandoc with Chinese for more info.
-
J3soon about 6 yearsXD. However, for me I need to use
mainfont
instead ofCJKmainfont
. -
Dmitri Chubarov almost 6 yearsIt is not true that
pdflatex
can not handle Unicode characters. -
jdhao almost 6 years@DmitriChubarov, is that true? I am not an expert at LaTeX. But
xelatex
seems the option when dealing with Unicode characters. -
Dmitri Chubarov almost 6 yearsLaTeX was designed before Unicode, so it had only support for 256-character coding tables, however as it was possible to switch coding tables and fonts dynamically under the hood, it was completely normal to supply UTF-8 encoded input to LaTeX compiler, translation was performed by packages such as
inputenc
andfontenc
. -
jdhao almost 6 years@DmitriChubarov, thanks for the information. I will revise my answer.
-
jdhao almost 6 years
-
Sardathrion - against SE abuse almost 6 years
--pdf-engine
is the new option … -
Creasixtine about 5 yearsYou can also use the
lang
variable as I have described in my answer. -
Roman Golyshev about 5 yearsThat 'babel-lang' option it the only thing that worked for me, thanks! It is not directly mentioned in the official pandoc documentation
-
Merchako about 5 yearsThis answer is incomplete (doesn't mention fonts) and no longer correct (should say --pdf-engine).
-
Nazar Paruna over 2 yearsI don't know why but for some reason after checking updates and installing packages this command stopped work. I have removed Pandoc (v2.16.2) and MikTeX (v21.8) and re-installed again. After that I check for updates, but stop to install new packages. And it works again. Maybe my comment will be useful for someone.