iText - Get Font size and family of a text segment

11,080

Solution 1

You can adapt the code provided in this answer, in particular this code snippet:

Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;

This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.

Solution 2

Thanks to Alexis I could convert his C# solution into Java code:

text = renderInfo.getText();

Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();

Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();

Solution 3

I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):

val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
  case (0, 0) => 0
  case _ => length1 / length2
}

Solution 4

If you want the exact fontsize, use the following code in your renderText:

float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
     - renderInfo.getDescentLine().getStartPoint().get(1);

Modify this as indicated in the other answers for rorated text.

Share:
11,080
Prine
Author by

Prine

Bsc in Computer Science at the University of Applied Sciences in Northwestern Switzerland. Founder of the company Prine Software Engineering located in Baden, Switzerland. My current interests: iOS Development (Swift / Objective-C) PHP (Laravel) Javascript Vue JS Java Artificial Intelligence Neural Networks Multi-Agent Systems Heuristic Algorithms

Updated on June 11, 2022

Comments

  • Prine
    Prine about 2 years

    I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.

    The following code I already have:

    Main

    public static void main(String[] args) throws IOException {
        String src = "SEM_081145.pdf";
    
        PdfReader reader = new PdfReader(src);
    
        SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
    
        PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
        Rectangle rect = new Rectangle(70, 80, 490, 580);
        RenderFilter filter = new RegionTextRenderFilter(rect);
    
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            // strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
            out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
        }
        out.flush();
        out.close();
    }
    

    And I have implemented the TextExtraction Strategy SemTextExtractionStrategy which looks like this:

    public class SemTextExtractionStrategy implements TextExtractionStrategy {
    
    private String text;
    
    @Override
    public void beginTextBlock() {
    }
    
    @Override
    public void renderText(TextRenderInfo renderInfo) {
        text = renderInfo.getText();
    
        System.out.println(renderInfo.getFont().getFontType());
    
        System.out.print(text);
    }
    
    @Override
    public void endTextBlock() {
    }
    
    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
    }
    
    @Override
    public String getResultantText() {
        return text;
    }
    }
    

    I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?

    Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.