使用pdfbox提取PDF文档的文字和图片内容

在日常开发工作中，我们经常需要处理PDF文档的内容提取需求。Apache pdfbox是一款功能强大的PDF处理工具，可以帮助我们轻松提取PDF文档中的文字内容和图片资源。本文将详细介绍如何使用pdfbox进行PDF文档的内容提取工作。

环境搭建

为了使用pdfbox进行PDF文档的内容提取，我们需要先完成以下环境搭建工作：

下载pdfbox依赖：通过maven依赖管理，我们可以快速获取所需的pdfbox组件。以下是推荐的依赖配置：


       
    
     org.apache.pdfbox
        
    
     fontbox
        
    
     2.0.1
    
   
       
    
     org.apache.pdfbox
        
    
     pdfbox
        
    
     2.0.1
    
   
       
    
     com.itextpdf
        
    
     itextpdf
        
    
     5.5.13
    
   
       
    
     net.coobird
        
    
     thumbnailator
        
    
     0.4.8

引入相关工具类：在代码开发中，我们可以通过以下工具类来进行PDF文档的内容提取工作：

import org.apache.pdfbox.getDocumentInformation.PDDocumentInformation;import org.apache.pdfbox.pdfinfo.PDFTextStripper;import org.apache.pdfbox.model.PDDocument;import org.apache.pdfbox.model.PDPage;import org.apache.pdfbox.model.COSName;import org.apache.pdfbox.resources.PDResources;

文字内容提取

pdfbox 提供了强大的API来处理PDF文档的内容提取。以下是一个简单的文字内容提取示例：

public static void extractText(String pdfFilePath) throws Exception {    try (PDDocument document = PDDocument.load(new File(pdfFilePath))) {        // 检查是否有内容提取权限        AccessPermission ap = document.getCurrentAccessPermission();        if (!ap.canExtractContent()) {            throw new IOException("You do not have permission to extract text");        }                PDFTextStripper stripper = new PDFTextStripper();        stripper.setSortByPosition(true);                for (int p = 1; p <= document.getNumberOfPages(); ++p) {            stripper.setStartPage(p);            stripper.setEndPage(p);            String text = stripper.getText(document);                        // 打印提取结果            System.out.println("Page " + p + ":");            System.out.println(text.trim());            System.out.println();        }    }}

提取代码解释

加载PDF文档：使用PDDocument.load方法加载目标PDF文档。

权限检查：检查当前账户是否有权限提取PDF内容。如果没有权限，抛出异常提示。

内容提取器初始化：创建一个PDFTextStripper对象，用于提取PDF文档的文字内容。

页面循环处理：逐页提取PDF文档的内容。通过设置startPage和endPage来指定提取的具体页面范围。

结果输出：将提取的文字内容打印出来，方便后续处理和验证。

PDF文档信息提取

除了文字内容，pdfbox还提供了丰富的API来提取PDF文档的其他信息。以下是一个完整的PDF文档信息提取示例：

public static void pdfParse(String pdfPath) throws Exception {    InputStream input = null;    PDDocument document = null;    try {        document = PDDocument.load(new File(pdfPath));                // 提取文档基本信息        PDDocumentInformation info = document.getDocumentInformation();        System.out.println("标题:" + info.getTitle());        System.out.println("主题:" + info.getSubject());        System.out.println("作者:" + info.getAuthor());        System.out.println("关键字:" + info.getKeywords());        System.out.println("创建时间:" + dateFormat(info.getCreationDate()));        System.out.println("修改时间:" + dateFormat(info.getModificationDate()));                // 提取页面资源信息        PDDocumentCatalog cata = document.getDocumentCatalog();        for (int i = 0; i < document.getNumberOfPages(); ++i) {            PDPage page = document.getPage(i);            if (page != null) {                PDResources res = page.getResources();                Iterable
    
      xit = res.getXObjectNames();                Iterator
     
       iterator = xit.iterator();                                while (iterator.hasNext()) {                    COSName cosName = iterator.next();                    if (res.isImageXObject(cosName)) {                        PDImageXObject pdImageXObject = (PDImageXObject) res.getXObject(cosName);                        Thumbnails.of(pdImageXObject.getImage()).scale(0.9).toFile(new File("D:\\pdf\\" + System.currentTimeMillis() + ".jpg"));                    }                }            }        }    } catch (Exception e) {        throw e;    } finally {        if (input != null) {            input.close();        }        if (document != null) {            document.close();        }    }}

提取代码解释

文档加载：使用PDDocument.load方法加载目标PDF文档。

信息提取：通过PDDocumentInformation类获取PDF文档的基本信息，包括标题、主题、作者等。

页面资源处理：遍历PDF文档的每一页，提取每个页面的资源信息。通过PDResources类获取页面的资源对象，判断是否为图片资源。

图片转换：对于发现的图片资源，使用Thumbnails.of方法将图片转换为指定大小的图片文件，并保存到目标目录中。

PDF文件转PNG图片

如果需要将PDF文件转换为图片文件，可以使用以下方法：

private static boolean pdf2Image(String PdfFilePath, String dstImgFolder, int dpi) {    File file = new File(PdfFilePath);    try {        PDDocument pdDocument = PDDocument.load(file);        PDFRenderer renderer = new PDFRenderer(pdDocument);                String imgPDFName = file.getName().substring(0, file.getName().lastIndexOf('.'));        String imgFolderPath = (dstImgFolder.isEmpty())             ? (file.getParent() + File.separator + imgPDFName)             : (dstImgFolder + File.separator + imgPDFName);                if (createDirectory(imgFolderPath)) {            for (int i = 0; i < renderer.getNumberOfPages(); ++i) {                BufferedImage image = renderer.renderImageWithDPI(i, dpi);                String imgFilePath = imgFolderPath + File.separator + imgPDFName +                     "_".concat(String.valueOf(formatNumber(i + 1))) + ".jpg";                File dstFile = new File(imgFilePath);                                ImageWriter writer = ImageIO.getImageWritersByFormatName("jpg").next();                writer.setOutput(ImageIO.createImageOutputStream(dstFile));                ImageWriteParam param = writer.getDefaultWriteParam();                param.setCompressionMode(ImageWriteParam.MODE_EXPLICIT);                param.setCompressionQuality(0.3f);                writer.write(null, new IIOImage(image, null, null), param);            }                        System.out.println("PDF文档转图片成功！" + dstImgFolder);            return true;        } else {            System.out.println("PDF文档转图片失败：" + "创建" + imgFolderPath + "失败");            return false;        }    } catch (IOException e) {        e.printStackTrace();        return false;    }}

代码解释

参数设置：方法接收三个参数，分别是PDF文档路径、图片存储路径和图片分辨率。

目录创建：使用createDirectory方法检查并创建目标图片存储目录。

页面循环处理：遍历PDF文档中的每一页，使用PDFRenderer渲染每页的图片。

图片保存：将每张图片按照指定的路径和文件名保存到目标目录中。文件名中包含页码信息，以区别不同的页面。

辅助功能

为了提高开发效率，我们还开发了一些辅助功能：

日期格式化：通过dateFormat方法将日期信息格式化为指定的字符串：

private static String dateFormat(Calendar calendar) throws Exception {    if (null == calendar) {        return null;    }        String pattern = "yyyy-MM-dd HH:mm:ss";    SimpleDateFormat format = new SimpleDateFormat(pattern);    return format.format(calendar.getTime());}

目录创建：通过createDirectory方法检查并创建指定的目录：

private static boolean createDirectory(String folder) {    File dir = new File(folder);    if (dir.exists()) {        return true;    } else {        return dir.mkdirs();    }}

总结

通过本文的介绍，我们可以清晰地了解如何使用pdfbox进行PDF文档的内容提取工作。无论是单纯的文字内容提取，还是对PDF文档的详细信息提取，都可以通过pdfbox提供的强大API轻松完成。通过合理配置依赖管理，我们可以快速启动项目，并通过自定义工具类提高开发效率。希望本文对您的PDF处理开发工作能有所帮助。

转载地址：http://hovfk.baihongyu.com/

你可能感兴趣的文章