2012年11月23日金曜日

Convert PDF documents to SVG for Inkscape

There is now a project that may work for you. It was made with Inkscape in mind. You can try it fromhttp://www.cityinthesky.co.uk/pdf2svg.html. If that doesn't work, try this awful method below.

I searched and searched the web for a project that could convert PDF documents into an editable format for some free software on Linux. I use Ubuntu daily at my workplace and only use a single Windows application for accounting software via RDP / VMPlayer. I needed a way to be able to modify PDF documents and I have Inkscape with all the features I need, except a functioning PDF import filter. Inkscape uses pstoedit which doesn't extract embedded raster images and convert them for the SVG format. There is a plugin for pstoedit to convert to SVG for $50, but it caused a segfault on my machine. So no joy with a single tool to automagically convert my documents.

I spent a loooong time trying different formats from pstoedit and converting through various other software to end up with an SVG file that I can use in Inkscape with included raster images. The following is the process I used to create the best results. The steps I have used are for creating one page at a time. Some more automation could be added to make most of this work without all these steps, as well as to do each page in the PDF document automatically.

Step 1
Create a working directory for this project. You will likely be creating hundreds of files during this process and it can get messy if it's mixed in with other documents.

Step 2
Make a Level 1 Postscript file from a page in the PDF document. When extracting pages from a PDF document, pstoedit's intermediate file formats have no support for Level 2 Postscript raster images. We'll create a Level 1 Postscript file from a PDF page that pstoedit will handle correctly.

pdf2ps -f pagenum -l pagenum -dLanguageLevel=1 document.pdf page.ps

 

Step 3
Convert the page to the fig format. The pstoedit tool does this job decently, and creates tons of files in the process. This process creates a .fig document with all of the images in separate EPS files. The .fig document simply references the image files to be included. At this point you could use xfig to make modifications, but it would be horribly slow and difficult to work with. This process will also rasterize all text, but I'll show in a later step how to get vectors back in your document.

pstoedit -f fig page.ps page.fig

 

Step 4
Convert from the fig format to the SVG format. When this is done, the SVG document contains only some formatting and placement information, while referencing all of the external EPS files.

fig2dev -L svg page.fig page.svg

 

Step 5
Convert EPS images to PNG images for use with Inkscape. Inkscape can't import those raster EPS files, only vector EPS files (it uses pstoedit to do the conversions, so it is limited by that tool). The EPS images must be converted to an image format that Inkscape can use. I chose PNG because the format is free, standardized, and lossless. Unfortunately, there is a problem that prevents a direct conversion. When pstoedit created the EPS files with embedded raster images, the EPS file may specify an incorrect image size/formatting. This shows up as white lines that surround the raster images, and the white lines are inconsistent between the images. What must be done is to extract the raster image data from the EPS file, not using the EPS specified sizing and formatting. The way to do this is rather ugly, but it works. First, the EPS files will be converted to PDF files. Second, the tool pdfimages will extract the raster images from the PDF files. Third, the Imagemagick tool convert will conver the images to PNG files.

#!/bin/sh
mkdir tmpimages
for epsfile in *.eps
do echo "${epsfile}"
convert "${epsfile}" "${epsfile}.pdf"
pdfimages "${epsfile}.pdf" tmpimages/
convert tmpimages/* "${epsfile}.png"
rm -f "${epsfile}.pdf"
rm -f tmpimages/*
done
rm -rf tmpimages/

 

Step 6
Change all the references in the SVG file from .eps to .eps.png. Open your favorite text editor (or get creative with sed) and change all .eps to .eps.png.

Step 7
Add the proper attributes in the <svg> tag to the SVG document to allow Inkscape to open it. For whatever reason, Inkscape won't recognize the images in the SVG document unless the following attributes are added to the <svg> tag:

xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://web.resource.org/cc/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"

 

Step 8
Extract vector information from the PDF document. If the PDF document contains text or vector information, it will have been converted to raster images as described earlier. The pstoedit tool can do this job as well.

pstoedit -f plot-svg -page pagenum -ssp -mergetext document.pdf page_text.svg

 

Step 9
Place the SVG text in the SVG document with the images.

a) Open both SVG documents with Inkscape.
b) Switch to the document that has the text only.
c) Select all the items then group them together as a single object.
d) Copy that object and paste it into the SVG document with the graphics.
e) Send the SVG text below the rasterized text in the document.
f) Delete all of the rasterized text to show only the SVG text below it.

 

Notes
It is my understanding that converting to fig documents may screw with your units due to a not high enough resolution. I have not confirmed this. You may want to scale the document up by a large factor (10x), do the conversion to the xfig format, then use xfig to scale back down before exporting.

Known Problems
Some raster images may not be clipped properly. This must be fixed manually. This is only a formatting issue and does not affect the quality of the images. Text may be converted to paths and may not be editable.

Warning
This could possibly take a LOT of disk space. Make sure to have at least several hundred megabytes free. It could also take a lot of memory and time for Inkscape to open the document for the first time until all the rasterized text is removed. Please be patient.

pdf2svg

Under Linux there aren't many freely available vector graphics editors and as far as I know there are none that can edit EPS (encapsulated postscript) and PDF (portable document format) files. I produce lots of these files in my day-to-day work and I would like to be able to edit them. The best vector graphics editor I have found so far isInkscape but it only reads SVG files… (Edit: recent versions can import PDFs but I'm not entirely happy with how text is imported; in particular, that fonts are not imported from the PDF.)

To overcome this problem I have written a very small utility to convert PDF files to SVG files using Poppler andCairo. Version 0.2.1 is available here (with modifications by Matthew Flaschen and Ed Grace). This appears to work on any PDF document that Poppler can read (try them in XPDF or Evince since they both use Poppler).

So now it is possible to easily edit PDF documents with your favourite SVG editor! One other alternative would be to use pstoedit but the commercial SVG module costs (unsurprisingly!) and the free SVG module is not very good at handling text… To install

      tar -zxf pdf2svg-0.2.1.tar.gz        cd pdf2svg-0.2.1        ./configure --prefix=/usr/local        make        make install

To use

pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]

Note: if you specify all the pages you must give a filename with %d in it (which will automatically be replaced by the appropriate page number). E.g.

pdf2svg input.pdf output_page%d.svg all

convert pdf to svg

I want to convert PDF to SVG please suggest some libraries/executable that will be able to do this efficiently. I have written my own java program using the apache PDFBox and Batik libraries -

PDDocument document = PDDocument.load( pdfFile );  DOMImplementation domImpl =      GenericDOMImplementation.getDOMImplementation();    // Create an instance of org.w3c.dom.Document.  String svgNS = "http://www.w3.org/2000/svg";  Document svgDocument = domImpl.createDocument(svgNS, "svg", null);  SVGGeneratorContext ctx = SVGGeneratorContext.createDefault(svgDocument);  ctx.setEmbeddedFontsOn(true);    // Ask the test to render into the SVG Graphics2D implementation.        for(int i = 0 ; i < document.getNumberOfPages() ; i++){          String svgFName = svgDir+"page"+i+".svg";          (new File(svgFName)).createNewFile();          // Create an instance of the SVG Generator.          SVGGraphics2D svgGenerator = new SVGGraphics2D(ctx,false);          Printable page  = document.getPrintable(i);          page.print(svgGenerator, document.getPageFormat(i), i);          svgGenerator.stream(svgFName);      }  

This solution works great but the size of the resulting svg files in huge.(many times greater than the pdf). I have figured out where the problem is by looking at the svg in a text editor. it encloses every character in the original document in its own block even if the font properties of the characters is the same. For example the word hello will appear as 6 different text blocks. Is there a way to fix the above code? or please suggest another solution that will work more efficiently.

 

Answers

Inkscape can also be used to convert PDF to SVG. It's actually remarkably good at this, and although the code that it generates is a bit bloated, at the very least, it doesn't seem to have the particular issue that you are encountering in your program. I think it would be challenging to integrate it directly into Java, but inkscape provides a convenient command-line interface to this functionality, so probably the easiest way to access it would be via a system call.

To use Inkscape's command-line interface to convert a PDF to an SVG, use:

inkscape -l out.svg in.pdf  

Which you can then probably call using:

Runtime.getRuntime().exec("inkscape -l out.svg in.pdf")  

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Runtime.html#exec%28java.lang.String%29

I think exec() is synchronous and only returns after the process completes (although I'm not 100% sure on that), so you shoudl be able to just read "out.svg" after that. In any case, Googling "java system call" will yield more info on how to do that part correctly.

OR

Take a look at pdf2svg:

To use

pdf2svg <input.pdf> <output.svg> [<pdf page no. or "all" >]  

When using all give a filename with %d in it (which will be replaced by the page number).

pdf2svg input.pdf output_page%d.svg all  

And for some troubleshooting see: http://www.calcmaster.net/personal_projects/pdf2svg/

JavaScript in PDF to HTML5 Conversion: The Event Object

Previously I mentioned that a key component in understanding JavaScript in PDFs was understanding how events work, in this blog article I will go into a little more depth on that subject.

Within the JavaScript for Acrobat API Reference there is a section describing the event object (it's actually one of the smaller objects if you have a look at the others) that explains briefly how events work within a PDF. This is one of the most vital objects used within PDFs containing JavaScript as it controls the output and results of each user driven event.

For instance, if the event in question is a Keystroke it runs the code for the Keystroke event, making use of an event object to store information on the event and ultimately, decide what output occurs.

After or during the Keystroke event a call can be made to the Validate events which in turn can follow down to other events including the Calculate and Format events.

If the event is not a Keystroke event we start at the Validate event. This is actually a gross simplification of how events work since there are also many other possible events and the Calculate event can generate additional Validate, Blur and Focus events. But it for now I will focus on Keystroke events since they are, from what I can tell the simplest event that can occur to a form field.

In our latest version of the converter we map the keystroke actions onto the onKeyPress attributes of input objects, taking into account how the events work within PDF files. Which for the most part works correctly as they seem to have a one to one relationship with each other. Originally we tried mapping onto the onKeyDown attribute because of how the browsers implemented key presses differently but eventually we wrote a work around for this issue as onKeyDown had some undesirable properties.

One major thing to consider about the conversion is that Calculate events also occur for all Field objects (that have them) whenever you leave a Field (on a blur) which in turn can generate more Validate and Format events. So we move the code from each Calculate event to a global function within our JavaScript that is then run on each field items onBlur attribute, resulting in similar behaviour to the PDF.

It's quite a hard subject to explain in writing, so here is an example:

This is the PDF

And this is the HTML

As you can see we have yet to fully implement the formatting for the calculated values but it's well on it's way!

http://blog.idrsolutions.com/2012/11/javascript-in-pdf-to-html5-conversion-the-event-object/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+JavaPdfBlog+%28Java+PDF+Blog%29

TIFF イメージファイル出力(ImageEncoder)

 


TIFFファイルの出力は以下を参考にしてください。
ライブラリとしてJAIを使用します。以下からダウンロードしてください。

http://java.sun.com/javase/technologies/desktop/media/jai/
グラフィックスにお絵かきする部分は適当に編集してください。
圧縮の指定 encParam.setCompression を TIFFEncodeParam.COMPRESSION_PACKBITS
以外のもっと圧縮率の高いものを指定してもうまく動作しないです。
実装方法が悪いのかもしれませんが。。。



int imageType = BufferedImage.TYPE_BYTE_INDEXED;
BufferedImage map = new BufferedImage(イメージ幅 , イメージ高さ ,imageType);
Graphics2D g = (Graphics2D)map.createGraphics();

グラフィックスに対してお絵かきする。

//ファイル出力
FileOutputStream fos = new FileOutputStream(出力ファイル名);
TIFFEncodeParam encParam = new TIFFEncodeParam();
encParam.setCompression(TIFFEncodeParam.COMPRESSION_PACKBITS);
ImageEncoder encImage = ImageCodec.createImageEncoder("tiff", fos, encParam);
encImage.encode(map);
fos.close();

PDFBox

PDFBoxはPDFファイルを操作するオープンソースのライブラリです。 主な機能は以下の通りです。

  • PDFファイルからのテキストの抽出
  • PDFファイルの結合
  • PDFファイルの暗号・複合化
  • 検索エンジン Lucene の組み込み
  • FDFデータの埋め込み
  • イメージをPDFに変換・PDFからのイメージ取得

ライセンス:Apache License, Version 2.0

情報源

サンプル

PDFファイルの読み込み

 String readFile = "xxx.pdf";
 
PDDocument pdf = null; // ドキュメントオブジェクト
 
FileInputStream pdfStream = null;
 
try {
     pdfStream
= new FileInputStream(readFile);
    
PDFParser pdfParser = new PDFParser(pdfStream);
     pdfParser
.parse(); // 分析
     pdf
= pdfParser.getPDDocument();
 
} catch (Exception e) {
     e
.printStackTrace();
 
} finally {
    
if (pdfStream != null) {
         pdfStream
.close();
    
}
 
}

PDFファイルの書き込み

 String writeFile = "xxx.pdf";
 
COSWriter writer = null;
 
FileOutputStream stream = null;
 
try {
     stream
= new FileOutputStream(writeFile);
     writer
= new COSWriter(stream);
     writer
.write(pdf); // ドキュメントオブジェクトの出力
 
} catch (Exception e) {
     e
.printStackTrace();
 
} finally {
    
if (stream != null) {
         stream
.close();
    
}
    
if (writer != null) {
         writer
.close();
    
}
 
}

フィールドの埋め込み

 String name = "title"; フィールドの名前
 
String value = "タイトルです"; // フィールドに埋め込む文字列
 
PDDocumentCatalog docCatalog = pdf.getDocumentCatalog();
 
PDAcroForm acroForm = docCatalog.getAcroForm();
 
PDField field = acroForm.getField(name); // フィールド取得
 
if (field != null) {
     field
.setValue(value); // フィールドに埋め込み
 
} else {
    
System.err.println("フィールドが見つかりません。:" + name);
 
}

PDFからイメージを抽出

 String readFile = "C:\\tmp\\Antenna_Data_Sheet.pdf";
 
PDDocument pdf = null; // ドキュメントオブジェクト
 
FileInputStream pdfStream = null;
 
try {
     pdfStream
= new FileInputStream(readFile);
    
PDFParser pdfParser = new PDFParser(pdfStream);
     pdfParser
.parse(); // 分析
     pdf
= pdfParser.getPDDocument();
    
int imageCounter = 1;
    
List pages = pdf.getDocumentCatalog().getAllPages();
    
Iterator iter = pages.iterator();
    
while (iter.hasNext()) { // 全ページからイメージを抽出
        
PDPage page = (PDPage) iter.next();
        
PDResources resources = page.getResources();
        
Map images = resources.getImages();
        
if (images != null) {
            
Iterator imageIter = images.keySet().iterator();
            
while (imageIter.hasNext()) {
                
String key = (String) imageIter.next();
                
PDXObjectImage image = (PDXObjectImage) images.get(key);
                
String name = key + "-" + imageCounter;
                 imageCounter
++;
                
System.out.println("Writing image:" + name);
                 image
.write2file(name); // ファイル出力
            
}
        
}
    
}
 
} catch (Exception e) {
     e
.printStackTrace();
 
} finally {
    
if (pdfStream != null) {
         pdfStream
.close();
    
}
 
}