Package org.apache.pdfbox.pdfparser
Class NonSequentialPDFParser
- java.lang.Object
-
- org.apache.pdfbox.pdfparser.BaseParser
-
- org.apache.pdfbox.pdfparser.PDFParser
-
- org.apache.pdfbox.pdfparser.NonSequentialPDFParser
-
public class NonSequentialPDFParser extends PDFParser
PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading ofPDFParser. This class can be used as aPDFParserreplacement. Firstparse()must be called before page objects can be retrieved, e.g.getPDDocument(). This class is a much enhanced version ofQuickParserpresented in PDFBOX-1104 by Jeremy Villalobos.
-
-
Field Summary
Fields Modifier and Type Field Description protected static intDEFAULT_TRAIL_BYTECOUNTprotected static char[]EOF_MARKEREOF-marker.protected static char[]OBJ_MARKERobj-marker.protected SecurityHandlersecurityHandlerThe security handler.protected static char[]STARTXREF_MARKERStartXRef-marker.static java.lang.StringSYSPROP_EOFLOOKUPRANGEstatic java.lang.StringSYSPROP_PARSEMINIMALstatic java.lang.StringTMP_FILE_PREFIX-
Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser
isFDFDocment, xrefTrailerResolver
-
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE
-
-
Constructor Summary
Constructors Constructor Description NonSequentialPDFParser(java.io.File file, RandomAccess raBuf)Constructs parser for given file using given buffer for temporary storage.NonSequentialPDFParser(java.io.File file, RandomAccess raBuf, java.lang.String decryptionPassword)Constructs parser for given file using given buffer for temporary storage.NonSequentialPDFParser(java.io.InputStream input)Constructor.NonSequentialPDFParser(java.io.InputStream input, RandomAccess raBuf, java.lang.String decryptionPassword)Constructor.NonSequentialPDFParser(java.lang.String filename)Constructs parser for given file using memory buffer.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voiddecrypt(COSBase pb, int objNr, int objGenNr)Decrypts given object.protected voiddecryptDictionary(COSDictionary dict, long objNr, long objGenNr)protected voiddecryptString(COSString str, long objNr, long objGenNr)Decrypts given COSString.protected voiddeleteTempFile()Remove the temporary file.PDPagegetPage(int pageNr)Returns the page requested with all the objects loaded into it.intgetPageNumber()Returns the number of pages in a document.PDDocumentgetPDDocument()This will get the PD document that was parsed.protected java.io.FilegetPdfFile()Return the pdf file.SecurityHandlergetSecurityHandler()Returns security handler of the document ornullif document is not encrypted orparse()wasn't called before.protected longgetStartxrefOffset()Looks for and parses startxref.protected voidinitialParse()The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects.booleanisLenient()Return true if parser is lenient.protected intlastIndexOf(char[] pattern, byte[] buf, int endOff)Searches last appearance of pattern within buffer.voidparse()This will parse the stream and populate the COSDocument object.protected COSStreamparseCOSStream(COSDictionary dic, RandomAccess file)This will read a COSStream from the input stream using length attribute within dictionary.protected COSBaseparseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj)This will parse the next object from the stream and add it to the local state.protected COSBaseparseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj)This will parse the next object from the stream and add it to the local state.protected voidreadPattern(char[] pattern)Reads given pattern fromBaseParser.pdfSource.protected voidreleasePdfSourceInputStream()Enable handling of alternative pdfSource implementation.voidsetEOFLookupRange(int byteCount)Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.voidsetLenient(boolean lenient)Change the parser leniency flag.protected voidsetPdfSource(long fileOffset)SetsBaseParser.pdfSourceto start next parsing at given file offset.-
Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser
clearResources, getDocument, getFDFDocument, isContinueOnError, parseHeader, parseStartXref, parseTrailer, parseXrefStream, parseXrefStream, parseXrefTable, readVersionInTrailer, setTempDirectory
-
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpaces
-
-
-
-
Field Detail
-
SYSPROP_PARSEMINIMAL
public static final java.lang.String SYSPROP_PARSEMINIMAL
- See Also:
- Constant Field Values
-
SYSPROP_EOFLOOKUPRANGE
public static final java.lang.String SYSPROP_EOFLOOKUPRANGE
- See Also:
- Constant Field Values
-
DEFAULT_TRAIL_BYTECOUNT
protected static final int DEFAULT_TRAIL_BYTECOUNT
- See Also:
- Constant Field Values
-
EOF_MARKER
protected static final char[] EOF_MARKER
EOF-marker.
-
STARTXREF_MARKER
protected static final char[] STARTXREF_MARKER
StartXRef-marker.
-
OBJ_MARKER
protected static final char[] OBJ_MARKER
obj-marker.
-
securityHandler
protected SecurityHandler securityHandler
The security handler.
-
TMP_FILE_PREFIX
public static final java.lang.String TMP_FILE_PREFIX
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.lang.String filename) throws java.io.IOExceptionConstructs parser for given file using memory buffer.- Parameters:
filename- the filename of the pdf to be parsed- Throws:
java.io.IOException- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.File file, RandomAccess raBuf) throws java.io.IOExceptionConstructs parser for given file using given buffer for temporary storage.- Parameters:
file- the pdf to be parsedraBuf- the buffer to be used for parsing- Throws:
java.io.IOException- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.File file, RandomAccess raBuf, java.lang.String decryptionPassword) throws java.io.IOExceptionConstructs parser for given file using given buffer for temporary storage.- Parameters:
file- the pdf to be parsedraBuf- the buffer to be used for parsingdecryptionPassword- password to be used for decryption- Throws:
java.io.IOException- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.InputStream input) throws java.io.IOExceptionConstructor.- Parameters:
input- input stream representing the pdf.- Throws:
java.io.IOException- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(java.io.InputStream input, RandomAccess raBuf, java.lang.String decryptionPassword) throws java.io.IOExceptionConstructor.- Parameters:
input- input stream representing the pdf.raBuf- the buffer to be used for parsingdecryptionPassword- password to be used for decryption.- Throws:
java.io.IOException- If something went wrong.
-
-
Method Detail
-
setEOFLookupRange
public void setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default valueDEFAULT_TRAIL_BYTECOUNT.In case system property
SYSPROP_EOFLOOKUPRANGEis defined this value will be set on initialization but can be overwritten later.- Parameters:
byteCount- number of trailing bytes
-
initialParse
protected void initialParse() throws java.io.IOExceptionThe initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.- Throws:
java.io.IOException- If something went wrong.
-
setPdfSource
protected final void setPdfSource(long fileOffset) throws java.io.IOExceptionSetsBaseParser.pdfSourceto start next parsing at given file offset.- Parameters:
fileOffset- file offset- Throws:
java.io.IOException- If something went wrong.
-
releasePdfSourceInputStream
protected final void releasePdfSourceInputStream() throws java.io.IOExceptionEnable handling of alternative pdfSource implementation.- Throws:
java.io.IOException- If something went wrong.
-
getStartxrefOffset
protected final long getStartxrefOffset() throws java.io.IOExceptionLooks for and parses startxref. We first look for last '%%EOF' marker (within lastDEFAULT_TRAIL_BYTECOUNTbytes (or range set viasetEOFLookupRange(int)) and go back to findstartxref.- Returns:
- the offset of StartXref
- Throws:
java.io.IOException- If something went wrong.
-
lastIndexOf
protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.- Parameters:
pattern- pattern to search forbuf- buffer to search pattern inendOff- offset (exclusive) where lookup starts at- Returns:
- start offset of pattern within buffer or
-1if pattern could not be found
-
readPattern
protected final void readPattern(char[] pattern) throws java.io.IOExceptionReads given pattern fromBaseParser.pdfSource. Skipping whitespace at start and end.- Parameters:
pattern- pattern to be skipped- Throws:
java.io.IOException- if pattern could not be read
-
parse
public void parse() throws java.io.IOExceptionThis will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.
-
getPdfFile
protected java.io.File getPdfFile()
Return the pdf file.- Returns:
- the pdf file
-
isLenient
public boolean isLenient()
Return true if parser is lenient. Meaning auto healing capacity of the parser are used.- Returns:
- true if parser is lenient
-
setLenient
public void setLenient(boolean lenient) throws java.lang.IllegalArgumentExceptionChange the parser leniency flag. This method can only be called before the parsing of the file.- Parameters:
lenient-- Throws:
java.lang.IllegalArgumentException- if the method is called after parsing.
-
deleteTempFile
protected void deleteTempFile()
Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream
-
getSecurityHandler
public SecurityHandler getSecurityHandler()
Returns security handler of the document ornullif document is not encrypted orparse()wasn't called before.- Returns:
- the security handler.
-
getPDDocument
public PDDocument getPDDocument() throws java.io.IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.- Overrides:
getPDDocumentin classPDFParser- Returns:
- The document at the PD layer.
- Throws:
java.io.IOException- If there is an error getting the document.
-
getPageNumber
public int getPageNumber() throws java.io.IOExceptionReturns the number of pages in a document.- Returns:
- the number of pages.
- Throws:
java.io.IOException- if PAGES or other needed object is missing
-
getPage
public PDPage getPage(int pageNr) throws java.io.IOException
Returns the page requested with all the objects loaded into it.- Parameters:
pageNr- starts from 0 to the number of pages.- Returns:
- the page with the given pagenumber.
- Throws:
java.io.IOException- If something went wrong.
-
parseObjectDynamically
protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws java.io.IOException
This will parse the next object from the stream and add it to the local state. This is taken fromPDFParserand reduced to parsing an indirect object.- Parameters:
obj- object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj- iftrueobject to be parsed must not be contained within compressed stream- Returns:
- the parsed object (which is also added to document object)
- Throws:
java.io.IOException- If an IO error occurs.
-
parseObjectDynamically
protected COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws java.io.IOException
This will parse the next object from the stream and add it to the local state. This is taken fromPDFParserand reduced to parsing an indirect object.- Parameters:
objNr- object number of object to be parsedobjGenNr- object generation number of object to be parsedrequireExistingNotCompressedObj- iftruethe object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)- Returns:
- the parsed object (which is also added to document object)
- Throws:
java.io.IOException- If an IO error occurs.
-
decryptDictionary
protected final void decryptDictionary(COSDictionary dict, long objNr, long objGenNr) throws java.io.IOException
- Parameters:
dict- the dictionary to be decryptedobjNr- the object numberobjGenNr- the object generation number- Throws:
java.io.IOException- ff something went wrong
-
decryptString
protected final void decryptString(COSString str, long objNr, long objGenNr) throws java.io.IOException
Decrypts given COSString.- Parameters:
str- the string to be decryptedobjNr- the object numberobjGenNr- the object generation number- Throws:
java.io.IOException- ff something went wrong
-
decrypt
protected final void decrypt(COSBase pb, int objNr, int objGenNr) throws java.io.IOException
Decrypts given object.- Parameters:
pb- the object to be decryptedobjNr- the object numberobjGenNr- the object generation number- Throws:
java.io.IOException- ff something went wrong
-
parseCOSStream
protected COSStream parseCOSStream(COSDictionary dic, RandomAccess file) throws java.io.IOException
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.- Overrides:
parseCOSStreamin classBaseParser- Parameters:
dic- dictionary that goes with this stream.file- file to write the stream to when reading.- Returns:
- parsed pdf stream.
- Throws:
java.io.IOException- if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
-
-