From 419949574df4525c61ffe060ad1c63daf66e806c Mon Sep 17 00:00:00 2001 From: Jay Berkenbilt Date: Thu, 21 Jun 2018 11:23:28 -0400 Subject: Add information about helper classes to the documentation --- manual/qpdf-manual.xml | 331 +++++++++++++++++++++++++++++++++++-------------- 1 file changed, 239 insertions(+), 92 deletions(-) diff --git a/manual/qpdf-manual.xml b/manual/qpdf-manual.xml index 0c8a3291..5287243e 100644 --- a/manual/qpdf-manual.xml +++ b/manual/qpdf-manual.xml @@ -1751,53 +1751,54 @@ outfile.pdf In general, one should adhere strictly to a specification when - writing but be liberal in reading. This way, the product of our + writing but be liberal in reading. This way, the product of our software will be accepted by the widest range of other programs, - and we will accept the widest range of input files. This library + and we will accept the widest range of input files. This library attempts to conform to that philosophy whenever possible but also aims to provide strict checking for people who want to validate - PDF files. If you don't want to see warnings and are trying to + PDF files. If you don't want to see warnings and are trying to write something that is tolerant, you can call - setSuppressWarnings(true). If you want to fail + setSuppressWarnings(true). If you want to fail on the first error, you can call - setAttemptRecovery(false). The default - behavior is to generating warnings for recoverable problems. Note - that recovery will not always produce the desired results even if - it is able to get through the file. Unlike most other PDF files - that produce generic warnings such as “This file is + setAttemptRecovery(false). The default behavior + is to generating warnings for recoverable problems. Note that + recovery will not always produce the desired results even if it is + able to get through the file. Unlike most other PDF files that + produce generic warnings such as “This file is damaged,”, qpdf generally issues a detailed error message - that would be most useful to a PDF developer. This is by design - as there seems to be a shortage of PDF validation tools out - there. (This was, in fact, one of the major motivations behind - the initial creation of qpdf.) + that would be most useful to a PDF developer. This is by design as + there seems to be a shortage of PDF validation tools out there. + This was, in fact, one of the major motivations behind the initial + creation of qpdf. Design Goals The QPDF package includes support for reading and rewriting PDF - files. It aims to hide from the user details involving object + files. It aims to hide from the user details involving object locations, modified (appended) PDF files, the directness/indirectness of objects, and stream filters including - encryption. It does not aim to hide knowledge of the object - hierarchy or content stream contents. Put another way, a user of + encryption. It does not aim to hide knowledge of the object + hierarchy or content stream contents. Put another way, a user of the qpdf library is expected to have knowledge about how PDF files work, but is not expected to have to keep track of bookkeeping details such as file positions. A user of the library never has to care whether an object is - direct or indirect. All access to objects deals with this - transparently. All memory management details are also handled by - the library. + direct or indirect, though it is possible to determine whether an + object is direct or not if this information is needed. All access + to objects deals with this transparently. All memory management + details are also handled by the library. The PointerHolder object is used internally - by the library to deal with memory management. This is basically - a smart pointer object very similar in spirit to the Boost - library's shared_ptr object, but predating - it by several years. This library also makes use of a technique - for giving fine-grained access to methods in one class to other + by the library to deal with memory management. This is basically a + smart pointer object very similar in spirit to C++-11's + std::shared_ptr object, but predating it by + several years. This library also makes use of a technique for + giving fine-grained access to methods in one class to other classes by using public subclasses with friends and only private members that in turn call private methods of the containing class. See QPDFObjectHandle::Factory as an @@ -1810,29 +1811,20 @@ outfile.pdf files. - QPDFObject is the basic PDF Object class. - It is an abstract base class from which are derived classes for - each type of PDF object. Clients do not interact with Objects - directly but instead interact with - QPDFObjectHandle. - - - QPDFObjectHandle contains - PointerHolder<QPDFObject> and - includes accessor methods that are type-safe proxies to the - methods of the derived object classes as well as methods for - querying object types. They can be passed around by value, - copied, stored in containers, etc. with very low overhead. - Instances of QPDFObjectHandle always - contain a reference back to the QPDF object - from which they were created. A + The primary class for interacting with PDF objects is + QPDFObjectHandle. Instances of this class + can be passed around by value, copied, stored in containers, etc. + with very low overhead. Instances of + QPDFObjectHandle created by reading from a + file will always contain a reference back to the + QPDF object from which they were created. A QPDFObjectHandle may be direct or indirect. If indirect, the QPDFObject the PointerHolder initially points to is a null - pointer. In this case, the first attempt to access the underlying + pointer. In this case, the first attempt to access the underlying QPDFObject will result in the QPDFObject being resolved via a call to the - referenced QPDF instance. This makes it + referenced QPDF instance. This makes it essentially impossible to make coding errors in which certain things will work for some PDF files and not for others based on which objects are direct and which objects are indirect. @@ -1848,48 +1840,6 @@ outfile.pdf modified in several ways. See comments in QPDFObjectHandle.hh for details. - - When the QPDF class creates a new object, - it dynamically allocates the appropriate type of - QPDFObject and immediately hands the - pointer to an instance of QPDFObjectHandle. - The parser reads a token from the current file position. If the - token is a not either a dictionary or array opener, an object is - immediately constructed from the single token and the parser - returns. Otherwise, the parser is invoked recursively in a - special mode in which it accumulates objects until it finds a - balancing closer. During this process, the - “R” keyword is recognized and an - indirect QPDFObjectHandle may be - constructed. - - - The QPDF::resolve() method, which is used to - resolve an indirect object, may be invoked from the - QPDFObjectHandle class. It first checks a - cache to see whether this object has already been read. If not, - it reads the object from the PDF file and caches it. It the - returns the resulting QPDFObjectHandle. - The calling object handle then replaces its - PointerHolder<QDFObject> with the one - from the newly returned QPDFObjectHandle. - In this way, only a single copy of any direct object need exist - and clients can access objects transparently without knowing - caring whether they are direct or indirect objects. Additionally, - no object is ever read from the file more than once. That means - that only the portions of the PDF file that are actually needed - are ever read from the input file, thus allowing the qpdf package - to take advantage of this important design goal of PDF files. - - - If the requested object is inside of an object stream, the object - stream itself is first read into memory. Then the tokenizer reads - objects from the memory stream based on the offset information - stored in the stream. Those individual objects are cached, after - which the temporary buffer holding the object stream contents are - discarded. In this way, the first time an object in an object - stream is requested, all objects in the stream are cached. - An instance of QPDF is constructed by using the class's default constructor. If desired, the @@ -1934,8 +1884,206 @@ outfile.pdf There are some convenience routines for very common operations such as walking the page tree and returning a vector of all page - objects. For full details, please see the header file - QPDF.hh. + objects. For full details, please see the header files + QPDF.hh and + QPDFObjectHandle.hh. There are also some + additional helper classes that provide higher level API functions + for certain document constructions. These are discussed in . + + + + Helper Classes + + QPDF version 8.1 introduced the concept of helper classes. Helper + classes are intended to contain higher level APIs that allow + developers to work with certain document constructs at an + abstraction level above that of + QPDFObjectHandle while staying true to + qpdf's philosophy of not hiding document structure from the + developer. As with qpdf in general, the goal is take away some of + the more tedious bookkeeping aspects of working with PDF files, + not to remove the need for the developer to understand how the PDF + construction in question works. The driving factor behind the + creation of helper classes was to allow the evolution of higher + level interfaces in qpdf without polluting the interfaces of the + main top-level classes QPDF and + QPDFObjectHandle. + + + There are two kinds of helper classes: + document helpers and + object helpers. Document helpers are + constructed with a reference to a QPDF + object and provide methods for working with structures that are at + the document level. Object helpers are constructed with an + instance of a QPDFObjectHandle and provide + methods for working with specific types of objects. + + + Examples of document helpers include + QPDFPageDocumentHelper, which contains + methods for operating on the document's page trees, such as + enumerating all pages of a document and adding and removing pages; + and QPDFAcroFormDocumentHelper, which + contains document-level methods related to interactive forms, such + as enumerating form fields and creating mappings between form + fields and annotations. + + + Examples of object helpers include + QPDFPageObjectHelper for performing + operations on pages such as page rotation and some operations on + content streams, QPDFFormFieldObjectHelper + for performing operations related to interactive form fields, and + QPDFAnnotationObjectHelper for working with + annotations. + + + It is always possible to retrieve the underlying + QPDF reference from a document helper and + the underlying QPDFObjectHandle reference + from an object helper. Helpers are designed to be helpers, not + wrappers. The intention is that, in general, it is safe to freely + intermix operations that use helpers with operations that use the + underlying objects. Document and object helpers do not attempt to + provide a complete interface for working with the things they are + helping with, nor do they attempt to encapsulate underlying + structures. They just provide a few methods to help with + error-prone, repetitive, or complex tasks. In some cases, a helper + object may cache some information that is expensive to gather. In + such cases, the helper classes are implemented so that their own + methods keep the cache consistent, and the header file will + provide a method to invalidate the cache and a description of what + kinds of operations would make the cache invalid. If in doubt, you + can always discard a helper class and create a new one with the + same underlying objects, which will ensure that you have discarded + any stale information. + + + By Convention, document helpers are called + QPDFSomethingDocumentHelper and are derived + from QPDFDocumentHelper, and object helpers + are called QPDFSomethingObjectHelper and + are derived from QPDFObjectHelper. For + details on specific helpers, please see their header files. You + can find them by looking at + include/qpdf/QPDF*DocumentHelper.hh and + include/qpdf/QPDF*ObjectHelper.hh. + + + In order to avoid creation of circular dependencies, the following + general guidelines are followed with helper classes: + + + + Core class interfaces do not know about helper classes. For + example, no methods of QPDF or + QPDFObjectHandle will include helper + classes in their interfaces. + + + + + Interfaces of object helpers will usually not use document + helpers in their interfaces. This is because it is much more + useful for document helpers to have methods that return object + helpers. Most operations in PDF files start at the document + level and go from there to the object level rather than the + other way around. It can sometimes be useful to map back from + object-level structures to document-level structures. If there + is a desire to do this, it will generally be provided by a + method in the document helper class. + + + + + Most of the time, object helpers don't know about other object + helpers. However, in some cases, one type of object may be a + container for another type of object, in which case it may make + sense for the outer object to know about the inner object. For + example, there are methods in the + QPDFPageObjectHelper that know + QPDFAnnotationObjectHelper because + references to annotations are contained in page dictionaries. + + + + + Any helper or core library class may use helpers in their + implementations. + + + + + + Prior to qpdf version 8.1, higher level interfaces were added as + “convenience functions” in either + QPDF or + QPDFObjectHandle. For compatibility, older + convenience functions for operating with pages will remain in + those classes even as alternatives are provided in helper classes. + Going forward, new higher level interfaces will be provided using + helper classes. + + + + Implementation Notes + + This section contains a few notes about QPDF's internal + implementation, particularly around what it does when it first + processes a file. This section is a bit of a simplification of + what it actually does, but it could serve as a starting point to + someone trying to understand the implementation. There is nothing + in this section that you need to know to use the qpdf library. + + + QPDFObject is the basic PDF Object class. + It is an abstract base class from which are derived classes for + each type of PDF object. Clients do not interact with Objects + directly but instead interact with + QPDFObjectHandle. + + + When the QPDF class creates a new object, + it dynamically allocates the appropriate type of + QPDFObject and immediately hands the + pointer to an instance of QPDFObjectHandle. + The parser reads a token from the current file position. If the + token is a not either a dictionary or array opener, an object is + immediately constructed from the single token and the parser + returns. Otherwise, the parser iterates in a special mode in which + it accumulates objects until it finds a balancing closer. During + this process, the “R” keyword is + recognized and an indirect QPDFObjectHandle + may be constructed. + + + The QPDF::resolve() method, which is used to + resolve an indirect object, may be invoked from the + QPDFObjectHandle class. It first checks a + cache to see whether this object has already been read. If not, + it reads the object from the PDF file and caches it. It the + returns the resulting QPDFObjectHandle. + The calling object handle then replaces its + PointerHolder<QDFObject> with the one + from the newly returned QPDFObjectHandle. + In this way, only a single copy of any direct object need exist + and clients can access objects transparently without knowing + caring whether they are direct or indirect objects. Additionally, + no object is ever read from the file more than once. That means + that only the portions of the PDF file that are actually needed + are ever read from the input file, thus allowing the qpdf package + to take advantage of this important design goal of PDF files. + + + If the requested object is inside of an object stream, the object + stream itself is first read into memory. Then the tokenizer reads + objects from the memory stream based on the offset information + stored in the stream. Those individual objects are cached, after + which the temporary buffer holding the object stream contents are + discarded. In this way, the first time an object in an object + stream is requested, all objects in the stream are cached. The following example should clarify how @@ -1951,12 +2099,11 @@ outfile.pdf The QPDF class checks the beginning of - a.pdf for - %!PDF-1.[0-9]+. It then reads the cross - reference table mentioned at the end of the file, ensuring that - it is looking before the last %%EOF. After - getting to trailer keyword, it invokes the - parser. + a.pdf for a PDF header. It then reads the + cross reference table mentioned at the end of the file, + ensuring that it is looking before the last + %%EOF. After getting to + trailer keyword, it invokes the parser. -- cgit v1.2.3-54-g00ecf