aboutsummaryrefslogtreecommitdiffstats
path: root/manual/linearization.rst
diff options
context:
space:
mode:
authorJay Berkenbilt <ejb@ql.org>2021-12-18 15:01:52 +0100
committerJay Berkenbilt <ejb@ql.org>2021-12-18 17:05:51 +0100
commit10fb619d3e0618528b7ac6c20cad6262020cf947 (patch)
treec893fedff351e809edead840376e8648f1cc28ff /manual/linearization.rst
parentf3d1138b8ab64c6a26e1dd5f77a644b19016a30d (diff)
downloadqpdf-10fb619d3e0618528b7ac6c20cad6262020cf947.tar.zst
Split documentation into multiple pages, change theme
Diffstat (limited to 'manual/linearization.rst')
-rw-r--r--manual/linearization.rst197
1 files changed, 197 insertions, 0 deletions
diff --git a/manual/linearization.rst b/manual/linearization.rst
new file mode 100644
index 00000000..abac843a
--- /dev/null
+++ b/manual/linearization.rst
@@ -0,0 +1,197 @@
+.. _ref.linearization:
+
+Linearization
+=============
+
+This chapter describes how ``QPDF`` and ``QPDFWriter`` implement
+creation and processing of linearized PDFS.
+
+.. _ref.linearization-strategy:
+
+Basic Strategy for Linearization
+--------------------------------
+
+To avoid the incestuous problem of having the qpdf library validate its
+own linearized files, we have a special linearized file checking mode
+which can be invoked via :command:`qpdf
+--check-linearization` (or :command:`qpdf
+--check`). This mode reads the linearization parameter
+dictionary and the hint streams and validates that object ordering,
+parameters, and hint stream contents are correct. The validation code
+was first tested against linearized files created by external tools
+(Acrobat and pdlin) and then used to validate files created by
+``QPDFWriter`` itself.
+
+.. _ref.linearized.preparation:
+
+Preparing For Linearization
+---------------------------
+
+Before creating a linearized PDF file from any other PDF file, the PDF
+file must be altered such that all page attributes are propagated down
+to the page level (and not inherited from parents in the ``/Pages``
+tree). We also have to know which objects refer to which other objects,
+being concerned with page boundaries and a few other cases. We refer to
+this part of preparing the PDF file as
+*optimization*, discussed in
+:ref:`ref.optimization`. Note the, in this context, the
+term *optimization* is a qpdf term, and the
+term *linearization* is a term from the PDF
+specification. Do not be confused by the fact that many applications
+refer to linearization as optimization or web optimization.
+
+When creating linearized PDF files from optimized PDF files, there are
+really only a few issues that need to be dealt with:
+
+- Creation of hints tables
+
+- Placing objects in the correct order
+
+- Filling in offsets and byte sizes
+
+.. _ref.optimization:
+
+Optimization
+------------
+
+In order to perform various operations such as linearization and
+splitting files into pages, it is necessary to know which objects are
+referenced by which pages, page thumbnails, and root and trailer
+dictionary keys. It is also necessary to ensure that all page-level
+attributes appear directly at the page level and are not inherited from
+parents in the pages tree.
+
+We refer to the process of enforcing these constraints as
+*optimization*. As mentioned above, note
+that some applications refer to linearization as optimization. Although
+this optimization was initially motivated by the need to create
+linearized files, we are using these terms separately.
+
+PDF file optimization is implemented in the
+:file:`QPDF_optimization.cc` source file. That file
+is richly commented and serves as the primary reference for the
+optimization process.
+
+After optimization has been completed, the private member variables
+``obj_user_to_objects`` and ``object_to_obj_users`` in ``QPDF`` have
+been populated. Any object that has more than one value in the
+``object_to_obj_users`` table is shared. Any object that has exactly one
+value in the ``object_to_obj_users`` table is private. To find all the
+private objects in a page or a trailer or root dictionary key, one
+merely has make this determination for each element in the
+``obj_user_to_objects`` table for the given page or key.
+
+Note that pages and thumbnails have different object user types, so the
+above test on a page will not include objects referenced by the page's
+thumbnail dictionary and nothing else.
+
+.. _ref.linearization.writing:
+
+Writing Linearized Files
+------------------------
+
+We will create files with only primary hint streams. We will never write
+overflow hint streams. (As of PDF version 1.4, Acrobat doesn't either,
+and they are never necessary.) The hint streams contain offset
+information to objects that point to where they would be if the hint
+stream were not present. This means that we have to calculate all object
+positions before we can generate and write the hint table. This means
+that we have to generate the file in two passes. To make this reliable,
+``QPDFWriter`` in linearization mode invokes exactly the same code twice
+to write the file to a pipeline.
+
+In the first pass, the target pipeline is a count pipeline chained to a
+discard pipeline. The count pipeline simply passes its data through to
+the next pipeline in the chain but can return the number of bytes passed
+through it at any intermediate point. The discard pipeline is an end of
+line pipeline that just throws its data away. The hint stream is not
+written and dummy values with adequate padding are stored in the first
+cross reference table, linearization parameter dictionary, and /Prev key
+of the first trailer dictionary. All the offset, length, object
+renumbering information, and anything else we need for the second pass
+is stored.
+
+At the end of the first pass, this information is passed to the ``QPDF``
+class which constructs a compressed hint stream in a memory buffer and
+returns it. ``QPDFWriter`` uses this information to write a complete
+hint stream object into a memory buffer. At this point, the length of
+the hint stream is known.
+
+In the second pass, the end of the pipeline chain is a regular file
+instead of a discard pipeline, and we have known values for all the
+offsets and lengths that we didn't have in the first pass. We have to
+adjust offsets that appear after the start of the hint stream by the
+length of the hint stream, which is known. Anything that is of variable
+length is padded, with the padding code surrounding any writing code
+that differs in the two passes. This ensures that changes to the way
+things are represented never results in offsets that were gathered
+during the first pass becoming incorrect for the second pass.
+
+Using this strategy, we can write linearized files to a non-seekable
+output stream with only a single pass to disk or wherever the output is
+going.
+
+.. _ref.linearization-data:
+
+Calculating Linearization Data
+------------------------------
+
+Once a file is optimized, we have information about which objects access
+which other objects. We can then process these tables to decide which
+part (as described in "Linearized PDF Document Structure" in the PDF
+specification) each object is contained within. This tells us the exact
+order in which objects are written. The ``QPDFWriter`` class asks for
+this information and enqueues objects for writing in the proper order.
+It also turns on a check that causes an exception to be thrown if an
+object is encountered that has not already been queued. (This could
+happen only if there were a bug in the traversal code used to calculate
+the linearization data.)
+
+.. _ref.linearization-issues:
+
+Known Issues with Linearization
+-------------------------------
+
+There are a handful of known issues with this linearization code. These
+issues do not appear to impact the behavior of linearized files which
+still work as intended: it is possible for a web browser to begin to
+display them before they are fully downloaded. In fact, it seems that
+various other programs that create linearized files have many of these
+same issues. These items make reference to terminology used in the
+linearization appendix of the PDF specification.
+
+- Thread Dictionary information keys appear in part 4 with the rest of
+ Threads instead of in part 9. Objects in part 9 are not grouped
+ together functionally.
+
+- We are not calculating numerators for shared object positions within
+ content streams or interleaving them within content streams.
+
+- We generate only page offset, shared object, and outline hint tables.
+ It would be relatively easy to add some additional tables. We gather
+ most of the information needed to create thumbnail hint tables. There
+ are comments in the code about this.
+
+.. _ref.linearization-debugging:
+
+Debugging Note
+--------------
+
+The :command:`qpdf --show-linearization` command can show
+the complete contents of linearization hint streams. To look at the raw
+data, you can extract the filtered contents of the linearization hint
+tables using :command:`qpdf --show-object=n
+--filtered-stream-data`. Then, to convert this into a bit
+stream (since linearization tables are bit streams written without
+regard to byte boundaries), you can pipe the resulting data through the
+following perl code:
+
+.. code-block:: perl
+
+ use bytes;
+ binmode STDIN;
+ undef $/;
+ my $a = <STDIN>;
+ my @ch = split(//, $a);
+ map { printf("%08b", ord($_)) } @ch;
+ print "\n";