diff options
author | Jay Berkenbilt <ejb@ql.org> | 2021-12-18 15:01:52 +0100 |
---|---|---|
committer | Jay Berkenbilt <ejb@ql.org> | 2021-12-18 17:05:51 +0100 |
commit | 10fb619d3e0618528b7ac6c20cad6262020cf947 (patch) | |
tree | c893fedff351e809edead840376e8648f1cc28ff /manual/object-streams.rst | |
parent | f3d1138b8ab64c6a26e1dd5f77a644b19016a30d (diff) | |
download | qpdf-10fb619d3e0618528b7ac6c20cad6262020cf947.tar.zst |
Split documentation into multiple pages, change theme
Diffstat (limited to 'manual/object-streams.rst')
-rw-r--r-- | manual/object-streams.rst | 186 |
1 files changed, 186 insertions, 0 deletions
diff --git a/manual/object-streams.rst b/manual/object-streams.rst new file mode 100644 index 00000000..6c2b3fc8 --- /dev/null +++ b/manual/object-streams.rst @@ -0,0 +1,186 @@ +.. _ref.object-and-xref-streams: + +Object and Cross-Reference Streams +================================== + +This chapter provides information about the implementation of object +stream and cross-reference stream support in qpdf. + +.. _ref.object-streams: + +Object Streams +-------------- + +Object streams can contain any regular object except the following: + +- stream objects + +- objects with generation > 0 + +- the encryption dictionary + +- objects containing the /Length of another stream + +In addition, Adobe reader (at least as of version 8.0.0) appears to not +be able to handle having the document catalog appear in an object stream +if the file is encrypted, though this is not specifically disallowed by +the specification. + +There are additional restrictions for linearized files. See +:ref:`ref.object-streams-linearization` for details. + +The PDF specification refers to objects in object streams as "compressed +objects" regardless of whether the object stream is compressed. + +The generation number of every object in an object stream must be zero. +It is possible to delete and replace an object in an object stream with +a regular object. + +The object stream dictionary has the following keys: + +- ``/N``: number of objects + +- ``/First``: byte offset of first object + +- ``/Extends``: indirect reference to stream that this extends + +Stream collections are formed with ``/Extends``. They must form a +directed acyclic graph. These can be used for semantic information and +are not meaningful to the PDF document's syntactic structure. Although +qpdf preserves stream collections, it never generates them and doesn't +make use of this information in any way. + +The specification recommends limiting the number of objects in object +stream for efficiency in reading and decoding. Acrobat 6 uses no more +than 100 objects per object stream for linearized files and no more 200 +objects per stream for non-linearized files. ``QPDFWriter``, in object +stream generation mode, never puts more than 100 objects in an object +stream. + +Object stream contents consists of *N* pairs of integers, each of which +is the object number and the byte offset of the object relative to the +first object in the stream, followed by the objects themselves, +concatenated. + +.. _ref.xref-streams: + +Cross-Reference Streams +----------------------- + +For non-hybrid files, the value following ``startxref`` is the byte +offset to the xref stream rather than the word ``xref``. + +For hybrid files (files containing both xref tables and cross-reference +streams), the xref table's trailer dictionary contains the key +``/XRefStm`` whose value is the byte offset to a cross-reference stream +that supplements the xref table. A PDF 1.5-compliant application should +read the xref table first. Then it should replace any object that it has +already seen with any defined in the xref stream. Then it should follow +any ``/Prev`` pointer in the original xref table's trailer dictionary. +The specification is not clear about what should be done, if anything, +with a ``/Prev`` pointer in the xref stream referenced by an xref table. +The ``QPDF`` class ignores it, which is probably reasonable since, if +this case were to appear for any sensible PDF file, the previous xref +table would probably have a corresponding ``/XRefStm`` pointer of its +own. For example, if a hybrid file were appended, the appended section +would have its own xref table and ``/XRefStm``. The appended xref table +would point to the previous xref table which would point the +``/XRefStm``, meaning that the new ``/XRefStm`` doesn't have to point to +it. + +Since xref streams must be read very early, they may not be encrypted, +and the may not contain indirect objects for keys required to read them, +which are these: + +- ``/Type``: value ``/XRef`` + +- ``/Size``: value *n+1*: where *n* is highest object number (same as + ``/Size`` in the trailer dictionary) + +- ``/Index`` (optional): value + ``[:samp:`{n count}` ...]`` used to determine + which objects' information is stored in this stream. The default is + ``[0 /Size]``. + +- ``/Prev``: value :samp:`{offset}`: byte + offset of previous xref stream (same as ``/Prev`` in the trailer + dictionary) + +- ``/W [...]``: sizes of each field in the xref table + +The other fields in the xref stream, which may be indirect if desired, +are the union of those from the xref table's trailer dictionary. + +.. _ref.xref-stream-data: + +Cross-Reference Stream Data +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The stream data is binary and encoded in big-endian byte order. Entries +are concatenated, and each entry has a length equal to the total of the +entries in ``/W`` above. Each entry consists of one or more fields, the +first of which is the type of the field. The number of bytes for each +field is given by ``/W`` above. A 0 in ``/W`` indicates that the field +is omitted and has the default value. The default value for the field +type is "``1``". All other default values are "``0``". + +PDF 1.5 has three field types: + +- 0: for free objects. Format: ``0 obj next-generation``, same as the + free table in a traditional cross-reference table + +- 1: regular non-compressed object. Format: ``1 offset generation`` + +- 2: for objects in object streams. Format: ``2 object-stream-number + index``, the number of object stream containing the object and the + index within the object stream of the object. + +It seems standard to have the first entry in the table be ``0 0 0`` +instead of ``0 0 ffff`` if there are no deleted objects. + +.. _ref.object-streams-linearization: + +Implications for Linearized Files +--------------------------------- + +For linearized files, the linearization dictionary, document catalog, +and page objects may not be contained in object streams. + +Objects stored within object streams are given the highest range of +object numbers within the main and first-page cross-reference sections. + +It is okay to use cross-reference streams in place of regular xref +tables. There are on special considerations. + +Hint data refers to object streams themselves, not the objects in the +streams. Shared object references should also be made to the object +streams. There are no reference in any hint tables to the object numbers +of compressed objects (objects within object streams). + +When numbering objects, all shared objects within both the first and +second halves of the linearized files must be numbered consecutively +after all normal uncompressed objects in that half. + +.. _ref.object-stream-implementation: + +Implementation Notes +-------------------- + +There are three modes for writing object streams: +:samp:`disable`, :samp:`preserve`, and +:samp:`generate`. In disable mode, we do not generate +any object streams, and we also generate an xref table rather than xref +streams. This can be used to generate PDF files that are viewable with +older readers. In preserve mode, we write object streams such that +written object streams contain the same objects and ``/Extends`` +relationships as in the original file. This is equal to disable if the +file has no object streams. In generate, we create object streams +ourselves by grouping objects that are allowed in object streams +together in sets of no more than 100 objects. We also ensure that the +PDF version is at least 1.5 in generate mode, but we preserve the +version header in the other modes. The default is +:samp:`preserve`. + +We do not support creation of hybrid files. When we write files, even in +preserve mode, we will lose any xref tables and merge any appended +sections. |