aboutsummaryrefslogtreecommitdiffstats
path: root/manual/object-streams.rst
diff options
context:
space:
mode:
authorJay Berkenbilt <ejb@ql.org>2021-12-18 15:01:52 +0100
committerJay Berkenbilt <ejb@ql.org>2021-12-18 17:05:51 +0100
commit10fb619d3e0618528b7ac6c20cad6262020cf947 (patch)
treec893fedff351e809edead840376e8648f1cc28ff /manual/object-streams.rst
parentf3d1138b8ab64c6a26e1dd5f77a644b19016a30d (diff)
downloadqpdf-10fb619d3e0618528b7ac6c20cad6262020cf947.tar.zst
Split documentation into multiple pages, change theme
Diffstat (limited to 'manual/object-streams.rst')
-rw-r--r--manual/object-streams.rst186
1 files changed, 186 insertions, 0 deletions
diff --git a/manual/object-streams.rst b/manual/object-streams.rst
new file mode 100644
index 00000000..6c2b3fc8
--- /dev/null
+++ b/manual/object-streams.rst
@@ -0,0 +1,186 @@
+.. _ref.object-and-xref-streams:
+
+Object and Cross-Reference Streams
+==================================
+
+This chapter provides information about the implementation of object
+stream and cross-reference stream support in qpdf.
+
+.. _ref.object-streams:
+
+Object Streams
+--------------
+
+Object streams can contain any regular object except the following:
+
+- stream objects
+
+- objects with generation > 0
+
+- the encryption dictionary
+
+- objects containing the /Length of another stream
+
+In addition, Adobe reader (at least as of version 8.0.0) appears to not
+be able to handle having the document catalog appear in an object stream
+if the file is encrypted, though this is not specifically disallowed by
+the specification.
+
+There are additional restrictions for linearized files. See
+:ref:`ref.object-streams-linearization` for details.
+
+The PDF specification refers to objects in object streams as "compressed
+objects" regardless of whether the object stream is compressed.
+
+The generation number of every object in an object stream must be zero.
+It is possible to delete and replace an object in an object stream with
+a regular object.
+
+The object stream dictionary has the following keys:
+
+- ``/N``: number of objects
+
+- ``/First``: byte offset of first object
+
+- ``/Extends``: indirect reference to stream that this extends
+
+Stream collections are formed with ``/Extends``. They must form a
+directed acyclic graph. These can be used for semantic information and
+are not meaningful to the PDF document's syntactic structure. Although
+qpdf preserves stream collections, it never generates them and doesn't
+make use of this information in any way.
+
+The specification recommends limiting the number of objects in object
+stream for efficiency in reading and decoding. Acrobat 6 uses no more
+than 100 objects per object stream for linearized files and no more 200
+objects per stream for non-linearized files. ``QPDFWriter``, in object
+stream generation mode, never puts more than 100 objects in an object
+stream.
+
+Object stream contents consists of *N* pairs of integers, each of which
+is the object number and the byte offset of the object relative to the
+first object in the stream, followed by the objects themselves,
+concatenated.
+
+.. _ref.xref-streams:
+
+Cross-Reference Streams
+-----------------------
+
+For non-hybrid files, the value following ``startxref`` is the byte
+offset to the xref stream rather than the word ``xref``.
+
+For hybrid files (files containing both xref tables and cross-reference
+streams), the xref table's trailer dictionary contains the key
+``/XRefStm`` whose value is the byte offset to a cross-reference stream
+that supplements the xref table. A PDF 1.5-compliant application should
+read the xref table first. Then it should replace any object that it has
+already seen with any defined in the xref stream. Then it should follow
+any ``/Prev`` pointer in the original xref table's trailer dictionary.
+The specification is not clear about what should be done, if anything,
+with a ``/Prev`` pointer in the xref stream referenced by an xref table.
+The ``QPDF`` class ignores it, which is probably reasonable since, if
+this case were to appear for any sensible PDF file, the previous xref
+table would probably have a corresponding ``/XRefStm`` pointer of its
+own. For example, if a hybrid file were appended, the appended section
+would have its own xref table and ``/XRefStm``. The appended xref table
+would point to the previous xref table which would point the
+``/XRefStm``, meaning that the new ``/XRefStm`` doesn't have to point to
+it.
+
+Since xref streams must be read very early, they may not be encrypted,
+and the may not contain indirect objects for keys required to read them,
+which are these:
+
+- ``/Type``: value ``/XRef``
+
+- ``/Size``: value *n+1*: where *n* is highest object number (same as
+ ``/Size`` in the trailer dictionary)
+
+- ``/Index`` (optional): value
+ ``[:samp:`{n count}` ...]`` used to determine
+ which objects' information is stored in this stream. The default is
+ ``[0 /Size]``.
+
+- ``/Prev``: value :samp:`{offset}`: byte
+ offset of previous xref stream (same as ``/Prev`` in the trailer
+ dictionary)
+
+- ``/W [...]``: sizes of each field in the xref table
+
+The other fields in the xref stream, which may be indirect if desired,
+are the union of those from the xref table's trailer dictionary.
+
+.. _ref.xref-stream-data:
+
+Cross-Reference Stream Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The stream data is binary and encoded in big-endian byte order. Entries
+are concatenated, and each entry has a length equal to the total of the
+entries in ``/W`` above. Each entry consists of one or more fields, the
+first of which is the type of the field. The number of bytes for each
+field is given by ``/W`` above. A 0 in ``/W`` indicates that the field
+is omitted and has the default value. The default value for the field
+type is "``1``". All other default values are "``0``".
+
+PDF 1.5 has three field types:
+
+- 0: for free objects. Format: ``0 obj next-generation``, same as the
+ free table in a traditional cross-reference table
+
+- 1: regular non-compressed object. Format: ``1 offset generation``
+
+- 2: for objects in object streams. Format: ``2 object-stream-number
+ index``, the number of object stream containing the object and the
+ index within the object stream of the object.
+
+It seems standard to have the first entry in the table be ``0 0 0``
+instead of ``0 0 ffff`` if there are no deleted objects.
+
+.. _ref.object-streams-linearization:
+
+Implications for Linearized Files
+---------------------------------
+
+For linearized files, the linearization dictionary, document catalog,
+and page objects may not be contained in object streams.
+
+Objects stored within object streams are given the highest range of
+object numbers within the main and first-page cross-reference sections.
+
+It is okay to use cross-reference streams in place of regular xref
+tables. There are on special considerations.
+
+Hint data refers to object streams themselves, not the objects in the
+streams. Shared object references should also be made to the object
+streams. There are no reference in any hint tables to the object numbers
+of compressed objects (objects within object streams).
+
+When numbering objects, all shared objects within both the first and
+second halves of the linearized files must be numbered consecutively
+after all normal uncompressed objects in that half.
+
+.. _ref.object-stream-implementation:
+
+Implementation Notes
+--------------------
+
+There are three modes for writing object streams:
+:samp:`disable`, :samp:`preserve`, and
+:samp:`generate`. In disable mode, we do not generate
+any object streams, and we also generate an xref table rather than xref
+streams. This can be used to generate PDF files that are viewable with
+older readers. In preserve mode, we write object streams such that
+written object streams contain the same objects and ``/Extends``
+relationships as in the original file. This is equal to disable if the
+file has no object streams. In generate, we create object streams
+ourselves by grouping objects that are allowed in object streams
+together in sets of no more than 100 objects. We also ensure that the
+PDF version is at least 1.5 in generate mode, but we preserve the
+version header in the other modes. The default is
+:samp:`preserve`.
+
+We do not support creation of hybrid files. When we write files, even in
+preserve mode, we will lose any xref tables and merge any appended
+sections.