aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJay Berkenbilt <ejb@ql.org>2022-09-06 16:00:50 +0200
committerJay Berkenbilt <ejb@ql.org>2022-09-06 16:09:26 +0200
commitf95e0549cc6d402ab29f64306560e5677e528dad (patch)
tree4a1dbec89ce3318c1b6a30ef25dfb9ec50ea7723
parented04b80caf7400622aa9d12797e221271c4d2016 (diff)
downloadqpdf-f95e0549cc6d402ab29f64306560e5677e528dad.tar.zst
Update documentation to clarify some limitations of qpdf JSON
-rw-r--r--TODO19
-rw-r--r--manual/json.rst60
2 files changed, 72 insertions, 7 deletions
diff --git a/TODO b/TODO
index ce962173..49d8be62 100644
--- a/TODO
+++ b/TODO
@@ -11,8 +11,6 @@ Next
Before Release:
* Stay on top of https://github.com/pikepdf/pikepdf/pull/315
-* Consider whether otherwise unreferenced object streams should be
- included in json output. Probably not. Or maybe optionally.
* Support json v2 in the C API. At a minimum, write_json,
create_from_json, and update_from_json need to be there and should
take the same kinds of functions as the C API for logger.
@@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle.
Possible future JSON enhancements
=================================
+* Consider not including unreferenced objects and trimming the trailer
+ in the same way that QPDFWriter does (except don't remove `/ID`).
+ This means excluding the linearization dictionary and hint stream,
+ the encryption dictionary, all keys from trailer that are removed by
+ QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and
+ the xref stream as long as all those objects are unreferenced. (They
+ always should be, but there could be some bizarre case of someone
+ creating a PDF file that has an indirect reference to one of those,
+ in which case we need to preserve it.) If this is done, make
+ `--preserve-unreferenced` preserve unreference objects and also
+ those extra keys. Search for "linear" and "trailer" in json.rst to
+ update the various places in the documentation that discuss this.
+ Also update the help for --json and --preserve-unreferenced.
+
* Add to JSON output the information available from a few additional
informational options:
@@ -376,7 +388,8 @@ I find it useful to make reference to them in this list.
convertible back to a valid PDF. Since providing the password may
reveal additional details, --show-encryption could potentially retry
with this option if the first time doesn't work. Then, with the file
- open, we can read the encryption dictionary normally.
+ open, we can read the encryption dictionary normally. If this is
+ done, search for "raw, encrypted" in json.rst.
* In libtests, separate executables that need the object library
from those that strictly use public API. Move as many of the test
diff --git a/manual/json.rst b/manual/json.rst
index 85210ee5..b062aacc 100644
--- a/manual/json.rst
+++ b/manual/json.rst
@@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close
as possible to the original input and is ready for being converted
back to PDF.
+The qpdf JSON data includes unreferenced objects. This may be
+addressed in a future version of qpdf. For now, that means that
+certain objects that are not useful in the JSON representation are
+included. This includes linearization and encryption dictionaries,
+linearization hint streams, object streams, and the cross-reference
+(xref) stream associated with the trailer dictionary where applicable.
+For the best experience with qpdf JSON, you can run the file through
+qpdf first to remove encryption, linearization, and object streams.
+For example:
+
+::
+
+ qpdf --decrypt --object-streams=disable in.pdf out.pdf
+ qpdf --json-output out.pdf out.json
+
+
.. _json-terminology:
JSON Terminology
@@ -299,10 +315,46 @@ Object Values
Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
As such, none of the things ``QPDFWriter`` does apply. This includes
recompression of streams, renumbering of objects, removal of
-unreferenced objects, anything to do with object streams (which are
-not represented by qpdf JSON at all since they are PDF syntax, not
-semantics), encryption, decryption, linearization, QDF mode, etc. See
-:ref:`rewriting` for a more in-depth discussion.
+unreferenced objects, encryption, decryption, linearization, QDF
+mode, etc. See :ref:`rewriting` for a more in-depth discussion. This
+has a few noteworthy implications:
+
+- Decryption is handled transparently by qpdf. As there are no QPDF
+ APIs, even internal to the library, that allow retrieval of
+ encrypted data in its raw, encrypted form, qpdf JSON always includes
+ decrypted data. It is possible that a future version of qpdf may
+ allow access to raw, encrypted string and stream data.
+
+- Objects that are related to a PDF file's structure, rather than its
+ content, are included in the JSON output, even though they are not
+ particularly useful. In a future version of qpdf, this may be fixed,
+ and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be
+ used to get the existing behavior. For now, to avoid this, run the
+ file through ``qpdf --decrypt --object-streams=disable in.pdf
+ out.pdf`` to generate a new PDF file that contains no unreferenced
+ or structural objects.
+
+ - Linearized PDF files include a linearization dictionary which is not
+ referenced from any other object and which references the
+ linearization hint stream by offset. The JSON from a linearized PDF
+ file contains both of these objects, even though they are not useful
+ in the JSON. Offset information is not represented in the JSON, so
+ there's no way to find the linearization hint stream from the
+ JSON. If a new PDF is created from JSON that was written, the
+ objects will be read back in but will just be unreferenced objects
+ that will be ignored by ``QPDFWriter`` when the file is rewritten.
+
+ - The JSON from a file with object streams will include the original
+ object stream and will also include all the objects in the stream
+ as top-level objects.
+
+ - In files with object streams, the trailer "dictionary" is a
+ stream. In qpdf JSON files, the ``"trailer"`` key will contain a
+ dictionary with all the keys in it relating to the stream, and the
+ stream will also appear as an unreferenced object.
+
+ - Encrypted files are decrypted, but the encryption dictionary still
+ appears in the JSON output.
.. _json.example: