TODO: flesh out JSON v2 details

author: Jay Berkenbilt <ejb@ql.org> 2022-02-25 20:54:25 +0100
committer: Jay Berkenbilt <ejb@ql.org> 2022-02-25 20:54:25 +0100
commit: 905e99a3141edc7d6523e8da47e624b1c1e664a3 (patch)
tree: 20b4aae349733a24561e7da0221627b4861a69ac /TODO
parent: 36794a60cf2a9739d4e1b021c9ba00feef9d42da (diff)
download: qpdf-905e99a3141edc7d6523e8da47e624b1c1e664a3.tar.zst
1 files changed, 152 insertions, 36 deletions
diff --git a/TODO b/TODO
index c658340d..6aeb1ca4 100644
--- a/TODO
+++ b/TODO
@@ -1,3 +1,4 @@
+
 Next
 ====
 
@@ -9,6 +10,7 @@ Priorities for 11:
 * cmake
 * PointerHolder -> shared_ptr
 * ABI
+* --json default is latest
 
 Misc
 * Get rid of "ugly switch statements" in QUtil.cc -- replace with
@@ -17,6 +19,16 @@ Misc
 * Consider exposing get_next_utf8_codepoint in QUtil
 * Add QUtil::is_explicit_utf8 that does what QPDF_String::getUTF8Val
   does to detect UTF-8 encoded strings per PDF 2.0 spec.
+* Add an option --ignore-encryption to ignore encryption information
+  and treat encrypted files as if they weren't encrypted. This should
+  make it possible to solve #598 (--show-encryption without a
+  password). We'll need to make sure we don't try to filter any
+  streams in this mode. Ideally we should be able to combine this with
+  --json so we can look at the raw encrypted strings and streams if we
+  want to. Since providing the password may reveal additional details,
+  --show-encryption could potentially retry with this option if the
+  first time doesn't work. Then, with the file open, we can read the
+  encryption dictionary normally.
 
 Soon: Break ground on "Document-level work"
 
@@ -82,21 +94,17 @@ A .clang-format file can be created at the top of the repository.
 Output JSON v2
 ==============
 
-Output JSON v2 contain enough information to completely recreate a PDF
-file.
-
-This is not an ABI change as long as the default --json version is 1.
+Output JSON v2 will contain enough information to completely recreate
+a PDF file. In other words, qpdf will have full, bidirectional,
+lossless json serialization/deserialization of PDF.
 
 If this is done, update --json option in cli.rst to mention v2. Also
 update QPDFJob::Config::json and of course other parts of the docs
 (json.rst).
 
-Fix the following problems:
+You can't create a PDF from v1 json because
 
-* Include the PDF version header somewhere.
-
-* Using "n n R" as a key in "objects" and "objectinfo" messes up
-  searching for things
+* The PDF version header is not recorded
 
 * Strings cannot be unambiguously encoded/decoded
 
@@ -110,36 +118,83 @@ Fix the following problems:
 * You can't tell a stream from a dictionary except by looking in both
   "object" and "objectinfo". Fix this, and then remove "objectinfo".
 
-* There are differences between information shown in the json format
-  vs. information shown with options like --check, --list-attachments,
+Additionally, using "n n R" as a key in "objects" and "objectinfo"
+messes up searching for things.
+
+For json v2:
+
+* Make sure it is possible to serialize and deserializes a PDF to JSON
+  without loading the whole thing into memory. This is substantial. It
+  means we need sax-style parsing and handling so we can
+  handle/generate objects as we go. We'll have to be able to keep
+  track of keys for dictionary error checking. May want to add json to
+  large file tests.
+
+* Resolve differences between information shown in the json format vs.
+  information shown with options like --check, --list-attachments,
   etc. The json format should be able to completely replace things
-  that write to stdout.
+  that write to stdout. Be sure getAllPages() and other top-level
+  convenience routines are there so people don't need to parse the
+  pages tree themselves. For many workflows, it should be possible for
+  someone to work in the json file based on json metadata rather than
+  calling the QPDF API. (Of course, you still need the QPDF API for
+  higher level helper objects.)
 
 * Consider using camelCase in multi-word key names to be consistent
   with job JSON and with how JSON is often represented in languages
-  that use it more natively
+  that use it more natively.
 
 * Consider changing the contract to allow fields to be absent even
   when present in the schema. It's reasonable for people to check for
   presence of a key. Most languages make this easy to do.
 
+* If we allow --json to be mixed with --ignore-encryption, we must
+  emphasize that the resulting json can't be turned back into a valid
+  PDF.
+
 Most things that are informational can stay the same. We will have to
-go through every item to decide for sure.
+go through every item to decide for sure, especially when camelCase is
+taken into consideration.
+
+New APIs:
 
-To address ambiguity, consider the following:
+QPDFObjectHandle::parseJSON(QPDF* context, JSON);
+QPDFObjectHandle::parseJSON(QPDF* context, std::string const&);
+operator ""_qpdf_json
+C API to create a QPDFObjectHandle from a json string
 
-Whenever a direct PDF object appears, disambiguate things represented
-in JSON as strings as follows:
+JSON::parseFile
+QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json)
+QPDF::updateFromJSON(JSON)
 
-* "/Name" -- if it starts with /, it's a name
-* "n n R" -- if it is "n n R", it's an indirect object
-* "u:utf8-encoded" -- a utf8-encoded string
-* "b:<12ab34>" -- a binary string
+CLI: --infile-is-json -- indicate that the input is a qpdf json file
+rather than a PDF file
+CLI: --update-from-json=file.json
 
-In "objects", the key is "obj:o,g", and the value is a dictionary with
-exactly one of "value" or "stream" as its single key.
+Have a "qpdf" key in the output that contains "jsonVersion",
+"pdfVersion", and "objects". This replaces the "objects" field at the
+top level. "objects" and "objectinfo" disappear from the top-level.
+".version" and ".qpdf.jsonVersion" will match. The input to parseJSON
+and updateFromJSON will have to have the "qpdf" key in it. All other
+keys are ignored.
 
-For non-streams, the value of "value" is as described above.
+When creating from a JSON file, the JSON must be complete with data
+for all streams, a trailer, and a pdfVersion. When updating from a
+JSON:
+
+* Any object whose value is null (not "value": null, but just null) is
+  deleted.
+* For any stream that appears without stream data, the stream data is
+  left alone.
+* Otherwise, the object from the JSON completely replaces the input
+  object. No dictionary merges or anything like that are performed.
+  It will call replaceObject.
+
+Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the
+value is a dictionary with exactly one of "value" or "stream" as its
+single key.
+
+For non-streams:
 
 {
   "obj:o,g": {
@@ -149,7 +204,6 @@ For non-streams, the value of "value" is as described above.
 
 For streams:
 
-{
   "obj:o,g": {
     "stream": {
       "dict": { ... stream dictionary ... },
@@ -160,27 +214,89 @@ For streams:
   }
 }
 
-Notes about stream data:
+Wherever a PDF object appears in the JSON output, including "value"
+and "stream"."dict" above as well as other places where they might
+appear, objects are represented as follows:
+
+* Arrays, dictionaries, booleans, nulls, integers, and real numbers
+  with no more than six decimal places are represented as their native
+  JSON type.
+* Real numbers with more than six decimal places are represented as
+  "r:{real-value}".
+* Names: "/Name" -- internal/canonical representation (e.g.
+  "/Text/Plain", not #xx quoted)
+* Indirect objects: "n n R"
+* Strings: one of
+  "s:json string treated as Unicode"
+  "b:json string treated as bytes; character > \u00ff is an error"
+  "e:base64-encoded bytes"
+
+Test cases: these are the same:
+* "b:\u00c8\u0080", "s:π", "s:\u03c0", and "e:z4A="
+* "b:\u00d8\u003e\u00dd\u0054", "s:🥔", "s:\ud83e\udd54", and "e:8J+llA=="
+
+When creating output from a string:
+* If the string is explicitly unicode (UTF-8 or UTF-16), encode as
+  "s:" without the leading U+FEFF
+* Else if the string can be bidirectionally mapped between pdf-doc and
+  unicode, transcode to unicode and encode as "s:"
+* Else if the string would be decoded as binary, encode as "e:"
+* Else encode as "b:"
+
+When reading a string, any string that doesn't follow the above rules
+is an error. This includes "r:" strings not paresable as a real
+number, "/Name" strings containing a NUL character, "s:" or "b:"
+strings that are not valid JSON strings, "b:" strings containing
+character values > 0xff, or "e:" values that are not valid base64.
+Once the string is read in, if the "s:" string can be bidirectionally
+mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store
+as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded
+and stored as bytes.
+
+Implementing this will require some refactoring of things between
+QUtil and QPDF_String, plus we will need to implement a base64
+encoder/decoder.
+
+This enables a workflow like this:
+
+* qpdf --json=latest infile.pdf > pdf.json
+* modify pdf.json
+* qpdf infile.pdf --update-from=pdf.json out.pdf
+
+or
+
+* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json
+* modify pdf.json
+* qpdf pdf.json --infile-is-json out.pdf
+
+Notes about streams and stream data:
+
+* Always include "dict". "/Length" is removed from the stream
+  dictionary.
 
-* Always include "dict".
+* Add new flag --json-stream-data={raw,filtered,none}. At most one of
+  "raw" and "filtered" will appear for each stream. If "filtered"
+  appears, "/Filter" and "/DecodeParms" are removed from the stream
+  dictionary. This makes the stream data and dictionary match for when
+  the file is read back in.
 
 * Always include "filterable" regardless of value of
   --json-stream-data. The value of filterable is influenced by
   --decode-level, which is already in parameters.
 
-* Add new flag --json-stream-data={raw,filtered,none}. At most one of
-  "raw" and "filtered" will appear for each stream.
-
 * Add to parameters: value of json-stream-data, default is none
 
-* If none, omit stream data entirely
+* If --json-stream-data=none, omit stream data entirely
 
-* If raw, include raw stream data as base64
+* If --json-stream-data=raw, include raw stream data as base64. Show
+  the data even for unfiltered streams in "raw".
 
-* If filtered, including the base64-encoded filtered stream data if we
-  can and should decode it based on decode-level. Otherwise, include
-  the base64-encoded raw data. See if we can honor
-  --normalize-content.
+* If --json-stream-data=filtered, include the base64-encoded filtered
+  stream data if we can and should decode it based on decode-level.
+  Otherwise, include the base64-encoded raw data. See if we can honor
+  --normalize-content. If a stream appears unfiltered in the input,
+  still show it as filtered. Remove /DecodeParms and /Filter if
+  filtering.
 
 Note that --json-stream-data=filtered is different from
 --filtered-stream-data in that --filtered-stream-data implies
author	Jay Berkenbilt <ejb@ql.org>	2022-02-25 20:54:25 +0100
committer	Jay Berkenbilt <ejb@ql.org>	2022-02-25 20:54:25 +0100
commit	905e99a3141edc7d6523e8da47e624b1c1e664a3 (patch)
tree	20b4aae349733a24561e7da0221627b4861a69ac /TODO
parent	36794a60cf2a9739d4e1b021c9ba00feef9d42da (diff)
download	qpdf-905e99a3141edc7d6523e8da47e624b1c1e664a3.tar.zst