Implement JSON v2 output

author: Jay Berkenbilt <ejb@ql.org> 2022-05-07 19:33:45 +0200
committer: Jay Berkenbilt <ejb@ql.org> 2022-05-08 19:45:20 +0200
commit: c76536dd9a150adb71fdcda11ee1a93f25128cc7 (patch)
tree: 03f68965ad1646f643d184b0435bd6706b42fcdc /TODO
parent: bdfc4da5105c86f0dc63ed390da240306e6b4466 (diff)
download: qpdf-c76536dd9a150adb71fdcda11ee1a93f25128cc7.tar.zst
1 files changed, 50 insertions, 72 deletions
diff --git a/TODO b/TODO
index 35cfbd79..b184d81f 100644
--- a/TODO
+++ b/TODO
@@ -50,6 +50,8 @@ Output JSON v2
 
 General things to remember:
 
+* Test inline and file stream data.
+
 * Make sure all the information from --check and other informational
   options (--show-linearization, --show-encryption, --show-xref,
   --list-attachments, --show-npages) is available in the json output.
@@ -58,9 +60,6 @@ General things to remember:
   when present in the schema. It's reasonable for people to check for
   presence of a key. Most languages make this easy to do.
 
-* The choices for json_key (job.yml) will be different for v1 and v2.
-  That information is already duplicated in multiple places.
-
 * Test stream with invalid data
 
 * When we get to full serialization, add json serialization
@@ -76,20 +75,61 @@ General things to remember:
   * "b:cf80", "b:CF80", "u:π", "u:\u03c0"
   * "b:d83edd54", "u:🥔", "u:\ud83e\udd54"
 
+JSON to PDF:
+
 When reading a JSON string, any string that doesn't follow the above rules
 is an error. Just use newUnicodeString on "u:" strings. For "b:"
 strings, decode the bytes with hex_decode and use newString.
 
+For going back from JSON to PDF, we can have
+QPDF::fromJSON(std::shared_ptr<InputSource> which will have logic
+similar to copyForeignObject. Note that this InputSource is not going
+to be this->file. We have to keep it separately.
+
+The backing input source is this memory block:
+
+```
+%PDF-1.3
+xref
+0 1
+0000000000 65535 f 
+trailer << /Size 1 >>
+startxref
+9
+%%EOF
+```
+
+* Ignore all keys except .qpdf.
+* Verify that .qpdf.jsonVersion is 2
+* Set this->m->pdf_version based on the .qpdf.pdfVersion key
+* For each object in .qpdf.objects:
+  * Walk through the object detecting any indirect objects. For each
+    one that is not already known, reserve the object. We can also
+    validate but we should try to do the best we can with invalid JSON
+    so people can get good error messages.
+  * Construct a QPDFObjectHandle from the JSON
+  * If the object is the trailer, update the trailer
+  * Else if the object doesn't exist, reserve it
+  * If the object is reserved, call replaceReserved()
+  * Else the object already exists; this is an error.
+
+For streams, have a stream data provider that, for inline streams,
+does a base64 from the file offsets and for file-based streams, reads
+the file. For the inline case, we have to keep the json InputSource
+around. Otherwise, we don't. It is an error if there is no stream data.
+
+Documentation:
+
 Serialized PDF:
 
 The JSON output will have a "qpdf" key containing
-* jsonVersion
-* pdfVersion
+* jsonversion
+* pdfversion
 * objects
 
 The "qpdf" key replaces "objects" and "objectinfo" in v1 JSON.
 
-Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the
+Within .qpdf.objects, the key is "obj:o g R" or "trailer", and the
 value is a dictionary with exactly one of "value" or "stream" as its
 single key.
 
@@ -113,16 +153,17 @@ For streams:
     "stream": {
       "dict": { ... stream dictionary ... },
       "data": "base64-encoded data",
-      "dataFile": "path to base64-encoded data"
+      "datafile": "path to base64-encoded data"
     }
   }
 }
 
-At most one of "data" or "dataFile" will be present. When serializing,
+At most one of "data" or "datafile" will be present. When serializing,
 stream decode parameters will be obeyed, and the stream dictionary
 will reflect the result. There will be the option to omit stream data.
 
-In the stream dictionary, "/Length" is always removed.
+When data is included, "/Length" is removed from the stream
+dictionary.
 
 Streams are filtered or not based on the --decode-level parameter. If
 a stream is filtered, "/Filter" and "/DecodeParms" are removed from
@@ -131,74 +172,11 @@ for when the file is read back in.
 
 CLI:
 
-* Add new flags
-
-  * --from-json=input.json -- signals reading from a JSON and counts
-    as an input file.
-
-  * --json-streams-omit -- stream data is omitted, the default
-
-  * --json-streams-inline -- stream data is included in the "data"
-    key as base64-encoded
-
-  * --json-streams-file-prefix=prefix -- stream is written to $prefix-$obj
-    where $obj is the object number. The path to the file is stored
-    in the "dataFile" key. A relative path is recommended and will be
-    interpreted as relative to the current directory. If a relative
-    prefix is given, a relative path will stored in "dataFile".
-    Example:
-    mkdir in-streams
-    qpdf in.pdf --json-streams-file-prefix=in-streams/ > out.json
-
-  * --to-json -- changes default to --json-streams-inline implies
-    --json-key=qpdf
-
 Example workflow:
 * qpdf in.pdf --to-json > pdf.json
 * edit pdf.json
 * qpdf --from-json=pdf.json out.pdf
 
-JSON to PDF:
-
-For going back from JSON to PDF, we can have
-QPDF::fromJSON(std::shared_ptr<InputSource> which will have logic
-similar to copyForeignObject. Note that this InputSource is not going
-to be this->file. We have to keep it separately.
-
-The backing input source is this memory block:
-
-```
-%PDF-1.3
-xref
-0 1
-0000000000 65535 f 
-trailer << /Size 1 >>
-startxref
-9
-%%EOF
-```
-
-* Ignore all keys except .qpdf.
-* Verify that .qpdf.jsonVersion is 2
-* Set this->m->pdf_version based on the .qpdf.pdfVersion key
-* For each object in .qpdf.objects:
-  * Walk through the object detecting any indirect objects. For each
-    one that is not already known, reserve the object. We can also
-    validate but we should try to do the best we can with invalid JSON
-    so people can get good error messages.
-  * Construct a QPDFObjectHandle from the JSON
-  * If the object is the trailer, update the trailer
-  * Else if the object doesn't exist, reserve it
-  * If the object is reserved, call replaceReserved()
-  * Else the object already exists; this is an error.
-
-For streams, have a stream data provider that, for inline streams,
-does a base64 from the file offsets and for file-based streams, reads
-the file. For the inline case, we have to keep the json InputSource
-around. Otherwise, we don't. It is an error if there is no stream data.
-
-Documentation:
-
 Update --json option in cli.rst to mention v2 and update json.rst.
 
 Other documentation fodder:
author	Jay Berkenbilt <ejb@ql.org>	2022-05-07 19:33:45 +0200
committer	Jay Berkenbilt <ejb@ql.org>	2022-05-08 19:45:20 +0200
commit	c76536dd9a150adb71fdcda11ee1a93f25128cc7 (patch)
tree	03f68965ad1646f643d184b0435bd6706b42fcdc /TODO
parent	bdfc4da5105c86f0dc63ed390da240306e6b4466 (diff)
download	qpdf-c76536dd9a150adb71fdcda11ee1a93f25128cc7.tar.zst