diff options
author | Jay Berkenbilt <ejb@ql.org> | 2021-12-18 15:01:52 +0100 |
---|---|---|
committer | Jay Berkenbilt <ejb@ql.org> | 2021-12-18 17:05:51 +0100 |
commit | 10fb619d3e0618528b7ac6c20cad6262020cf947 (patch) | |
tree | c893fedff351e809edead840376e8648f1cc28ff /manual/json.rst | |
parent | f3d1138b8ab64c6a26e1dd5f77a644b19016a30d (diff) | |
download | qpdf-10fb619d3e0618528b7ac6c20cad6262020cf947.tar.zst |
Split documentation into multiple pages, change theme
Diffstat (limited to 'manual/json.rst')
-rw-r--r-- | manual/json.rst | 177 |
1 files changed, 177 insertions, 0 deletions
diff --git a/manual/json.rst b/manual/json.rst new file mode 100644 index 00000000..660486ef --- /dev/null +++ b/manual/json.rst @@ -0,0 +1,177 @@ +.. _ref.json: + +QPDF JSON +========= + +.. _ref.json-overview: + +Overview +-------- + +Beginning with qpdf version 8.3.0, the :command:`qpdf` +command-line program can produce a JSON representation of the +non-content data in a PDF file. It includes a dump in JSON format of all +objects in the PDF file excluding the content of streams. This JSON +representation makes it very easy to look in detail at the structure of +a given PDF file, and it also provides a great way to work with PDF +files programmatically from the command-line in languages that can't +call or link with the qpdf library directly. Note that stream data can +be extracted from PDF files using other qpdf command-line options. + +.. _ref.json-guarantees: + +JSON Guarantees +--------------- + +The qpdf JSON representation includes a JSON serialization of the raw +objects in the PDF file as well as some computed information in a more +easily extracted format. QPDF provides some guarantees about its JSON +format. These guarantees are designed to simplify the experience of a +developer working with the JSON format. + +Compatibility + The top-level JSON object output is a dictionary. The JSON output + contains various nested dictionaries and arrays. With the exception + of dictionaries that are populated by the fields of objects from the + file, all instances of a dictionary are guaranteed to have exactly + the same keys. Future versions of qpdf are free to add additional + keys but not to remove keys or change the type of object that a key + points to. The qpdf program validates this guarantee, and in the + unlikely event that a bug in qpdf should cause it to generate data + that doesn't conform to this rule, it will ask you to file a bug + report. + + The top-level JSON structure contains a "``version``" key whose value + is simple integer. The value of the ``version`` key will be + incremented if a non-compatible change is made. A non-compatible + change would be any change that involves removal of a key, a change + to the format of data pointed to by a key, or a semantic change that + requires a different interpretation of a previously existing key. A + strong effort will be made to avoid breaking compatibility. + +Documentation + The :command:`qpdf` command can be invoked with the + :samp:`--json-help` option. This will output a JSON + structure that has the same structure as the JSON output that qpdf + generates, except that each field in the help output is a description + of the corresponding field in the JSON output. The specific + guarantees are as follows: + + - A dictionary in the help output means that the corresponding + location in the actual JSON output is also a dictionary with + exactly the same keys; that is, no keys present in help are absent + in the real output, and no keys will be present in the real output + that are not in help. As a special case, if the dictionary has a + single key whose name starts with ``<`` and ends with ``>``, it + means that the JSON output is a dictionary that can have any keys, + each of which conforms to the value of the special key. This is + used for cases in which the keys of the dictionary are things like + object IDs. + + - A string in the help output is a description of the item that + appears in the corresponding location of the actual output. The + corresponding output can have any format. + + - An array in the help output always contains a single element. It + indicates that the corresponding location in the actual output is + also an array, and that each element of the array has whatever + format is implied by the single element of the help output's + array. + + For example, the help output indicates includes a "``pagelabels``" + key whose value is an array of one element. That element is a + dictionary with keys "``index``" and "``label``". In addition to + describing the meaning of those keys, this tells you that the actual + JSON output will contain a ``pagelabels`` array, each of whose + elements is a dictionary that contains an ``index`` key, a ``label`` + key, and no other keys. + +Directness and Simplicity + The JSON output contains the value of every object in the file, but + it also contains some processed data. This is analogous to how qpdf's + library interface works. The processed data is similar to the helper + functions in that it allows you to look at certain aspects of the PDF + file without having to understand all the nuances of the PDF + specification, while the raw objects allow you to mine the PDF for + anything that the higher-level interfaces are lacking. + +.. _json.limitations: + +Limitations of JSON Representation +---------------------------------- + +There are a few limitations to be aware of with the JSON structure: + +- Strings, names, and indirect object references in the original PDF + file are all converted to strings in the JSON representation. In the + case of a "normal" PDF file, you can tell the difference because a + name starts with a slash (``/``), and an indirect object reference + looks like ``n n R``, but if there were to be a string that looked + like a name or indirect object reference, there would be no way to + tell this from the JSON output. Note that there are certain cases + where you know for sure what something is, such as knowing that + dictionary keys in objects are always names and that certain things + in the higher-level computed data are known to contain indirect + object references. + +- The JSON format doesn't support binary data very well. Mostly the + details are not important, but they are presented here for + information. When qpdf outputs a string in the JSON representation, + it converts the string to UTF-8, assuming usual PDF string semantics. + Specifically, if the original string is UTF-16, it is converted to + UTF-8. Otherwise, it is assumed to have PDF doc encoding, and is + converted to UTF-8 with that assumption. This causes strange things + to happen to binary strings. For example, if you had the binary + string ``<038051>``, this would be output to the JSON as ``\u0003•Q`` + because ``03`` is not a printable character and ``80`` is the bullet + character in PDF doc encoding and is mapped to the Unicode value + ``2022``. Since ``51`` is ``Q``, it is output as is. If you wanted to + convert back from here to a binary string, would have to recognize + Unicode values whose code points are higher than ``0xFF`` and map + those back to their corresponding PDF doc encoding characters. There + is no way to tell the difference between a Unicode string that was + originally encoded as UTF-16 or one that was converted from PDF doc + encoding. In other words, it's best if you don't try to use the JSON + format to extract binary strings from the PDF file, but if you really + had to, it could be done. Note that qpdf's + :samp:`--show-object` option does not have this + limitation and will reveal the string as encoded in the original + file. + +.. _json.considerations: + +JSON: Special Considerations +---------------------------- + +For the most part, the built-in JSON help tells you everything you need +to know about the JSON format, but there are a few non-obvious things to +be aware of: + +- While qpdf guarantees that keys present in the help will be present + in the output, those fields may be null or empty if the information + is not known or absent in the file. Also, if you specify + :samp:`--json-keys`, the keys that are not listed + will be excluded entirely except for those that + :samp:`--json-help` says are always present. + +- In a few places, there are keys with names containing + ``pageposfrom1``. The values of these keys are null or an integer. If + an integer, they point to a page index within the file numbering from + 1. Note that JSON indexes from 0, and you would also use 0-based + indexing using the API. However, 1-based indexing is easier in this + case because the command-line syntax for specifying page ranges is + 1-based. If you were going to write a program that looked through the + JSON for information about specific pages and then use the + command-line to extract those pages, 1-based indexing is easier. + Besides, it's more convenient to subtract 1 from a program in a real + programming language than it is to add 1 from shell code. + +- The image information included in the ``page`` section of the JSON + output includes the key "``filterable``". Note that the value of this + field may depend on the :samp:`--decode-level` that + you invoke qpdf with. The JSON output includes a top-level key + "``parameters``" that indicates the decode level used for computing + whether a stream was filterable. For example, jpeg images will be + shown as not filterable by default, but they will be shown as + filterable if you run :command:`qpdf --json + --decode-level=all`. |