diff options
author | Jay Berkenbilt <ejb@ql.org> | 2018-12-23 14:54:59 +0100 |
---|---|---|
committer | Jay Berkenbilt <ejb@ql.org> | 2018-12-23 15:15:40 +0100 |
commit | 76bf863aaa7bd57c2672718804dd334a6c561cfb (patch) | |
tree | 85091cb7f729cb2c2585f9cbdd2c223ba21e4f0b /manual | |
parent | 52a0b767c8b2acb18bbdc076b258092dc122a1c6 (diff) | |
download | qpdf-76bf863aaa7bd57c2672718804dd334a6c561cfb.tar.zst |
Add page position information to json
Diffstat (limited to 'manual')
-rw-r--r-- | manual/qpdf-manual.xml | 373 |
1 files changed, 215 insertions, 158 deletions
diff --git a/manual/qpdf-manual.xml b/manual/qpdf-manual.xml index 8d8ecaa0..b4b07f4b 100644 --- a/manual/qpdf-manual.xml +++ b/manual/qpdf-manual.xml @@ -1940,178 +1940,235 @@ outfile.pdf</option> </chapter> <chapter id="ref.json"> <title>QPDF JSON</title> - <para> - Beginning with qpdf version 8.3.0, the <command>qpdf</command> - command-line program can produce a json representation of the - non-content data in a PDF file. It includes a dump in json format - of all objects in the PDF file excluding the content of streams. - This json representation makes it very easy to look in detail at - the structure of a given PDF file, and it also provides a great way - to work with PDF files programmatically from the command-line in - languages that can't call or link with the qpdf library directly. - Note that stream data can be extracted from PDF files using other - qpdf command-line options. - </para> - <para> - The qpdf json representation includes a json serialization of the - raw objects in the PDF file as well as some computed information in - a more easily extracted format. QPDF provides some guarantees about - its json format. These guarantees are designed to simplify the - experience of a developer working with the JSON format. - <variablelist> - <varlistentry> - <term>Compatibility</term> + <sect1 id="ref.json-overview"> + <title>Overview</title> + <para> + Beginning with qpdf version 8.3.0, the <command>qpdf</command> + command-line program can produce a json representation of the + non-content data in a PDF file. It includes a dump in json format + of all objects in the PDF file excluding the content of streams. + This json representation makes it very easy to look in detail at + the structure of a given PDF file, and it also provides a great way + to work with PDF files programmatically from the command-line in + languages that can't call or link with the qpdf library directly. + Note that stream data can be extracted from PDF files using other + qpdf command-line options. + </para> + </sect1> + <sect1 id="ref.json-guarantees"> + <title>JSON Guarantees</title> + <para> + The qpdf json representation includes a json serialization of the + raw objects in the PDF file as well as some computed information in + a more easily extracted format. QPDF provides some guarantees about + its json format. These guarantees are designed to simplify the + experience of a developer working with the JSON format. + <variablelist> + <varlistentry> + <term>Compatibility</term> + <listitem> + <para> + The top-level json object output is a dictionary. The json + output contains various nested dictionaries and arrays. With + the exception of dictionaries that are populated by the fields + of objects from the file, all instances of a dictionary are + guaranteed to have exactly the same keys. Future versions of + qpdf are free to add additional keys but not to remove keys or + change the type of object that a key points to. The qpdf + program validates this guarantee, and in the unlikely event + that a bug in qpdf should cause it to generate data that + doesn't conform to this rule, it will ask you to file a bug + report. + </para> + <para> + The top-level json structure contains a + “<literal>version</literal>” key whose value is + simple integer. The value of the <literal>version</literal> key + will be incremented if a non-compatible change is made. A + non-compatible change would be any change that involves removal + of a key, a change to the format of data pointed to by a key, + or a semantic change that requires a different interpretation + of a previously existing key. A strong effort will be made to + avoid breaking compatibility. + </para> + </listitem> + </varlistentry> + <varlistentry> + <term>Documentation</term> + <listitem> + <para> + The <command>qpdf</command> command can be invoked with the + <option>--json-help</option> option. This will output a json + structure that has the same structure as the json output that + qpdf generates, except that each field in the help output is a + description of the corresponding field in the json output. The + specific guarantees are as follows: + <itemizedlist> + <listitem> + <para> + A dictionary in the help output means that the corresponding + location in the actual json output is also a dictionary with + exactly the same keys; that is, no keys present in help are + absent in the real output, and no keys will be present in + the real output that are not in help. + </para> + </listitem> + <listitem> + <para> + A string in the help output is a description of the item + that appears in the corresponding location of the actual + output. The corresponding output can have any format. + </para> + </listitem> + <listitem> + <para> + An array in the help output always contains a single + element. It indicates that the corresponding location in the + actual output is also an array, and that each element of the + array has whatever format is implied by the single element + of the help output's array. + </para> + </listitem> + </itemizedlist> + For example, the help output indicates includes a + “<literal>pagelabels</literal>” key whose value is + an array of one element. That element is a dictionary with keys + “<literal>index</literal>” and + “<literal>label</literal>”. In addition to + describing the meaning of those keys, this tells you that the + actual json output will contain a <literal>pagelabels</literal> + array, each of whose elements is a dictionary that contains an + <literal>index</literal> key, a <literal>label</literal> key, + and no other keys. + </para> + </listitem> + </varlistentry> + <varlistentry> + <term>Directness and Simplicity</term> + <listitem> + <para> + The json output contains the value of every object in the file, + but it also contains some processed data. This is analogous to + how qpdf's library interface works. The processed data is + similar to the helper functions in that it allows you to look + at certain aspects of the PDF file without having to understand + all the nuances of the PDF specification, while the raw objects + allow you to mine the PDF for anything that the higher-level + interfaces are lacking. + </para> + </listitem> + </varlistentry> + </variablelist> + </para> + </sect1> + <sect1 id="json.limitations"> + <title>Limitations of JSON Representation</title> + <para> + There are a few limitations to be aware of with the json structure: + <itemizedlist> <listitem> <para> - The top-level json object output is a dictionary. The json - output contains various nested dictionaries and arrays. With - the exception of dictionaries that are populated by the fields - of objects from the file, all instances of a dictionary are - guaranteed to have exactly the same keys. Future versions of - qpdf are free to add additional keys but not to remove keys or - change the type of object that a key points to. The qpdf - program validates this guarantee, and in the unlikely event - that a bug in qpdf should cause it to generate data that - doesn't conform to this rule, it will ask you to file a bug - report. + Strings, names, and indirect object references in the original + PDF file are all converted to strings in the json + representation. In the case of a “normal” PDF file, + you can tell the difference because a name starts with a slash + (<literal>/</literal>), and an indirect object reference looks + like <literal>n n R</literal>, but if there were to be a string + that looked like a name or indirect object reference, there + would be no way to tell this from the json output. Note that + there are certain cases where you know for sure what something + is, such as knowing that dictionary keys in objects are always + names and that certain things in the higher-level computed data + are known to contain indirect object references. </para> + </listitem> + <listitem> <para> - The top-level json structure contains a - “<literal>version</literal>” key whose value is - simple integer. The value of the <literal>version</literal> key - will be incremented if a non-compatible change is made. A - non-compatible change would be any change that involves removal - of a key, a change to the format of data pointed to by a key, - or a semantic change that requires a different interpretation - of a previously existing key. A strong effort will be made to - avoid breaking compatibility. + The json format doesn't support binary data very well. Mostly + the details are not important, but they are presented here for + information. When qpdf outputs a string in the json + representation, it converts the string to UTF-8, assuming usual + PDF string semantics. Specifically, if the original string is + UTF-16, it is converted to UTF-8. Otherwise, it is assumed to + have PDF doc encoding, and is converted to UTF-8 with that + assumption. This causes strange things to happen to binary + strings. For example, if you had the binary string + <literal><038051></literal>, this would be output to the + json as <literal>\u0003•Q</literal> because + <literal>03</literal> is not a printable character and + <literal>80</literal> is the bullet character in PDF doc + encoding and is mapped to the Unicode value + <literal>2022</literal>. Since <literal>51</literal> is + <literal>Q</literal>, it is output as is. If you wanted to + convert back from here to a binary string, would have to + recognize Unicode values whose code points are higher than + <literal>0xFF</literal> and map those back to their + corresponding PDF doc encoding characters. There is no way to + tell the difference between a Unicode string that was originally + encoded as UTF-16 or one that was converted from PDF doc + encoding. In other words, it's best if you don't try to use the + json format to extract binary strings from the PDF file, but if + you really had to, it could be done. Note that qpdf's + <option>--show-object</option> option does not have this + limitation and will reveal the string as encoded in the original + file. </para> </listitem> - </varlistentry> - <varlistentry> - <term>Documentation</term> + </itemizedlist> + </para> + </sect1> + <sect1 id="json.considerations"> + <title>JSON: Special Considerations</title> + <para> + For the most part, the built-in JSON help tells you everything you + need to know about the JSON format, but there are a few + non-obvious things to be aware of: + <itemizedlist> <listitem> <para> - The <command>qpdf</command> command can be invoked with the - <option>--json-help</option> option. This will output a json - structure that has the same structure as the json output that - qpdf generates, except that each field in the help output is a - description of the corresponding field in the json output. The - specific guarantees are as follows: - <itemizedlist> - <listitem> - <para> - A dictionary in the help output means that the corresponding - location in the actual json output is also a dictionary with - exactly the same keys; that is, no keys present in help are - absent in the real output, and no keys will be present in - the real output that are not in help. - </para> - </listitem> - <listitem> - <para> - A string in the help output is a description of the item - that appears in the corresponding location of the actual - output. The corresponding output can have any format. - </para> - </listitem> - <listitem> - <para> - An array in the help output always contains a single - element. It indicates that the corresponding location in the - actual output is also an array, and that each element of the - array has whatever format is implied by the single element - of the help output's array. - </para> - </listitem> - </itemizedlist> - For example, the help output indicates includes a - “<literal>pagelabels</literal>” key whose value is - an array of one element. That element is a dictionary with keys - “<literal>index</literal>” and - “<literal>label</literal>”. In addition to - describing the meaning of those keys, this tells you that the - actual json output will contain a <literal>pagelabels</literal> - array, each of whose elements is a dictionary that contains an - <literal>index</literal> key, a <literal>label</literal> key, - and no other keys. + While qpdf guarantees that keys present in the help will be + present in the output, those fields may be null or empty if the + information is not known or absent in the file. Also, if you + specify <option>--json-keys</option>, the keys that are not + listed will be excluded entirely except for those that + <option>--json-help</option> says are always present. </para> </listitem> - </varlistentry> - <varlistentry> - <term>Directness and Simplicity</term> <listitem> <para> - The json output contains the value of every object in the file, - but it also contains some processed data. This is analogous to - how qpdf's library interface works. The processed data is - similar to the helper functions in that it allows you to look - at certain aspects of the PDF file without having to understand - all the nuances of the PDF specification, while the raw objects - allow you to mine the PDF for anything that the higher-level - interfaces are lacking. + In a few places, there are keys with names containing + <literal>pageposfrom1</literal>. The values of these keys are + null or an integer. If an integer, they point to a page index + within the file numbering from 1. Note that json indexes from + 0, and you would also use 0-based indexing using the API. + However, 1-based indexing is easier in this case because the + command-line syntax for specifying page ranges is 1-based. If + you were going to write a program that looked through the json + for information about specific pages and then use the + command-line to extract those pages, 1-based indexing is + easier. Besides, it's more convenient to subtract 1 from a + program in a real programming language than it is to add 1 from + shell code. </para> </listitem> - </varlistentry> - </variablelist> - </para> - <para> - There are a few limitations to be aware of with the json structure: - <itemizedlist> - <listitem> - <para> - Strings, names, and indirect object references in the original - PDF file are all converted to strings in the json - representation. In the case of a “normal” PDF file, - you can tell the difference because a name starts with a slash - (<literal>/</literal>), and an indirect object reference looks - like <literal>n n R</literal>, but if there were to be a string - that looked like a name or indirect object reference, there - would be no way to tell this from the json output. Note that - there are certain cases where you know for sure what something - is, such as knowing that dictionary keys in objects are always - names and that certain things in the higher-level computed data - are known to contain indirect object references. - </para> - </listitem> - <listitem> - <para> - The json format doesn't support binary data very well. Mostly - the details are not important, but they are presented here for - information. When qpdf outputs a string in the json - representation, it converts the string to UTF-8, assuming usual - PDF string semantics. Specifically, if the original string is - UTF-16, it is converted to UTF-8. Otherwise, it is assumed to - have PDF doc encoding, and is converted to UTF-8 with that - assumption. This causes strange things to happen to binary - strings. For example, if you had the binary string - <literal><038051></literal>, this would be output to the - json as <literal>\u0003•Q</literal> because - <literal>03</literal> is not a printable character and - <literal>80</literal> is the bullet character in PDF doc - encoding and is mapped to the Unicode value - <literal>2022</literal>. Since <literal>51</literal> is - <literal>Q</literal>, it is output as is. If you wanted to - convert back from here to a binary string, would have to - recognize Unicode values whose code points are higher than - <literal>0xFF</literal> and map those back to their - corresponding PDF doc encoding characters. There is no way to - tell the difference between a Unicode string that was originally - encoded as UTF-16 or one that was converted from PDF doc - encoding. In other words, it's best if you don't try to use the - json format to extract binary strings from the PDF file, but if - you really had to, it could be done. Note that qpdf's - <option>--show-object</option> option does not have this - limitation and will reveal the string as encoded in the original - file. - </para> - </listitem> - </itemizedlist> - </para> - <para> - For specific details on the information provided in the json - output, please run <command>qpdf --json-help</command>. - </para> + <listitem> + <para> + The image information included in the <literal>page</literal> + section of the json output includes the key + “<literal>filterable</literal>”. Note that the + value of this field may depend on the + <option>--decode-level</option> that you invoke qpdf with. The + json output includes a top-level key + “<literal>parameters</literal>” that indicates the + decode level used for computing whether a stream was + filterable. For example, jpeg images will be shown as not + filterable by default, but they will be shown as filterable if + you run <command>qpdf --json --decode-level=all</command>. + </para> + </listitem> + </itemizedlist> + </para> + </sect1> </chapter> <chapter id="ref.design"> <title>Design and Library Notes</title> |