summaryrefslogtreecommitdiffstats
path: root/TODO
diff options
context:
space:
mode:
authorJay Berkenbilt <ejb@ql.org>2019-12-30 15:17:05 +0100
committerJay Berkenbilt <ejb@ql.org>2020-01-13 15:18:36 +0100
commit49f4600dd6feae74079ad3a3678f6a390bb4e3a1 (patch)
tree156a1eec254ab858262d9408227b7f0d3431f39b /TODO
parent0ae19c375ebc24c303765953ff127ecb6a4664dc (diff)
downloadqpdf-49f4600dd6feae74079ad3a3678f6a390bb4e3a1.tar.zst
TODO: Move lexical stuff and add detail
Diffstat (limited to 'TODO')
-rw-r--r--TODO41
1 files changed, 18 insertions, 23 deletions
diff --git a/TODO b/TODO
index 9654059d..4e367cae 100644
--- a/TODO
+++ b/TODO
@@ -59,29 +59,6 @@ C++-11
time.
-Lexical
-=======
-
- * Make it possible to run the lexer (tokenizer) over a whole file
- such that the following things would be possible:
-
- * Rewrite fix-qdf in C++ so that there is no longer a runtime perl
- dependency
-
- * Make it possible to replace all strings in a file lexically even
- on badly broken files. Ideally this should work files that are
- lacking xref, have broken links, etc., and ideally it should work
- with encrypted files if possible. This should go through the
- streams and strings and replace them with fixed or random
- characters, preferably, but not necessarily, in a manner that
- works with fonts. One possibility would be to detect whether a
- string contains characters with normal encoding, and if so, use
- 0x41. If the string uses character maps, use 0x01. The output
- should otherwise be unrelated to the input. This could be built
- after the filtering and tokenizer rewrite and should be done in a
- manner that takes advantage of the other lexical features. This
- sanitizer should also clear metadata and replace images.
-
Page splitting/merging
======================
@@ -407,3 +384,21 @@ I find it useful to make reference to them in this list
* If I ever decide to make appearance stream-generation aware of
fonts or font metrics, see email from Tobias with Message-ID
<5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
+
+ * Consider creating a sanitizer to make it easier for people to send
+ broken files. Now that we have json mode, this is probably no
+ longer worth doing. Here is the previous idea, possibly implemented
+ by making it possible to run the lexer (tokenizer) over a whole
+ file. Make it possible to replace all strings in a file lexically
+ even on badly broken files. Ideally this should work files that are
+ lacking xref, have broken links, etc., and ideally it should work
+ with encrypted files if possible. This should go through the
+ streams and strings and replace them with fixed or random
+ characters, preferably, but not necessarily, in a manner that works
+ with fonts. One possibility would be to detect whether a string
+ contains characters with normal encoding, and if so, use 0x41. If
+ the string uses character maps, use 0x01. The output should
+ otherwise be unrelated to the input. This could be built after the
+ filtering and tokenizer rewrite and should be done in a manner that
+ takes advantage of the other lexical features. This sanitizer
+ should also clear metadata and replace images.