Hacking with PDF (2022)

JKCalhoun · on Aug 17, 2024

FWIW, ages ago I wrote the PDFKit framework for the Mac (used by Preview and the built-in PDF viewer in Safari).

The only exploit listed here that has a chance of working with Preview/Safari (PDFKit) is the URI one — none of the Javascript exploits will work.

Why? I never implemented Javascript support [1].

Security was extremely important at Apple (there's a whole security team that frequently interact with the various project owners around the company, write and deploy file fuzzers, create must-fix Radars around exploits found in the wild, etc.).

In fact though I had no idea how I would hoist a Javascript runtime and I didn't really have the cycles to implement it if I had known how to. Anyways we were content to support the 99% of PDFs out there.

[1] In fact there were a few US tax documents that used very simple Javascript snippets to take the values from two fields, add them, and put the result in a third. Some code in PDFKit I added would identify these few very simple patterns and implement them sans JS runtime.

felipefar · on Aug 17, 2024

Nice job! I've been wanting to write a PDF parser for learning purposes, but have been put off by the quantity of files that open source PDF parsers have on their repos and the different tech that they need (image formats, compression formats, etc.). I'll probably settle for a reasonable ratio between PDFs supported/learning extracted from the project, so it's useful knowing that PDFs with JS are not very widely used.

Also, I'm the developer of a reference management software, and have naturally been thinking about what it'd take to save in the PDF file metadata fields that are generally useful for advanced readers and academics: original publication dates, ISBNs, DOIs, edition, publisher, etc., instead of just author and title.

gettalong · on Aug 17, 2024

You can get a long way with only implementing the most basic things of the PDF specification, like section 7. And even there you don't need everything. For example, there is no need to implement the CCITTFaxDecode, JBIG2Decode, DCTDecode or JPXDecode filters if you don't want to get at the raw pixels of the images.

Once you have parsing and writing of a simple PDF file going (sections 7.2, 7.3, 7.4, 7.5, 7.7), add in support for encryption (section 7.6). Now you are able to handle to at least parse and write nearly all PDF files.

Then implement all the things you need gradually For example:

* Need support for parsing or creating the contents of a page? -> sections 7.8, 8, and 9. Mind you, start out with only supporting the built-in PDF fonts for creating text and later add support for TrueType (easier) and OpenType (harder if you need to implement the font parser yourself).

* Need support for annotations? -> section 12.5

And so on.

If you just need to store the metadata in the PDF, you only need support for parsing and writing a PDF because this usually also entails that you can modify the PDF object tree which is needed for storing the metadata. However, if you need to store that metadata in a way that is usable for other PDF processors, you would need to store it as an XMP file and creating that is yet another deep dive if you don't have an XMP library available. See section 14.3.2 in the PDF spec for this (btw. the latest PDF spec is available at no cost at https://pdfa.org/resource/iso-32000-2/).

bla3 · on Aug 17, 2024

Do you know if it's still maintained? I have a bunch of PDFs where images don't show up in Preview. I filed bugs for them, but they're being ignored.

JKCalhoun · on Aug 18, 2024

No doubt it's still being maintained. Older frameworks though end up being a sort of "other thing" that some engineer get's saddled with, probably not the squeaky wheel among their bugs, sadly.

Personally curious though about the PDFs. If you could share a link I'm interested.

jahewson · on Aug 17, 2024

Nice! There was an exploit in iOS Messages found last year due to code that Apple had under license from Xpdf. I’ve wondered why Apple needed that when they already had PDFKit?

lysace · on Aug 17, 2024

PDFKit is awesome to use. Thanks!

jjbinx007 · on Aug 17, 2024

I've always held the opinion that viewing PDFs in something other than Adobe Acrobat gives the user more of a chance of avoiding such attacks... is there any credence to this or is it just wishful thinking?

unanimous · on Aug 17, 2024

I've tried creating a PDF Canarytoken [0] and opening it in a few applications not including Adobe Acrobat. None of them triggered the canary.

[0]: https://canarytokens.org/nest/

agumonkey · on Aug 17, 2024

Acrobat implements more features than say muddy I assume. So in terms of attack surface it would be riskier, But maybe they have more security analysts too..

banku_brougham · on Aug 17, 2024

This is a great demo, ive been concerned about all these pdfs i like to read, this gives me a little more confidence about tools to scan odfs for attacks.

nicolodev · on Aug 18, 2024

I’m writing a little tool for analysing a pdf and its internals, if author is interested or anyone else, just let me know :)