Download PDFs from websites

Things I've learned trying to download websites as PDFs

2019-12-10

From the site is very straightforward. You just right click, print, and save as PDF. And that’s all the story there.

If you want to automate or get a PDF from a cli, then is a different.

You have puppeteer, the first tool I’ve used for downloading pdfs, as I was somewhat familiared with it. The thing was that I had to run a chrome instance, and wanted a lighter and simpler solution.

One thing I loved about downloading a PDF directly from the website, is that it let you respect the @media print query on css. So I’d avoid the unnecesary information, like sidebars, headers, … Instead, I’d only had the content as pdf, with the styles of the site. Awesome!

So now that I’m using pandoc, I want to use something like that to download the content from a website.

$ pandoc http://localhost:8000/posts/why-using-rss-feed.html -o out.pdf

Not so fast.

It throws me an error to install a pdf engine. Fair enough.

default –pdf-engine

On pandoc you have to download a pdf engine to be able to convert the html to pdf. It doens’t come up installed. So I downloaded MacTex (latex for mac or something like that).

Then I run the command:

$ pandoc http://localhost:8000/posts/why-using-rss-feed.html -o out.pdf

And it works! It was fast but…

It didn’t include the styles nor respect the @media print query to avoid unnecessary content.

Ok, lets try something else.

prince

Searching for a way include the styles on my downloads, I encounter prince. Simple to install and very straight forward on how to use it:

$ prince http://localhost:8000/posts/why-using-rss-feed.html --page-margin=10mm -o out.pdf

The output is as expected. It includes the styles and excludes the unnecessary content according to the @media print query. Awesome.

But there are a couple of things I don’t like.

Not completely free to use. Buying a license costs 500 dollars for 1 user on a desktop 🙃.
It adds an ugly logo on the pdf.
It doesn’t recognize css grid styles.
I had to install another program.

So I kept searching, and I remembered that with pandoc I had other pdf engine options.

wkhtmltopdf

That’s a real name.

Looking at the wkhtmltopdf config options for pandoc it lets you add --css to specify the css file. Great! The thing is that I needed to make a local reference to that file… I couldn’t reference it with http. But I can always curl and then run the command. Ok, another step, but nothing to crazy.

So I installed wkhtmltopdf from their site, and worked well:

$ curl http://localhost:8000/css/styles.css > styles.css
$ pandoc http://localhost:8000/posts/why-using-rss-feed.html --pdf-engine wkhtmltopdf --css styles.css -V margin-left=10 -V margin-top=10 -V margin-right=10 -V margin-bottom=10 -o out.pdf

Something to consider is that I had to specified to margins of every side for it to look ok, so the command got bigger.

The output was ok, it didn’t consider the css grid property but as with prince, specifying a margin was enought.

The last thing that was bothering me about this was that it keep including the extra info hidden by the print media query…

Looking at the wkhtmltopdf config, it specifies that you can give a --print-media-type option to make it load with the print media query.

I tried to use it from pandoc but didn’t work, so I saw to use wkhtmltopdf direclty…

$ wkhtmltopdf http://localhost:8000/posts/why-using-rss-feed.html --print-media-type out.pdf

Wow. Much more simple than using wkhtmltopdf though pandoc, who would say?

And I have all the things I want! The same output as prince but without paying 500 dolars.

Conclution

One thing I’ve learned with this, is to use the tools in the a more direct way, not though another tools that may limit the inferface. Like with pandoc and wkhtmltopdf.

And thats it. Now that I’ve figured out how to download websites as PDFs in a simple way, I’ll probably automate something for myself.

I recommend that if you’re going to use wkhtmltopdf make an alias or script with a better name.

Have a good reading!