December 11, 2020

Given the blazing speed of Node.js these days, I expected HTML parsing to be faster on Node than on Python. So I compared lxml with htmlparser2 – the fastest libraries on Python and JS in parsing the reddit home page (~700KB). lxml took ~8.6 milliseconds htmlparser2 took ~14.5 milliseconds Looks like lxml is much faster. I’m likely to stick around with Python for pure HTML parsing (without JavaScript) for a while longer. In [1]: from lxml.html import parse In [2]: %timeit tree = parse('reddit.html') 8.69 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) const { Parser } = require("htmlparser2"); const { DomHandler } = require("domhandler"); const fs = require("fs"); const html = fs.readFileSync("reddit.html"); const handler = new DomHandler(function(error, dom) {}); const start = +new Date(); for (var i = 0; i < 100; i++) { const parser = new Parser(); parser.write(html); parser.end(); } const end = +new Date(); console.log((end - start) / 100); Note: If I run the htmlparser2 code 100 times instead of 10, it only takes 7ms per loop. The more the number of loops, the faster it parses. I guess Node.js optimizes repeated loops. But I’m only interested in the first iteration, since I’ll be parsing files only once.