GeoIP Lookup

The Quest for the Fastest GeoIP Lookup
(featuring pledge and unveil)

The Quest for the Fastest GeoIP Lookup
(featuring pledge and unveil)

Written November, 2019

I recently found myself needing to perform GeoIP lookups on huge lists of IP addresses while operating in a memory constrained cloud environment. After testing and researching for a number of days and evaluating numerous different options, I believe I've stumbled upon the worlds fastest GeoIP lookup solution.

I had a couple primary objectives to fullfill in finding a suitable GeoIP solution. First and foremost, I was looking for a solution that would allow me to do IPv4 and IPv6 queries against the freely available Maxmind City Database without requiring protocol specific syntax (ie. the tool shouldn't require different command syntax for looking up an IPv4 or IPv6 address; I should be able to feed it a mixed list of addresses and have the lookup perfomed without issue). The second objective was performance. I needed to be able to perform a lookup on several million or more addresses at least twice an hour.

My operating environment was fairly constrained. I was going to be performing the GeoIP lookups on one of Vultr's $3.50/month 1 core 512MB cloud instances running OpenBSD. The machine is used to parse the httpd logs from an httpd server cluster and extrapolate data and gather analytics from the httpd logs (information such as website traffic stats, bandwidth usage, hits per country and/or city etc).

First thought: check the ports tree

My first thought was to try using the "GeoIP" tool in ports. Since the tool was written in C, I figured it would be blazing fast. The only problem is that the tool will only accept a single address as an argument, and to add insult to injury, requires protocol specific syntax. Additionally, it only supports the old (now deprecated) maxmind database format. This obviously wasn't going to work, as I needed a tool that supports the new GeoIP2 Maxmind databases that come in .mmdb format. Just for fun, I decided to see how well the tool ran when run in a simple for loop in ksh.

Performance was predictibly horrible. There was massive overhead in running the tool, opening the databases, performing a single lookup and then closing everything down before repeating for the next address. The reaper kernel process was running hard and top reported most of my CPU cyles were being eaten up by kernel processess (busy allocating and freeing memory from repeatedly opening and closing the database and starting and quiting the utitlity) and as a result, no real work was being done. On a lookup test against a list of 100,000 addresses, The test ran for over 20 minutes before I killed the script.

Second thought: when in doubt, run it from base

My next thought was to try performing the lookups using the OpenBSD communities favorite swiss army knife: Perl. I looked online and the Maxmind folks have an officially supported Perl module for doing lookups called GeoIP2-perl. On Maxmind's GitHub page, they state the module is deprecated and will only receive security and major bug fixes. For fun, I decided to give it a whirl.

The perl module appears to be able to run an optimized C lookup process using MaxMind::DB::Reader::XS. Unfortunately, it failed to install correctly, complaining about compilation failures. This was attempted a couple of months ago, so my memory of the issue is foggy. All I know is that MaxMind::DB::Reader was forced to fall back to the pure Perl lookup process (while also pulling in a terrifyingly large number of dependencies that took 45 minutes to install). I wasn't really in the mood to do a bunch of debugging, as I was just browsing and evaluating my options. If I couldn't get the module to install easily, I wasn't going to waste time wrestling with it. Considering I couldn't get it to work reasonably well out of the box, there was no way I was throwing it into production, especially since the module was deprecated anyways, but I was still interested in giving it a rough benchmark.

For fun, I spent a few minutes cobbling a script together so I could time the lookup against the same 100,000 address list I ran the GeoIP C tool against. After confirming I wasn't thrashing the kernel (top reported >98% userspace CPU time) I decided to let the script run its full course to see just how long the lookup would take. I left to grab some dinner, and upon returning found the script took just under an hour to run (56 minutes). That lookup time was unacceptable, as I was going to need to be doing over a million lookups, twice an hour. This script was only able to perform roughly 29 lookups per second. If I wanted to hit my bare minimum requirement of 1,000,000 lookups every 30 minutes, I was going to need to be able to do at least 600 lookups per second.

A Brave New World

I now started a serious search for a performance oriented lookup tool. I tried some solutions using Ruby and Go, but they were either slow or required protocol specific syntax (or both). I was now getting somewhat desperate. In my searching I kept hearing about a Maxmind mmdb reader called node-maxmind. It had everything I was looking for: it was able to lookup addresses regardless of IP protocol version, and it was fast. Very fast.

I had avoided Node.js like the plague for years as all I had heard about it was that it was a language for hipsters and a tool to make frontend web devs think they can write backend code. But oh lord was it fast. With a totally unoptimized script (it was the first Node.js script I had ever written) it was able to rip through my list of 100,000 addresses in roughly 45 seconds-- that's over 2200 lookups per second, ie nearly 4x my mimimum performance requirement!.

I had never used Node before, and I felt reluctant to start, but it was by far the best option available. It had everything I needed, but I was still aprehensive considering npm and the Node.js ecosystem's spotty track record with regards to security and other endemic issues. I kind of felt like this guy.

Over several days of searching, I was unable to find something that came close to the performance levels of node-maxmind, so I decided to put my biases aside and dig into Node and give it a good go.

The first order of business was finding a way to feed node-maxmind a list of addresses without it loading the whole list into memory. I was operating in a resource constrained environment where every megabyte of RAM counted. Unfortunately, my script would load the whole address list into memory and begin swapping, and then promptly lockup the machine. This was no good. I needed to find a way to feed node-maxmind the address list in chunks.

I happened to stumble upon a tool called "fileEachLine" that is released in a bundle of tools called "pixl-tools". "fileEachLine" ended up being exactly what the doctor ordered. It would load and feed the IP address list to node-maxmind in chunks using an asynchronous callback function. By default, fileEachLine uses 1KB chunks, but this is tuneable. I tried changing the chunk size to a number of different values (all the way up to 1MB chunk size), but saw no difference in performance, so I decided to run with the default chunk size.

Victory

I now had my lookup solution using less than 100MB of memory, and doing over 2200 lookups per second on a dirt cheap $3.50/month Vultr VPS. I was now cooking with fire.

The node-maxmind author benchmarked his tool against other popular Javascript GeoIP lookup tools, and he claims to be able to get over 600,000 lookups per second but he gives no information as to what what hardware or operating systems the tests were performed on. The only information he provides regarding the benchmark tests is a link to his benchmarking script and a note that he ran the tests on node v8.4.0.

I tried running my script on one of Vultr's "High Frequency" VPS with 2 CPU and 4GB of RAM and NVME storage, and I was able to get just under 19,000 lookups per second, allowing me to rip through a list of 1,000,000 addresses in roughly 54 seconds. Benchmarking on a VPS is always a silly idea, so all my numbers are merely rough indicators of performance. I'm sure if I were to run my script on bare metal on a fast machine using a performance oriented Linux, I could get the benchmark numbers far higher.

I could have happily stopped here, but I found a page with a list of languages with supported pledge and/or unveil bindings. I was pleasantly suprised to find out that Node.js has pledge and unveil bindings. This got me thinking about the principal of least privledge.

I was already running Node.js as an unprivledged user and blocking its network access using pf (block drop quick log user _nodejs) but now I wanted more. Using the Node.js unveil bindings I was able to restrict the script to read-only access to "/var/db/GeoIP/" where I kept the Maxmind database, and the file "/tmp/GeoIP/iplist.txt" which contains the actual list of IP addresses I'm performing the lookup on.

The next step was pledging the script. OpenBSD developer Aaron Bieber (who is also the OpenBSD Node.js port maintainer) maintains the "node-pledge" module that offers pledge bindings for Javascript. I followed the provided examples and everything worked perfectly. He asserts that the module is experimental only, and to use at your own risk, but I've found it to be stable and working as advertised.

Interestingly, a couple of times pledge has killed the script for trying to access the "tty" pledge. I'm not sure why it would need to require tty access. I haven't added the tty pledge to my script because it doesn't seem to be neccesary. When I have more time do dig into this, I will. At least I know pledge is doing its job. If somebody running this doesn't feel like putting up with that, you can easily add the "tty" pledge to the script.

Show Me the Code!

The Javascript code is quite simple and straight forward (25 lines including whitespace and comments).


$ cat maxmind.js

const maxmind = require('maxmind');
const unveil = require('openbsd-unveil')
const fs = require('fs')
var pledge = require('node-pledge');
var Tools = require('pixl-tools');

pledge.init("stdio unveil rpath prot_exec");

unveil('/tmp/GeoIP/iplist.txt', 'r') // can only read this file
unveil('/var/db/GeoIP', 'r') // can read files from here down

pledge.init("stdio rpath prot_exec"); // re-pledge to disallow futher unveils

  maxmind.open('/var/db/GeoIP/GeoLite2-City.mmdb').then((lookup) => {
        Tools.fileEachLine( "/tmp/GeoIP/iplist.txt",
        function(line, callback) {
                // this is fired for each line
        console.log(lookup.get(line));
               // fire callback for next line, pass error to abort
                callback();
        },
        function(err) {
                // all lines are complete
                if (err) throw err;
        }
);
});

A simple way to test it out would be:

Prerequisite: Make sure you download the GeoLite2-City.mmdb and put it in "/var/db/GeoIP/"

The neccesary modules can be installed by running:


# npm install maxmind
# npm install pixl-tools
# npm install openbsd-unveil
# npm install node-pledge

A simple example script: (Assuming you have httpd logs on your machine)


$ cat test.sh

#!/bin/sh
mkdir -m 0700 /tmp/GeoIP
awk '{print $2}' /var/www/logs/access.log > /tmp/GeoIP/iplist.txt 
node maxmind.js

The script will output the database lookup information to stdout. From there, you can parse it using your tools of choice. I'm a minimalist, so I parse the output using a bit of grep and awk.

Conclusion

That's all there really is to it folks. A blazing fast, simple, easy to use GeoIP lookup solution that runs pledged and unveiled.