We continue the series of interesting blog posts written by industry experts, including marketers, analysts, developers, and other professionals.
NetSPI recently launched its Attack Surface Management offering to continuously monitor and penetration test networks for the world’s most prominent organizations. To help identify new and changing IP addresses across our clients’ global attack surfaces, we partnered with IPinfo to ingest their Geolocation and Company data sets.
After getting our API tokens, we were faced with the challenge of architecting and building an ingestion pipeline capable of handling over 70 million rows of data on a daily basis. In this blog, I’ll outline the initial approach we took, as well as the more performant approach suggested to us by IPinfo’s CEO and Founder, Ben Dowling.
The road not taken
Our goal was simple: Take IPInfo’s Geolocation data set, import it into NetSPI’s Attack Surface Management database, and denormalize the location data onto our own ip_address table. Let’s assume our table schemas look like this:
CREATE TABLE ip_address (
id bigserial primary key,
ip inet,
country text
);
CREATE TABLE ip_location (
start_ip inet,
end_ip inet,
join_key inet,
city text,
region text,
country text,
latitude text,
longitude text,
postal_code text,
timezone text
);
Ingesting the data can be done through a simple bash command:
curl -sL https://ipinfo.io/data/location.csv.gz?token=[IP_INFO_API_TOKEN] 2>&1 | gzip -d | psql postgresql://username:password@database:5432/postgres -c "COPY ip_location FROM STDIN DELIMITER ',' CSV HEADER"
This takes around 5 minutes on a db.m6g.2xlarge AWS RDS instance running Postgres 13.3. That’s a quarter-million writes per second!
Once we had the ip_location table filled with location data, we needed to seed our ip_address table with 100k fake IP addresses to simulate a real environment:
INSERT INTO ip_address (ip)
select CONCAT(
TRUNC(RANDOM() * 250 + 2),'.',
TRUNC(RANDOM() * 250 + 2),'.',
TRUNC(RANDOM() * 250 + 2),'.',
TRUNC(RANDOM() * 250 + 2)
)::INET
FROM generate_series(0, 100000);
I shall be telling this with a sigh
This is where our trouble begins. We wanted to take the data from ip_location and denormalize it onto the ip_address table, so we don’t pay the performance penalty at query time when searching IPs. Postgres optimization and indexing is a fickle beast, and the results depend on many configurations. Assuming a default AWS RDS configuration, the following unindexed query takes around 20 minutes:
UPDATE ip_address ip
SET country=ipl.country
FROM ip_location ipl
WHERE ip.ip BETWEEN ipl.start_ip AND ipl.end_ip;
With some simple indexes, this query will take around 4 minutes, but neither of these times is favorable and have many implications given the ip_address table is locked during the update. Experimenting with different indexing strategies (i.e. gist indexes) did not result in faster execution times.
There was one saving grace, which is the join_key provided as part of the IPinfo dataset. This key represents the Class C network each IP address is part of, allowing you to filter the result set significantly before filtering to the exact IP address you want. Our modified query for that is as follows:
UPDATE ip_address ip
SET country=ipl.country
FROM ip_location ipl
WHERE host(set_masklen(ip::cidr, 16))::inet = ipl.join_key
AND ip.ip BETWEEN ipl.start_ip and ipl.end_ip;
Both indexed and unindexed executions of this query take around 2.5 minutes. The join_key column minimizes the result set so much that there are minor tweaks you can make with indexes to improve performance, but this is about as far as you can take it. While this solution is fast and somewhat concise, it doesn’t allow us to take advantage of some of the cool benefits Postgres provides, like 3rd party extensions.
And that has made all the difference
At this point, we asked Ben if he’d look at our solution. Before he even saw our implementation, he told us, “If you’re using Postgres, you’ll want the IP4R extension (RDS has support) and add gist indexes to iprange data types.”
We had somehow missed the IP4R extension in our research, which is a Postgres extension that adds advanced support for IP address ranges and indexes. This was a perfect fit because all the IPinfo data includes a start_ip and end_ip constituting an IP range.
There were very few changes we had to make to our original schema — mainly converting inet columns to ipaddress columns, and adding a generated ip_range column to the ip_location table:
create extension ip4r;
create table ip_address
(
id bigserial primary key,
ip ipaddress,
country text
);
create table ip_location
(
start_ip ipaddress,
end_ip ipaddress,
join_key ipaddress,
ip_range iprange generated always as (iprange(start_ip, end_ip)) stored,
city text,
region text,
country text,
latitude text,
longitude text,
postal_code text,
timezone text
);
The key here was the ip_range column, which generates an IP4R iprange type that we can index extremely performantly. Let’s take a look at our new baseline query:
UPDATE ip_address ip
SET country=ipl.country
FROM ip_location ipl
WHERE ipl.ip_range && ip.ip;
Unindexed, this query takes an unimpressive handful of minutes, however, IP4R will now take proper advantage of GIST indexes on ipranges:
create index ip_location_ip_range on ip_location using gist (ip_range);
This reduced our execution time to 24 seconds – a 72x performance improvement over our initial query! This also increased single IP lookup time from 20 seconds with our first schema to 4 milliseconds with IP4R.
IP4R has been an exciting new addition to our database and is surely the road less traveled for us, but the benefits are clear. I’d like to thank Ben, Augustin, and the entire IPinfo team for their insight and service — it’s made all the difference. We’re looking forward to all the ways we can gain new insights from this data to deliver the best attack surface management capabilities to NetSPI’s customers.
Try IPinfo’s geolocation data for yourself. Sign up for our free trial - up to 50,000 free geolocation lookups per month.
This blog post is written by Jake Reynolds from NetSPI. Jake is the Head of Emerging Technology at NetSPI, a penetration testing company that is transforming the cyber security testing industry with tech-enabled services and prevention-based cyber security techniques.