Surprise, surprise: my online metadata actually reveals where I’ve been

Here are some of the places that Ars tracked Cyrus Farivar to in February 2014.
Cyrus Farivar

In January 2014, documents provided by Edward Snowden showed that a Canadian spy agency used a unique identifier to follow thousands of Canadians as they moved about the country. The tracking all originated from an unnamed airport.

It got us thinking: how hard would it be to replicate this little experiment, writ small? Could I use one of my own online identifiers as a way to track my own movements through time and space?

Like many other websites, Ars Technica employs a system of voluntary user logins. These logins allow you to do things like leave comments at the bottom of every story and engage in our user forums. Each time you log in to Ars, we record the date, time, and IP address that you logged in from. This is a common practice: nearly every website maintains similar records. Typically though, Ars only keeps one record per user of the last date, time, and IP address used. We do not keep any historical records of login data.

However, Ars lead developer Lee Aylward was kind enough to make an exception—me. For 11 days in February 2014, Ars tracked all of my logins. The working theory was that since I’m telling Ars who I am (my login name is the frequently used and obvious “cfarivar”) and loading the site multiple times per day, my logins would actually give Ars a clear idea of my actions and movements.

In turn, I sent this 11-day log along to Nicholas Weaver, a computer security researcher at the International Computer Science Institute based in Berkeley, California. It took Weaver just a short amount of time to write a Python script that converted the raw CSV data file (including Unix time notation). It would start with a line like this:

1392056430,335607,
[IP REDACTED],/wp-admin/post.php?post=410003&action=edit&message=10

And Weaver's creation could turn it into something much more human-readable, like this:

Between Fri Feb 14 07:53:51 2014 and Fri Feb 14 10:58:58 2014 at SecuredServers.com

That means Ars showed I was editing a particular story for about three hours on the morning of February 14, and I was connected likely through Private Internet Access (PIA), the commercial VPN that I frequently use. Normally, for privacy reasons, I use PIA to obscure my tracks online. While I tried leaving it off for the purposes of this experiment, sometimes I left it on by accident. That turned out to be useful, allowing us to see what it looks like when online origins are obscured.

Home is where the data is

Looking at the raw data and the cleaned-up script on my own, there were a few things that seemed obvious: first, it showed when I started and ended my work day. Some days, I was logged into Ars as early as 4:14am (February 13) and was active as late as 9:30pm (February 16). But generally speaking, I was consistently online by about 7am and ended around 5pm. There was then a few hours' gap (I knew this was for dinner) and sometimes a check-in again before calling it a night.

"I am person X at this location."

The precision of the IP addresses was surprising.

In one instance, on Thursday February 6, at 9:30am, I was logged in at a particular San Francisco IP address. Looking up that IP on myip.ms turned up not only the city, but one of two possible street addresses as well. The search was again correct: on that particular day at that particular hour, I was conducting an interview with Boxbee CEO Kristoph Matthews at The Hatchery, a co-working space and startup incubator at 645 Harrison Street, in San Francisco’s South of Market district.

“If I was Google doing this analysis or the [National Security Agency], I would already have a large database as to what [building corresponds] to this IP address, or what all the information I know about [that IP] is,” Weaver added. “Once you have that, you have a much richer suite of options; you might even know which building [you were in].” (Lots of companies are already doing this, creating physical maps influenced by the location of known, fixed Wi-Fi networks.)

Weaver explained that a stronger and more persistent adversary, like the NSA, would have a much longer-term and comprehensive data set. Data sets like that would include information from plenty of sites beyond Ars.

“Facebook knows if you hit any page that has a Like button on it,” he said. “Same with TweetThis, unless the site goes out of the way to mask them, then these are specifically reporting them to social networks. This is why NSA loves it, is because they can go along for the ride.

“One thing that we know that the NSA does on their non-US wiretaps is bind usernames to cookies, so if you see a request for LinkedIn or YouTube or Yahoo, these are all sites that have user ID in the clear. All you need to do is see a request, and say I don't know who this is or I know who this is, but then you look at the HTML body and look for the username. This is why the NSA went after Google ad networks; they include user identification [broadcast] in the clear: ‘I am person X at this location.’”

Despite the vast amount of data, it's just as easy to store as it is to interpret. “It works out to only a few kilobytes per person for everyone on the planet,” Weaver added. In other words, if I had the access, it'd cost just a few thousand dollars to have enough consumer-grade storage to keep data on everyone in the United States. It would comfortably fit on my desk.

Metadata is surveillance

There was good news from this exercise. Mainly, the digital obfuscatory tools I normally run did help mask my online trail.

Promoted Comments

Hat MonsterArs Legatus Legionis
jump to post

I don't see how anyone can justify metadata as being somehow less than the data it is associated with. It's more, much more.

I can't readily use the content of your HTTP sessions to work out when you send them, from where and what site to and frankly, it's probably not interesting to me as a snoop. Likewise the content of your phone calls isn't interesting and is simply cumbersome nonsense. I want to know about them, not their content.

The important bit is indeed the metadata, the content is just noise. Where from? Who to? When? For how long? I can build up an entire picture of your life. Your search queries, another metadata term, are also very important. I can work out those little things about you that make you unique, once I have those, you're so much easier to trace.

Through metadata a full picture of your life emerges. I can infer your doctor, and probably your medical conditions. I can work out your routine, and set alerts if you deviate from it. I can tie you in with your social contacts from SMS and phone metadata, as well as rank them in order of importance. 20 minutes talking to a wedding planner service? Congratulations. 45 minutes on the phone to an employment lawyer? Guess things aren't working out too well at work.

I can run a PageRank-like algorithm over your phone records and everyone who you called, and their contacts, and get my own database of everything that makes you, you.

I may never get your name, but I will know everything about you, your job, what's happening in your life, and all your friends, acquaintances and colleagues.

And that's terrifying.

37759 posts | registered Jan 21, 2001

Biz & IT —

Surprise, surprise: my online metadata actually reveals where I’ve been

In an attempt to simulate the NSA's capability, Ars tracks its own editor for 11 days.

Further Reading

Home is where the data is

Further Reading

"I am person X at this location."

Further Reading

Metadata is surveillance

Further Reading

Promoted Comments

Promoted Comments

Channel Ars Technica

Further Reading

Home is where the data is

Further Reading

"I am person X at this location."

Further Reading

Metadata is surveillance

Further Reading

Promoted Comments

Promoted Comments

reader comments

Channel Ars Technica