Biz & IT —

Surprise, surprise: my online metadata actually reveals where I’ve been

In an attempt to simulate the NSA's capability, Ars tracks its own editor for 11 days.

Here are some of the places that Ars tracked Cyrus Farivar to in February 2014.
Here are some of the places that Ars tracked Cyrus Farivar to in February 2014.
Cyrus Farivar

In January 2014, documents provided by Edward Snowden showed that a Canadian spy agency used a unique identifier to follow thousands of Canadians as they moved about the country. The tracking all originated from an unnamed airport.

It got us thinking: how hard would it be to replicate this little experiment, writ small? Could I use one of my own online identifiers as a way to track my own movements through time and space?

The answer, perhaps unsurprisingly, is: yes. It’s easy to do, and it’s revealing about what I do, when I do it, and where I go.

Like many other websites, Ars Technica employs a system of voluntary user logins. These logins allow you to do things like leave comments at the bottom of every story and engage in our user forums. Each time you log in to Ars, we record the date, time, and IP address that you logged in from. This is a common practice: nearly every website maintains similar records. Typically though, Ars only keeps one record per user of the last date, time, and IP address used. We do not keep any historical records of login data.

However, Ars lead developer Lee Aylward was kind enough to make an exception—me. For 11 days in February 2014, Ars tracked all of my logins. The working theory was that since I’m telling Ars who I am (my login name is the frequently used and obvious “cfarivar”) and loading the site multiple times per day, my logins would actually give Ars a clear idea of my actions and movements.

In turn, I sent this 11-day log along to Nicholas Weaver, a computer security researcher at the International Computer Science Institute based in Berkeley, California. It took Weaver just a short amount of time to write a Python script that converted the raw CSV data file (including Unix time notation). It would start with a line like this:

1392056430,335607,
[IP REDACTED],/wp-admin/post.php?post=410003&action=edit&message=10

And Weaver's creation could turn it into something much more human-readable, like this:

Between Fri Feb 14 07:53:51 2014 and Fri Feb 14 10:58:58 2014 at SecuredServers.com

That means Ars showed I was editing a particular story for about three hours on the morning of February 14, and I was connected likely through Private Internet Access (PIA), the commercial VPN that I frequently use. Normally, for privacy reasons, I use PIA to obscure my tracks online. While I tried leaving it off for the purposes of this experiment, sometimes I left it on by accident. That turned out to be useful, allowing us to see what it looks like when online origins are obscured.

Cyrus Farivar

Home is where the data is

Looking at the raw data and the cleaned-up script on my own, there were a few things that seemed obvious: first, it showed when I started and ended my work day. Some days, I was logged into Ars as early as 4:14am (February 13) and was active as late as 9:30pm (February 16). But generally speaking, I was consistently online by about 7am and ended around 5pm. There was then a few hours' gap (I knew this was for dinner) and sometimes a check-in again before calling it a night.

Second, the data showed physical places that I knew I visited in the Bay Area: a particular San Francisco office building, an Oakland café, and the University of California, Berkeley, campus.

But Weaver’s analysis was far cleverer than I expected.

“I assumed you worked at home, because you had a residential Comcast IP address,” Weaver said. (He’s right: like nearly all of us at Ars, I work primarily at home.)

I didn’t realize that Comcast distinguishes its IP information in the hostname of business versus residential accounts. Anything that shows up as comcast.net is a residence, while anything else that shows up at comcastbusiness.net is likely a business. (Of course, anyone can sign up for a “Business-class” account at home, like Ars editor Lee Hutchinson, but most people don’t go that route.)

Apparently, the original CSV file he used also contained URL information for which article I was viewing. “I knew what you were reading,” Weaver added. “That tells me what article you were working on, if you're reading old stuff it means you're looking for links.”

(Again he's right. If I’m pulling up the last three stories I wrote about Bitcoin, there’s a high likelihood that I was working on a new story on Bitcoin.)

“When VPN was active I could see that you were active, but not where,” he said.

"I am person X at this location."

The precision of the IP addresses was surprising. 

In one instance, on Thursday February 6, at 9:30am, I was logged in at a particular San Francisco IP address. Looking up that IP on myip.ms turned up not only the city, but one of two possible street addresses as well. The search was again correct: on that particular day at that particular hour, I was conducting an interview with Boxbee CEO Kristoph Matthews at The Hatchery, a co-working space and startup incubator at 645 Harrison Street, in San Francisco’s South of Market district.

“If I was Google doing this analysis or the [National Security Agency], I would already have a large database as to what [building corresponds] to this IP address, or what all the information I know about [that IP] is,” Weaver added. “Once you have that, you have a much richer suite of options; you might even know which building [you were in].” (Lots of companies are already doing this, creating physical maps influenced by the location of known, fixed Wi-Fi networks.)

Weaver explained that a stronger and more persistent adversary, like the NSA, would have a much longer-term and comprehensive data set. Data sets like that would include information from plenty of sites beyond Ars.

“Facebook knows if you hit any page that has a Like button on it,” he said. “Same with TweetThis, unless the site goes out of the way to mask them, then these are specifically reporting them to social networks. This is why NSA loves it, is because they can go along for the ride.

“One thing that we know that the NSA does on their non-US wiretaps is bind usernames to cookies, so if you see a request for LinkedIn or YouTube or Yahoo, these are all sites that have user ID in the clear. All you need to do is see a request, and say I don't know who this is or I know who this is, but then you look at the HTML body and look for the username. This is why the NSA went after Google ad networks; they include user identification [broadcast] in the clear: ‘I am person X at this location.’”

Despite the vast amount of data, it's just as easy to store as it is to interpret. “It works out to only a few kilobytes per person for everyone on the planet,” Weaver added. In other words, if I had the access, it'd cost just a few thousand dollars to have enough consumer-grade storage to keep data on everyone in the United States. It would comfortably fit on my desk.

Metadata is surveillance

There was good news from this exercise. Mainly, the digital obfuscatory tools I normally run did help mask my online trail.

Generally speaking, I run all kinds of anti-tracking software on my browser: constant private mode, Ghostery, Disconnect, and my VPN. (I also have Tor and use it occasionally. Though the VPN, of course, concealed my location but did not conceal my activity. I was still clearly logged into Ars.) And Weaver said, yes, these tools do help to thwart tracking to some degree.

“The biggest reason why the NSA thinks Tor stinks is that it's actually really hard to link user activity to people,” he said. “Because the [Tor browser] bundle operates [by default] not storing cookies and [doesn’t allow Flash]. The browser bundle is allowed to not have linkages across sessions. Every time you exit the tor browser it looks like a new user. Normal browsers are not set with clear all cookies. The real fault lies in the architecture of the Web. The Web is designed [to allow the] business model of tracking. If you have your browser set to clear cookies every time you quit, it really helps. Tor is overkill; your single hop VPN is still bouncing all over the place.”

As many privacy activists and security researchers have long noted, free products turn the customers into products. Google and Facebook are some of the biggest companies that make billions of dollars by tracking their users' behavior and selling ads against that behavior. But even my work account would have the potential for data mining.

“[Your Ars log] didn't tell me anything new about your site, but it does tell me about your workflow. It tells me where you go and when you're active,” Weaver concluded. “This is why everybody says metadata is surveillance.”

Channel Ars Technica