Mapping Twitter with NodeXL and Gephi

7 min readMay 9, 2016

Understanding relationships on social media is a key, but largely unexplored, area of digital strategy. Understanding who matters to conversations you’re getting involved with online is difficult insofar as it’s time-consuming; hundreds of intern-hours spent diving into retweets and research to figure out who is important. Turns out, there’s a better way.

Tools

You’ll need NodeXL Pro for data acquisition, and Gephi for visualizing the data. Grab ’em. Yes, NodeXL Pro is a bit expensive unless you’re a student.

Limitations

Twitter doesn’t give out full firehose access, but instead a sample of all tweets for a search. The details on how the sampling mechanism works are sparse, but you work with what you got. Gnip, Twitter’s enterprise API platform, will sell you access to the “decahose,” 10% of all tweets, so scale your assumptions around what the regular search API gets access to based off of that.

Process

This is a two-step process; acquiring then visualizing data. For the first part we’ll use NodeXL, and the second part we’ll send the data into Gephi.

Data Acquisition

Once you’ve downloaded and installed the NodeXL template, open it and you’ll be presented with a screen similar to this:

Click Import in the top left, and select From Twitter Search Network. There are a ton of other options here that are probably tempting, but we’re sticking with Twitter for now. Selecting Twitter Search Network will get you another screen, like so:

Before you get started the first time, you’ll have to allow NodeXL access your Twitter account in the bottom left — since it’s pulling straight from the Twitter API, it needs a key to attach to. For further uses, just set it as having authorization.

This is where you’ll enter the search term you’re interested in analyzing, and what to import. You can use any of the operators listed here in NodeXL as well, because you’ll likely want to start figuring out how to filter out data, rather than capturing everything.

Unless you know you’re looking at a small search term, you’ll almost always want to import the “basic network,” rather than “basic network plus friends.” The second option adds at least an order of magnitude to how much data you’re going to pull, and since Twitter rate limits (15 per account per 15 minute window from the first request), you’ll be adding a significant amount of time overhead to pulling all the data.

Short of having a friend at Twitter engineering whitelist your account for rate limits, or hooking up with Gnip (why are you reading this, then?), the “basic network” is overwhelmingly going to be the option to choose, especially on major trending topics.

While NodeXL chugs, grab a cup of coffee. You’ve got some time. Once it’s finished, you’ll have a screen that looks something like this:

Spoilers: @barackobama is central to the #doyourjob hashtag.

There are some options for graphing and summarizing the data as-is, but we’re going to move the data out of NodeXL to take a better look at it. If you’re using the Basic version, this is where you’re on your own; you can make visualizations in the template itself, but exporting is a Pro feature. NodeXL Basic is pretty powerful, so don’t think it’s totally game over, but the rest of this will require NodeXL Pro.

Hit “Export” in the top left, and choose “To GraphML” file. Name it something memorable and put it somewhere you’ll find it, and we’re finished with NodeXL at this point. Say goodbye.

Data Visualization

Fire up Gephi, and open your new .graphml file, wherever you put it. It will import and call out any issues it finds, the number of Nodes (users) and Edges (relationships) in the data. Additionally, you want a directed graph, since Twitter interactions are generally one-way; if you were mapping a network of Facebook followers, you’d probably use an undirected graph. Once you do that, you’ll be presented with something that looks like this:

Not super helpful, but that is all our data there, hiding in one little box. We’re going to change that. Bottom left corner, the Layout box, select ForceAtlas 2 from the dropdown menu. The options presented in the dropdown menu are different ways of weighting the network into “neighborhoods,” and are largely a matter of taste. The math people are going to get mad at me over that. Hit Run, and magic:

While it’s running, on the right side, press Run next to Modularity and Eigenvector Centrality. This is all the fancy math we’ll have to do here, and the computer did all of it. The short version is that we’re using these tools to determine who is important in these networks in an algorithmic way — similar to Google’s Pagerank, but for people.

The last thing we want to do before we start manipulating the graph more is on the left, under the ForceAtlas2 options, check Prevent Overlap and allow it to run for a bit longer. This change will let us see all the users that would otherwise be so tightly clustered together as to be hiding under each other.

You’ll have something that looks like this, which is better, but we can still tweak it to be readable and useful. In the top left corner, where it says “appearance,” highlight the palette, as I’ve done here. In the dropdown menu available after clicking Attribute, select “Modularity Class,” and apply. This colors our chart by “neighborhoods,” like so:

In short, users of the same color are mostly related to each other. Our big pink cluster on the left for #doyourjob is, unsurprisingly, centered around @barackobama.

The other thing we want to do to help understand the state of our network is to rescale size by importance to the network. As we have it visually, @barackobama is as important as I am to the network, and that’s…probably not the case.

To do this, select the icon to the right of the palette under appearance — three circles growing in size. Make sure you have “Nodes” selected under that, and in the dropdown menu under Attribute, select In-Degree. Apply that, and you’ll see some major changes to the size of particular nodes. You can scale this by choosing size minimum and maximum, if you want to really highlight some particular set of major contributors to the network.

This is where I should highlight that there are tons of tweaks you can make at this point. You can color relationships by type (e.g., mention, tweet, @reply) by selecting the palette under Appearance, highlighting edges and attribute, and selecting Relationship from the dropdown. Not interested in “neighborhoods,” but rather in where the Verified accounts are in your network? Go back to the palette, and under Nodes and attributes, select Verified from the dropdown menu and apply. There’s tons of options here, and many of the statistical tools on the right side, where we selected Modularity and Eigenvector Centrality, will add additional ways to visualize the network. PageRank, Average Path Length, and so on will give additional ways to visualize and work with the network.

Finally, let’s throw some names on this so we can see who is who: under the visualization, there are several options. A lightbulb, a camera, a bold T, and others. Click the T to turn on Node Labels, and in the Size Mode dropdown menu — the black A — select Node Size, and that’s about it.

Now that we have this, how do we read it? Remember how we decided to scale different nodes on relative size, by Eigenvector Centrality: “the size of a node is relatively scaled by how important it is to the overall network.”

Naturally, @barackobama is important, as are @senatedems, @chuckgrassley, @moveon, @POTUS, and @scotusnom. Additionally, @kellyayotte, @weneednine, @thedemocrats, @ofa and @speakerryan are all major components to the discussion.

What’s more interesting are the lower-tier entries, where you have random niche users that just love the hell out of a specific topic. These are the people you would focus on and reach out to to continue bearing your message, or, if you were doing this as a proactive exercise, preemptively reach out to for working with messaging to their highly motivated audience.

Like @natureguy101. Love @natureguy101, and in this case, he’s more important to the network than the New York Times or Ted Cruz:

First, and hopefully only edit: my first experience with Gephi and NodeXL came from Clara Guibourg’s great writeup. I’ve tried to update it with the most recent versions of the tools, and explain use cases.