Home » Uncategorized

Python & JSON: Working with large datasets using Pandas

This article was posted by Vik Paruchuri. 

Introduction

Working with large JSON datasets can be a pain, particularly when they are too large to fit into memory. In cases like this, a combination of command line tools and Python can make for an efficient way to explore and analyze the data. In this post, we’ll look at how to leverage tools like Pandas to explore and map out police activity in Montgomery County, Maryland. We’ll start with a look at the JSON data, then segue into exploration and analysis of the JSON with Python.

When data is stored in SQL databases, it tends to follow a rigid structure that looks like a table. Here’s an example from a SQLite database:

id|code|name|area|area_land|area_water|population|population_growth|birth_rate|death_rate|migration_rate|created_at|updated_at 1|af|Afghanistan|652230|652230|0|32564342|2.32|38.57|13.89|1.51|2015-11-01 13:19:49.461734|2015-11-01 13:19:49.461734 2|al|Albania|28748|27398|1350|3029278|0.3|12.92|6.58|3.3|2015-11-01 13:19:54.431082|2015-11-01 13:19:54.431082 3|ag|Algeria|2381741|2381741|0|39542166|1.84|23.67|4.31|0.92|2015-11-01 13:19:59.961286|2015-11-01 13:19:59.961286 

As you can see, the data consists of rows and columns, where each column maps to a defined property, like id, or code. In the dataset above, each row represents a country, and each column represents some fact about that country.

But as the amount of data we capture increases, we often don’t know the exact structure of the data at the time we store it. This is called unstructured data. A good example is a list of events from visitors on a website. Here’s an example of a list of events sent to a server:

2808335550

As you can see, three separate events are listed above. Each event has different fields, and some of the fields are nested within other fields. This type of data is very hard to store in a regular SQL database. This unstructured data is often stored in a format called JavaScript Object Notation (JSON). JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. Even though JSON starts with the word Javascript, it’s actually just a format, and can be read by any language.

Python has great JSON support, with the json library. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries. JSON data looks much like a dictionary would in Python, with keys and values stored.

In this post, we’ll explore a JSON file on the command line, then import it into Python and work with it using Pandas.

Table of Contents:

  • The dataset
  • Exploring the JSON data
  • Extracting information on the columns
  • Extracting the data
  • Reading the data into Pandas
  • Converting columns
  • Subsetting the stops
  • Conclusion

To read the full article, click here

Top DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge