Using R to access 311 service request from NYC Open Data using Socrata Open Data API, and the RSocrata package
Today, I’m going to show you how to access 311 service request from NYC Open Data using Socrata Open Data API, R and the RSocrata package. In our example today, we’re going to look at service requests that were routed to the New York Police Department (NYPD) in 2020.
Before we begin, API stands for Application Programming Interface. From a data analytics viewpoint think about it as a way access and load data. One way we could have accessed the 311 service request data would have to download the data as a .csv file, then import it into R. However, after a quick glance 24.3 million rows and 41 columns means:
- this is a massive file.
2. It’ll contain way more information that we’re interested in (remember, we only want NYPD related requests)
3. Lastly and most importantly, 311 service requests will continue to come in, so if you plan to analyze any new data that will occur in the future, you’ll have to re-download the massive file each and every time you want a data refresh.
This is where API’s come in and shines. First, we’re going head over the 311 service request page and click on the “API” button:
This will give you a URL commonly referred to as a “API endpoint”. An API endpoint, is a URL that you will use to access the data.
Notice, the format defaulted to “JSON”, we’ll talk more about JSONS shortly. Go ahead and copy that URL and paste it into your browser and go to the page. Let’s see what happens.
It should look like a bunch of text and characters appear on the page; this is not an error nor is it random. It’s the 311 service request data we’re interested in in JSON format! Let’s take a closer look:
Your probably familiar with a typical dataframe that is comprised of columns and rows. Where each row would be a 311 service request. Well data that’s in JSON format, each 311 service request starts with a “{“ and ends with a “}”. Inside, it’s the name of the variable and the value that’s inside. In the image above, you’ll see that the first variable is “Unique Key” and the value is 48025021. The second variable is “Created_Date” and the value is “2020–10–31T02:01:41.000”.
In a traditional data fame, the data would like this:
In JSON it looks like this:
[{“unique_key”:”48025021",”created_date”:”2020–10–31T02:01:41.000"}]
Now that we have a better understanding of API and JSONS, let’s fire up R and get some data. We’re going to use the RSocrata package, so go ahead and install the package using the following code:
And loading the package:
Note: we’re using the RSocrata package because it automatically converts JSON data into a traditional data frame with columns and rows.
After the package is loaded up, we’re ready to grab the data and have RSocrata convert it to a dataframe. The code (function) for this is simple:
NYPD — is what I am naming my data frame.
Read.socrata — is the function that will pull the data and convert it to a dataframe.
The URL inside the parenthesis is the API endpoint copied from the 311 Service Request Open Data page
BEFORE YOU RUN THIS CODE:
Remember, this dataset has 24.3 million rows! This would take a very long time to download and process. So let’s take it to the next step and edit our API EndPoint.
Notice how I added a $limit=10 at the end of the API endpoint? This is going to limit the call to the 10 most recent rows. Go ahead, give it a try!
In my case, you’ll see that my data frame has six Department of Sanitation requests, three Department of Transportation requests and one for the Department of Parks and Recreation. But wait, we were interested in looking for NYPD service request.
In this case, all we needed to do was add in Agency=NYPD and you get the following data frame:
Now, let’s say we wanted to get only NYPD 311 service requests within a specific time period.
You’ll see that we changed the request a bit from the previous. Notice after the API endpoint, we add in the following code:
NYPD <- read.socrata(“https://data.cityofnewyork.us/resource/fhrw-4uyv.json?$where=created_date between ‘2020–01–01T12:00:00’ and ‘2020–11–01T14:00:00’ and agency = ‘NYPD’”)
We used the where function as well as the and function. The code above reads, “give me all 311 service requests made from January 1, 2020 to January 11, 2020 that were directed to the NYPD. There are many other functions to query your data such as: $select
, $order
, $having
etc. for a full list, visit the Socrata website for a list of queries.
Hopefully this is helpful in getting your feet wet with Socrata Open Data API, the RSocrata package, JSONS and R.