Calculating public burden using OIRA data -- Part Two

An experiment in using open data to make government better


Yesterday, I published an article about using open government data to hunt for paper-based information requests by the government. Based on the data, it looked like there are still a lot of hours spent filling out paper-based forms. As I noted, though, I ran out of time to do careful analysis. So, today, let's explore deeper.

First, we'll create a histogram to look for the distributions of requests. To do so, we'll use pandas to examine the results data, and specifically the histogram method.

In [1]:
# Set up the graphing environment. Because I'm using jupyter notebooks, first I need to tell
# it to show the graphs inline. I also use the `ggplot` style, because it's less hideous. 
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
In [2]:
import pandas as pd
data = pd.read_json('results.json')
data.burden.plot.hist()
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fca4799d358>

Wait. Hold on right there. That's not what you'd expect to see. That looks like there's an outlier. Let's see what that might be... To do so, we look for the top ten burdens.

In [3]:
data[["burden", "title"]] .sort_values('burden', ascending=False).head(10)
Out[3]:
burdentitle
7202997500000U. S. Business Income Tax Return
72948731780IRA Contribution Information
71934115874Form 1099-DIV--Dividends and Distributions
71824951529Return of Organization Exempt From Income Tax ...
248200360122017-2018 Free Application for Federal Student...
50913500230National Fire Incident Reporting System (NFIRS...
71710880812Employer's Annual Tax Return for Agricultural ...
4979902378Arrival and Departure Record
4497736084Physician Quality Reporting System (PQRS) (CMS...
7137041290Customer Due Diligence Requirements for Financ...

Oh dear. Looks like we've got a pretty obvious mistake here: "U.S. Business Income Tax Return" can definitely be filed electronically. Same with the other things on the list. And that one outlier accounts for 3 billion of the 3.3 billion hours. Oof. So what gives?

Well, it turns out that the way that OIRA displays the burden data is that if any of the forms that are part of an information collection request is not electronically available, then the burden for all of the forms gets aggregated. And unfortunately, there doesn't seem to be an obvious way to back out the other forms. So, that's not very useful, unfortunately.

Let's see what the total burden is if you remove the top 20% of information collection requests.

In [4]:
"{:,} hours".format(data.burden.sum() - data.sort_values('burden', ascending=False).head(220).burden.sum())
Out[4]:
'5,589,316 hours'

So, that feels a lot more sane, and a lot less exciting. There are only 5,589,316 hours of public burden for everything but the top 20% of information collection requests.

In the end, this is a great lesson in how a data schema can lead to incorrect conclusions.

Still, we have some good data near the bottom of the chart.

In [5]:
data.sort_values('burden').head(890).burden.plot.hist(bins=30)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fca467f8080>

In other words, there are a lot of information requests that account for a couple hundred hours of public burden. Not a surprising result, but perhaps even more useful in the end. This result means that there are about 200 forms in the middle that account for much of the remaining burden hours. Now, that seems like a good place to start.