Musings: Crowdsource RECAP of PACER documents for next to nothing

Last night Lawrence Lessig spoke at Dartmouth College about Rebooting our Government. I’ve read Lessig’s articles and listened to his lectures before, and seeing him speak in person was quite a treat.

Lessig’s lecture highlighted his mission to give control of our government back to the people — to the citizens of the US. Fix Congress First is one of the groups encouraging this reform, and I suggest that you go check out their website right now!

Part of giving the Citizenry control is making sure that everyone has free, open access to all of our laws and court case records. Federal court records are in the public domain and are available online through the Public Access to Court Electronic Records (PACER) electronic record system, however access to the PACER system is billed using a per-page rate.

Because the documents in PACER are public domain, once a document is accessed, it may be distributed without restriction or additional fee. As a result, several groups are currently working on opening the vast archive of documents in PACER so that anyone can access any of them, at any time, with no fees or strings attached.

One of the projects trying to free the PACER documents is called RECAP, and is run by the Center for Information Technology Policy at Princeton University.

RECAP has written an eponymous Firefox extension that augments the interface to the PACER website. The benefits of this extension include (as described by the RECAP website):

  • Helps you give back: Contributes to a public archive hosted by the Internet Archive
  • Saves you money: Shows you when free documents are available
  • Keeps you organized: Gives you better filenames, enables useful headers
  • …and more

I’ll let you visit the RECAP website to see a video of the interface, but the important part of the idea is that once a RECAP user pays for access to a court record, that record will be added to the set of “freed” records and no RECAP user will have to pay to access that record ever again.

I’m sure you all know where I’m going with this… that’s right, monkeys!

So in a hypothetical situation in which we had a million monkeys, all with no-limit credit cards, PACER accounts, and computers (sorry, typewriters only cut it for Shakespeare), we could let the monkeys wail away at the PACER database, and (given enough time) the set of simians would add all files in the PACER system to the RECAP database.

But who am I kidding? I don’t know a single monkey with a no-limit credit card…do you? I guess we’ll have to use humans instead.

According to my totally scientific calculations, the PACER system has about 100 million pages in it.

Calculation of pages in PACER

According to wikipedia, Aaron Swartz downloaded “about 20% of the entire database” during a fee-free trial.
According to the WIRED article linked to by wikipedia, Swartz downloaded “19,856,160 pages.”

Doing the simple math, Total_Pages * 0.20 = 19,856,160 pages, so Total_Pages = 5 * 19,856,160 pages = 99,280,800 pages.

Call it 100 million pages — that’s close enough for government work.

If one person were to try to download the entire catalog of PACER files, it would cost them a lot. A whole lot. PACER charges $0.08/page, but only charges for the first 30 pages of a document (capping the total cost of a document at $2.40). There are several files in PACER that are longer than 30 pages, but let’s just go for a upper bound here and assume that we’ll be charged 8 cents for every page.

So 100 million pages * $0.08/page = $8 million. Ouch!

Of course, the money doesn’t have to come from just one person. We could ask for donations, but getting people to pay $5 or $10 to free court records doesn’t have the same ring as asking people to donation $5 or $10 to save cute furry animals or cute children.

Luckily for light users, the PACER system has a cutoff point for billing. According to wikipedia, starting in March 2001 “no fee would be owed until a user accrued more than $10 worth of charges in a calendar year.” Ten years later, in March 2010, “that limit was effectively quadrupled, with users not billed unless their charges exceed $10 in a quarterly billing period.”

What if we were to get a whole bunch of people to sign up for PACER, and ask each one to spend a a little less than $10 each quarter RECAPing files from PACER? It would only cost them time, not money.

Hmmm. Let’s crunch some numbers!

We’ve got 100 million pages @ $0.08/page. One person could download
120 pages for $9.60 each quarter, so that gives us
100 million pages/120 pages/person-quarter = 833,333 person-quarters (not to be confused with chicken quarters).

Of course, we don’t actually have that much work to do. Remember that Aaron “the Hoover” Swartz already downloaded 20% of the total database, so we only have 80% of that, or 666,667 person-quarters (Nom nom nom), left to go.

With a total workload of 666,667 person-quarters, that means that we could extract the entire database in
3 months, given 666,667 people,
1 year, given 166,667
5 years, given 33,333
167 years, given 1000

Of course, this is a quick-n-dirty upper bound. We made the assumption that all documents were under 30 pages, but there are certainly thousands of files longer than that, each of which will only be billed as $2.40, saving us many person-quarters. Remember, too, that if more lawyers and paralegals can be encouraged to use the RECAP Firefox extension, the RECAP database will grow much faster. Of course, PACER isn’t standing still, either.

PACER is a living system and will continue to grow beyond 100 million pages of records. Absent any big shift toward open access to PACER, even after the initial “catch-up” step is completed, contributors will need to continue RECAPing new records each month as they are added to the system.

It’s possible that PACER or the FBI might not take too kindly to this “distributed hoovering” approach to freeing the documents stored in PACER, and like in the Swartz case, they might raise some ruckus. That being said, as far as I can tell, the approach I describe is entirely legal. I really can’t imagine a legal reason why the feds would want to put a stop to it.

The Future

Looking to the future, PACER has begun to provide digital audio files from court records. In 2008, 7,400 audio files were uploaded to the PACER system. I haven’t found any data on the file format they’re using for the digital audio, but the PACER website prices each audio file at $2.40.

An individual user can download 4 files @ $2.40 per quarter (and stay under $10), so
7,400 audio files/4 audio files/person-quarter = 1,850 person-quarters

Compared to the 666,667 person-quarters for the paper files, the audio files are chump change. For now. It’s very likely that more and more court proceedings will have digital audio recordings, so not only is the total number of audio recordings in PACER increasing, but the rate at which recordings are being added is probably increasing as well.

It’s all so complicated, isn’t it?

In the end, it would make things much easier if all of these public domain data were just available to the public without fee. Factoring the cost of distributing the records of the court into the cost of running the court seems like a wise and simple solution, however I think that the Federal government is currently happy with the cash cow that is PACER.

So fire up Firefox, get your PACER account set up, and RECAP your $9.60 of files for this quarter. That’s 120 pages down, 80,000,000 pages to go… 🙂

–Q

UPDATE – 2010-06-03:
Oops! Wikipedia is off by at least an order of magnitude, according to Tim Lee of the RECAP project (see comment below). “There are about 500 million documents in PACER,” says Tim, “which translates to several billion pages.”

What does that mean for RECAP? Probably that we’ll need to invite more friends to participate!

Advertisements

2 thoughts on “Musings: Crowdsource RECAP of PACER documents for next to nothing

  1. Thanks for the post! The 20% figure is wrong. According to the AO of the courts, there are about 500 million documents in PACER, which translates to several billion pages. So Swartz got something like 1% of all pages, not 20%.

    The good news is that not all court documents are created equal. Some documents are downloaded far more often than others, and the most popular documents are also likely to be the most important from a public interest perspective. So we don’t need to get all 500 million documents in order to build a really useful public archive. It would be great to see someone build an automated tool that can log people into PACER and download $9 worth of documents each quarter.

  2. Tweets that mention Musings: Crowdsource RECAP of PACER documents for next to nothing « Things that have escaped from my mind -- Topsy.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s