Now Queryable and open linked data: U.S. Census/Congress datasets: 1 billion triples

and its fast!

As you can’t blog enough about it, I am copying a story from this announcement email:
(following Text by Josh Tauberer)

Hi, everyone. (This is a revised/combined reannouncement for what was
originally posted on the Linking Open Data list.)

Last November, Chris Bizer wrote, “[T]he DBLP server increases the size
of the Semantic Web by around 10 percent ;-)” [1] Based on the same
logic, I have recently increased the size of the semantic web by 200%!
(in terms of the number of triples; and of course I’m also just joking
here w.r.t. size of the semantic web)

I’m announcing here a new U.S. 2000 Census dataset of 1 billion triples,
accessible over SPARQL and browsable by linked data [2] principles, and
re-announcing my U.S. Congress dataset which is newly browsable with
linked data principles. These two datasets are interconnected, and the
Census dataset is linked up via owl:sameAs to Geonames [3].

I like the Census data set a lot for three reasons— first, if you live
in the U.S. it has something for you, since it has detailed statistics
on geographic entities down to the level of small towns/villages, and
everyone lives somewhere; second, it meshes up with two other data sets;
and third, it’s rich enough on its own to support a wide array of
interesting and real-world useful queries (if, say, you were doing
research).

The OpenLink guys were kind enough to host the data set previously, but
I wanted to push the limits of my own semweb C# library [4] and I wanted
to be able to revise the data set as needed, so I’ve wanted to host it
myself, which only recently I was able to do (even though I’ve had the
triples laying around for nearly a year).

A complete description of the data set and how it was constructed and
exposed is here:

http://www.rdfabout.com/demo/census/

Some features of the data set:

Data on 3,200 U.S. counties, 36,000 “towns”, 16,000 “villages”, 33,000
ZCTAs (something like zip-codes), and 435 congressional districts.

Each of those locations contains around 10 thousand population
statistics, as well as a dc:title, a basic hierarchical structure
between regions, and latitude/longitude.

Very basic geographic/name/lat-lng data (1 million triples) can be
downloaded in N3.

All of the 1 billion triples are accessible via SPARQL. See:
http://www.rdfabout.com/demo/census/sparql.xpd which has a few sample
queries. An example query is “List the states in the United States that
have more students in dorms than prisoners.”

The URIs for the geographic regions are dereferencable http: URIs. (The
URIs for the predicates in the data set will be updated to be
dereferencable in the future.) For example, you can visit the URI for
New York State:

http://www.rdfabout.com/rdf/usgov/geo/us/ny

(Some URIs return very large pages that take Firefox quite a while to
render. That one’s OK.)

The dereferencable URIs return 303’s to SPARQL DESCRIBE pages describing
those URIs.

There is a sitemap.xml file based on the latest draft circulated [5],
referenced from robots.txt: http://rdfabout.com/robots.txt

And, source code to generate the triples from the Census download files
are posted. It’s too large for me to provide the whole RDF myself, for
now at least.

The U.S. Congress data set, which I originally made SPARQL-accessible in
December 2005 but is now revised to follow the new linked data
principles, has 12 million triples containing brief biographical data
for all members of Congress, and mainly data for federal legislation and
voting records going back a number of years. Here are two example
dereferencable URIs:

http://www.rdfabout.com/rdf/usgov/congress/people/M000303
(= Senator John McCain)

http://www.rdfabout.com/rdf/usgov/congress/109/bills/h867
(= a bill in Congress)

Some example Congress-related queries are posted here:
http://www.govtrack.us/sparql.xpd
And dump files are here:
http://www.govtrack.us/data/rdf/

An example I like to use is that one could fairly easily create a table
using SPARQL aligning votes on a particular bill by congressmen with,
for instance, the median commuting time to work of their constituents,
as reported by the Census.

Thanks to those who gave feedback on the LOD list — I haven’t been
able to address all of it yet (like how to deal with backlinks on the
dereferenced pages).

[1] http://lists.w3.org/Archives/Public/semantic-web/2006Nov/0008.html
[2] http://linkeddata.org/
[3] http://www.geonames.org/
[4] http://razor.occams.info/code/semweb
[5] http://sw.deri.org/2007/07/sitemapextension/

—
– Josh Tauberer

http://razor.occams.info