SeeItSaveIt is an Addon for Firefox that extracts structured information from web pages, and sends it to services that can use that data.

What's that?

Let's say you are looking at a website, and you see some data you'd like to use. What kind of data? Maybe it's an item that you can purchase: the page will tell you the name and description of the item, its price, maybe some shipping information. If you are doing comparison pricing you might want to compare that side-by-side with product information from another page. The structured data here would be fields for price and name and so on.

Another example might be a social network like Facebook. You have access to a lot of data, like your list of contacts, people's posts, image galleries you've uploaded, etc. Some of this data can be extracted through private APIs on the social network, but all of the data is available to you: if you can see the information, it's almost certainly possible to extract it in a structured form. SeeItSaveIt helps you get that structured information.

What then? Having "structured data" probably doesn't mean much to you; even for a computer programmer this is a very vague statement. You almost certainly want to do something with the data (and if you don't have anything to do with the data, you probably wouldn't care about it in the first place).

SeeItSaveIt gives that data to a service that you want to use, and that knows how to use the data you have. A service that aggregates your contact data might accept data extracted from Facebook, LinkedIn, your email, or your intranet. A product comparison tool could accept data from thousands of sites (but probably not Facebook).

How to get started?

Warning: SeeItSaveIt is currently an experimental product. It may not protect your privacy as much as we want it to, and it may not work as reliably as we would like. Also, at this time, the process of finding a service to send data to is crude and perhaps unhelpful.

SeeItSaveIt is an Addon for Firefox. To get started you need to install it in your browser (this does not require a restart):

Install the Addon

SeeItSaveIt won't do anything once installed, but it will add a new button in the bottom right of your browser, like this:

If you don't see that you might need to display the Add-on bar, you can do this with View > Toolbars > Add-on bar (read more here).

How does this work?

What follows will be a technical discussion of what SeeItSaveIt does. In the manifesto is a description of why.

Crowdsourcing parsing

Commonly attempts to get structured data see this as a supply side issue: first we must convince data providers to structure their data. SeeItSaveIt instead seeks to build and satisfy demand.

In order to do this SeeItSaveIt tries to parse and understand data wherever it exists, without asking the data provider to change anything. But to do this we need data extraction scripts for each of these data providers.

With this in mind, SeeItSaveIt asks the public (aka the crowd) to write these scripts. Each script works for a specific website(s), and produces a certain kind of output — a certain structure of data, like contacts or a article. There's just no generalized way to do this. It may be possible to extract meaning statistically or with heuristics, but that is still a possibility with SeeItSaveIt — fallback scripts may use these measures.

To make script development easier SeeItSaveIt includes a development site (available at a single click) to work on your extraction scripts. Scripts typically weigh in at only a couple dozen lines. Work like finding and parsing the page is handled either by SeeItSaveIt or by the user themself, making the script writing fairly easy. In the future we hope to include features to make it easy to report failed extraction.

Creating read-only versions of pages

Running scripts from random people isn't safe or pleasant, but SeeItSaveIt has a firewall between these scripts and the page itself. Only the visible parts of a page are serialized, and then this simpler version of the page is passed to the script. This generally removes data like cookies or information that might be in scripts; it also means the script cannot do anything that would change the page or effect your browsing session — in other words, the script can't pretend to be you.

Things like Add-ons, bookmarklets and Greasemonkey scripts (and even sometimes API access) doesn't necessarily guarantee this. This makes it harder to scale up participation — these systems are legitimate in large part to the degree they are obscure.

In progress is also enough sandboxing so that the scripts can get a page coming in, and data going out into the user's control, without risk of information leaking. Because this isn't done there are still privacy risks with using SeeItSaveIt.

Two-phase extraction and delivery of data

Data is first extracted from the page by a script, then SeeItSave delivers that parsed data to some consuming site or application. Separating this into two phases has some advantages:

  1. The extraction script can be heavily sandboxed.
  2. There are multiplicative benefits of extraction scripts and consuming sites. Once there are scripts to extract a certain kind of data from quite a few scripts, any new consuming site can enter and get that benefit.
  3. There is less business collusion between these two components. Many services are uncomfortable or even aggressive when a competitor tries to get data out of one of their systems. Previously amenable relationships can sour as business units expand or join. Because the extractor script is not a business, the two sites are kept at arms length, joined only by user action.
  4. Scripts can be repurposed. The extraction scripts aren't particularly specific to SeeItSaveIt, and so other experiments or projects may find them useful.

Future: Web Activities/Intents

Currently one of the least functional pieces of SeeItSaveIt is choosing a site to send your data to. There is simply a global list of consuming sites. This is no good!

Unlike extraction, it is assumed that sites will be willing to go out of their way to accept data. Everyone likes more data! So we do not expect to have ad hoc ways of inputing data. (But I would expect a site that, for instance, converted extracted data into an Excel or CSV file that you could download.)

Still, what sites does a user use, and is interestedd in? And what kinds of data are those sites interested in? There are a pair of related up-and-coming protocols for handling just this kind of situation: Web Activities, used by Mozilla for its Firefox OS project. Web Intents is a more standards-oriented proposal, led by Google and being implemented in Chrome.

In both systems a web site can request to register its ability to handle certain kinds of Intents or Activities. These Intents are typed much like the structured data that SeeItSaveIt outputs. When another site "starts an intent" (or activity) the browser looks at all the sites that have registered the ability to handle the intent and passes it on to that site.

In many ways SeeItSaveIt is the ability to extract an Intent from any webpage without its cooperation. As such it should be a good match. However the technology is still in progress, so we wait. Or perhaps we create a dumb simplified version of the same thing as a stop-gap.

A Manifesto

I want to tell you why I think this project is important. If you are thinking "should I try this out" then this is probably a much more long-winded and abstract discussion than would make sense. But if you are asking "is this important" or "is this the right approach" I hope my arguments will help you make those determinations.

Why?

Have you heard the refrain users should own their own data? Unfortunately this advocacy takes the form of begging: begging services to open up data, begging users to care in enough numbers to make that happen and to exercise those options in ways that will make them protect and force maintenance of that open data in the future.

But we don't need to be beggars for this data. The data is already there — everything you can see in your browser is data that can be read, extracted, and understood. It's not always easy to extract this data, but it's possible, and usually it's not very hard; it requires some expertise but the results of that expertise can be shared with everyone.

Of course we already extract data regularly from our web browsing. The URL is the easiest piece of data to pass around, and we pass this location identifier around so other people can see what we can see. But things at a URL can change — pages disappear, are edited, represent streams of updates that are always changing, are customized for individual users, are only available to subscribers or inside private networks. Sometimes a URL is exactly the right thing to share, or the right thing to save with a bookmark; but often it's not the address that we want to work with, it's the actual data we see on the screen.

Web sites can protect and change the pages at any one URL, but when you see something on the screen, that data is in your hands, it is literally located on your computer. It's your data — if you are willing to take hold of it!

SeeItSaveIt lets you take hold of that data. It extracts structured information. So while you might see article text, user names, contact information, etc. on your screen, the meaning of each piece of text is not clear to the computer, or to any other services on the web. And the data isn't nearly as useful if the computer can't understand it. You want to throw away the useless information — ads, hints about how to use the site, branding, etc. You want to mark that data according to what it is. Once you have that, there's unlimited ways of using that information.

And what do you do with the data once you've extracted it? You probably don't personally have a way to use structured but abstract data. But the data is given a type, and we can search for other services that can use that type of data, or you can find such services yourself or, if you are a developer, develop such services. As an example, contact information might go to a contact manager, or be used to recreate friend networks on other services, or might be archived for research purposes, or might be correlated with information you've collected elsewhere.

Why Ad Hoc Extraction?

This tool relies on individual users to write and maintain scripts to extract the structure from web pages. The structure of pages on http://facebook.com is different from the structure of pages on http://wikipedia.org. There's no standard way to represent different kinds of structured information. Or maybe there's too many standard ways.

But then, shouldn't we build a standard way? It's been tried before (Microformats, RDF, Open Graph Protocol). If you look at those links, you'll notice I linked to Wikipedia in each case, even though they each have an "official" page. Why is Wikipedia so often a better reference location for a concept than a self-controlled and branded page? Because we (in the largest sense of "we") are usually better at describing each other than we are at describing ourselves. When I describe myself, or when an institution describes itself, it can't help but confuse what it is with what it hopes to be or what it plans to be. It's hard to be objective, to be grounded in the present, and ultimately the description is self-serving rather than serving the audience.

This is why I believe ad hoc scripts are not just the result of flaws in the current system, or of poor adoption of structured markup that should be fixed. Ad hoc scripts are developed by people who want to do something. Their goals are functional: they've found information they want to extract. The authors of these scripts are also the audience. The validity of their process is determined by the usefulness of the result. I would go so far as to say that this should be the preferred foundation of semantics — knowledge structure is coherent in so far as it is useful.

Microformats included an important design decision that I believe this builds on: data is only correct when it is visible and used. If data is hidden (e.g., embedded invisibly into the structure of the page) then no one is checking to make sure it is correct or useful. While Microformats use this in a prescriptive way — coming to the conclusion that semantic markup must wrap visible elements — SeeItSaveIt uses this concept in a descriptive way: the most useful information is already visible and parseable.

The Politics of Modularity

SeeItSaveIt modularizes the process with three basic components:

  1. The SeeItSaveIt Addon: this is what finds extraction scripts, and what santizes and freezes the page you are looking at into something we give the extraction script.
  2. The Extraction Script: this is what takes the visible page and turns it into structured data with a somewhat formal "type". This is not permitted to do anything with the data, only to extract it.
  3. The Consumer: this is a service that receives the extracted data. It needs to understand the structure, and probably needs a personal relationship with the user.

There are architectural and privacy reasons for this separation, but another significant reason is simple politics. There was a time when it was common to give your credentials to sites, and they would log in as you and extract data or potentially do things. There were quite a few problems with this, including Terms Of Service violations and services that took data users didn't realize they were taking, or were just buggy and messed things up. Once you give a service access to log in as you, there's no end to what they can do. Another very substantial reason for pushback against this technique is that services were using it to extract data so they could directly compete (this was particularly true when there was a lot of competition in the social space, something that has died down) — this offended some site owners, the integration wasn't symmetric, and often the competitors didn't fully clarify to users what they wanted to do with that data.

The technique now for this kind of integration is OAuth, where one service authenticates with both the user and the other site, and uses published APIs to do very particular things. With cooperative sites this can be helpful, but it has some problems:

  1. Services need permission not just from the user, but also from the other site. They can and are blocked for reasons that don't relate to anything the user cares about.
  2. The scope of what is made public is restricted to what seems like a positive business move by the site.
  3. Development happens at a pace determined by the site owners and developers.
  4. Integration is in many ways more ad hoc, because the API that one site exposes is usually quite different from the API another site exposes. So while a site might integrate with Twitter or Facebook, it won't be worth integrating with more minor players.

SeeItSaveIt's modularity addresses many of these concerns:

  1. The Addon prepares the data in a relatively safe way, so that only read access is given to the extracting script. An extraction script flaw will generally just lead to corrupt extracted data, not anything done to the site itself (e.g., you can't post on behalf of the user, just because a script is given access to read data off Facebook's site). This can mean privacy leaks, but there are ways that we can mitigate that problem as well.
  2. The Addon runs as you, it sees exactly what you see. It doesn't send a URL to a server that then tries to reproduce and then extract content. It doesn't have to ask any site's permission. Because you have to explicitly start the extraction process, user consent is concretely implied.
  3. The author of the extraction script is not (or at least plausibly not) in direct competition with the site. The extraction script creates neutral data that can be consumed by anyone. The data might be particularly useful to the extraction script author, but they ultimately have to enable anyone else to do useful things at the same time.
  4. Consumers are the only ones who really receive any data written, and they only get fully extracted and baked content. There are several steps along the way where we can do auditing and validation of the results, because we have neutral documents instead of these components chatting with each other in difficult-to-understand ways.

Because of these features I believe SeeItSaveIt is more politically and technically palatable than previous systems where agents worked on behalf of the user in a way that was indistinguishable from the user.

Development / Contributing

The source is on github. It uses the Addon SDK (aka Jetpack). The source includes both the Addon and the server for registration and querying of extraction scripts, and development of new scripts.

A quick guide to the code: seeit-services/seeitservices/ is a simple server to support the querying and development of new code. The extraction script development part is in seeit-services/seeitservices//develop.py and seeit-services/seeitservices/static-develop/. You can run seeit-services/run-server.py and you shouldn't need to install anything.

The addon is in lib/ and data/, with lib/main.js doing the bulk of the work. Use the cfx tool in Jetpack/Addon-SDK to build the code into an .xpi addon.

How Does It Work? The Details

The basic steps:

  1. The user indicates they want to extract information from the page, by pressing that button.
  2. The page is copied and serialized. This process eliminates parts of the page that make it interactive, but aren't important to its appearance at that moment. Things like <script> tags are removed. Links are made absolute when it's clear that something is a link, though a <base href> tag is still generally necessary. Things like external CSS files are left as-is, but iframes and canvas elements are made static.
  3. While the serialization is happening, we also ask the server if there are extracting scripts that work for the given URL. We ask for consumers that go with the output of those extractors at the same time. The user chooses where they want data sent, given the subset of consumers that take data that some extractor produces.
  4. We turn that whole serialized page into a data: URL. We also add the extractor script to the page, and any scripts it says it requires. Those are the only scripts that are on the page — any scripts on the original page have been removed, so there can't be conflicts.
  5. We call the Javascript function that performs the scrape. It should return an object (it can do so synchronously or asynchronously). We should sandbox everything and keep the Javascript from accessing external resources, but so far that isn't being done.
  6. The extractor declared its type, so we add that type to the result object, along with some information about the original document (such as its URL).
  7. We send a POST request to the consumer which the user selected, with the extracted data, and display the result of that to the user. Alternately we can send a kind of postMessage.