SeeItSaveIt is an Addon for Firefox that extracts structured information from web pages, and sends it to services that can use that data.
Let's say you are looking at a website, and you see some data you'd like to use. What kind of data? Maybe it's an item that you can purchase: the page will tell you the name and description of the item, its price, maybe some shipping information. If you are doing comparison pricing you might want to compare that side-by-side with product information from another page. The structured data here would be fields for price and name and so on.
Another example might be a social network like Facebook. You have access to a lot of data, like your list of contacts, people's posts, image galleries you've uploaded, etc. Some of this data can be extracted through private APIs on the social network, but all of the data is available to you: if you can see the information, it's almost certainly possible to extract it in a structured form. SeeItSaveIt helps you get that structured information.
What then? Having "structured data" probably doesn't mean much to you; even for a computer programmer this is a very vague statement. You almost certainly want to do something with the data (and if you don't have anything to do with the data, you probably wouldn't care about it in the first place).
SeeItSaveIt gives that data to a service that you want to use, and that knows how to use the data you have. A service that aggregates your contact data might accept data extracted from Facebook, LinkedIn, your email, or your intranet. A product comparison tool could accept data from thousands of sites (but probably not Facebook).
Warning: SeeItSaveIt is currently an experimental product. It may not protect your privacy as much as we want it to, and it may not work as reliably as we would like. Also, at this time, the process of finding a service to send data to is crude and perhaps unhelpful.
SeeItSaveIt is an Addon for Firefox. To get started you need to install it in your browser (this does not require a restart):
SeeItSaveIt won't do anything once installed, but it will add a new button in the bottom right of your browser, like this:
If you don't see that you might need to display the Add-on bar, you can do this with View > Toolbars > Add-on bar (read more here).
What follows will be a technical discussion of what SeeItSaveIt does. In the manifesto is a description of why.
Commonly attempts to get structured data see this as a supply side issue: first we must convince data providers to structure their data. SeeItSaveIt instead seeks to build and satisfy demand.
In order to do this SeeItSaveIt tries to parse and understand data wherever it exists, without asking the data provider to change anything. But to do this we need data extraction scripts for each of these data providers.
With this in mind, SeeItSaveIt asks the public (aka the crowd) to
write these scripts. Each script works for a specific website(s), and
produces a certain kind of output — a certain structure of data,
contacts or a
article. There's just no
generalized way to do this. It may be possible to extract meaning
statistically or with heuristics, but that is still a possibility with
SeeItSaveIt — fallback scripts may use these measures.
To make script development easier SeeItSaveIt includes a development site (available at a single click) to work on your extraction scripts. Scripts typically weigh in at only a couple dozen lines. Work like finding and parsing the page is handled either by SeeItSaveIt or by the user themself, making the script writing fairly easy. In the future we hope to include features to make it easy to report failed extraction.
Running scripts from random people isn't safe or pleasant, but SeeItSaveIt has a firewall between these scripts and the page itself. Only the visible parts of a page are serialized, and then this simpler version of the page is passed to the script. This generally removes data like cookies or information that might be in scripts; it also means the script cannot do anything that would change the page or effect your browsing session — in other words, the script can't pretend to be you.
Things like Add-ons, bookmarklets and Greasemonkey scripts (and even sometimes API access) doesn't necessarily guarantee this. This makes it harder to scale up participation — these systems are legitimate in large part to the degree they are obscure.
In progress is also enough sandboxing so that the scripts can get a page coming in, and data going out into the user's control, without risk of information leaking. Because this isn't done there are still privacy risks with using SeeItSaveIt.
Data is first extracted from the page by a script, then SeeItSave delivers that parsed data to some consuming site or application. Separating this into two phases has some advantages:
Currently one of the least functional pieces of SeeItSaveIt is choosing a site to send your data to. There is simply a global list of consuming sites. This is no good!
Unlike extraction, it is assumed that sites will be willing to go out of their way to accept data. Everyone likes more data! So we do not expect to have ad hoc ways of inputing data. (But I would expect a site that, for instance, converted extracted data into an Excel or CSV file that you could download.)
Still, what sites does a user use, and is interestedd in? And what kinds of data are those sites interested in? There are a pair of related up-and-coming protocols for handling just this kind of situation: Web Activities, used by Mozilla for its Firefox OS project. Web Intents is a more standards-oriented proposal, led by Google and being implemented in Chrome.
In both systems a web site can request to register its ability to handle certain kinds of Intents or Activities. These Intents are typed much like the structured data that SeeItSaveIt outputs. When another site "starts an intent" (or activity) the browser looks at all the sites that have registered the ability to handle the intent and passes it on to that site.
In many ways SeeItSaveIt is the ability to extract an Intent from any webpage without its cooperation. As such it should be a good match. However the technology is still in progress, so we wait. Or perhaps we create a dumb simplified version of the same thing as a stop-gap.
I want to tell you why I think this project is important. If you are thinking "should I try this out" then this is probably a much more long-winded and abstract discussion than would make sense. But if you are asking "is this important" or "is this the right approach" I hope my arguments will help you make those determinations.
Have you heard the refrain users should own their own data? Unfortunately this advocacy takes the form of begging: begging services to open up data, begging users to care in enough numbers to make that happen and to exercise those options in ways that will make them protect and force maintenance of that open data in the future.
But we don't need to be beggars for this data. The data is already there — everything you can see in your browser is data that can be read, extracted, and understood. It's not always easy to extract this data, but it's possible, and usually it's not very hard; it requires some expertise but the results of that expertise can be shared with everyone.
Of course we already extract data regularly from our web browsing. The URL is the easiest piece of data to pass around, and we pass this location identifier around so other people can see what we can see. But things at a URL can change — pages disappear, are edited, represent streams of updates that are always changing, are customized for individual users, are only available to subscribers or inside private networks. Sometimes a URL is exactly the right thing to share, or the right thing to save with a bookmark; but often it's not the address that we want to work with, it's the actual data we see on the screen.
Web sites can protect and change the pages at any one URL, but when you see something on the screen, that data is in your hands, it is literally located on your computer. It's your data — if you are willing to take hold of it!
SeeItSaveIt lets you take hold of that data. It extracts structured information. So while you might see article text, user names, contact information, etc. on your screen, the meaning of each piece of text is not clear to the computer, or to any other services on the web. And the data isn't nearly as useful if the computer can't understand it. You want to throw away the useless information — ads, hints about how to use the site, branding, etc. You want to mark that data according to what it is. Once you have that, there's unlimited ways of using that information.
And what do you do with the data once you've extracted it? You probably don't personally have a way to use structured but abstract data. But the data is given a type, and we can search for other services that can use that type of data, or you can find such services yourself or, if you are a developer, develop such services. As an example, contact information might go to a contact manager, or be used to recreate friend networks on other services, or might be archived for research purposes, or might be correlated with information you've collected elsewhere.
This tool relies on individual users to write and maintain scripts to extract the structure from web pages. The structure of pages on http://facebook.com is different from the structure of pages on http://wikipedia.org. There's no standard way to represent different kinds of structured information. Or maybe there's too many standard ways.
But then, shouldn't we build a standard way? It's been tried before (Microformats, RDF, Open Graph Protocol). If you look at those links, you'll notice I linked to Wikipedia in each case, even though they each have an "official" page. Why is Wikipedia so often a better reference location for a concept than a self-controlled and branded page? Because we (in the largest sense of "we") are usually better at describing each other than we are at describing ourselves. When I describe myself, or when an institution describes itself, it can't help but confuse what it is with what it hopes to be or what it plans to be. It's hard to be objective, to be grounded in the present, and ultimately the description is self-serving rather than serving the audience.
This is why I believe ad hoc scripts are not just the result of flaws in the current system, or of poor adoption of structured markup that should be fixed. Ad hoc scripts are developed by people who want to do something. Their goals are functional: they've found information they want to extract. The authors of these scripts are also the audience. The validity of their process is determined by the usefulness of the result. I would go so far as to say that this should be the preferred foundation of semantics — knowledge structure is coherent in so far as it is useful.
Microformats included an important design decision that I believe this builds on: data is only correct when it is visible and used. If data is hidden (e.g., embedded invisibly into the structure of the page) then no one is checking to make sure it is correct or useful. While Microformats use this in a prescriptive way — coming to the conclusion that semantic markup must wrap visible elements — SeeItSaveIt uses this concept in a descriptive way: the most useful information is already visible and parseable.
SeeItSaveIt modularizes the process with three basic components:
There are architectural and privacy reasons for this separation, but another significant reason is simple politics. There was a time when it was common to give your credentials to sites, and they would log in as you and extract data or potentially do things. There were quite a few problems with this, including Terms Of Service violations and services that took data users didn't realize they were taking, or were just buggy and messed things up. Once you give a service access to log in as you, there's no end to what they can do. Another very substantial reason for pushback against this technique is that services were using it to extract data so they could directly compete (this was particularly true when there was a lot of competition in the social space, something that has died down) — this offended some site owners, the integration wasn't symmetric, and often the competitors didn't fully clarify to users what they wanted to do with that data.
The technique now for this kind of integration is OAuth, where one service authenticates with both the user and the other site, and uses published APIs to do very particular things. With cooperative sites this can be helpful, but it has some problems:
SeeItSaveIt's modularity addresses many of these concerns:
Because of these features I believe SeeItSaveIt is more politically and technically palatable than previous systems where agents worked on behalf of the user in a way that was indistinguishable from the user.
A quick guide to the
seeit-services/seeitservices/ is a simple server to
support the querying and development of new code. The extraction
script development part is in
seeit-services/seeitservices/static-develop/. You can
seeit-services/run-server.py and you shouldn't need to install anything.
The addon is in
lib/main.js doing the bulk of the work. Use
cfx tool in Jetpack/Addon-SDK to build the code into
an .xpi addon.
The basic steps:
<script>tags are removed. Links are made absolute when it's clear that something is a link, though a
<base href>tag is still generally necessary. Things like external CSS files are left as-is, but iframes and canvas elements are made static.
data:URL. We also add the extractor script to the page, and any scripts it says it requires. Those are the only scripts that are on the page — any scripts on the original page have been removed, so there can't be conflicts.