• Home
  • New Entries
  • Popular Entries
  • Submit a Story
  • About

Can RSS & XML Help Us Build the Data Web ...

The Web is full of data: statistics, surveys, and reports can be found on almost any topic you care to search for. It is this very fact that makes the Web the first stop in anyone is research. Want to know the average number of petals on a daisy? Thirty-four. The number of species of whale? Eighty (or thereabouts, depending on your definition of "whale" apparently).

Of course one could spend all day typing random questions into a search engine but the serious business of research lies in statistical analysis, comparing datasets for trends. One obvious example is comparing the performance of various stock markets around the globe over time, this happens so frequently that it is quite a simple task online. So is comparing currency is performances against each other. But what if you wanted to do both? What if you wanted to compare the yen against the London Stock Exchange is closing index? At first glance that might seem like a nonsensical comparison but in financial research trends are key and trends can be found without comparing datasets, no matter how far and wide.

Unfortunately most, if not all, of the data on the Web is embedded in Web pages, and most often only presented in graphical form. There is just no way of looking at a chart of something over time and reliably viewing the corresponding data, at least, not across the board.

Now let is step back a moment and look at another existing Web technology: RSS (or ATOM if you are so inclined). Really Simple Syndication is a way for Web sites to publish a simplified list of links to articles and items on their sites. So instead of logging into a news site every day to check for new items you "subscribe" to an RSS feed and just check the feed. What made this possible was a simple, standard, open format that was easy both to create and read by both humans and computers.

Over the past few years RSS has caught on in a big way, it is simple to write your own RSS parser and create a news ticker that watches your favorite Web sites, for example. Browser plug-ins have given way to native support and suddenly the Web started to feel just that little bit more integrated. RSS foretold the Web 2.0 age and should not be overlooked. Before there were JavaScript APIs coming out of the woodwork there were RSS aggregators that built virtual Web sites out of the parts of other Web sites.

Now let is consider the most seen AJAX powered mashup: modifications of map sites, adding real estate pictures and locations to a map, for example. This sort of thing would be made a lot easier and accessible if the real estate agents published an RSS-like feed of properties, along with their GPS coordinates and prices. Even in this limited scope the possibilities are endless, a burger chain could publish the locations of its restaurants, or news bulletins could come attached with markers. Planes, trains, and - well, possibly - automobiles could be tracked and tacked onto maps. Want to see where the roadworks are on your journey? Just import the official highway is feed of roadworks into any mapping site or software of your choosing.

This problem has already been solved to a certain extent of course with Google Earth, but the scope is limited by its narrow focus, purely on coordinates. Wouldn it be great if a feed could contain not only property locations and prices, but that your browser could detect the presence of both. The browser itself would then let you tack them onto a map, or sort by price in a spreadsheet. This too has limits. I doubt "number of bedrooms" could be generalized into a universal datatype, but price (currency="USD") and coordinates (type="GPS") are easily comparable and transposable across differing datasets. The next step would be to merge the property and burger chain feeds, selecting only those houses within x miles of the nearest drive-thru.

So let is get back to our original data problem, this time suppose I want to compare the populations of London, New York, and Tokyo over the past century. Sounds simple enough and a few minutes of Web searching yielded a handy "Inner London" set of census data I could use, which was encouraging. However, New York took a few minutes longer before it too yielded data. Tokyo proved too stubborn however and I gave up, having only managed to get data for a handful of random years; 2000, 2003, and 1960, hardly an extensive dataset. I got relatively good data for London and New York, one set of figures for each decade. One, however, was embedded in a HTML table and the other in a text file, formatted and aligned using spaces.

The next step is to copy and paste each snippet of data cell by cell into a spreadsheet and finally run the graph wizard. Finally you can behold the wonder as Inner London is population stays essentially still while New York is stealthily overtakes it. Now I could be accused at this point of serial laziness, but it just seems like a lot of work, especially the extensive search engine hammering. Suppose that each site had published a standard formatted set of data, and I could run a dataset search to pull it back. Then an aggregator could look at the data and splice it together. Sounds farfetched, but is it really?

Let is now pull RSS back into this discussion, suppose we apply a similar principle to this data too. First, give the dataset a header including its title and source then add each column of data with a strict set of criteria for defining its type and formatting. This way a piece of aggregator software would look at our population sets and see two columns in eac h, one for the year and one for the population. Population is just a number and at the behest of trying to keep it simple our columns could probably be just "year" and "number," leaving it up to the column title to belie what the number count is, well, counting.

Next the aggregator would look across the datasets and recognizing that both have a major axis of "year" instantly know that they can be combined, even if the years don match exactly (British population counts are done on years ending in 1, not 0 for example). Provided the definition of "year" is fixed and the numbers representing the year are always four-digit then there can never be any ambiguity between different datasets.

Now the aggregator would see that both have a second column of number, but even though it doesn know what the number pertains to it would notice that their lowest and highest values fall into a similar range, hardly rocket science. The result is the ability to graph them both together. Even if we were comparing two datasets with differing second columns then, like we do manually, we would overlay them both scaled to facilitate easy comparison with a label to say what units each are in.

And here is a crude example, thrown together in minutes:

<?xml version="1.0" encoding="utf-8" ?>
   <dataset title="Census Population of Inner London 1901-2001" key="year">
   <head cols="2">
     <col id="year" title="Year" format="yyyy" />
     <col title="Population" format="0" />
       </head>
       <body>
     <tr><td>1901</td><td>6506889</td></tr>
     <tr><td>1911</td><td>7160441</td></tr>
     <tr><td>1921</td><td>7386755</td></tr>
     <tr><td>1931</td><td>8110358</td></tr>
     <tr><td>1941</td><td note="Estimate">8160000</td></tr>
     <tr><td>1951</td><td>8196807</td></tr>
     <tr><td>1961</td><td>7992443</td></tr>
     <tr><td>1971</td><td>7368693</td></tr>
     <tr><td>1981</td><td>6608598</td></tr>
     <tr><td>1991</td><td>6679699</td></tr>
     <tr><td>2001</td><td>7172036</td></tr>
       </body>
   </dataset>

Basically what we have here is a combination of technologies, the XML is designed to reflect RSS and bear more than a passing resemblance to an HTML table. In the latter case it means that any programmer familiar with browser DOM scripting can easily parse this too. It takes a different approach to a SOAP dataset in that the datatypes and column names are listed in a header block while the data table structure is fixed. It is also more lightweight than a serialized recordset but still contains the important details about the data we are publishing.

Because it is a straight table with defined columns we can use it to publish serial data, as above, or discreet data, such as the list of burger restaurants and their locations. Finally, because it is simple XML it is extensible, rows, or cells could have individual notes attached. Indeed, notice that in 1941 World War II was going on so we only have an estimate to work with.

Suddenly comparing global temperature to the numbers of sea-borne pirates over time becomes so much easier - and though I am only speculating out loud here, I for one would welcome such a standard. Imagine what this would do for business practice in general, and democracy at large. If the de facto standard of openness was to publish on one is Web site a syndication of data: incomes, expenditure, tax, contracts - all there for the public to parse, analyze, and compare in the click of a button.

Wait a minute, I hear you say, what about ODF spreadsheets? Yes, they are open, standard, and already XML formatted. But suggesting that a data syndication format would be pointless if you could already download and open a spreadsheet in a spreadsheet application is the same as asking of RSS: Why not just visit the Web site and check for new news items yourself? So ask yourself this instead: If that had remained the attitude for all this time, would we have a Web 2.0?

 View Full Story.
Posted at 12:23:29 pm | Permalink | Posted in RSS  

Related Stuff

  • MooV: Using cutting edge Video phones and Software Video Phones - coupling all that with VoIP and empowering the disabled.

  • Moo Telecom: VoIP communications made easy - Ring anyway with the fun and ease of using a normal phone

  • TagR:Mobile Social Network with Real Time Locations Based services, and Ambience Intelligence, VoiP, IM, Skype, Googletalk, Mapping, Flickr, Events, Calendaring, Scheduling, SecondLife Support

  • ClearSMS : ClearSMS is a Web-based application that lets you send bulk SMS messages to your customers, contacts, or just about anyone.

  • Jajah:jah is a VoIP (Voice over IP) provider, founded by Austrians Roman Scharf and Daniel Mattes in 2005[1]. The Jajah headquarters are located in Mountain View, CA, USA, and Luxembourg. Jajah maintains a development centre in Israel.

  • Skype: It’s free to download and free to call other people on Skype. Skype the number one voice over ip software

  • PrivatePhone: a free local phone number with voicemail and messages you can check online or from any phone.

Be the first ... |Add your comment.

Your Comment ...

  Name (required)

  Email (required, hidden)

  Website


About Ajaxlines

Ajaxlines is a project focused on providing its audience with a database of most of Ajax related articles, resources, tutorials and services from around the world.

Its purpose is to showcase the power of Ajax and to act as a portal to the Ajax development community.


Recent Stuff

Making AJAX Ro(a)R

Ruby on Rails Applications Development with Ajax

Facebook Drops Another Hurdle

FBJS and Ajax to acheive Facebook profile link tracking

Getting a JavaScript stracktrace in any browser

Ajax data picker Javascript


Our Partners

Facebook Applications

Ajax Projects

Web 2.0 Sites

Webloglines

Human Development Handbook

Software Development Company

Ajaxlines

Stock Exchange Chat


Search


Topics

  • .Net (109)
  • Articles (82)
  • Bookmarking (35)
  • Calendar (19)
  • Chat (39)
  • ColdFusion (3)
  • CSS (37)
  • Email (23)
  • Facebook (19)
  • Flash (15)
  • Games (6)
  • Google (26)
  • Html (13)
  • Image (11)
  • International Calls & VOIP (7)
  • Java (36)
  • Javascript (169)
  • JSON (21)
  • Perl (2)
  • PHP (87)
  • Presentation (19)
  • Python (3)
  • Resources (2)
  • RSS (1)
  • Ruby (9)
  • Storage (4)
  • Toolkits (90)
  • Tutorials (196)
  • UI (11)
  • Utilities (166)
  • Web2.0 (13)
  • XmlHttpRequest (20)
  • YUI (4)

© 2006 www.ajaxlines.com. All Rights Reserved. Powered by IRange