<body><script type="text/javascript"> function setAttributeOnload(object, attribute, val) { if(window.addEventListener) { window.addEventListener('load', function(){ object[attribute] = val; }, false); } else { window.attachEvent('onload', function(){ object[attribute] = val; }); } } </script> <div id="navbar-iframe-container"></div> <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script> <script type="text/javascript"> gapi.load("gapi.iframes:gapi.iframes.style.bubble", function() { if (gapi.iframes && gapi.iframes.getContext) { gapi.iframes.getContext().openChild({ url: 'https://www.blogger.com/navbar.g?targetBlogID\x3d6651501\x26blogName\x3dNotes+on+tech\x26publishMode\x3dPUBLISH_MODE_BLOGSPOT\x26navbarType\x3dBLACK\x26layoutType\x3dCLASSIC\x26searchRoot\x3dhttp://ypjain-notesontech.blogspot.com/search\x26blogLocale\x3den_US\x26v\x3d2\x26homepageUrl\x3dhttp://ypjain-notesontech.blogspot.com/\x26vt\x3d8064441079851785414', where: document.getElementById("navbar-iframe-container"), id: "navbar-iframe" }); } }); </script>

Notes on tech

Notes on technology, business, enterpreneurship, economy, markets along with interesting general tidbits.

Notification project update - Comparison Framework

8/28/2004 01:54:00 AM, posted by anand

As I inch my way towards completion of the notification project, I have had my own set of challenges. One of the challenges was to extract plain vanilla text out of the complex HTML that we usually come across on various sites. The html that is dished out on a typical website contains various elements like flash banners, advertisements, style sheets, java script etc. If you are out to determine whether the content of a particular page has changed since you last viewed it, you might as well extract just the content, ignoring all the other blah blew dooh daah that make up the page. Also every site has a different style of presentation as well as content updation logic. Since the past one week, I was busy trying to come up with a pluggable framework that can contain various comparators. The comparators do the filtering of content and determine whether there have been any updates to the content since the last visit (crawl?). There might be comparators that correspond to certain kind of site layouts, some might be responsible for filtering RSS feeds, PDF's, what not.

So the framework is ready, and I have already implemented a simple text comparator that basically is fairly successful in extracting the real stuff from the fluff that surrounds it. Only if we had something like this in our lives.
« Home

» Post a Comment