Archive for the ‘NewsHeap’ Category

Seeing double

Wednesday, February 12th, 2003

This post was almost titled Humpty Dumpty…. but the pieces are coming back together.

Newsheap has undergone a major rearrangement to support multiple views onto feeds.

Each category folder now holds references to underlying feeds such that a feed can show up in multiple categories.

If you are tracking through your .NET feeds, you may read Java Morning, C# Afternoon. You might also peruse this feed in your JAVA feeds. Your choice.
In this image you can see multiple groups containing the same feed.
The master list of feeds is now a popup frame, reflected dynamically in the ‘All’ folder.

Category Nodes are next. Each feed will support metadata including user defined categories. Dynamic folder nodes will coallesce all feeds with the associated categories and will stay updated as new feeds are added or modified.

You can have your cake and eat it too, just not yet. The destabilizing effects of this exercise are still manifest. It will take a little bit of time to settle down.

NewsHeap

Wednesday, February 5th, 2003

The first alpha is coming along. I have been working on refactoring the MVC code to ease some of the pain of adding higher order functions.

The final thing I need to work on before release is the notion of virtual categories. I have noticed that in the process of organizing feeds into categories that it is very confining to assign a feed to only one category. Should a feed be in my Daily Reads folder or Tech or maybe even .NET. The correct answer is that it should be able to be in all three.

In a recent discussion with Coty, we discussed the strategy that is employed in iTunes where you can have playlists or folders that reference feeds. This makes sense.

I am adding this concept now and ultimately it will include auto-cataloging as well. You can establish a folder as dynamic and it will pick through feed level categorizations to assemble a grouping of related feeds.

Later when I add the Intertwingularity Coefficient Calculator, feeds could be related by their connectedness. Or even FOAF folders.

Soon

NewsHeap Redux

Monday, January 27th, 2003

Don’t have much time, but for those following the NewsHeap adventure, it is off the floor and heading for 0.1.

Redeveloped in Python, I am using wxPython which has been suprisingly pleasant.

I am not well versed in python, but skills translate pretty effectively. The refactoring will get interesting as I grok more of the power of python.

Python gotcha doing something like if (anObj.isFolder): as opposed to if (anObj.isFolder()):, it is executable, but returns true if the method isFolder is accessible from the class of anObj Given that my hobby coding happens really early or really late, I’ve been bitten by this type of thing several times

Given the availability of Mark Pilgrim’s excellent RSS parser, I have been focusing on the gui and infrastructure to manage large feed lists. Although I now use Syndirella daily, the inability to group feeds is frustrating. I know that Dmitry Jemerov is working on this as a priority.

As of tonight, NewsHeap does:

  • Import of OPML
  • Grouping of feeds
  • Caching of eeds
  • Threaded polling
  • Direct manipulation of feeds
  • Brower invocation on feeds and items
  • Display of content:encoded or description
  • etc

Hopefully this weekend, I will have the code cleaned up at which time I’ll post the source and an installer.

Soon I’ll resume posting about some higher order functions and the direction this is going in.

For the impatient, here’s a snapshot NewsHeap

Ruby 0, Python 1

Saturday, January 18th, 2003

When I first ventured into the NewsHeap project, I had several goals

  • Converge on a compelling Win32 desktop aggregator
  • Advance the state of the art in the Ruby space
  • Provide a platform for building the features I want in a feedreader

The scorecard to date is:

Ruby The introduction of the Ruby RSS parser advances the state of Ruby by providing a flexible, real-world capable RSS capability. The OPML parser is in the same camp.
The rub with Ruby is the extant challenge with providing a compelling Win32 GUI. My foray into Qt remains blocked by the functionality of the QTextBrowser. It’s display of HTML is acceptable, but the intersection between Qt2/Ruby and the underlying library prevents method overloading. You’re left with the default behavior of the Qt library and no facility to alter it. The challenge of creating a display widget that allows programmatic handling of links remains orthogonal to the goal of creating a compelling Win32 app.
In all other respects, Ruby remains a fun, powerful and expressive vehicle for development.


Win32 Reader given the unfufilled promises on the GUI front, NewsHeap is lying on the floor. Not exactly where I wanted to be. The arrival of NewsGator and RSSBandit raises the ante on getting to job done.


Mitigation Strategy Over a four hour period, I coded up a Python version of NewsHeap and rapidly converging on a base set of functionality. Although my colon hurts and I’ve been bitten by tabs vs spaces several times, I’m much further along.

I won’t ascribe this to the strength of Python the language, but rather the completeness of the environment surrounding it. wxPython is rich, the demos easily extractable and seems very capable.

I’m going to continue drilling down into the Python version and see where it goes. More on which later.

GUI Choices

Tuesday, January 14th, 2003

In the world of GUI programming with Ruby, a plethora of choices exist. In an ideal world, cross platform support, moderately native GUI support and specifically for NewsHeap the ability to render HTML are all requirements. One could bend on the first and even the second, but to pull off the three pane browser, the ability to display HTML credibly is a real necessity.
Ruby for Windows provides TK, VisualuRuby, and FXRuby. FXRuby sits ontop of FOX, a cross platform GUI toolkit. The Ruby bindings are rich and the programming model fairly simple. The rub with all these choices is that there is no HTML display widget. wxWindows is a great toolkit, but the ruby crew is just getting off the ground with the Ruby bindings with wxRuby. I’m quite confident that wxRuby will rock when it emerges, but in the near term the choices are limited.

Qt offers a windows version that is free for non-commercial applications only and an excellent set of ruby bindings exist: Ruby/QT2. QT provides a credible HTML control, a powerful widget set.

The choice for NewsHeap is clear. Qt is the only viable choice for the near term. See Ruby GUI Comparison for a more in depth look at GUI offerings in the Ruby space.

I’m playing around with Ruby/Qt2 now and include this screenshot. The code is rough and experimental, so indulge me for a few time units while I clean up.

Take Control

Thursday, January 9th, 2003

The test driver snippet from the last entry hints at the core processing loop that we need for NewsHeap - Iterate over all channels, retrieve their content, parse the result. Rinse, Lather, Repeat.

This post begins to layer some behavior ontop of our current classes, pushing NewsHeap towards useful.

A feed reader should be able to

  • Maintain a list of subscribtions
  • Routinely query each subscription to see if updated content is present
  • Maintain a record of when the content was last checked
  • Provide per channel update frequency

The OPML file serves as the list of subscriptions. We can parse it, we can update it, alas we can’t insert or move outlines around yet, but that will come.

To keep track of when channels were last updated, etags if present, update frequency etc, we need to introduce a Subscription. We also need something to manage a list of Subscriptions. A SubscriptionList.

One design issue to wrestle with is the representation of Subscription organization. Foreshadowing our UI requirements to organize our subscription list heirarchically, do we replicate the heirarchy between the subscription list and the OPML? Do we eliminate the OPML altogether and make it available as a import/export format?

Having gone back and forth on this, it strikes me that we can borrow the OPML parser and create a control file format by extending OPML and Outline.

A simple change to the OPML SAX Handler facilitates us setting the class for new outline instances.


def initialize(handlerClass=Outline)
     ...
     @handlerClass = handlerClass
end

    def start_element uri, localname, qname, attributes
      if (!@opml)				# ensure that first element is 'OPML'
         ...
     elsif (localname == 'outline')
	o = @handlerClass.new()   # Was —> Outline.new() 

We make a corresponding change to OPML.readFrom so that we can control the type of Outline.

We can create some subclasses now to hide these details from users of the classes. We need to create Subscripton and SubscriptionList


class Subscription < Outline
	def initialize()
	  @etag = nil
	  @lastModified = nil
	  @xmlURL = nil
	  @lastGet = nil
	  super()
	end

	def etag=(anEtag)
	   @attributes['etag'] = anEtag
	end

	def etag()
	    @attributes.fetch('etag',nil)
	end

You will note that we harden the interface for the Subscription class, adding accessors for elements such as etag.

SubscriptionList is even easier


class SubscriptionList < OPML
    ...

    def readFrom(aSource)
      super(aSource,Subscription)
    end

This will ensure that when we do


subList = SubscriptionList.new
subList.readFrom(File.new("somecontrolfile.xml"))

that all instances in the subscription list will be of class Subscription

A few unit tests and the addition of an iterator in OPML, and we’re done.

Driver

Let’s bring this together with a driver program.
First, some top matter and initialization


 opmlFile = ARGV[0] || "myChannels.opml"

  # read the control file
  subscriptionList = NewsHeap::SubscriptionList.new()
  subscriptionList.readFrom (File.new("./control/control.xml",File::CREAT|File::RDONLY))

  # read the opml
  opml = OPML.new
  opml.readFrom(File.new(opmlFile))
  parser = RssParser::new()
  fetcher = HttpGetter::new()

The preceding checks for a control file in ./control/control.xml and creates it if not found.
Start looping over every entry in the OPML file, checking the SubscriptionList for an entry.


# for every outline
  begin
    opml.each_outline { | outline |
	url = outline.attributes["xmlUrl"]
	# check control file
	subscription = getSubscription(subscriptionList,url)

                result = {}
                etag = subscription.etag
	lastModified = subscription.lastModified

	data = fetcher.readData(url,result,etag,lastModified,NewsHeap::Control::AGENT)

	if (newData?(result,subscription))
	   result = parser.parse(data,result)
	end

	printIt(result)
	# update the control entry
	subscription.etag = result['etag']
	subscription.lastModified = result['modified']
	subscription.lastGet = Time.now()

    }
  rescue => bang
     print_backtrace(bang)
  end
  # update control data
  subscriptionList.persist("./control/control.xml")

We grab the etag and lastModified attributes from the subscription and pass them to our HTTPFetcher. The fetcher may return a 304 Not Modified if we’ve already got current content. The fetcher always updates the etag and last modified, so we can check if a new entry is present. If so, we’ll pass it to the parser. Update the control entries and when we’re all done, we persist the control file for use in the next go around.

One final enhancement makes this actually usable


  ...
  etag = lastModified = nil if !cached?(url)  #  <=== Check the cache for this URL - don't modify headers if no cache hit
  data = fetcher.readData(url,result,etag,lastModified,NewsHeap::Control::AGENT)

  # update and cache results
  cache(data,url) if data # <=== cache the data so if we get a 304, we still have content to display

  if (newData?(result,subscription))
    result = parser.parse(data,result)
  end

Summary

We have a control file to store details of our subscriptions and support for bandwidth friendly behavior.

What we’re missing is per-channel update frequency and some behavior to rearrange the control file to reflect heirarchies and groups.

In the next round, we’ll examine some GUI options, make a decision and implement the basic 3-pane viewer.

The existing code still needs to be cleaned up - camelCase vs canonical_ruby, organization etc can all be better. I’m adding unit tests as we go, so we can layer in some good feelings.

Download file

Got OPML?

Tuesday, January 7th, 2003

RSS is in the bag (more or less). Now we need to start getting a list of feeds, consume them and keep track of what we’ve seen and when.

Regardless of what Amr E. Malik thinks of OPML, it seems to be the default format for storing feeds lists in RSS aggregators. See OPML Loader for some interesting stuff.
The following contrived example of OPML includes a list of sites organized as a heirarchy.


<?xml version="1.0"?>
<!-- OPML pooped by NewsHeap -->
<opml version="1.1">
  <head>
     <title>mySubscriptions</title>
  </head>
  <body>
    <outline text="Sites of interest" ">
      <outline text="manicwave" description="Surfing the Wave" title="manicwave" type="rss" version="RSS"
                      htmlUrl="http://www.manicwave.com" xmlUrl="http://www.manicwave.com/rss.xml"/>
      <outline text="manicwave3" description="Surfing the Wave" title="manicwave" type="rss" version="RSS"
                      htmlUrl="http://www.manicwave.com" xmlUrl="http://www.manicwave.com/rss.xml">
         <outline text="manicwave3.1" description="Surfing the Wave" title="manicwave" type="rss" version="RSS"
                         htmlUrl="http://www.manicwave.com" xmlUrl="http://www.manicwave.com/rss.xml"/>
         <outline text="manicwave3.2" description="Surfing the Wave" title="manicwave" type="rss" version="RSS"
                         htmlUrl="http://www.manicwave.com" xmlUrl="http://www.manicwave.com/rss.xml"/>
      </outline>
    </outline>
    <outline text="manicwave4" description="Surfing the Wave" title="manicwave" type="rss" version="RSS"
                    htmlUrl="http://www.manicwave.com" xmlUrl="http://www.manicwave.com/rss.xml"/>
  </body>
</opml>

OPML is simply a format. We need a basic class that can consume and produce OPML. A simple internal representation of OPML would be attributes which simply is a hash of elements from the <head> section. The outline items are an array of Outline instances, each of which can contain a list of children.

Typical usage would be


  opml = OPML.new
  opml.readFrom(string or IO)
  opml.roots.each { | outline |
      print "Has childrenn" if (outline.has_children?)
      print "No childrenn" if (!outline.has_children?)
  }

to access the attributes of each outline, outline.attributes[‘attrName’]

OPML is done. We could imagine the intersection of the OPML code and the RSS code thusly:


require 'rss-parser'
require 'OPML'
require 'pp'

rssParser = RssParser::new()
fetcher = HttpGetter::new()

opml = OPML.new
opml.readFrom(File.new('test.opml'))  # <==Change this appropriately
opml.roots.each { |outline|
  next unless url = outline.attributes[’xmlUrl’]
  data = fetcher.readData(url)
  result = rssParser.parse(data)
  pp result
}

Download is here

In the next installment, we’ll add a control file to keep track of feeds we’ve read, when and associated etags if any. Then we can start caching content and doing net-friendly updates.

Premature Release

Tuesday, January 7th, 2003

It’s difficult to talk about, especially in mixed company.

The RSS parser from yesterday is shite. My fault. It worked for most 0.91 feeds, but incompletely parsed xml namespace qualified tags and attributes. In fact, the list of what didn’t work is longer than those that did.

Several issues surfaced when retesting the parser. First, the ruby sgml-parser modeled after the python sgmllib (which does by the way) didn’t deal with namespace qualifications at all. I’ve made the changes and posted a patched copy of sgml-parser.rb and an updated rss-parser.

The OPML code is done, of which this post should have been that post.

A few more tweaks and we’ll talk about OPML, and a driver progam.

NewsHeap - Parsing RSS in Ruby

Monday, January 6th, 2003

So you want to build an RSS Feed Reader.
First we need to be able to parse RSS. Not to spec (0.91, 1.0, or 2.0), but the real world stuff that is distributed as RSS.

What we need is a liberal parser that gets RSS enough to extract content and normalize it into a usable form. Mark Pilgrim has done some great work in Python with his ultra-liberal RSS parser.

Pilgrim’s RSS parser builds upon a python module for SGML Parsing. The Ruby world provides a port of this as HTML/SGML Parser.

Good. It’s fairly easy to port Python to Ruby. My colon hurt a little, but the experience was all good until I got to the open_resource method. The python code was providing a uniform method to access data to parse. The comment said:


   This function lets you define parsers that take any input source
    (URL, pathname to local or network file, or actual data as a string)
    and deal with it in a uniform manner.  Returned object is guaranteed
    to have all the basic stdio read methods (read, readline, readlines).
    Just .close() the object when you're done with it.

Ruby 1.6 doesn’t provide a uniform stream interfaces across URLs, files, and strings. Ruby 1.7 introduces StringIO, but I made the decision to factor out the aquisition of data from the parsing of data. The python interface is:
def parse(uri, etag=None, modified=None, agent=None, referrer=None):
and the new Ruby interface is simply
def parse(uri):

Here’s the RSS Parser - it contains a class called HTTPGetter that does etag, last-modified and gzip handling. A typical usage of the RSS parser would be:


urls = ['http://www.pocketsoap.com/rssTests/rss1.0withModules.xml',
              'http://www.pocketsoap.com/rssTests/rss1.0withModulesNoDefNS.xml',
              'http://www.pocketsoap.com/rssTests/rss1.0withModulesNoDefNSLocalNameClash.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0noNSwithModules.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0noNSwithModulesLocalNameClash.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0NSwithModules.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0NSwithModulesNoDefNS.xml',
              'http://www.pocketsoap.com/rssTests/rss2.0NSwithModulesNoDefNSLocalNameClash.xml']

    r = RssParser::new()
    getter = HttpGetter.new()
    urls.each { | url |
      print "#{url}n"
      result = {}
      data = getter.readData(url,result)
      result = r.parse(data,result)
      pp (result)
   }

Of course readData supports a variety of parameters to make it a nice RSS netizen e.g. def readData(source, result, etag=nil, modified=nil, agent=nil, referrer=nil )

If you just run ruby rss-parser.rb it will run a series of tests from Simon Fell’s RSS Tests. If you want to test a single feed, ruby rss-parser.rb http://some.url/rss.xml or such.

The parser has been tested with Ruby 1.6.7 and 1.7.3 on windows. There are some differences between 1.6 and 1.7 - the notable ones are the intro of StringIO and pp (pretty printing) in 1.7. Both of these are available with the shim, a library of post 1.6 enhancements backported to 1.6.

I’ve decided to support 1.6 natively, so you will see code like:


# abstract the differences between 1.7 and 1.6 w/o requiring the shim library
	begin
	  require "stringio" if not defined? StringIO
	  body = StringIO.new(data)
	rescue LoadError
	   require "tempfile"
	   body = Tempfile.new("CGI")
	   body.binmode
	   body.write(data)
	   body.flush
	   body.pos = 0
	end
	stream = body

	gzReader = Zlib::GzipReader.new(stream)

Step 1 is complete. We can parse RSS, we get a hash with several entries, items an array of items, each of which is a hash containing a title, description and link, channel which is a hash of channel information, including the title, description and link, modified, the last modification timestamp and possibly an etag.

In the next installment, we need to add some support for OPML. What better way to test our parser than to consume the OPML file from your current aggregator. We’ll wrap some kind of command-line driver around OPML, add a control file to maintain etags and modified timestamps, paving the way to slap an initial UI on NewsHeap.

No Sleep til Brooklyn

Friday, January 3rd, 2003

Even when the manicwave is in full tilt, I can’t proclaim things like 30 days to a more accessible web site. Units involving time aren’t flexibile enough for the undulation. Several recent wave entries have alluded to the insufferable state of windows RSS aggregators. My routine use of AmphetaDesk,my admiration for HEP and my longing for NetNewsWire on my daily driver not withstanding, the RSS aggregator situation on windows is appalling.

Rather than snipe and carp, I’m going to crank up the manicwave and do something about it.

The balance between my aversion to time units and my desire to develop in the round results in this catchy tag line No Sleep ’til NewsHeap. Not literally of course, but when the wave rolls, good things happen.

The approach: each entry will present a goal, a proposed solution and some code to bring it together. I’ll present a running list of things to do and check them off as I go.

I’ll start off with a random list of requirements and desires:

  • Ruby - this stuff is orthogonal to my day job. Ruby is about the most fun you can have these days. This will present some challenges relative to Python for instance. It’s a journey for which several destinations exist
  • Groks RSS - we’re all tired of getting This channel has no items to display. We’ll use my Ruby port of Pilgrim’s liberal RSS parser
  • Does OPML - Every time I want to try an aggregator, I’m forced to reenter my feeds. Tiring and now quite irritating.
  • Aggregates OPML - Have a local list of feeds, blend it with a blogroll or two. Shake, Serve over ice
  • blogging capabilities

and the list goes on. Feedback welcome.

Next I’ll layout some of the prerequisites and some initial scaffolding. We’ll start with a quick look at Ruby and make some tough versioning calls.

Stay tuned for NewsHeap, the next installment.