Friday, September 19, 2008

Semantic-Content Management

We've been developing a tool set, or framework, or whatever for several years. We call it "Tractare" which is Latin for "to handle, manage, perform". I'm a sucker for that kind of naming.

Anyway, what's it all about? Well, as the Internet becomes more saturated with raw information, keyword search engines really aren't enough to locate that needle in the haystack. We (content providers) need to describe our content in a way that users "get" -- we need to describe the "aboutness" of our content.

An example I like to use when speaking on this subject is this: If you were to use google to search for "retarded", you would get a gazillion hits. But few, if any of those hits would have come up using the politically correct phrase "intellectually challenged". This is because a keyword engine like google depends on the actual presence of the keyword, either as text in the content or as metadata. Now, you could encode both forms of this concept as meta-data on your web-page and it would be found. Now, if you are a psychologist or someone working in mental health, you'd probably be getting the results you want. But if you are a firefighter, the word "retarded" has a whole different meaning. How do we express that? The answer is in several parts of course. But first, we need to capture the meaning of the content; the "aboutness". We need to associate the "firefighter" concept with the content that pertains to fighting fires. This is what Tractare permits us to do.

Tracare is a framework. It's not an off-the-shelf product. It is built on the idea of topic maps -- organizing content around indexes and concepts. It's true power lies in a combination of searching and navigation tools that allow the user to narrow the scope of their work to a set of concepts. We build custom CMS and delivery solutions on top of it.

The CMS systems we build usually include features found in social networking, including folksonomies (as well as traditional taxonomy and classification support) and ranking/commenting. These features allow content providers to apply semantics to content in a number of new and different ways.

The delivery systems we build often include a number of search and navigation interfaces that web users have come to love, including mashups, classification searching, semantic browsing and so on.

Tuesday, September 9, 2008

Advertising-Based Publishing

In a previous post, I was exploring new trends in electronic publishing (April 30, 2008). One of the key points was that traditional subscription and publishing revenues are shifting away from subscription fees towards "free" information supported by advertising revenues.

"One well known revenue shift is towards advertising revenues. This is an old model, of course. Newspapers have been at it forever, and online search engines for years. But more recently, we're seeing a larger shift of revenue away from subscription sales for information resources and towards advertising."
I had this reinforced recently when I realized that even traditional reference book publishers such as dictionaries and encyclopedias were increasingly garnering revenue from advertising rather than from subscription sales. Wow -- that was an eye-opener. It makes sense though -- if I'm looking for an informative article on "Chocolate", am I going to pay for access to Britannica or would I go to the free Wikipedia? I guess the answer (for me at least) depends on how authoritative my answer needs to be. But in general, I would go to the free site. And I suspect I'm not alone. So how is a traditional reference publisher to compete in the age of Wiki-whatever? Product quality alone isn't enough. It has to be free too. Enter advertising.

On the surface of it, it seems easy to generate advertising revenues. Especially if your area of publishing is targeted - in fact, the more specialized your content, the more valuable you are to advertisers? I'm not sure that this is true, but it sure looks that way. Virtually anyone can establish a Google AdSense account and tie it into their content publishing operation (as I've done to this blog). But who's actually making money at this? That's the hard question.

It always comes back to the same basic principal - supply and demand. I loosely translate the supply side to "timely, quality content". Timely doesn't necessarily mean frequent, it means "frequent enough". And demand is partly driven by the content and partly by your sales/marketing operation. Demand is partly the number of visitors to your web site and partly by the number of visitors who pay attention to your advertising. So, we need quality content and we need to advertise its presence.

These sound like the principals as those by which we've been driven forever.

Thursday, July 3, 2008

Gilbane, San Francisco, 2008

I spoke at the Gilbane Conference in San Francisco in June (http://gilbanesf.com/) on the subject of using folksonomies as a publishers tool for classifying content. The session moderator had warned me that attendance was a little thin and that we shouldn't expect too many in the audience, but my rough estimate was that we had about 45 people there. I was pleased. And there were lots of questions and a lot of discussion.

Here's the abstract from my presentation:

Folksonomies, Just Good Enough For All Kinds of Things
Extending Folksonomies to Describe All Kinds of Content

Folksonomies are gaining popularity in the content delivery world as a "good enough" way to classify content; tag clouds are becoming commonplace on content-driven website. More and more, folksonomies are working their way back into the content creation and management process. Huge volumes of content, increasing and diversifying user requirements are pushing content management operations to look to folksonomies as a Web-2.0 way to describe the "aboutness" their content.

With this simple concept in hand, why not use this technology to describe more than just content subjects? "Aboutness" is just one aspect of your content. What about order, threads, or related content? This talk will explore the extended use of folksonomies as a technology for enhancing content in new and different ways.

I can send the power point to anyone interested, please email me or leave a comment on this blog.

Wednesday, April 30, 2008

New Trends in Electronic Publishing

I'm a consultant (sounds like an admission) in the publishing industry and throughout my career I've developed and integrated new technologies to help publishers stay competitive. Everything from back-room content management to user-facing "wow" tools. And, I'm noticing some trends in the publishing business lately. Especially in electronic publishing. These are some things I'm observing from my clients specifically, and from public media (annual reports, news, and so on). I think we're seeing a big shift in the way publishers generate revenue, away from the value on information.

One well known revenue shift is towards advertising revenues. This is an old model, of course. Newspapers have been at it forever, and online search engines for years. But more recently, we're seeing a larger shift of revenue away from subscription sales for information resources and towards advertising. I think this is driven by a couple of factors.

First is the nature of the internet and social networking. These are creating an underlying expectation of free information on the web. No news there! But this is creating new revenue streams in the form of advertising that are replacing traditional subscription revenues. And with tools like Google's AdSense, this is easier than ever to do, making it possible for just about anyone to create an ad-driven information portal.

The second factor is the rise in information portals repurposing and rebranding content for their own purposes. Tools like RSS and aggregators like Feedburner make it easy to brand your own content, creating a demand for and an expectation of specialized information sources. In a sense, this is dividing the traditional publishing industry into two parts -- information creation and information dissemination. Publishers used to integrate both these functions under one roof, but in the repurposing/rebranding scenario lines between these functions become clear. New revenue streams are being created and recognized by selling content for aggregation, repurposing and rebranding.

Related to this is the increasing role public entities are playing in the information dissemination arena. A large market used to exist for private publishing of government information. In the old days, these publishers took government data and added value through indexing or other finding aids and sold the improved access. As technology tools have improved, government agencies are more easily and readily able to perform this publishing themselves. So adding content value has become more important to the ability to create revenue than using technology to improve access.

And this is exacerbated by the economy. With more choices and less budget, customers will opt for the cheapest product that reduces their workload. User expectations are indeed greater. Users now expect to access content with minimum wasted time and maximum accuracy. To stay competitive, publishers need more sophisticated user-facing technology to meet these expectations. In turn, new revenue streams are coming from sales of integrated technology and content or from the sale of the technology itself (software products).

So, do I have a conclusion from this? Not really. I don't think anyone can accurately predict the course of publishing over the next few years. I do think publishers need to keep their tools sharp. My personal belief is that investments in the right technology will win out in the end, but mostly because it will enable the publisher to be ready to respond to changes more rapidly. Of course, I'm a techy saying this, so my view is slanted. I'm reminded of a joke by Emo Phillips: "I used to think the brain was the most important organ in the human body, then I realized what was telling me this!" But I can't help thinking that technology investment, especially in these tough economic times will out.

Getting Some Real Offshore Experience, Redux

I posted about getting some real experience working with an offshore development group. As I said, we entered into this arrangement with a short-term trial to see how it would work out. We ended that trail today and I guess I still have mixed results to report.

Working with an offshore team in an Agile world is annoying. A lot of time is lost because the offshore developers are not co-located (even within the same time zone). I would say that my initial conclusions about the TCO for code developed still holds -- about he same or even more than the cost of doing it with our regular development team.

We had some problems with the develop initially assigned to work with us. To their credit, the offshore company took care of that, replacing the one with a more mature/professional developer. We lost some time and money in that transition, but the startup cost of this kind of relationship cannot be underestimated.

I guess the conclusion I've come to is this, write good specs and way overestimate the cost. But the skills are there. I will work with this group again (probably). Next project we get.

Saturday, April 12, 2008

Getting Some Real Offshore Experience

In a former post (see "Offshore Development Woes"), I wrote some personal conclusions from reading Martin Fowler on "Using an Agile Software Process with Offshore Development". It was a summary of what I got from Mr. Fowler's experiences. Interestingly, though I have worked with many developers from many Asian countries over the years, I have never actually worked with an offshore software development company performing development for hire under my direction. It's easy to be an armchair critic.

But that's changed, at least a little. Faced with a recent influx of development projects and a lack of staff to get them done in a timely manner, I decided to develop a relationship with a firm based in India. Being the cautious type I am, our initial foray into this arrangement was for a limited time and scope -- I really wanted to see how well it would work out. And I was especially intrigued given Mr. Fowler's experiences because we too are an agile development shop. I had misgivings based on the rapidity of our development and the time and language differences. But the hourly rates proffered were downright seductive. I figured that at the $25 per hour rate we were being charged, we could afford to lose some time to these difficulties and still come out ahead. And I really want it to work.

It's now been 7 weeks into this trial and I guess I have mixed results to report. First, as Mr. Fowler observed, there is no real cost savings in spite of the seductive hourly rate. Between the language barriers, the time differences, things just plain take longer. A lot longer. And that costs. It's not like the developer in India isn't charging me for those differences. We use email and Skype to get our communication done since written English at least removes accents from the equation. And our developers in India work odd hours, so in spite of the 9.5 hour time difference, I can usually talk to them throughout the morning.

But I've now worked with the group on three development tasks. The initial effort was to get the existing project development infrastructure (Eclipse, Tomcat, CVS, MySQL, and the project code) up and running. I have to say, I was impressed with how quickly and smoothly that happened. It showed that the Indian team was well acquainted with these tools. So we moved on to actual development tasks. I made an effort to be more clear and concise than I usually am in providing task writeups. We had several email exchanges asking for clarification and I got my first code delivery in about a week. I was very disappointed. The code was no where near complete, was undocumented, did not follow style guidelines, and was generally unproffesional (a java package name of "jesus.xml"?). But the real problem was the lack of some basic computer science -- a huge switch construct instead of arrays; a single large method instead of encapsulation, and so on. I rewrote the code and sent it back with further instructions on what I was looking for.

To make a long story short(er), this seems to be the pattern -- delivered code is under-par, I rewrite it and return it, and they then follow my pattern. This put the first several tasks way over budget.

We did make a complaint to the management team. Their solution was to add another resource (a more senior programmer) to the team. At their own cost (to their credit). But now, every detail is being questioned and designed out in advance. We've lost our agile approach. And the costs are higher still, since I spend so much time trying figure out the answers to their questions.

So, in all, this effort has proven to be more costly than we had hoped and it pushed our project costs over budget. It may be that a different team from India could perform better, or that if we developed using a traditional "waterfall" life cycle, things would be better. But this has not been a success for us.

Friday, March 28, 2008

Using Folksonomies in Content

I've been thinking more and more about folksonomies as a replacement for traditional taxonomies. We can create all the tools we need, or work off of existing tools such as del.icio.us. But in the end, size does matter. For a folksonomy to work, a *lot* of people have to look at and tag the content. To me, this means content has to be exposed to readers, and lots of them, to tag the content.

Traditionally, the classification operation has been something done under control and part of back-room content management, by a small, select group of indexers. So first, we have to decide to relinquish some control. While this may seem scary, in reality the volume of tagging makes up for the lack of specific control. And, we can build the tagging tools so that editors and indexers can follow behind the tagging and clean it up. But if we can achieve a large volume of tagging, the volume and repetitive nature of the tagging will create common tags, threads and relationships.

Second, we need to find a large group of taggers. Depending on the nature of the content, this can be accomplished in a couple of ways. First, and perhaps easiest, is to expose content to the web. Perhaps through incentives, taggers can be enticed to tag. And, of course, staff of the publisher should be encouraged to participate as well. Failing that, a publisher could look at a human-automation engine such as Amazon's Mechanical Turk, where large numbers of minuscule tasks that are best done by people are spread out over many people for a small fee.

Thursday, March 27, 2008

Jobs (or Maybe Just Job)

Retrieval Systems Corporation (my employer), needs an experienced developer to serve as a project lead, closing out an ongoing project. The project uses SQL Server, IIS, Exchange/Outlook and VB.NET to track document filings, and enforcement activities through a legal process. We are currently wrapping up the initial data entry and enforcement status components of the application. The next phases will involve integration with Quickbooks and Sharepoint services.

The individual we are looking for will be a professional software developer, with a minimum of three years experience. They will need a solid knowledge of SQL, .net, and developing within the Microsoft web technologies. The job is for a project lead who will be expected to work on all aspects of the project life cycle.

Retrieval Systems Corporation is a developer of custom software. Our developers work in small teams and are involved in most aspects of a project. We are located at 2071 Chain Bridge Rd., Suite 510, Vienna, VA 22182. Work on this project, will be performed on site at our offices.

Thursday, March 20, 2008

Agile for Consultant Development Teams

I'm a consultant (sounds like an admission at the start of a 12-step program). I love the Agile approach to software development but there seems to be a fundamental conundrum I haven't been able to solve -- software project pricing. Seems like my clients want to know, in advance, how much the development of software is going to cost (the nerve of some people!). I just can't see how to get a price together for a project using an approach that essentially leaves out the up-front design effort that is characteristic of the traditional waterfall approach. And the Agile experts with whom I've spoken have no answer to this either.

Now, I don't actually believe that the waterfall approach gives any better or more realistic cost estimate, except perhaps on the most simplistic of projects. If anything, I believe it is a more expensive method that produces a false sense of cost security. But, I'm not getting into a comparison of these approaches. I'm just trying to figure out how to give a customer a price for an Agile project. And I've tried to say things like "what is your budget? We'll do as many of your priorities as possible within that." There is a disconnect between what clients want to know (how much for the whole thing) and the reality of what the "whole" thing needs to be and how much it will really cost.

One approach would be to simply lie. Tall the client what they want to hear. And then manage the consequences in the middle of the project. Doesn't seem like the right way though.

So instead, we've been trying to educate our clients on the tension between cost and features. In essence,
We've been using a variation on Wide-Band-Delphi-Blind to create estimates. We take the full list of features, attempt to designate tasks for implementing them, and then send that around to as many senior developers as we can -- developers with the right skill set to create an estimate for that type of project. Each estimator does his estimate alone and sends the results back to a coordinator and then we go through the estimates collectively to resolve the differences. Using this estimate of all features we then attempt to explain to our client that the estimate is under their control -- choose high priority features to implement first and then work through the list until either features or budget are exhausted.

This is not too different from agile poker (http://www.planningpoker.com/). As I understand that technique, several estimators each make a bid for how long each task or feature will take. The actual bids are not really hours. They are units. After several rounds of bidding and discussion on differences between bids, the winning bids are turned into dollars by applying the development team's velocity to the bid and figuring out real hours. This means that we have to know that velocity -- not an easy problem in its own right.

But we end up in the same place -- with an estimate for the whole project which we then have to "sell" to the client. Is there a better way?

Thursday, March 13, 2008

Bill Gates at the NVTC Breakfast Forum

I got the opportunity to attend the Northern Virginia Technology Council (NVTC) breakfast this morning, in DC at the Capitol Hilton. The speaker was Bill Gates. You know, that Microsoft guy? It was kind of cool to be a couple of hundred feet away from the richest man in the world (or maybe second now that Warren Buffet's star has been rising). Bill spoke at length on the direction computing is headed and mentioned everything from surface computing (your desk working for you?) to intention recognition through visualization. Very cool stuff, though I think I'll make sure I stay dressed in front of the PC-CAM.

And, I am fascinated by the concepts of surface interfaces -- as an active sailor, I'm envisioning a future sailboat that has a surface navigation computer that ties together mapping/charting, GPS, depth sounding (and underwater 3-D profiling), routing and guidance, with timely updates on chart data from the NOAA Notices to Mariners, coupled with a finger-driven interface similar to what Apple now uses on their iPods and iPhones (oh, wait, that's not from Microsoft).

But, I have to say, I was not "wow-ed" by Bill Gates' talk. I've heard him speak before and it's pretty much the same spiel I heard 20 years ago at a Microsoft CD-ROM Multimedia conference -- better human interaction with machines.

There were a couple of questions from the audience after the talk, and one questioner pounced on Microsoft for considering security as an afterthought (hear-hear). But my real disappointment is that Bill Gates' vision did not account for any of the real-world problems we seem to be facing: global warming, rising energy costs, environmental issues and so on. Where is computing really going to be in 20 years if, as Al Gore is suggesting, sea level has risen enough to flood some of our coastal cities? Or if we can no longer tolerate the environmental impact of hardware disposal?

So, I throw out a challenge: "What will computing look like in an age where we have higher sea levels, unaffordable energy, and intolerable environmental issues". Bill?

Wednesday, March 12, 2008

Folksonomies Applied

In a previous blog entry, I was starting to think about folksonomies as they might apply to content management. Having lots of experience in topical classification, therauri, and related tools, I figured this would be a simple discussion, but it isn't.

What is a Folksonomy? "Folksonomy (also known as collaborative tagging, social classification, social indexing, and social tagging) is the practice and method of collaboratively creating and managing tags to annotate and categorize content. In contrast to traditional subject indexing, metadata is not only generated by experts but also by creators and consumers of the content. Usually, freely chosen keywords are used instead of a controlled vocabulary" (Wikipedia).

By their nature, folksonomies are created by the people. I think the key to a successful folksonomy is participation by many people -- we make up for the lack of a controlled, standardized vocabulary and it's application to content by sheer volume and enthusiasm from a wide-spread using community. In fact, complaints about this approach usually center around the imprecise nature of the tagging. Since users typically apply tags to content, the tags are often ambiguous, overly personalised and inexact. But, Guy and Tonkin make a persuasive argument that user applied tags are in fact converging -- that the overall universe of applied tags is becoming self-limiting. If so, then the universe of tags that are being created in services like Flikr and Del.icio.us are becoming useful bases for classifications.

And in other, related developments, these services are beginning to categorize their tags. Especially at Del.icio.us where there are now classification tags and action tags, among others. To me these seem like we gray-beards call facets.

But that's not really my point, though I think it is important. I think we need some tools that can work with content management to allow tagging, maybe even super tagging, wherein the tags are members of controlled facets. This isn't really hard. Virtually every content management system "knows" about content by a URI. And there are some very cool features in the Del.icio.us service, including keeping my bookmarks and tags private, and retrieving them via an API later. So, we can set a bookmark in Del.icio.us containing the URI of the content we want to tag with the tags we want for that content.

I tried this very simply by registering at Del.icio.us, turning on the "private bookmarks" setting, and putting the Del.icio.us buttons on my browser toolbar. Then I pointed my browser at a content item in a CMS (Alfresco) and clicked the Del.icio.us "Tag" button. Added tags and saved it. I can see the tags and the URIs in my items on the Del.icio.us website.

But, to actually use this information, we need to pull the tags and content URIs back out of Del.icio.us. XML to the rescue. Or, rather, XML and the Del.icio.us API. We can fetch the tags we're using by using this URL: https://api.del.icio.us/v1/tags/get. We can also see all of our content using this URL: https://api.del.icio.us/v1/posts/get. And we can retrieve by tag: https://api.del.icio.us/v1/posts/get?tag=C

Pretty cool stuff.

Friday, March 7, 2008

Library Takes 'Talking Books' Digital, Washington Post, March 5, 2008

I love this article in the Washington Post. For a couple of reasons. First, I have several friends who are legally blind (which I think is different from "illegally blind"). These kinds of tools help them enormously. It makes me feel like, as a society, we're doing the right things at least sometimes.

But, and perhaps more importantly, I'm thrilled to see this because WE DEVELOPED IT. That's right folks, the excellent programming staff at Retrieval Systems (who pay me) wrote the underlying software for the digital talking book for the Library of Congress, National Library Service for the Blind and Physically Handicapped (NLS). So this makes us proud.

We've been doing work in the background for the blind community for many years. Mostly this has involved library data exchange. For both NLS and the Recording for the Blind and Dyslexic (RFB&D) we wrote bunches of code to help them exchange bibliographic records, supporting creation of a union catalog so that blind patrons could more readily find books. And for NLS we've also developed some production tools to help in the creation of digital talking books.

But the DTB player is something of which we're especially proud. Not only was it technically challenging, it was socially responsive.

Wednesday, February 13, 2008

A New Kind of Content Management

I've been enamored with classification of content through my entire career (which is certainly long enough). Early on, I was working on library automation systems and in particular with Library of Congress Subject Headings (LCSH) data and I always thought classification of documents and other content types was a panacea we should try to achieve. Librarians used these and other tools as finding aids to locating specific information, usually books.

Along came the internet and the amount of information available grew exponentially. Search engines provide keyword access to text-base content. Publishing was redefined to include anyone putting information out on the internet. Published content became a thing to be managed with buzzwords like single-sourcing, XML, multimedia, syndication, blogs, Wikis, and so on.

Well, the world is very different now and with everything from books and documentation to audio and video as well as thought streams all published on the web, finding a particular bit of information can be a nightmare. Even traditional textual content is hard to find in the vast archive that is the internet.

So how do we find things now? What are our modern finding aids starting to look like?

Some very cool new technologies have come along recently: folksonomies and mashups, to name two.

Folksonomies in particular offer a new approach to an old problem -- how to capture the "aboutness" of content. Classifying content is an age-old process that has spawned many fields including taxonomies and indexing. These, in turn, have created an arcane set of rules and procedures wherein the maintenance of the indexes or thesauri become big efforts on their own -- at times larger than the content management effort they are supporting.

Mashups are a cool way to combine content from multiple sources into a single presentation. Kind of like federated search and portals in a social-networking environment.

I'll be exploring these more over the next several posts.