Computer Science

I’ve included a lot of stuff here that doesn’t really deserve to be here such as programming and web 2.0 etc. But its how I think about the whole thing, so I guess it counts.

Automata-Based Programming With Petri Nets – Part 1

Petri Nets are extremely powerful and expressive, but they are not as widely used in the software development community as deterministic state machines. That’s a pity – they allow us to solve problems beyond the reach of conventional state machines. This is the first in a mini-series on software development with Petri Nets. All of the code for a full feature-complete Petri Net library is available online at GitHub. You’re welcome to take a copy, play with it and use it in your own projects. Code for this and subsequent articles can be found at http://github.com/aabs/PetriNets.

(more…)

Functional Programming in C# – Higher-Order Functions

  1. Functional Programming – Is it worth your time?
  2. Functional Programming in C# – Higher-Order Functions

This is the second in a series on the basics of functional programming using C#. My topic today is one I touched on last time, when I described the rights and privileges of a function as a first class citizen. I’m going to explore Higher-Order Functions this time. Higher-Order Functions are functions that themselves take or return functions. Meta-functions, if you like.

As I explained last time, my programming heritage is firmly in the object-oriented camp. For me, the construction, composition and manipulation of composite data structures is second nature. A higher-order function is the equivalent from the functional paradigm. You can compose, order and recurse a tree of functions in just the same way as you manipulate your data. I’m going to describe a few of the techniques for doing that using an example of pretty printing some source code for display on a web site.

I’ve just finished a little project at Readify allowing us to conduct code reviews whenever an interesting code change gets checked into our TFS servers. A key feature of that is pretty-printing the source before rendering it. Obviously, if you’re displaying XHTML on an XHTML page, your browser will get confused pretty quickly unless you take steps to HTML-escape all the XHTML entities that might corrupt the display. The examples I’ll show will highlight the difference between the procedural and functional approaches.

This example shows a fairly typical implementation that takes a file that’s been split into lines:

public static string[] RenderLinesProcedural(string[] lines)
{
    for (int i = 0; i < lines.Count(); i++)
    {
      lines[i] = EscapeLine(lines[i]);
    }
    return lines;
}

public static string EscapeLine(string line)
{
  Debug.WriteLine(“converting ” + line);
  return line.Replace(” “, ”  “)
      .Replace(“\t”, ”  “)
      .Replace(“<“, “<“)
      .Replace(“>”, “>”);
}

There’s a few things worth noticing here. In C#, strings are immutable. That means that whenever you think that you are changing a string, you’re not. In the background, the CLR is constructing a modified copy of the string for you. The Array of strings on the other hand is not immutable, therefore a legitimate procedural approach is to make an in-place modification of the original collection and pass that back.  The EscapeLine method repeatedly makes modified copies of the line string passing back the last copy.

Despite C# not being a pure functional programming language[1], it’s still doing a lot of copying in this little example. My early impression was that pure functional programming (where all values are immutable) would be inefficient because of all the copying goign on. Yet here is a common-or-garden object oriented language that uses exactly the same approach to managing data, and we all use it without a qualm. In case you didn’t know, StringBuilder is what you should be using if you need to make in-place modifications to strings.

Let’s run the procedural code and record what happens:

private static void TestProcedural()
{
   string[] originalLines = new string[] { “<head>”, “</head>” };
   Debug.WriteLine(“Converting the lines”);
   IEnumerable<string> convertedStrings = RenderLinesProcedural(originalLines);
   Debug.WriteLine(“Converted the lines?”);

   foreach (string s in convertedStrings)
    {
    Debug.WriteLine(s);
    }
}

Here’s the output:

clip_image001

As you can see, the lines all got converted before we even got to the “converted the lines?” statement. That’s called ‘Eager Evaluation’, and it certainly has its place in some applications. Now lets use Higher-Order Functions:

public static IEnumerable<string> RenderLinesFunctional(IEnumerable<string> lines)
{
    return lines.Map(s => EscapeString(s));
}

static IEnumerable<R> Map<T, R>(this IEnumerable<T> seq, Func<T, R> f)
{
   foreach (var t in seq)
     yield return f(t);
}

static string EscapeString(string s)
{
   Debug.WriteLine(“converting ” + s);
   return s.Replace(”  “, “&nbsp;&nbsp;”)
     .Replace(“\t”, “&nbsp;&nbsp;”)
     .Replace(“<“, “&lt;”)
     .Replace(“>”, “&gt;”);
}

private static void TestFunctional()
{
   string[] originalLines = new string[] { “<head>”, “</head>” };
   Debug.WriteLine(“Converting the lines”);
   IEnumerable<string> convertedStrings = RenderLinesFunctional(originalLines);
   Debug.WriteLine(“Converted the lines?”);

   foreach (string s in convertedStrings)
    {
    Debug.WriteLine(s);
    }
}

This time the output looks different:

clip_image001[5]

At the time that the “Converted the Lines?” statement gets run, the lines have not yet been converted. This is called ‘Lazy Evaluation[2]‘, and it’s a powerful weapon in the functional armamentarium. For the simple array that I’m showing here, the technique looks like overkill but imagine that you were using a paged control on a big TFS installation like Readify’s TFSNow. You might have countless code reviews going on. If you rendered every line of code in all the files being viewed, you would waste both processor and bandwidth resources needlessly.

So what did I do to change the way this program worked so fundamentally? Well the main thing was to opt to use the IEnumerable interface, which then gave me the scope to provide an alternative implementation to representing the collection. in the procedural example, the return type was a string array, so I was bound to create and populate the array before returning from the function. That’s a point worth highlighting: Use iterators as return types where possible – they allow you to mix paradigms. Converting to IEnumerables is not enough. I could change the signature of TestProcedural to use iterators, but it would still have used Eager Evaluation.

The next thing I did was use the Map function to return a functional iterator rather than a concrete object graph as was done in the procedural example. I created Map here to demonstrate that there was no funny LINQ business going on in the background. In most cases I would use the Enumerable.Select() extension method from LINQ to do the same thing. Map is a function that is common in functional programming, it allows the lazy transformation of a stream or collection into something more useful. Map is the crux of the transformation – it allows you to insert a function into the simple process of iterating a collection.

Map is a Higher-Order Function, it accepts a function as a parameter and applies it to a collection on demand. Eventually you will need to deal with raw data – such as when you bind it to a GridView. Till that point you can hold off on committing resources that may not get used. Map is not the only HOF that we can use in this scenario. We’re repeatedly calling String.Replace in our functions. Perhaps we can generalise the idea of repeatedly calling a function with different parameters.

Func<T, T> On<T>(this Func<T, T> f, Func<T, T> g)
{
    return t => g(f(t));
}

This method encapsulates the idea of composing functions. I’m creating a function that returns the result of applying the inner function to an input value of type T, and then applying the outer function to the result. In normal mathematical notation this would be represented by the notation “g o f”, meaning g applied to f. Composition is a key way of building up more complex functions. It’s the linked list of the functional world – well it would be if the functional world were denied normal data structures, which it isn’t. :P

Notice that I’m using an extension method here, to make it nicer to deal with functions in your code. The next example is just a test method to introduce the new technique.

private static void TestComposition()
{
    var seq1 = new int[] { 1, 3, 5, 7, 11, 13, 19 };
    var g = ((Func<int, int>)(a => a + 2)).On(b => b * b).On(c => c + 1);
    foreach (var i in seq1.Map(g))
    {
      Debug.WriteLine(i);
    }
}

TestComposition uses the ‘On’ extension to compose functions into more complex functions. The actual function is not really that important, the point is that I packaged up a group of functions to be applied in order to an input value and then stored that function for later use. You might think that that’s no big deal, since the function could be achieved by even the most trivial procedure. But this is dynamically composing functions – think about what you could do with dynamically composable functions that don’t require complex control logic to make them work properly. Our next example shows how this can be applied to escaping strings for display on a web page.

void TestComposition2()
{
   var strTest = @”<html><body>hello world</body></html>”;
   string[][] replacements = new[]
     {
       new[]{“&”, “&amp;”},
       new[]{”  “, “&nbsp;&nbsp;”},
       new[]{“\t”, “&nbsp;&nbsp;”},
       new[]{“<“, “&lt;”},
       new[]{“>”, “&gt;”}
     };

  Func<string, string> f = x => x;
  foreach (string[] strings in replacements)
  {
     var s0 = strings[0];
     var s1 = strings[1];
     f = f.On(s => s.Replace(s0, s1));
  }

  Debug.WriteLine(f(strTest));
}

This procedure is again doing something quite significant – it’s taking a data structure and using that to guide the construction of a function that performs some data-driven processing on other data. Imagine that you took this from config data or a database somewhere. The function that gets composed is a fast, directly executable, encapsulated, interface free, type safe, dynamically generated unit of functionality. It has many of the benefits of the Gang Of Four Strategy Pattern[3].

The techniques I’ve shown in this post demonstrate some of the power of the functional paradigm. I described how you can combine higher order functions with iterators to give a form of lazy evaluation. I also showed how you can compose functions to build up fast customised functions that can be data-driven. I’ve also shown a simple implementation of the common Map method that allows a function to be applied to each of the elements of a collection. Lastly I provided a generic implementation of a function composition mechanism that allows you to build up complex functions within a domain.

Next time I’ll introduce the concept of closure, which we’ve seen here at work in the ‘On’ composition function.

Some references:

1. Wikipedia: Pure Functions

2. Wikipedia: Lazy Evaluation

3. Wikipedia: Strategy Pattern

The Future of Search

The future of search lies in finding ways to bypass search engines altogether. Where information exists locally to make sense of what you’re after it’ll be used to create better searches. Data mining will be used to form a picture of what results go together, and what meanings a user attaches to a word. Alternative, unambiguous entry points to the web, like spatial search, will provide alternative, quicker routes to data for compatible hand-held devices.

When the mass of data finally overwhelms the conventional search engines, something will have to give. If we want our data out there being used, we will have to transition our content over to semantic web technologies. These technologies provide solutions to developers seeking smarter software, and provide search engines with a way to give better results. New user interface idioms will have to be invented, but in the end the web will be a better place to live.

Crisis? What Crisis?

Search engine technology faces a pressing challenge. The challenge is to quickly yield relevant results from the avalanche of data we are producing. It’s not enough to find the data eventually – search engines must find the right data immediately. They must deduce what ‘right‘ means for a query, and that requires a level of understanding they don’t currently have. Every three years the data we produce doubles – in the next three years we will have three times as much data as we created during the whole of history up to 2007. By 2010 search engines will have to work three times harder than today, or be three times less discriminating. Surely, incremental fixes to search algorithms can’t keep up?

Perhaps ‘avalanche’ is not the right term – ‘lahar’ is more appropriate. A lahar is a colossal down-pouring of mud and water. Along with the water comes a variety of unwelcome dross. How do you distinguish worthwhile content from dross? Sometimes it is not easy. The continued success of Nigerian con-artists seems to show that many people naively believe everything they read. Deducing authority, accuracy and relevance is hard to automate. Search engines assume that if many others link to a page then it must be valuable and so rank it higher. That approach cannot hold up under the relentless lahar of dubious data.

When working on the web you have to filter data manually. As the web grows, your role in the search filtration process will grow all-consuming, unless a revolutionary change is brought about. Search engines that rely on reputation face the dilution of the very data they work with. Web users can’t keep up with (or find) the flow of worthy data – so the worthy data isn’t linked to and ranking algorithms the search engines use become useless. They are useless as a judge of truth. Tim Berners-Lee has identified ‘trust’ assessment as a major component of his vision for Web 3.0. It’s hard to see how current search engines can contribute to his vision.

The difficulty with performing searches based on natural language search terms is that natural language allows many meanings for words, homonyms, synonyms and a wealth of nuance for each individual word in a search. There are twelve synonyms of the word ‘search’. Can they all be used interchangeably? The answer is no, they all mean something slightly different, and so probably should return a different result set. The space of all possible meanings grows enormously as the complexity of the search increases. The average relevance of each search result goes down relative to the complexity of a query. Automatically clarifying a search query is a challenge that will not be solved by 2010, but just stating the problem points to some interesting solutions based on present day technologies.

Why is search important?

Consider the trend – we are conducting more and more of our professional and private lives online. We have the communications protocols to allow any connected individual to converse with any other. We have the means to share our thoughts with everyone else in the world that cares to know them. As a computer programmer, I spend a significant portion of my working life searching for information relevant to the problem I am trying to solve. Things that prevent me from achieving my goals are therefore a direct cost to my clients and to me. They reduce my productivity and increase the cost to my clients of getting new software developed. It is hard to underestimate the cost to an economy of search engine inefficiencies.

Inefficient search technologies engender cognitive dissonance – just at the point that we need information we are forced to context switch into a battle of wits with a search engine. It prevents us from entering the flow. The knock-on effect of stopping us from being able to smoothly transition between the data we have and the data we need is another economic cost. It reduces our job satisfaction, forces us to perform below peak and frustrates us. Our morale is weakened by the poor search technologies we have to work with!

Semantic Search

So much for the problems. What of the solutions? How can search be improved to lessen or eliminate the costs to our productivity and fun? Well the solution exists in three parts – preparing a query, performing it and presenting the results. In addition we need to solve the problem of how to prepare (or reveal) data so that search engines can query it. The issue of data and query formats falls broadly within the category of “semantic web” technologies. Next generation data formats are undoubtedly are a necessary component of future search, but they are not sufficient – semantic web query languages are beyond the reach of everyday users.

I have endeavored to bring semantic search into the domain of the commercial programmer with my LinqToRdf project. What I have done, though, is to replace one formal language for another. For semantic search to become a commercial reality on the desktop, we must devise user interfaces that hide or simplify the searching experience. Google already provides ways to build complex queries that filter my results better. Yet, for most searches (of which I must perform dozens every day) I stick to keyword searches. Most users will not ever use the advanced features of an engine like Google. To reach them, we need a way to produce semantically meaningful queries to match the semantically meaningful data that will be out there on the web. How can we do that? I think the answer lies in context.

Assuming that the search engine doesn’t understand a word that you say when you type in a sequence of search keywords, it makes sense that the relevance of the results will decline as the number of possible misinterpretations increases. The English language, or any other natural language for that matter, is full of so many nuances and alternate meanings for each significant word that the space of possible interpretations for each word increases dramatically. As a consequence, the chances of a search engine being able to deliver relevant results are bound to decline at a similar rate. In fact, the problem is worse than that. The rate of decline of relevance is going to be the product of the probabilities of getting the sense of each word right for each word in the query. If there are five ‘senses’ to each of two words in a search string, then the chances of you getting the right result are going to be one in twenty five. A long shot. There are various heuristics that can be used to shorten the odds. But in the end, you’re never going to consistently get decent results out of a machine that doesn’t understand you or the documents that it has indexed. Enter the semantic web.

The semantic web is all about providing a means to be completely unambiguous about what you are talking about. The markup languages devised by the W3C and others are aimed at allowing content writers to provide clear and unambiguous meanings to terms. Specifically they allow one to distinguish between the uses of a word by attaching a URL to the sense, which being unique gives a unique sense to the search term. Obviously, if terms are unique then the queries against them can also be unique by referencing the term of interest. The adoption rates for semantic web technologies has not so far been meteoric, and I think that that is as much to do with tool support as with any innate pessimism about reaching a consensus about how to model your data.

I think that HTML would have taken much longer to reach saturation point if there had been no WYSIWYG editors available. What is needed to make data on the web (and the queries about them) more meaningful is a revolutionary new interface to the semantic web schemas that are available for querying.

Perhaps what is required is for a predictive lookup extension to the good old textbox that will offer you an intellisense type lookup so that as you type a term into the textbox it looks up the available senses of the term from the search engine. Examples of this are on display in the text entry boxes of Freebase. These senses can be derived from ontologies known to the search engine. If the context of what you’re working on allows it, the right ontology type can be preselected as you perform a search. Just as context provides the way to isolate the specific meaning of a word in a sentence or a query, so too does it provide a way for future search tools to generate intelligible queries.

The price of being able to make meaningful queries and get meaningful answers is that someone somewhere has to compose their data in a meaningful way. Meaning is a tricky word – here I mean complex structured data designed to adequately describe some domain. In other words someone has to write and populate an ontology for each domain that the users want to ask questions about. It’s painstaking, specialized work that not just anyone can do. Not even a computer scientist – whilst they may have the required analysis and design skills, they don’t have the domain knowledge or the data. Hence the pace of forward progress has been slow as those with the knowledge are unaware of the value of an ontology or the methods to produce and exploit it.

Compare this to the modus operandi of the big search companies. Without fail they all use some variant on full-text indexing. It should be fairly clear why as well – they require no understanding of the domain of a document, nor do their users get any guarantees of relevance in the result sets. Users aren’t even particularly surprised when they get spurious results. It just goes with the territory.

Companies that hope or expect to maintain a monopoly in the search space have to use a mechanism that provides broad coverage across any domain, even if that breadth is at the expense of accuracy or meaningfulness. Clearly, the semantic web and monolithic search engines are incompatible. Not surprising then that for the likes of Microsoft and Google the semantic web is not on their radar. They can’t do it. They haven’t got the skills, time, money or incentive to do it.

If the semantic web is to get much of a toehold in the world of search engines it is going to have to be as a confederation of small search engines produced by specialized groups that are formed and run by domain experts. In a few short years Wikipedia has come to rival the likes of Encyclopedia Britannica. The value of its crowd-sourced content is obvious. This amazing resource came about through the distributed efforts of thousands across the web, with no thought of profit. Likewise, it will be a democratized, decentralized, grass-roots movement that will yield up the meaningful information we all need to get a better web experience.

One promising direction in structured data is being taken by Metaweb (founded by Danny Hillis, of Thinking Machines fame) whose Freebase collaborative database is just becoming available to the general public. It uses types in a way similar to OWL, and has a very simple and intuitive search interface. It provides a straightforward means for creating new types and adding data based on them – critical for its data to flourish. My own experience in creating types to describe Algorithms and Data Structures was encouraging. I seeded the database with a few entries for the most common examples and was able to link those entries to textual descriptions found in Wikipedia easily. Within a few weeks the entries had been fleshed out and many more obscure algorithms had been added by others. If this group of Freebase types continues to gain attention, it may become a definitive list of algorithms on the web acting as a main entry point for those wishing to research potential solutions to their programming problems.

Perhaps islands of high quality structured information may become the favored web entry points. Users may trace links out of an entry in Freebase to areas where HTML anchors are all that they need to follow. Perhaps structured data can be used to disambiguate queries on conventional search engines – There may be many homonymous entries in freebase, and reference to the specific freebase type may help the search engine to filter its own results more effectively.

This crowd-sourced approach to creating structured data on the web is obviously going to help in the creation of isolated areas of excellence like Freebase and Wikipedia. Can it help to displace the mass of unstructured data on the web? I don’t think so. Our exponential outpour of data will mostly be in the form of video and audio. Without sophisticated image and voice recognition systems, search engines of any type will not be able to index them. Metadata in RDF or OWL format is being used to describe the binary data out there on the web, but that is equivalent to only indexing the titles of documents while ignoring their content.

Microsoft and Google both seem ambivalent towards the Semantic Web. Unofficially, they claim that they don’t believe in the semantic web, but in recent months Microsoft has released media management products that are based on the W3C’s RDF and OWL standards. They claim that semantic web technologies overcome problems they could not solve using conventional techniques.

Embedded Search Tools

At the moment search engines are stand-alone. They are cut-off from the tools we are using them to gather data for. In my case, I spend most of my day inside a programmer’s IDE (integrated development environment) – like a glorified text editor for source code. When I’m in there I’m constantly faced with little puzzles that crop up to do with the program I’m writing. When I need to get a piece of reference information I have to come out of the IDE and call up a browser and perform the search in that before switching back to what I was doing. That context switching is distracting and I’d prefer the whole experience to be smoother. In addition to that, there is a whole bunch of contextual information that exists inside of the application that is (potentially) germane to the search I’m doing.

Embedding the internet search facilities inside of general purpose applications such as my IDE provides a wealth of information to the search engine with which to automatically filter the results or to target the search. This extra context enhances the accuracy of results but in the final accounting it is just extending the lifespan of the free text search paradigm, without properly bridging the gap between meaningful and meaningless queries and data. To go beyond the limiting technologies in use at the moment we will have to search for new representations of data and use those as the context in which we ask our questions.

For example, the current documentation that we write for our source code is derived from HTML and contains links to other classes and systems that we use. If the documentation that came with code was derived from a ontology, and it linked to other ontologies describing the systems our code uses and interacts with, then the documentation could be mined as a source of context information with which to produce a semantic search query.

Imagine that you are an engineer and are producing a quote for one of your clients. Imagine that you needed to put the price for a gross of sprockets into the invoice. To do that you need the price from one of your supplier’s online catalogues. The fact that your suppliers are referenced in previous invoices should allow an embedded search engine to query the supplier’s catalogue (an ontology) using semantic web technologies. The context is derived from a mix of internal knowledge of what task you are performing, and what pattern of relationships you have with suppliers that helps you to fulfil orders.

Embedded search engines will rely on the pervasiveness of structured data both inside and outside of the enterprise. Without them, they will have just as hard a time making sense of a query as any general purpose search engine.

Statistical Models and Search

A viable short term solution to the problem of injecting meaning into a search query and matching it with the meaning in a document is to exploit data mining. Data mining can be used to help make predictions of the meaning that is likely to be attached to a term by a user. In just the same way that Amazon looks at what books a user has, and what books they click on, and how they purchase books, so too can a modern search engine gather data about what you search for, and what results you clicked on from what was offered. In essence, there is very little difference between the task performed by the Amazon web site and that performed by an Internet search engine.

A store like Amazon is a vast collection of data items that can be searched against. Each user that searches has a particular ‘best result’ in mind, and when they find it they select it (by buying it, viewing it or adding it to a wish list or shopping cart). Similarly a search engine is used in the same way – in this case the results are given away, but the dynamics are the same – users determine from the metadata provided whether a given result is a close enough match for what they were after. If it is, then they will click on it. Each time a user clicks on a link, they are providing some information back to the search engine that can be used to tune the user’s search in future. Search engines can use models that can be used to prioritize the results that are offered next time that user (or one like them) searches.

Needless to say, the data generated in this kind of system would be truly gigantic. As the internet grows, this metadata will grow at an exponential rate. The only way to avoid this kind of exponential curve is to increase the information density of the data on the net, or to increase the hit rate and the efficiency of the process of finding results. Again, the practicalities of dealing with data sets as big as the web point to the need for some fundamental revolution in data storage and search.

Geospatial Search

When you perform a search on the internet you are diving headfirst into a wide sea of data. The data you are after is like an island in that sea, but you might dive in a long way from where it is. The place where you dive in is critical to your success, and geospatial search may provide the solution by providing a natural and unambiguous point of entry into the data space. Much of the data that we place on the web has a natural home location, and that can be used to bring relevant data to the fore.

Rolf Landauer once said ‘information is physical’. Information cannot be had about anything other than physical things, and it cannot be represented on anything other than physical objects. Information itself is the structure of the universe. Although we seldom think of it that way – search technology is really an extension of the senses. It allows us to peer into other places and times. All information is tied to specific geospatial and temporal locations. Mostly, we are not in the habit of taking much notice of these ambient contexts that are often the only concrete links between facts in the world.

Some years ago, I was sat daydreaming in a coffee shop near where I lived. While looking out of the window at some old shops across the road I wondered what would happen if I was able to call up the information related to those shops just because I was sat near them. Each shop was inhabited by a business, each business was but one of a long line of inhabitants, and each inhabitant would have a huge network of public and private relationships that in some sense defines them as an entity.

I imagined that I had a GPS enabled device that made use of a protocol similar to the DNS domain naming service. DNS divides web sites into a partly spatial and partly logical or commercial hierarchy. The process of searching for the web site of one of those businesses would involve sending a query up the line from the domain I was on, till the query got to a DNS server that knew how to route the query to the DNS servers for the host of the website I was after. A spatial query engine would add another level of searching over the top of DNS, mapping spatio-temporal coordinates onto the conventional DNS naming system.

I wondered whether a means of search based on geographical location might not be more sensible. Imagine that the search results were prioritized according to physical and temporal distance from where and when I was. This would be an alternative name resolution strategy that would give me the locations on record for the shops across the road. From there I could navigate the local patterns of the publicly accessible relations for the entities that I had found.

If I requested the spatial search, the first result on the list should be the coffee shop I was in (plus any other companies or individuals that were in other floors of the building). Following that would come any entities nearby, in order of distance. Perhaps I could point my handheld device at the shop (say it’s an art gallery) and a digital compass might deduce the direction I was pointing in and filter the results for me to just that sector I was pointing in up a certain distance. The gallery is a business, and as such it is (in the UK at least) obliged by law to file its accounts with the government organization called Companies House. These links should be accessible, since the gallery would (by virtue of renting space in the real world) have a claim on the virtual space of the search engine. All that is required to get the company accounts would be for the company accounts to be online and indexed to a specific company that has a registry entry in the spatial search engine.

I then imagined that getting the data for the gallery opened up a path to a whole bunch of other information that flowed out into other sites. Perhaps images were available for all the works of art on sale. Perhaps from there I could navigate to the sites of the artists that I liked. Perhaps I liked one of the pictures the gallery had on display. Perhaps I finished my coffee, and crossed the road and took a look? This kind of entry point might make a difference to the number of passers-by that went into the gallery. The spatial search entry point to the web gives way to conventional web browsing, without any of the intermediate head scratching.

This spatial search can of course tie in both with the new generation of GPS enabled multi-purpose mobile devices that are now coming onto the market. It could also dovetail nicely with existing spatial search engines such as Google Maps or Live Earth. Spatial search creates a direct and meaningful structure in the virtual space that easily correlates to our real space. Spatial addresses may well provide a more powerful entry-point into the mass of data that we are trying to negotiate than a mere keyword search. The problem with text indexing and keyword searches is that they carry too many possible meanings for what you were after, and the relevance of the results decreases in proportion to the number of possible interpretations. GPS at least is precise and unambiguous, and soon to be available in every phone…

Object Modeling is Vocabulary Design

Andrew Cantos raised some interesting philosophical points in reply to my partially tongue in cheek post The Great Domain Model Debate – Solved the other day. As ever, my short reply turned into a blog post and this is it. Andrew’s point was that there is a metaphorical link between objects in a domain model and elementary particles in some physical system. The ability of these elements to take part in the wider system is often a function of their sheer simplicity rather than being loaded with complex properties. He use the example of Oxygen, as an example of something that can take part in many reactions, but which does not define or characterize those reactions. I extended the metaphor to observe that the same holds true when comparing Anemic Domain Models with their Rich brethren.

I like his metaphor. The metaphor I tend use when I think about this issue is related to human languages. Words are poor carriers of meaning on their own, in the same sense that rich objects are poor carriers of business functionality. A word’s specific value comes within the dynamic context of a sentence. I.e it’s exact meaning and value can only be resolved when composed together in a richer context.

Likewise, the same happens in an OO system – the analogue of the ‘sentence’ here is the thread of execution, or the transaction script or whathaveyou. they give meaning to the data carried by the anemic object. Without that context the object is worthless. What a RDM seeks to do is carry with the object the full set of possible contexts. It also seeks to restrict that set of contexts to a manageable set.

I can sympathize with that aim – ADM’s do little to guarantee that they get used right. RDMs do. However, I think that as with a science, an enterprise system needs to have a commonly agreed shared vocabulary. With that, a greater richness of communication becomes possible. If however, you were restricted in the ways you could use these words, you may have greater precision, but communication would become a chore, and you probably wouldn’t bother.

You can extend this whole ‘enterprise vocabulary’ metaphor even further. If you look at a typical, poorly coded or governed system, you will often see a situation where there are a multitude of little DTOs all of which contain roughly the same data, but just those fields that are needed to service a given page or screen. This situation is analogous to the situation in an immature science where there are many words freighted with slightly different meanings. Confusion results when a speaker intends different things from the listener. So too in software, where the lack of a commonly agreed object model serves to add confusion to the development process, and to increase he likelihood of errors creeping into a system.

What does this imply? It seems to me that the right approach (assuming the metaphor holds true) is that there ought to be a  well defined, definitive and widely shared object model within an organization. All system should use it, and the organization should mint new classes with great care and forethought. This of course ties in with the efforts of various groups in the Semantic Web area, who are attempting to do just that in domains a widely flung as life sciences and disaster relief. The fact that the efforts are inter-organizational means that there will be less tolerance for poorly advised ‘pragmatism’.

Which can only be a good thing in the long run. Right?

LinqToRdf – Designing a Query Provider

When I started implementing the SPARQL support in LINQ to RDF, I decided that I needed to implement as much of the standard query operators as possible. SPARQL is a very rich query language that bears a passing syntactical resemblance to SQL. It didn’t seem unreasonable to expect most of the operators of LINQ to SQL to be usable with SPARQL. With this post I thought I’d pass on a few design notes that I have gathered during the work to date on the SPARQL query provider.

The different components of a LINQ query get converted into successive calls to your query class. My class is called SparqlQuery<T>. If you had a query like this:

[TestMethod]
public void SparqlQueryOrdered()
{
    string urlToRemoteSparqlEndpoint = @"http://someUri";
    TripleStore ts = new TripleStore();
    ts.EndpointUri = urlToRemoteSparqlEndpoint;
    ts.QueryType = QueryType.RemoteSparqlStore;
    IRdfQuery<Track> qry = new RDF(ts).ForType<Track>(); 
    var q = from t in qry
        where t.Year == 2006 &&
        t.GenreName == "History 5 | Fall 2006 | UC Berkeley" 
        orderby t.FileLocation
        select new {t.Title, t.FileLocation};
    foreach(var track in q){
        Trace.WriteLine(track.Title + ": " + track.FileLocation);
    }        
}

This would roughly equate to the following code, using the extension method syntax:

[TestMethod]
public void SparqlQueryOrdered()
{
    ParameterExpression t;
    string urlToRemoteSparqlEndpoint = http://someUri;
    TripleStore ts = new TripleStore();
    ts.EndpointUri = urlToRemoteSparqlEndpoint;
    ts.QueryType = QueryType.RemoteSparqlStore;
    var q = new RDF(ts).ForType<Track>()
        .Where<Track>(/*create expression tree*/)
        .OrderBy<Track, string>(/*create  expression tree*/)
        .Select(/*create  expression tree*/;
    foreach (var track in q)
    {
        Trace.WriteLine(track.Title + ": " + track.FileLocation);
    }
}

The bold red method calls are the succession of calls to the query’s CreateQuery method. That might not be immediately obvious from looking at the code. In fact it’s downright unobvious! There’s compiler magic going on in this, that you don’t see. Anyway, what happens is that you end up getting a succession of calls into your IQueryable<T>.CreateQuery() method. And that’s what we are mostly concerned about when creating a query provider.

The last I blogged about the CreateQuery method I gave a method with a switch statement that identified the origin of the call (i.e. Where, OrderBy, Select or whatever) and dispatched the call off to be immediately processed. I now realise that that is not the best way to do it. If I try to create my SPARQL query in one pass, I will not have much of a chance to perform optimization or disambiguation. If I generate the projection before I know what fields were important, I would probably end up having to get everything back and filter on receipt of all the data. I think Bart De Smet was faced with that problem with LINQ to LDAP (LDAP doesn’t support projections) so he had to get everything back. SPARQL does support projections, and that means I can’t generate the SPARQL query string till after I know what to get back from the Select call.

My solution to this is to keep all the calls into the CreateQuery in a Hashtable so that I can use them all together in the call to GetEnumerator. That gives me the chance to do any amount of analysis on the expression trees I’ve got stored, before I convert them into a SPARQL query. The CreateQuery method now looks like this:

protected Dictionary<string, MethodCallExpression> expressions;
public IQueryable<S> CreateQuery<S>(Expression expression)
{
    SparqlQuery<S> newQuery = CloneQueryForNewType<S>();

    MethodCallExpression call = expression as MethodCallExpression;
    if (call != null)
    {
        Expressions[call.Method.Name] = call;
    }
    return newQuery;
}

This approach helps because it makes it much simpler to start adding the other query operators.

I also been doing a fair bit of tidying up as I go along. My GetEnumerator method now reflects the major grammatical components of the SPARQL spec for SELECT queries.

private void CreateQuery(StringBuilder sb)
{
    if(Expressions.ContainsKey("Where"))
    {
        // first parse the where expression to get the list of parameters to/from the query.
        StringBuilder sbTmp = new StringBuilder();
        ParseQuery(Expressions["Where"].Parameters[1], sbTmp);
        //sbTmp now contains the FILTER clause so save it somewhere useful.
        Query = sbTmp.ToString();
        // now store the parameters where they can be used later on.
        queryGraphParameters.AddAll(Parser.Parameters);
    }
    CreateProlog(sb);
    CreateDataSetClause(sb);
    CreateProjection(sb);
    CreateWhereClause(sb);
    CreateSolutionModifier(sb);
}

The If clause checks whether the query had a where clause. If it did, it parses it, creating the FILTER expression, and in the process gathering some valuable information about what members from T were referenced in the where clause. This information is useful for other tasks, so it gets done in advance of creating the Where clause.

Creating A LINQ Query Provider

As promised last time, I have extended the query mechanism of my little application with a LINQ Query Provider. I based my initial design on the method published by Bart De Smet, but have extended that framework, cleaned it up and tied it in with the original object deserialiser for SemWeb (a semantic web library by Joshua Tauberer).

In this post I’ll give you some edited highlights of what was involved. You may recal that last post I provided some unit tests that i was working with. For the sake of initial simplicity (and to make it easy to produce queries with SemWeb’s GraphMatch algorithm) I restricted my query language to make use of Conjunction, and Equality. here’s the unit test that I worked with to drive the development process. What I produced last time was a simple scanner that went through my podcasts extracting metadata and creating objects of type Track.

[TestMethod]
public void QueryWithProjection()
{
CreateMemoryStore();
IRdfQuery<Track> qry = new RdfContext(store).ForType<Track>();
var q = from t in qry
where t.Year == 2006 &&
t.GenreName == "History 5 | Fall 2006 | UC Berkeley"
select new {t.Title, t.FileLocation};
foreach(var track in q){
Trace.WriteLine(track.Title + ": " + track.FileLocation);
}
}

This method queries the Tracks collection in an in-memory triple store loaded from a file in N3 format. It searches for any UC Berkley pod-casts produced in 2006, and performs a projection to create a new anonymous type containing the title and location of the files.

I took a leaf from the book of LINQ to SQL to crate the query object. In LINQ to SQL you indicate the type you are working with using a Table<T> class. In my query context class, you identify the type you are working with using a ForType<T>() method. this method instantiates a query object for you, and (in future) will act as an object registry to keep track of object updates.

The RDFContext class is very simple:

public class RdfContext : IRdfContext
{
public Store Store
{
get { return store; }
set { store = value; }
}
protected Store store;
public RdfContext(Store store)
{
this.store = store;
}
public void AcceptChanges()
{
throw new NotImplementedException();
}
public IRdfQuery<T> ForType<T>()
{
return new RdfN3Query<T>(store);
}
}

As you can see, it is pretty bare at the moment. It maintains a reference to the store, and instantiates query objects for you. But in future this would be the place to create transactional support, and perhaps maintain connections to triple stores. By and large, though, this class will be pretty simple in comparison to the query class that is to follow.

I won’t repeat all of what Bart De Smet said in his excellent series of articles on the production of LINQ to LDAP. I’ll confine myself to this implementation, and how it works. So we have to start by creating our Query Object:

public class RdfN3Query<T> : IRdfQuery<T>
{
public RdfN3Query(Store store)
{
this.store = store;
this.originalType = typeof (T);
parser = new ExpressionNodeParser<T>();
}

First it stores a reference to the triple store for later use. In a more real world implementation this might be a URL or connection string. But for the sake of this implementation, we can be happy with the Memory Store that is used in the unit test. next we keep a record of the original type that is being queried against. this is important because later on you may also be dealing with a new anonymous type that will be created by the projection. This will not have any of the Owl*Attribute classes with which to work out URLs for properties and to perform deserialisation.

The two most important methods in IQueryable<T> are CreateQuery and GetEnumerable. CreateQuery is the place where LINQ feeds you the expression tree that it has built from your initial query. You must parse this expression tree and store the resultant query somewhere for later use. I created a string called query to keep that in, and created a class called ExpressionNodeParser to walk the expression tree to build tyhe query string. This is equivalent to the stage where the SQL SELECT query gets created in DLINQ. My CreateQuery looks like this:

public IQueryable<TElement> CreateQuery<TElement>(Expression expression)
{
RdfN3Query<TElement> newQuery = new RdfN3Query<TElement>(store);
newQuery.OriginalType = originalType;
newQuery.Project = project;
newQuery.Properties = properties;
newQuery.Query = Query;
newQuery.Logger = logger;
newQuery.Parser = new ExpressionNodeParser<TElement>(new
StringBuilder(parser.StringBuilder.ToString()));
MethodCallExpression call = expression as MethodCallExpression;
if (call != null)
{
switch (call.Method.Name)
{
case "Where":
Log("Processing the where expression");
newQuery.BuildQuery(call.Parameters[1]);
break;
case "Select":
Log("Processing the select expression");
newQuery.BuildProjection(call);
break;
}
}
return newQuery;
}

You create new query because you may be doing a projection, in which case the type you are enumerating over will not be the original type that you put into ForType<T>(). Instead it may be the anonymous type from the projection. You transfer the vital information over to the new Query object, and then handle the expression that has been passed in. I am handling two methods here: Where and Select. There are others I could handle, such as OrderBy or Take, but that will have to be for a future post.

Where is the part where the expression representing the query is passed in. Select is passed the tree representing the projection (if there is one). The work is passed off to BuildQuery and BuildProjection accordingly. these names were gratefully stolen from LINQ to LDAP.

BuildQuery in LINQ to LDAP is a fairly complicated affair, but in LINQ to RDF I have paired it right downb to the bone.

private void BuildQuery(Expression q)
{
StringBuilder sb = new StringBuilder();
ParseQuery(q, sb);
Query = Parser.StringBuilder.ToString();
Trace.WriteLine(Query);
}

We create a StringBuilder that can be passed down into the recursive descent tree walker to gather the fragments of the query as each expression gets parsed. the result is then stored in the Query property of the Query object. BuildProjection looks like this:

private void BuildProjection(Expression expression)
{
LambdaExpression le = ((MethodCallExpression)expression).Parameters[1] as
LambdaExpression;
if (le == null)
throw new ApplicationException("Incompatible expression type found when building a projection");
project = le.Compile();
MemberInitExpression mie = le.Body as MemberInitExpression;
if (mie != null)
foreach (Binding b in mie.Bindings)
FindProperties(b);
else
foreach (PropertyInfo i in originalType.GetProperties())
properties.Add(i.Name, null);
}

Much of it is taken directly from LINQ to LDAP. I have adapted it slightly because I am targeting the May 2007 CTP of LINQ. I’ve done this only because I have to use VS 2005 during the day, so I can’t use the March 2007 version of Orcas.

ParseQuery is used by BuildQuery to handle the walking of the expression tree. Again that is very simple since most of the work is now done in ExpressionNodeParser. It looks like this:

private void ParseQuery(Expression expression, StringBuilder sb)
{
Parser.Dispatch(expression);
}

Parser.Dispatch is a gigantic switch statement that passes off the expression tree to handler methods:

public void Dispatch(Expression expression)
{
switch (expression.NodeType)
{
case ExpressionType.Add:
Add(expression);
break;
case ExpressionType.AddChecked:
AddChecked(expression);
break;
case ExpressionType.And:
And(expression);
break;
case ExpressionType.AndAlso:
AndAlso(expression);
//...

Each handler method then handles the root of the expression tree, breaking it up and passing on what it can’t handle itself. For example, the method AndAlso just takes the left and right side of the operator and recursively dispatches them:

public void AndAlso(Expression e)
{
BinaryExpression be = e as BinaryExpression;
if (be != null)
{
Dispatch(be.Left);
Dispatch(be.Right);
}
}

The equality operator is the only operator that currently gets any special effort.

public void EQ(Expression e)
{
BinaryExpression be = e as BinaryExpression;
if (be != null)
{
MemberExpression me = be.Left as MemberExpression;
ConstantExpression ce = be.Right as ConstantExpression;
QueryAppend(tripleFormatStringLiteral,
InstancePlaceholderName,
OwlClassSupertype.GetPropertyUri(typeof(T),
me.Member.Name),
ce.Value.ToString());
}
MethodCallExpression mce = e as MethodCallExpression;
if (mce != null && mce.Method.Name == "op_Equality")
{
MemberExpression me = mce.Parameters[0] as MemberExpression;
ConstantExpression ce = mce.Parameters[1] as ConstantExpression;
QueryAppend(tripleFormatStringLiteral,
InstancePlaceholderName,
OwlClassSupertype.GetPropertyUri(typeof(T),
me.Member.Name),
ce.Value.ToString());
}
}

The equality expression can be formed either through the use of a binary expression with NodeType.EQ or as a MethodCallExpression on op_Equality for type string. If the handler for the MethodCallExpression spots op_Equality it passes the expression off to the EQ method for it to render instead. EQ therefore needs to spot which type of Node it’s dealing with to know how to get the left and right sides of the operation. In a BinaryExpression there are Right and Left properties, whereas in a MethodCallExpression these will be found in a Parameters collection. In our example they get the same treatment.

You’ll note that we assume that the left operand is a MemberExpression and the right is a ConstantExpression. That allows us to form clauses like this:

where t.Year == 2006

but it would fail on all of the following:

where t.Name.ToUpper() == "SOME STRING"
where t.Name == t.Other
where t.Year.ToString() == "2006"

Each of these cases will have to be handled individually, so the number of cases we need to handle can grow. As Bart De Smet pointed out, some of the operations might have to be performed after retrieval of the results since semantic web query languages are unlikely to have complex string manipulation and arithmetic functions. Or at least, not yet.

The QueryAppend forms an N3 Triple out of its parameters and appends it to the StringBuilder that was passed to the Parser initially. At the end of the recursive tree walk, this string builder is harvested and preprocessed to make it ready to pass to the triple store. In my previous post I described an ObjectDeserialisationsSink that was passed to SemWeb during the query process to harvest the results. This has been reused to gather the results of the query from within our query.

I mentioned earlier that the GetEnumerator method was important to IQueryable. An IQueryable is a class that can defer execution of its query till someone attempts to enumerate its results. Since that’s done using GetEnumerator the query must be performed in GetEnumerator. My implementation of GetEnumerator looks like this:

IEnumerator<T> IEnumerable<T>.GetEnumerator()
{
if (result != null)
return result.GetEnumerator();
query = ConstructQuery();
PrepareQueryAndConnection();
PresentQuery(query);
return result.GetEnumerator();
}

result is the List<TElement> variable where I cache the results for later use. What that means is that the query only gets run once. Next time the GetEnumerator gets called, result is returned directly. This reduces the cost of repeatedly enumerating the same query. Currently the methods ConstructQuery, PrepareQueryAndConnection, and PresentQuery are all fairly simple affairs that exist more as placeholders so that I can reuse much of this code for a LINQ to SPARQL implementation that is to follow.

As you’ve probably worked out, there is a huge amount of detail that has to be attended to, but the basic concepts are simple. the reason why more people haven’t written LINQ query providers before now is simply that fact that there is no documentation about how to do it. When you try though, you may find it easier than you thought.

There is a great deal more to do to LINQ to RDF before something it is ready for production use, but as a proof of concept that semantic web technologies can be brought into the mainstream it serves well. Thereason why we use ORM systems such as LINQ to SQL is to help us overcome the Impedance Mismatch that exists between the object and relational domains. An equally large mismatch exists between the Object and Semantic domains. tools like LINQ to RDF will have to overcome the mismatch in order for them to be used outside of basic domain models.

Using RDF and C# to create an MP3 Manager – Part 3

Last time I hurriedly showed you how you can perform the next step of converting a triple store into an ORM system of sorts. The purpose of all this activity, and the reason I left off blogging about LINQ was that I am working on a system to allow me to use LINQ with a triple store and reasoner. The benefit of doing so is that we should have a lot of the benefits of the modern relational world with the power and connectivity of the new world of the semantic web. In this post I shall outline the steps required to start working with LINQ to RDF (as I’ve chosen to call it through lack of imagination).

I’ve been using test driven development throughout, so I already have a few ‘integration’ unit tests to show you:

[TestMethod]
public void Query()
{
  string urlToRemoteSparqlEndpoint = "http://localhost/MyMusicService/SparqlQuery.ashx";
  RdfContext<Track> ctx = new RdfSparqlContext<Track>(urlToRemoteSparqlEndpoint);
  var titles = from t in ctx
    where t.Year > 1998 &&
    t.GenreName == "Ambient Techno" ||
    t.GenreName == "Chillout"
    select t.Title;
  foreach(string title in titles)
    Console.WriteLine(title);
}

In English, this means that rather than manipulating a local triple store, I want the RdfSparqlContext to compose a SPARQL query and present it to the query endpoint found at location urlToRemoteSparqlEndpoint. I then want it to deserialise the results returned and store them in titles. This is a nice mixture of new features from .NET 3.5, combined with some of the features I’ve already developed for object deserialisation.

With the example above I am concerned more with the querying aspect of LINQ (it being a query language n’ all!) but that is not much use to me in the world of transactional web development where I find myself mired for the moment, so we need full CRUD behaviour from this system. Here’s a unit test for object update.

[TestMethod]
public void Update()
{
  string urlToRemoteSparqlEndpoint = @"http://localhost/MyMusicService/SparqlQuery.ashx";
  RdfContext<Track> ctx = new RdfSparqlContext<Track>(urlToRemoteSparqlEndpoint);
  var q = from t in ctx
    where t.Year > 1998 &&
    t.GenreName == "Ambient Techno" ||
    t.GenreName == "Chillout"
    select t;
  foreach (Track t in q)
    t.Rating = 5;
  ctx.AcceptChanges();
}

Here, I’m getting a bunch of objects back from the triple store, modifying their Rating property and then asking for those changes to be stored. This follows the usage patterns for LINQ to SQL.

To satisfy these unit tests (plus ones for the rest of the CRUD behaviour and with support for N3, in-memory and RDBMS based local triple stores is what I’m aiming to complete – eventually. Here’s the general scheme for implementing querying using LINQ. It assumes that RDF data structures are taken unmodified from the SemWeb library.

  • Create a Query Object (I guess this would be our RdfContext class above)
  • Implement IQueryable<T> on it.
  • When the enumerable is requested, convert the stored expression tree into the target language
  • Present it to whatever store or endpoint is available,
  • Deserialise the results into objects
  • Yield the results (probably as they are being deserialised)

This is a very broad outline of the tasks. I’ll explain in a lot more depth in subsequent posts, as I tackle each step.