LINQ

Not another mapping markup language!

Kingsley Idehen has again graciously given LinqToRdf some much needed link-love. He mentioned it in a post that was primarily concerned with the issues of mapping between the ontology, relational and object domains. His assertion is that LinqtoRdf, being an offshoot of an ORM related initiative, is reversing the natural order of mappings. He believes that in the world of ORM systems, the emphasis should be in mapping from the relational to the object domain.

I think that he has a point, but not for the reason he’s putting forward. I think that the natural direction of mapping stems from the relative richness of the domains being mapped. The impedence mismatch between the relational and object domains stems from (1) the implicitness of meaning in the relationships of relational systems and (2) the representation of relationships and (3) type mismatches.

If the object domain has great expressiveness and explicit meaning in relationships it has a ‘larger’ language than that expressible using relational databases. Relationships are still representable, but their meaning is implicit. For that reason you would have to confine your mappings to those that can be represented in the target (relational) domain. In that sense you get a priority inversion that forces the lowest common denominator language to control what gets mapped.

The same form of inversion occurs between the ontological and object domains, only this time it is the object domain that is the lowest common denominator. OWL is able to represent such things as restriction classes and multiple inheritance and sub-properties that are hard or impossible to represent in languages like C# or Java. When I heard of the RDF2RDB working group at the W3C, I suggested (to thunderous silence) that they direct their attentions to coming up with a general purpose mapping ontology that could be used for performing any kind of mapping.

I felt that it would have been extremely valuable to have a standard language for defining mappings. Just off the top of my head I can think of the following places where it would be useful:

  1. Object/Relational Mapping Systems (O/R or ORM)
  2. Ontology/Object Mappings (such as in LinqToRdf)
  3. Mashups (merging disparate data sources)
  4. Ontology Reconciliation – finding intersects between two sets of concepts
  5. Data cleansing
  6. General purpose data access layer automation
  7. Data export systems
  8. Synchronization Systems (i.e. keeping systems like CRM and AD in sync)
  9. mapping objects/tables onto UIs
  10. etc

You can see that most of these are perennial real-world problems that programmers are ALWAYS having to contend with. Having a standard language (and API?) would really help with all of these cases.

I think such an ontology would be a nice addition to OWL or RDF Schema, allowing a much richer definition of equivalence between classes (or groups or parts of classes). Right now one can define a one-to-one relationship using the owl:equivalentClass property. It’s easy to imagine that two ontology designers might approach a domain from such orthogonal directions that they find it hard to define any conceptual overlap between entities in their ontologies. A much more complex language is required to allow the reconciliation of widely divergent models.

I understand that by focusing their attentions on a single domain they increase their chances of success, but what the world needs from an organization like the W3C is the kind of abstract thinking that gave rise to RDF, not another mapping markup language!


Here’s a nice picture of how LinqToRdf interacts with Virtuoso (thanks to Kingsley’s blog).

How LINQ uses LinqToRdf to talk to SPARQL stores

How LINQ uses LinqToRdf to talk to SPARQL stores

Wanted: Volunteers for .NET semantic web framework project

 LinqToRdf* is a full-featured LINQ** query provider for .NET written in C#. It provides developers with an intuitive way to make queries on semantic web databases. The project has been going for over a year and it’s starting to be noticed by semantic web early adopters and semantic web product vendors***. LINQ provides a standardised query language and a platform enabling any developer to understand systems using semantic web technologies via LinqToRdf. It will help those who don’t have the time to ascend the semantic web learning curve to become productive quickly.

The project’s progress and momentum needs to be sustained to help it become the standard API for semantic web development on the .NET platform. For that reason I’m appealing for volunteers to help with the development, testing, documentation and promotion of the project.

Please don’t be concerned that all the best parts of the project are done. Far from it! It’s more like the foundations are in place, and now the system can be used as a platform to add new features. There are many cool things that you could take on. Here are just a few:

Reverse engineering tool
This tool will use SPARQL to interrogate a remote store to get metadata to build an entity model.

Tutorials and Documentation
The documentation desperately needs the work of a skilled technical writer. I’ve worked hard to make LinqToRdf an easy tool to work with, but the semantic web is not a simple field. If it were, there’d be no need for LinqToRdf after all. This task will require an understanding of the LINQ, ASP.NET, C#, SPARQL, RDF, Turtle, and SemWeb.NET systems. It won’t be a walk in the park.

 

Supporting SQL Server
The SemWeb.NET API has recently added support to SQL Server, which has not been exploited inside LinqToRdf (although it may be easy to do).  This task would also involve thinking about robust scalable architectures for semantic web applications in the .NET space.

 

Porting LinqToRdf to Mono
LINQ and C# 3.0 support in Mono is now mature enough to make this a desirable prospect. Nobody’s had the courage yet to tackle it. Clearly, this would massively extend the reach of LinqToRdf, and it would be helped by the fact that some of the underlying components are developed for Mono by default.

 

SPARQL Update (SPARUL) Support
LinqToRdf provides round-tripping only for locally stored RDF. Support of SPARQL Update would allow data round-tripping on remote stores. This is not a fully ratified standard, but it’s only a matter of time.

 

Demonstrators using large scale web endpoints
There are now quite a few large scale systems on the web with SPARQL endpoints. It would be a good demonstration of LinqToRdf to be able to mine them for useful data.

 

These are just some of the things that need to be done on the project. I’ve been hoping to tackle them all for some time, but there’s just too much for one man to do alone. If you have some time free and you want to learn more about LINQ or the Semantic Web, there is not a better project on the web for you to join.  If you’re interested, reply to this letting me know how you could contribute, or what you want to tackle. Alternatively join the LinqToRdf discussion group and reply to this message there.

 

Thanks,

 

Andrew Matthews

 

* http://code.google.com/p/linqtordf

** http://msdn.microsoft.com/en-us/netframework/aa904594.aspx

*** http://virtuoso.openlinksw.com/Whitepapers/html/linqtordf/linqtordf1.htm

Functional Programming – lessons from high-school arithmetic

I’ll start out with an apology – it was only by writing this post, that I worked out how to write a shorter post on the same topic. Sometime I’ll follow this up with something less full of digressions, explorations or justifications. The topic of the post started out as ‘Closure‘. It then became ‘Closure plus Rules of Composition‘ and finally ended up as ‘functional programming – lessons from high school arithmetic‘. The point of this post is to explore the API design principles we can learn from rudimentary high school arithmetic.  You already know all the mathematics I’m going to talk about, so don’t be put off by any terminology I introduce – it’s only in there to satisfy any passing mathematicians. ;^]

The topic of this post is much more wide-ranging than the previous ones. The post will eventually get around to talking about API design, but I really started out just wanting to dwell on some neat ideas from philosophy or mathematics that just appealed to me. The first idea is ‘Closure‘.

Closure has many meanings, but the two most relevant to this blog are:

  • A function that gets evaluated with a bound variable
  • A set that is closed under some operation.

It’s this second meaning that I want to explore today – it’s one of those wonderful philosophical rabbit holes that lead from the world of the mundane into a wonderland of deeply related concepts. As you’ll already know if you’ve read any of my earlier posts on functional programming, I am not a mathematician. This won’t be a deep exposition on category theory, but I do want to give you a flavour so that hopefully you get a sense of the depth of the concepts I’m talking about.

First let’s start with two little equations that seem to bear almost no information of interest:

(1)     1 + 1 = 2

and

(2)     2 – 1 = 1

(1) involves adding two natural numbers to get another natural number. (2) involves subtracting one natural number from another to get a third natural number. They seem to be very similar, except for the fact that if you keep repeating (2) you eventually get a negative number which is not a natural number. If you repeatedly perform addition, you can go on forever. That property is called ‘closure‘. It means that if you perform addition on any natural number you are guaranteed to get a valid result. That closure guarantee for some operations is one of the first things I want you to ponder – some operations give you guarantees of object validity, while others don’t. We need to learn how to spot those ideas.

Another interesting thing that some introspection reveals about equation (2) is that the set from which it takes it’s values is bounded in one direction, and that at the lower bound is a value that is idempotent for the operation. That term idempotent is daunting to us non-mathematicians but what it means is simply that when the operation is performed the result remains unchanged, no matter how many times it gets performed. Here’s another thing that is worth pondering – some operations are stable because they guarantee not to change your state.

Digression. Why on earth would anyone ever waste their time in writing code that was designed at the outset to do nothing? It seems like the ultimate exercise in futility. The answer is that idempotent operations are not doing nothing when in the presence of ‘rules of combination’. With rules of combination (of which more later), idempotent operations become a useful tool in composing functions.

SubDigression: A rule of combination is a feature of a system allowing you to combine distinct entities of a domain together to form a new entity. You can see how this relates to closure. It relates to closure on two levels. For example, when adding two integers:

  • The result of adding two integers is an integer. That’s closure on the set of integers.
  • The composition of two closed functions is itself closed. That’s closure at the level of functions on integers.

In other words, you can choose to provide closure at the level of domain object, or on the functions that manipulate them. LINQ queries of type IQueryable<T> are a good example. You can combine together two queries to get a sequence of T, thus providing domain-level closure. You can also combine together IQueryables to create new IQueryables that also yield sequences of T. That’s functional closure. LINQ is closed on both levels. It’s closed at the level of the entities that it is retrieving, but it’s also closed at the level of the functions it uses to represent queries.

It’s that level of composability that gives LINQ its power. And finding those design principles that we can apply to our own APIs is the purpose of this post. Ponder this: we don’t often provide rules of combination in our object models. If we did, our systems would probably be more flexible. End of SubDigression

Several years ago I produced a graphics library for producing montages in a telepathology app. The system used a scripted generator to produce a tree of graphics operations. Each node on the tree manipulated an image then passed it on to its children. Without an idempotent operation it would have been extremely difficult to add orthogonal operations  (like comms, or microscope operations) or to bind together trees, or to create a default root of an operation graph.

The point of this outer-digression is that there are plenty of cases where at first sight Idempotence seems like architectural overkill. When you have rules of combination you find idempotent operations complete the puzzle making everything just click together. While the idempotent operation does nothing, it creates a framework on which other operations can be composed. Ponder this: Sometimes targeting an architectural abstraction might seem overkill, but if done wisely it can yield great simplicity and flexibility. If you don’t believe this - play with LINQ a little. End of Digression.

If these were properties that only applied to natural numbers under addition or subtraction then they wouldn’t be worth a blog post. It’s the fact that this is a pattern that can be observed in other places that makes them worth my time writing about, and your time reading. Lets stay with integers a bit longer, though:

(3)     2 * 2 = 4

(4)     1 * 2 = 2

You probably noticed right away that the number 1 is idempotent in (4). We could keep multiplying by 1 till the cows come home and we’d always get 2. Now, I’m not setting out to explore the idea of idempotence. The reason I’m mentioning it is that it is an important property of an algebraic system. Closure is another. When you multiply two integers together you get another integer – that’s closure.

Just as addition has it’s inverse in the form of subtraction, so too does multiplication have an inverse in the form of division. Take a look at this:

(5)     4 / 2 = 2

(6)     1 / 2 = 0.5

In (6), the result is not an integer. As an interesting byline – the history of mathematics is littered with examples where new branches of mathematics were formed when non-closed operations were performed that led to awkward results. The process of creating a closed version of an operation’s inverse led mathematicians to create new mathematical structures with new capabilities, thus extending mathematics’ reach. The non-closure of subtraction (the inverse of addition) led to the introduction of the integers over the natural numbers. The non-closure of the division operation (the inverse of multiplication) led to the introduction of the rational numbers over the integers. And the non-closure of the square root operation (the inverse of the power operation) led to the introduction of the irrational numbers. On many occasions through history the inverse of an everyday closed operation has led to the expansion of the space of possible data types. Ponder that – attempting to produce data structures on which the inverses of closed operations are also closed can lead to greater expressivity and power. A bit of a mouthful, that, and probably not universally true, but its something to ponder.

Again, if that were all there were to the idea, I (as a programmer) probably wouldn’t bother to post about it – I’d leave it to a mathematician. But that is not the limit to closure. Closure has been recognized in many places other than mathematics – from physics to philosophy and from API to language design. Lets describe an algebraic system in the abstract to isolate what it means to be closed. The simplest mathematical structure that fits my description is called a Magma:

(7)     A Magma is any set M equipped with a binary function M \times M \rightarrow M

This kind of thing is known to mathematicians as an Algebraic Structure. There are LOTS of different kinds, but that’s one digression I’m not going to make. One thing to notice is that closure is built into this most basic of algebraic structures. What M \times M \rightarrow M means is that if you apply the operation ‘\times ‘ to the two values from M you get another value from M. By that definition, division doesn’t qualify as a Magma if the set M is integers, but it does if the set is the rational numbers.

(8)     2 + 3 + 5 = 10

(9)     (2 + 3) + 5 = 10

(10)     2 + (3 + 5) = 10

Each of these equations demonstrates what is known as associativity. If you add that to the definition of a Magma, you get what is called a semigroup. Integers with addition have that property of associativity, so it counts as a semigroup.

(11)     2 – 3 -5 = -6

(12)     (2 – 3) – 5 = -6

(13)     2 – (3 – 5) = 4

Clearly the subtraction operation on the integers is not associative, so it doesn’t qualify to be called a semigroup.  Try this on for size – associative operations are inherently flexible and composable. Abelson and Sussman even went so far as to say that designing systems with such properties was a better alternative to the traditional top-down techniques of software engineering.

We saw earlier that the property of idempotence means that there may be an element that yields the same value for that operation. If the Magma has an identity property, then it is called a ‘loop’. The point of this is to point out the other properties that operations can have (and how they contribute to membership of an algebraic structure). The key properties are:

  • Closure
  • Associativity
  • Identity
  • Inversibility

I’m going to throw a code snippet in at this point. If you’re a programmer with no particular interest in algebra, you might be wondering what on earth this has to do with programming

var q = from u in MyDataContext.Users
where u.Name.StartsWith("And")
select u;

var q2 = from v in q
where v.Name.EndsWith("rew")
select v;

Here’s an example taken from something like LINQ To SQL. Take a look at the ‘where’ keyword. It is clearly closed, since the application of where to a query yields another query (regardless of whether it gives you any useful results). The example is also associative, since you can reverse the order of the clauses and the resulting set will be the same. LINQ has an identity as well – “.Where(t => t)” which does nothing. LINQ lacks and inversion operation, so you can’t add a clause, then cancel it out with another – instead, if you tried to do that, you’d get no results or everything. Here’s something to ponder – would link be more or less powerful if it had the property of inversibility? It’s clearly possible (though probably extremely difficult to implement).

I started thinking about these ideas because I wanted to understand why LINQ is powerful. It’s flexible and easy to understand because of the  mathematical ‘structure’ of the standard query operations. Ponderable: is any API going to be more powerful and flexible (and less brittle) if it displays some of the properties of an algebraic structure?

What are the advantages of creating APIs that have group structure? Just because we could design an API that has a group structure does not mean that we must. There must be an advantage to doing so. So far I have only hinted at those advantages. I now want to state them directly. If “we can regard almost any program as the evaluator for some language“[r], then we can also regard some languages as a more effective representation of a domain than others.  For many years, I’ve felt that the object oriented paradigm was the most direct and transparent representation of a domain. At the same time, I also felt there was something lacking in the way operations on an OO class work (in a purely procedural approach).

To cleanly transition the state of a class instance to another state, you (should) frequently go to extreme lengths[r] to keep the object in a consistent state. This kind of practice is avoided in those cases where it is feasible to use immutable objects, or more importantly to design your API so that the objects passed around might as well be immutable. Consider a class in C++ that implements the + operator. You could implement the operator in two ways:

  1. add the value to the right to this, and then return this.
  2. create a temporary object, add the value on the right to it and return the temporary object.

The following pseudo-code illustrates the point by imaging a class that supports “operator +”:

MyClass a = new MyClass();
MyClass b = new MyClass();
MyClass c = new MyClass();
MyClass d = a + b + a + c;

If you implement ‘+’ using technique 1 the result in d is (3a + b + 3c) whereas if you implement it using technique 2, the result in d is correctly (2a + b + c). Can you work out where the 3c comes from? The state, being mutable, is modified in a way that is incorrect during the addition operator. The operands of an operation like ‘+’ should be unaffected by the fact that they took part in the assignment of a value to d. Something else to ponder: immutable objects or operations can make it easier to produce clean APIs that work with the language to create a correct answer.

You might complain that what I’m aiming for here is a programming style that uses mathematical operators to implement what would be otherwise done using normal methods. But you’d be missing the point. Whether your method is called ‘+’ or if it’s called ‘AddPreTaxBenefits’ is irrelevant. The structure of the operation, at the mathematical level, is the same. And the same principles can apply.

The method signature of a closed method is T 'times T \rightarrow T. There are plenty of methods that don’t fit this model. Lets pick one that pops into my mind quite readily – bank account transactions:

void Transfer(Account debit, Account credit, decimal sumToTransfer); 

There is an entity in here that does fit the bill for such group like transactions – Money. There are endless complexities in financial transactions between currencies, like currency conversion, exchange rates and catching rounding errors. But the point is that it makes sense to be able to implement group operators on currency values. That ability allows you to define a language of currencies that can be exploited on a higher level item of functionality – the Account. BTW: I’m not talking about the algebraic structure of addition on decimals. I’m talking about adding values of locale specific money values – a different thing.

void Transfer(Account debit, Account credit, Currency sumToTransfer)
{
debit.Balance = debit.Balance - sumToTransfer;
credit.Balance = credit.Balance + sumToTransfer;
}

Would it be better to define the operation on the Account class itself? The operator might actually be updating the internal balance property, but we don’t care about that.

void Transfer(Account debit, Account credit, Currency sumToTransfer)
{
debit = debit - sumToTransfer;
credit = credit + sumToTransfer;
}

Lets take a look and see whether operator ‘+’ fits the criteria we defined earlier for group-like structures:

Closed If you take an account and you add a value to it, you get another valid account, so yes, this is closed.

Associative Yes – though I’m not sure what that would mean in terms of bank accounts. Double entry bookkeeping is kinda binary…

Identity OK - The identity element of the ‘+’ operation on accounts is the zero currency value.

Inverse operation Easy – subtraction. Or the negative currency value. Do you allow negative currency values? that’s incompatible with double entry bookkeeping, so it might not be possible to provide an inverse operator for some systems. There’s an example where trying to get an inverse could lead you to change the design of your system.

This approach passes the criteria, but it also highlights a conceptual layer difference between money types and bank account types that makes for an awkward API if you try to treat them as equivalent. From a design perspective you can see that if there are non-obvious rules about how you can combine the elements of your class, you’re no better off than with a conventional API design. One thing that does occur to me, though, is that the inconclusive group structure here pointed to a mismatch of levels. The addition operator applies at the level of quantities of cash – account balances. Accounts are more than just balances, and attempting to make them behave like they were nothing more than a number highlights the pointlessness of doing so. Ponder this: the concept of ‘levels’ may be something that arises naturally out of the algebraic structure of the entities in a system? I’m not sure about this yet, but it’s an intriguing idea, don’t you think?

Obviously, we could have expected group structure at the level of balances, since we’re dealing with real numbers that are a known group under addition and subtraction. But what about higher level classes, like bank accounts? What are the operations we can perform on them that fits this structure?

I wasn’t sure whether I’d come away with any conclusions from this post, but I did come away with some very suggestive ideas to ponder:

  • Some operations give you guarantees of object validity. As a programmer, you need to learn how to spot them.
  • Some operations are preferable because they guarantee not to change your state.
  • Provide rules of combination in our object models would probably make them more flexible.
  • Sometimes abstraction might seem overkill, but if used wisely it can yield great simplicity and flexibility. If you don’t believe this – play with LINQ a little.
  • Produce data structures on which the inverses of closed operations are also closed can lead to greater expressivity and power.
  • Associative operations are inherently flexible and composable.
  • Maybe all APIs will be more expressive and flexible (and less brittle) if they displays some of the properties of an algebraic structure?
  • Immutable objects or operations can make it easier to produce clean APIs that work with the language to create a correct answer.
  • Trying to get an inverse for an operation could lead you to change the design of your system.
  • The concept of ‘levels’ may be something that arises naturally out of the algebraic structure of the entities in a system.

It’s funny that these ideas flow simply from looking at high-school algebra, especially since some of them read like a functional-programming manifesto. But, hopefully, you’ll agree that some of them have merit. They’re just thoughts that have occurred to me from trying to understand an offhand comment by Eric Mejer about the relationship between LINQ and Monads. Perhaps I’ll pursue that idea some more in future posts, but for now I’ll try to keep the posts coming more frequently.

Announcing LinqToRdf 0.3 and LinqToRdf Designer 0.3

The third release of LinqToRdf has been uploaded to GoogleCode. Go to the project web site for links to the latest release.

LinqToRdf Changes:
- support for SPARQL type casting
- numerous bug fixes
- better support for identity projections
- more SPARQL relational operators
- latest versions of SemWeb & SPARQL Engine, incorporating recent bug
fixes and enhancements of each of them

I have also released a new graphical designer to auto-generate C# entity models as well as N3 ontology specifications from UML-like designs. This new download is an extension to Visual Studio 2008 beta 2, and should make working with LinqToRdf easier for those who are not that familiar with the W3 Semantic Web specifications.

Please let me know how you get on with them.

Page Rank 1 for LINQ

After about 17 months and about 32 posts (or 33 if you count this, which I don’t :) I finally got my LINQ postings to the top slot on Google. Thanks to Paul Stovell for letting me know. I’m not sure what made the difference – only a few months ago, I was on page 1,000,001 for queries like this. I’m sure it won’t last, especially when the LINQ documentation team start releasing their content out onto MSDN, so I’m indulging myself in a little back slapping while I can. Thanks to all those who visited, and linked to my posts – I hope I’ve helped to promote such an incredibly cool, elegant and worthwhile system.

Long May It Rule!

pagerankone

I shall be coming back to LINQ fairly soon with a series on creating a semantic web design tool (a Domain Specific Language, or DSL) for Visual Studio. I’ll be creating a DSL to allow me to create an OWL ontology and all the LinqToRdf code needed to work with it. I’ve just gotta read the book first. I may also be giving a short talk on the architecture of  LINQ query providers at the Victoria.NET User Group in the near future.

LinqToRdf now works on the Visual Studio 2008 Beta 2

I should have brought the code up to date weeks back – but other things got in the way. Still – all the unit tests are in the green.  And the code has been minimally converted over to the new .NET 3.5 framework. I say ‘minimally’ because with the introduction of beta 2 there is now an interface for IQueryProvider that seems to be a dispenser for objects that support IQueryable. I suspect that with IQueryProvider, there is now a canonical architecture that is recommended by the LINQ team. Probably that will mean moving more responsibility into the RDF<T> class away from the QuerySupertype.  Time (and more documentation from MS) will tell.

There are several new expression types that are not yet supported (such as the coalescing operator on nullable types) – it remains to be seen whether they are supportable in SPARQL at all. Further research required. The solution doesn’t currently support WIX – I’m not sure whether WIX 3 will work with 2008 yet. Again, more research required.  What that means is that there will not be any MSI releases produced till WIX supports the latest drop of VS.NET.

Enjoy – and don’t forget to give us plenty of feedback on your experiences

To got to the google code project click here.

Using Mock Objects When Testing LINQ Code

I was wondering the other day whether LINQ could be used with NMock easily. One problem with testing code that has not been written to work with unit tests is that if you test business logic, you often end up making multiple round-trips to the database for each test run. With a very large test suite that can turn a few minute’s work into hours for a test suite. the best approach to this is to use mock data access components to dispense canned results, rather than going all the way through to the database.

After a little thought it became clear that all you have to do is override the IOrderedQueryable<T>.GetEnumerator() method to return an enumerator to a set of canned results and you could pretty much impersonate a LINQ to SQL Table (which is the IOrderedQueryable implementation for LINQ to SQL). I had a spare few minutes the other day while the kids were going to sleep and I decided to give it a go, to see what was involved.

I’m a great believer in the medicinal uses of mock objects. Making your classes testable using mocking enforces a level of encapsulation that adds good structure to your code. I find that the end results are often much cleaner if you design your systems with mocking in mind.

Lets start with a class that you were querying over in your code. This is the type that you are expecting to get back from your query.

public class MyEntity
{
    public string Name
    {
        get { return name; }
        set { name = value; }
    }

    public int Age
    {
        get { return age; }
        set { age = value; }
    }

    public string Desc
    {
        get { return desc; }
        set { desc = value; }
    }

    private string name;
    private int age;
    private string desc;
}

Now you need to create a new context object derived from the DLINQ DataContext class, but providing a new constructor function. You can create other ways to insert the data you want your query to return, but the constructor is all that is necessary for this simple example.

public class MockContext : DataContext
{
    #region constructors

    public MockContext(IEnumerable col):base("")
    {
        User = new MockQuery<MyEntity>(col);
    }
    // other constructors removed for readability
    #endregion
    public MockQuery<MyEntity> User;
}

Note that you are passing in an untyped IEnumerable rather than an IEnumerable<T> or a concrete collection class. The reason is that when you make use of projections in LINQ, the type gets transformed along the way. Consider the following:

var q = from u in db.User
        where u.Name.Contains("Andrew") && u.Age < 40
        select new {u.Age};

The result of db.User is an IOrderedQueryable<User> query class which is derived from IEnumerable<User>. But the result that goes into q is an IEnumerable of some anonymous type created specially for the occasion. there is a step along the way when the IQueryable<User> gets replaced with an IQueryable<AnonType>. If I set the type on the enumerator of the canned results, I would have to keep track of them with each call to CreateQuery in my Mock Query class. By using IEnumerable, I can just pass it around till I need it, then just enumerate the collection with a custom iterator, casting the types to what I ultimately need as I go.

The query object also has a constructor that takes an IEnumerable, and it keeps that till GetEnumerator() gets called later on. CreateQuery and CloneQueryForNewType just pass the IEnumerable around till the time is right. GetEnumerator just iterates the collection in the cannedResponse iterator casting them to the return type expected for the resulting query.

public class MockQuery<T> : IOrderedQueryable<T>
{
    private readonly IEnumerable cannedResponse;

    public MockQuery(IEnumerable cannedResponse)
    {
        this.cannedResponse = cannedResponse;
    }

    private Expression expression;
    private Type elementType;

    #region IQueryable<T> Members

    IQueryable<S> IQueryable<T>.CreateQuery<S>(Expression expression)
    {
        MockQuery<S> newQuery = CloneQueryForNewType<S>();
        newQuery.expression = expression;
        return newQuery;
    }

    private MockQuery<S> CloneQueryForNewType<S>()
    {
        return new MockQuery<S>(cannedResponse);
    }
    #endregion

    #region IEnumerable<T> Members
    IEnumerator<T> IEnumerable<T>.GetEnumerator()
    {
        foreach (T t in cannedResponse)
        {
            yield return t;
        }
    }
    #endregion

    #region IQueryable Members
    Expression IQueryable.Expression
    {
        get { return System.Expressions.Expression.Constant(this); }
    }

    Type IQueryable.ElementType
    {
        get { return elementType; }
    }
    #endregion
}

For the sake of readability I have left out the required interface methods that were not implemented, since they play no part in this solution. Now lets look at a little test harness:

class Program
{
    static void Main(string[] args)
    {
        MockContext db = new MockContext(GetMockResults());

        var q = from u in db.User
                where u.Name.Contains("Andrew") && u.Age < 40
                select u;
        foreach (MyEntity u in q)
        {
            Debug.WriteLine(string.Format("entity {0}, {1}, {2}", u.Name, u.Age, u.Desc));
        }
    }

    private static IEnumerable GetMockResults()
    {
        for (int i = 0; i < 20; i++)
        {
            MyEntity r = new MyEntity();
            r.Name = "name " + i;
            r.Age = 30 + i;
            r.Desc = "desc " + i;
            yield return r;
        }
    }
}

The only intrusion here is the explicit use of MockContext. In the production code that is to be tested, you can’t just go inserting MockContext where you would have used the SqlMetal generated context. You need to use a class factory that will allow you to provide the MockContext on demand in a unit test, but dispense a true LINQ to SQL context when in production. That way, all client code will just use mock data without knowing it.

Here’s the pattern that I generally follow. I got it from the Java community, but I can’t remember where:

class DbContextClassFactory
{
    class Environment
    {
        private static bool inUnitTest = false;

        public static bool InUnitTest
        {
            get { return Environment.inUnitTest; }
            set { Environment.inUnitTest = value; }
        }
        private static DataContext objectToDispense = null;

        public static DataContext ObjectToDispense
        {
            get { return Environment.objectToDispense; }
            set { Environment.objectToDispense = value; }
        }
    }

    public object GetDB()
    {
        if (Environment.InUnitTest)
            return Environment.ObjectToDispense;
        return new TheRealContext() as DataContext;
    }
}

Now you can create your query like this:

DbContextClassFactory.Environment.ObjectToDispense = new MockContext(GetMockResults());
var q = from u in DbContextClassFactory.GetDB() where ...

And your client code will use the MockContext if there is one, otherwise it will use a LINQ to SQL context to talk to the real database. Perhaps we should call this Mockeries rather than Mock Queries. What do you think?

GroupJoins in LINQ

OWL defines two types of property: DatatypeProperty and ObjectProperty. An object property links instances from two Classes, just like a reference in .NET between two objects. In OWL you define it like this:

<owl:ObjectProperty rdf:ID=”isOnAlbum”>
  <rdfs:domain rdf:resource=”#Track”/>
  <rdfs:range rdf:resource=”#Album”/>
</owl:ObjectProperty>

A DatatypeProperty is similar to a .NET property that stores some kind of primitive type like a string or an int. In OWL it looks like this:

<owl:DatatypeProperty rdf:ID=”fileLocation”>
  <rdfs:domain rdf:resource=”#Track” />   
  <rdfs:range  rdf:resource=”&xsd;string”/>
</owl:DatatypeProperty>

The format is very much the same, but the task of querying for primitive types in LINQ and SPARQL is easy compared to performing a one to many query like a SQL Join. So far, I have confined my efforts to DatatypeProperties, and tried not to think about ObjectProperties too much. But the time of reckoning has come – I’ve not got much else left to do on LinqToRdf except ObjectProperties.

Here’s the kind of LINQ join I plan to implement:

[TestMethod]
public void TestJoin()
{
    TestContext db = new TestContext(CreateSparqlTripleStore());
    var q = from a in db.Album 
            join t in db.Track on a.Name equals t.AlbumName into tracks
            select new Album{Name = a.Name, Tracks = tracks};
    foreach(var album in q){
        Console.WriteLine(album.Name);
        foreach (Track track in album.Tracks)
        {
            Console.WriteLine(track.Title);
        }
    }
}

This uses a GroupJoin to let me collect matching tracks and store them in a temporary variable called tracks. I then insert the tracks into the Tracks property on the album I’m newing up in the projection. I need to come up with a SPARQL equivalent syntax, and convert the expression passed for the join into that. SPARQL is a graph based query language, so I am going to be converting my requests into the usual SPARQL triple format, and then using the details from the NewExpression on the query to work out where to put the data when I get it back.

With the non-join queries I have been testing my query provider on, I have observed that for each syntactical component of the query I was passed an Expression tree, representing its contents. With a GroupJoin, you get one, and it contains everything you need to perform the query. My first quandary is over the process of converting this new expression structure into a format that my existing framework can understand. Below is a snapshot of the expression tree created for the join I showed above.

GroupJoin Expression contents

There are five parameters in the expression:

  1. The query object on the Album. That’s the “a in db.Album” part.
  2. The query object on the Track. The “t in db.Track” part.
  3. A lambda function from an album to its Name.
  4. A lambda function from a track to its AlbumName.
  5. A projection creating a new Album, and assigning the tracks collected to the Tracks collection on the newly created Album.

Parameters 1 & 2 are LinqToRdf queries that don’t need to be parsed and converted. I can’t just ask them to render a query for me, since they don’t have any information of value to offer me other than the OriginalType that they were created with. They have received no expressions filtering the kind of data that they return, and they’ll never have their results enumerated. These query objects are just a kind of clue for the GroupJoin about how to compose the query. They can tell it where the data that it’s looking for is to be found.

Here’s how I would guess the SPARQL query would look:

SELECT ?Name ?Title ?GenreName <snip> 
WHERE {
    _:a a a:Album .
    _:t a a:Track .
    _:a a:name ?Name.
    _:t a:albumName ?Name .
    OPTIONAL {_:t a: ?Title}
    OPTIONAL {_:t a: ?GenreName}
    <snip>
}

We can get the names for blank nodes _:a and _:t from the parameter collections of the GroupJoins parameters 3 and 4 respectively. We know that we will be equating ?Name on _:a and ?Name on _:t since those are the lambda functions provided and that’s the format of the join. The rest of the properties are included in optional sections so that if they are not present it won’t stop the details of the OWL instance coming back. By using

    _:a a:name ?Name.
    _:t a:albumName ?Name .

We achieve the same as equality, since two things that are equal to the same are equal to each other. That restricts the tracks to those that are part of an album at the same time.

I’m not sure yet what I will do about the projection, since there is an intermediate task that needs to be performed: to insert the temporary variable ‘tracks’ into the Album object after it has been instantiated. More on that once I’ve found out more.

Designing a LINQ Query Provider

The process of creating a LINQ query provider is reasonably straightforward. Had it been documented earlier, there would have doubtless been dozens of providers written by now. Here’s the broad outline of what you have to do.

  1. Find the best API to talk to your target data store.
  2. Create a factory or context object to build your queries.
  3. Create a class for the query object(s).
  4. Choose between IQueryable<T> and IOrderedQueryable<T>.
  5. Implement this interface on the query class.
  6. Decide how to present queries to the data store.
  7. Create an Expression Parser class.
  8. Create a type converter.
  9. Create a place to store the LINQ expressions.
  10. Wrap the connecting to and querying of the data store.
  11. Create a result deserialiser.
  12. Create a result cache.
  13. Return the results to the caller.

What It Means

These steps provide you with a high-level guide to the problems you have to solve when creating a query provider for the first time. In the sections below I’ve tried to expand on how you will solve the problem. In many cases I’ve explained from the viewpoint I took when implementing LINQ to RDF. Specifically, that means my problem was to create a query provider that supported a rich textual query language communicated via an SDK, and retrieved results in a format that needed subsequent conversion back into .NET objects.

Find the best API to talk to your target data store.

Normally there is going to be some kind of API for you to request data from your data store. The main reason for creating a LINQ query provider is that the API reflects the underlying technology to much, and you want a more full encapsulation of the technology. For instance, standard APIs in the Semantic web space deal with triples and URIs. When you’re an object oriented developer, you want to be dealing with objects not triples. That almost definitely means that there will be some kind of conversion process needed to deal with the entities of the underlying data store. In many cases there will be several APIs to choose between, and the choice you make will probably be due to performance or ease of interfacing with LINQ. If there is no overall winner, then prepare to provide multiple query types for all the ways you want to talk to the data store. :-)

Create a factory or context object to build your queries.

This class will perform various duties for you to help you keep track of the objects you’ve retrieved, and to write them back to the data store (assuming you choose to provide round-trip persistence). this class is equivalent to the Context class in LINQ to SQL. This class can provide you with an abstract class factory to perform the other tasks, like creating type converters, expression translators, connections, command objects etc. It doesn’t have to be very complex, but it IS useful to have around.

In the case of LinqToRdf, I pass the class factory a structure that tells it where the triple store is located (local or remote, in-memory or persistent) and what query language to use to to query it.

Create a class for the query object(s).

This class is the brains of the operation, and is where the bulk of your work will be.

This is the first main step in the process of creating a query provider. You will have to implement one of the standard LINQ query interfaces on it, and either perform the query from this class, or use it to coordinate those components that will do the querying.

LINQ talks to this query class directly, via the CreateQuery method, so this is the class that will have to implement the IQueryable or IOrderedQueryable interface to allow LINQ to pass in the expression trees. Each grammatical component of the query is passed into CreateQuery in turn, and you can store that somewhere for later processing.

Choose between IQueryable<T> and IOrderedQueryable<T>.

This is a simple choice. Do you want to be able to order the results that you will be passing back? If so use IOrderedQueryable, and you will then be able to write queries using the orderby keyword. Declare your query class to implement the chosen interface.

Implement this interface on the query class.

Now you’ve decided which interface to use, you have to implement this interface on the query class  from point 3. Most of the work is in the CreateQuery and GetEnumerator methods.

CreateQuery gets called once for each of the major components of the query. So for a query like this:

var q = (from t in qry
    where t.Year == "2006" &&
    t.GenreName == "History 5 | Fall 2006 | UC Berkeley" 
    orderby t.FileLocation
    select new {t.Title, t.FileLocation}).Skip(10).Take(5);

Your query class will get called five times. Once each for the extension methods that are doing the work behind the scenes: Where, OrderBy, Select, Skip and Take. If you’re not aware of the use of Extension methods in the design of LINQ, go over to the LINQ project site on Microsoft and peruse the documents on the Standard Query Operators. The integrated part of LINQ is a kind of syntactic sugar that masks the use of extension methods to make successive calls on an object in a way which is more attractive than plain static calls.

My initial attempt at handling the expressions passed in through CreateQuery was to treat the whole process like a Recursive Descent compiler. Later on I found that to optimize the queries a little, I needed to wait till I had all of the expressions before I started processing them. The reason I did this is that I needed to know what parameters were going to be used in the projection (The Select part) before I could generate the body of the graph specification that is mostly based on the where expression.

Decide how to present queries to the data store.

Does the API use a textual query language, a query API or its own Expression tree system? This will determine what you do with the expressions that get sent to you by LINQ. If it is a textual query language, then you will need to produce some kind of text from the expression trees in the syntax supported by the data store (like SPARQL or SQL). If it is an API, then you will need to interpret the expression trees and convert them into API calls on the data store. Lastly, if the data store has it’s own expression tree system, then you need to create a tree out of the LINQ expression tree, that the data store will be able to convert or interpret on its own (Like NHibernate).

SPARQL is a textual query language so my job was to produce SPARQL from a set of expression trees. Yours may be to drive an API, in which case you will have to work out how to invoke the methods on your API appropriately in response to the nodes of the expression tree.

Create an Expression interpreter class.

I found it easier to break off various responsibilities into separate classes. I did this for filter clause generation, type conversion, connections, and commands. I described that in my previous post, so I won’t go into much depth here. Most people would call this a Visitor class, although I think in terms of recursive descent (since that’s not patented). I passed down a StringBuilder with each recursive call to the Dispatch method on the expression translator. The interpreter inserts textual representations of the properties you reference in the query, the constant values they are compared against and it appends textual representation of the operators supported by the target query language. If necessary this is where you will use a type converter class to convert the format of any literals in your expressions.

Create a type converter.

I had to create a type converter because there are a few syntactic conventions about use of type names in SPARQL. In addition, DateTime types are represented differently between SPARQL and .NET. You may not have this problem (although I bet you will) and if that’s so, then you can get away with a bit less complexity.

My type converter is just a hash table mapping from .NET primitives to XML Schema data types. In addition I made use of some custom attributes to allow me to add extra information about how the types should be handled. here’s what the look up table works with:

public enum XsdtPrimitiveDataType : int
{
    [Xsdt(true, "string")]
    XsdtString,
    [Xsdt(false, "boolean")]
    XsdtBoolean,
    [Xsdt(false, "short")]
    XsdtShort,
    [Xsdt(false, "int")]
    XsdtInt,

The XsdtAttribute is very simple, but provides a means, if I need it, to add more sophistication at a later date:

[AttributeUsage(AttributeTargets.Field)]
public class XsdtAttribute : Attribute
{
    public XsdtAttribute(bool isQuoted, string name)
    {
        this.isQuoted = isQuoted;
        this.name = name;
    }

isQuoted allows me to tell the type converter whether to wrap a piece of data in double quotes, and the name parameter indicates what the type name is in the XML Schema data types specification. Your types will be different, but the principle will be the same, unless you are dealing directly with .NET types.

I set up the lookup table like this:

public XsdtTypeConverter()
{
    typeLookup.Add(typeof(string), XsdtPrimitiveDataType.XsdtString);
    typeLookup.Add(typeof(Char), XsdtPrimitiveDataType.XsdtString);
    typeLookup.Add(typeof(Boolean), XsdtPrimitiveDataType.XsdtBoolean);
    typeLookup.Add(typeof(Single), XsdtPrimitiveDataType.XsdtFloat);

That is enough for me to be able to do a one-way conversion of literals when creating the query.

Create a place to store the LINQ expressions.

As I mentioned above, you may need to keep the expressions around until all calls into CreateQuery have been made. I used another lookup table to allow me to store them till the call to GetEnumerator.

protected Dictionary<string, MethodCallExpression> expressions;
public IQueryable<S> CreateQuery<S>(Expression expression){
    SparqlQuery<S> newQuery = CloneQueryForNewType<S>();
    MethodCallExpression call = expression as MethodCallExpression;
    if (call != null){
        newQuery.Expressions[call.Method.Name] = call;
    }
    return newQuery;
}

You may prefer to have named variables for each source of expression. I just wanted to have the option to gather everything easily, before I had provided explicit support for it.

Wrap the connecting to and querying of the data store.

This is a matter of choice, but if you wrap the process of connecting and presenting queries to your data store inside of a standardized API, then you will find it easier to port your code to new types of data store later on. I found this when I decided that I wanted to support at least 4 different types of connectivity and syntax in LinqToRdf. I also chose to (superficially) emulate the ADO.NET model (Connections, Commands, CommandText etc) there was no real need to do this, I just thought it would be more familiar to those from an ADO.NET background. the emulation is totally skin deep though, there being no need for transactions etc, and with LINQ providing a much neater way to present parameters than ADO.NET will ever have.

When you implement the IQueryable interface, you will find that you have two versions of GetEnumerator, a generic version and an untyped version. Both of these can be served by the same code. I abstracted this into a method called RunQuery.

protected IEnumerator<T> RunQuery()
{
    if (Context.ResultsCache.ContainsKey(GetHashCode().ToString()))
        return (IEnumerator<T>)Context.ResultsCache[GetHashCode()
.ToString()].GetEnumerator(); StringBuilder sb = new StringBuilder(); CreateQuery(sb); IRdfConnection<T> conn = QueryFactory.CreateConnection(this); IRdfCommand<T> cmd = conn.CreateCommand(); cmd.CommandText = sb.ToString(); return cmd.ExecuteQuery(); }

The first thing it does is look to see whether it’s been run before. If it has, then any results will have been stored in the Context object (see point 2) and they can be returned directly.

If there are no cached results, then it passes a string builder into the CreateQuery object that encapsulates the process of creating a textual query for SPARQL. The query class also has a reference to a class called QueryFactory, that was created for it by the Context object. This factory allows it to just ask for a service, and get one that will work for the query type that is being produced. This is the Abstract Factory pattern at work, which is common in ORM systems and others like this.

The IRdfConnection class that this gets from the QueryFactory encapsulates the connection method that will be used to talk to the triple store. The IRdfCommand does the same for the process of asking for the results using the SPARQL communications protocol.

ExecuteQuery does exactly what you would expect. One extra facility that is exploited is the ability of the IRdfCommand to store the results directly in the context so that we can check next time round whether to go to all this trouble.

I wrote my implementation of CreateQuery(sb) to conform fairly closely to the grammar spec of the SPARQL query language. Here’s what it looks like:

private void CreateQuery(StringBuilder sb)
{
    if (Expressions.ContainsKey("Where"))
    {
        // first parse the where expression to get the list 
// of parameters to/from the query.
StringBuilder sbTmp = new StringBuilder(); ParseQuery(Expressions["Where"].Parameters[1], sbTmp); //sbTmp now contains the FILTER clause so save it
// somewhere useful.
FilterClause = sbTmp.ToString(); // now store the parameters where they can be used later on. if (Parser.Parameters != null) queryGraphParameters.AddAll(Parser.Parameters); // we need to add the original type to the prolog to allow
// elements of the where clause to be optimised
namespaceManager.RegisterType(OriginalType); } CreateProlog(sb); CreateDataSetClause(sb); CreateProjection(sb); CreateWhereClause(sb); CreateSolutionModifier(sb); }

I’ve described this in more detail in my previous post, so I’ll not pursue it any further. The point is that this is the hard part of the provider, where you have to make sense of the expressions and convert them into something meaningful. For example the CreateWhereClause looks like this:

private void CreateWhereClause(StringBuilder sb)
{
    string instanceName = GetInstanceName();
    sb.Append("WHERE {\n");
    List<MemberInfo> parameters = new List<MemberInfo>(
queryGraphParameters.Union(projectionParameters)); if (parameters.Count > 0) { sb.AppendFormat("_:{0} ", instanceName); } for (int i = 0; i < parameters.Count; i++) { MemberInfo info = parameters[i]; sb.AppendFormat("{1}{2} ?{3} ", instanceName,
namespaceManager.typeMappings[originalType] + ":",
OwlClassSupertype.GetPropertyUri(originalType,
info.Name, true), info.Name); sb.AppendFormat((i < parameters.Count - 1) ? ";\n" : ".\n"); } if (FilterClause != null && FilterClause.Length > 0) { sb.AppendFormat("FILTER(\n{0}\n)\n", FilterClause); } sb.Append("}\n"); }

 The meaning of most of this is specific to SPARQL and won’t matter to you, but you should take note of how the query in the string builder is getting built up piece by piece, based on the grammar of the target query language.

Create a Result Deserialiser.

Whatever format you get your results back in, one thing is certain. You need to convert those back into .NET objects. SemWeb exposes the SPARQK results set as a set of Bindings between a

public override bool Add(VariableBindings result)
{
    if (originalType == null) throw new ApplicationException
("need a type to create"); object t = Activator.CreateInstance(instanceType); foreach (PropertyInfo pi in instanceType.GetProperties()) { try { string vn = OwlClassSupertype.GetPropertyUri(OriginalType, pi.Name).Split('#')[1]; string vVal = result[pi.Name].ToString(); pi.SetValue(t, Convert.ChangeType(vVal, pi.PropertyType), null); } catch (Exception e) { Console.WriteLine(e); return false; } } DeserialisedObjects.Add(t); return true; }

InstanceType is the type defined in the projection provided by the Select expression. Luckily LINQ will have created this type for you. You can pass the type (as a generic type parameter) to the deserialiser. the process is quite simple. In LinqToRdf, the following steps are performed:

  1. create an instance of the projected type (or the original type if using an identity projection)
  2. for each public property on the projected type
    1. Get the matching property from the original type (which has the OwlAttributes on each property)
    2. Lookup the RDFS property name used for the property we’re attempting to fill
    3. Lookup the value for that property from the result set
    4. Assign it to the newly created instance
  3. Add the instance to the DeserialisedObjects collection

The exact format your results come back in will be different, but again the principlple remains the same – create the result object using the Activator, fill each of its public properties with values from the result set. Repeat until all results have been converted to .NET objects.

Create a Result Cache.

One advantage of being able to intercept calls to GetEnumerator is that you have the option to cache the results of the query, or to cache the intermediate query strings you used to get them. This is one of the great features of LINQ (and ORM object based queries generally).

In the case of Semantic web applications we don’t necessarily expect the data in the store to be changing frequently, so I have opted to store the .NET objects returned from the previous query (if there is one).  I suspect that I will opt to unmake this decision, since in the case of active data stores there is no guarantee that the results will remain consistent. It is still a major time saving to be able to run the query using the query string generated the first time round. In the case of LINQ to RDF using SPARQL this corresponds to around 67ms to generate the query. Admittedly the query including connection processing and deserialisation takes a further 500ms for a small database, but there are further optimizations that can be added at a later date.

Return the Results to the Caller.

This is the last stage. Just get the results that you stored in the Context and return an enumerator from the collection. If you have the luxury to be able to use cursors or some other kind of incremental retrieval from the data store, then you will want to consider whether to use a custom iterator to deserialise objects on the fly.

LinqToRdf – Designing a Query Provider

When I started implementing the SPARQL support in LINQ to RDF, I decided that I needed to implement as much of the standard query operators as possible. SPARQL is a very rich query language that bears a passing syntactical resemblance to SQL. It didn’t seem unreasonable to expect most of the operators of LINQ to SQL to be usable with SPARQL. With this post I thought I’d pass on a few design notes that I have gathered during the work to date on the SPARQL query provider.

The different components of a LINQ query get converted into successive calls to your query class. My class is called SparqlQuery<T>. If you had a query like this:

[TestMethod]
public void SparqlQueryOrdered()
{
    string urlToRemoteSparqlEndpoint = @"http://someUri";
    TripleStore ts = new TripleStore();
    ts.EndpointUri = urlToRemoteSparqlEndpoint;
    ts.QueryType = QueryType.RemoteSparqlStore;
    IRdfQuery<Track> qry = new RDF(ts).ForType<Track>(); 
    var q = from t in qry
        where t.Year == 2006 &&
        t.GenreName == "History 5 | Fall 2006 | UC Berkeley" 
        orderby t.FileLocation
        select new {t.Title, t.FileLocation};
    foreach(var track in q){
        Trace.WriteLine(track.Title + ": " + track.FileLocation);
    }        
}

This would roughly equate to the following code, using the extension method syntax:

[TestMethod]
public void SparqlQueryOrdered()
{
    ParameterExpression t;
    string urlToRemoteSparqlEndpoint = http://someUri;
    TripleStore ts = new TripleStore();
    ts.EndpointUri = urlToRemoteSparqlEndpoint;
    ts.QueryType = QueryType.RemoteSparqlStore;
    var q = new RDF(ts).ForType<Track>()
        .Where<Track>(/*create expression tree*/)
        .OrderBy<Track, string>(/*create  expression tree*/)
        .Select(/*create  expression tree*/;
    foreach (var track in q)
    {
        Trace.WriteLine(track.Title + ": " + track.FileLocation);
    }
}

The bold red method calls are the succession of calls to the query’s CreateQuery method. That might not be immediately obvious from looking at the code. In fact it’s downright unobvious! There’s compiler magic going on in this, that you don’t see. Anyway, what happens is that you end up getting a succession of calls into your IQueryable<T>.CreateQuery() method. And that’s what we are mostly concerned about when creating a query provider.

The last I blogged about the CreateQuery method I gave a method with a switch statement that identified the origin of the call (i.e. Where, OrderBy, Select or whatever) and dispatched the call off to be immediately processed. I now realise that that is not the best way to do it. If I try to create my SPARQL query in one pass, I will not have much of a chance to perform optimization or disambiguation. If I generate the projection before I know what fields were important, I would probably end up having to get everything back and filter on receipt of all the data. I think Bart De Smet was faced with that problem with LINQ to LDAP (LDAP doesn’t support projections) so he had to get everything back. SPARQL does support projections, and that means I can’t generate the SPARQL query string till after I know what to get back from the Select call.

My solution to this is to keep all the calls into the CreateQuery in a Hashtable so that I can use them all together in the call to GetEnumerator. That gives me the chance to do any amount of analysis on the expression trees I’ve got stored, before I convert them into a SPARQL query. The CreateQuery method now looks like this:

protected Dictionary<string, MethodCallExpression> expressions;
public IQueryable<S> CreateQuery<S>(Expression expression)
{
    SparqlQuery<S> newQuery = CloneQueryForNewType<S>();

    MethodCallExpression call = expression as MethodCallExpression;
    if (call != null)
    {
        Expressions[call.Method.Name] = call;
    }
    return newQuery;
}

This approach helps because it makes it much simpler to start adding the other query operators.

I also been doing a fair bit of tidying up as I go along. My GetEnumerator method now reflects the major grammatical components of the SPARQL spec for SELECT queries.

private void CreateQuery(StringBuilder sb)
{
    if(Expressions.ContainsKey("Where"))
    {
        // first parse the where expression to get the list of parameters to/from the query.
        StringBuilder sbTmp = new StringBuilder();
        ParseQuery(Expressions["Where"].Parameters[1], sbTmp);
        //sbTmp now contains the FILTER clause so save it somewhere useful.
        Query = sbTmp.ToString();
        // now store the parameters where they can be used later on.
        queryGraphParameters.AddAll(Parser.Parameters);
    }
    CreateProlog(sb);
    CreateDataSetClause(sb);
    CreateProjection(sb);
    CreateWhereClause(sb);
    CreateSolutionModifier(sb);
}

The If clause checks whether the query had a where clause. If it did, it parses it, creating the FILTER expression, and in the process gathering some valuable information about what members from T were referenced in the where clause. This information is useful for other tasks, so it gets done in advance of creating the Where clause.