Month: July 2005

Truly abstracting a persistence mechanism

The initial design that I used when I made the ancestor of norm was based upon designs by Scott Ambler – The initial intent of Ambler's designs was definitely to provide an abstraction around the logic of a relational database. What I want to do with norm is to abstract the very notion of a data structure. When we persist an object into a row on a RDB it is almost irrelevant that the data is persistent. Of course its not truly irrelevant, else why bother at all? What I mean is that the persistence of the data store in an RDB is irrelevant – it's the concepts that are used that make ORM such a complicated enterprise, the notion of a relationship is subtly different from that used in the object domain. Therein lies the "object/relational impedance mismatch" that Ambler identifies in his hugely influential paper, the design of a robust persistence layer for relational databases.

As you can see the persistence mechanism is deliberately kept very simple, since there is little in the way of overlap (conceptually) between the APIs for a flat file persistence store and a relational database. In fact the notion of connection means different things in each mechanism.

So what I'm after is a way to bridge the gap between persistence stores so that mappings can be made between two different object models as easily as they can between object models and relational databases. What I'm wondering is whether Ambler's model is the right one to use for such an abstraction. My first task is to purge any domain pollution from the mapping system, the transaction manager and the command system.

My initial system was a very close parallel to Ambler's designs. But now I'm looking to diverge in the hope of a cleaner conceptual model. What most ORM systems do is to define a invertible function between the object domain and the relational domain. I propose to do the same thing, but I want to do it in a non-explicit way.

Normally the mapping is done by enumerating the domain set (the object domain normally), enumerating the range set (the relational model) and then defining the mappings between them. If you look closely at the mapping file formats for persistence mechanisms such as Torque, Hibernate, ObjectSpaces and norm's predecessor, they all followed this idiom using XML configuration files to define the mapping, and an in-memory model to serve as a runtime aid to the persistence broker to build its SQL commands.

That has to be the way of doing it ultimately, but I wonder whether we can't define the mapping in another way, rather like the difference between defining a set using an enumeration of all of its members or through the definition of a generative function that maps onto the set. i.e. Rather than say:

x = {2, 4, 6, 8}

we could say

x = {2i where i > 0 & i < 5}

It's definitely more complicated than explicitly enumerating the mappings, but might enable the easy solution of mappings in the case of inheritance where there are several solutions that all work.

To do this conceptual mappings we need to work out what the key abstractions that define the mapping functions:

  • whole/part relationship
  • complex type
  • association
  • CRUD operation
  • is-a relationship

Each of these things are present in every representation that I am considering. They exist in RDBMSs, object databases, and XML documents (i.e. a flat file, kinda:) But how they are represented and realized is vastly different between each of these technologies.

I wonder, and that's all I've done so far, whether if we defined how the underlying concept is laid out in each case we could do the mapping by specifying how that meaning was projected onto the concepts of the problem domain. Maybe I could perform the mapping by saying that this groups of concepts is achieved using this kind of mapping, and maybe the ORM can deduce the rest.

Of course, proper naming strategies in each domain dictate different names, and they are seldom held to consistently, so short of naming the attributes exhaustively there is no way of doing the mapping. So is it worth my time? Or am I just proposing a slight change of terminology so as not to give away the format of the persistence mechanism?

A sign of things to come…

Finding that we didn't have room for a drying machine, and figuring that Melbourne summers are long and hot and dry, we decided that a clothes horse or two would be a better short term investment. And, keen as she is, Kerry decided to give one a try.

Lots of baby clothes

It gives me the shivers – I see a future littered with drying stinky babies romper suits. Our neat and minimalist, and even quite stylishly modern on a good day, existence is about to become very biological. Not that I mind – I also see myself becoming virtually American in my capacity for maudlin sentimentality about fatherhood.

Like I said – very biological.

The power of advertising

I've always been a believer in the power of good advertising, which is why I was so pleased to see this fantastic example of the art in Melbourne this week. It has it all – it has the doleful situation of the foolish dolt who has so willfully left himself in a state of baldness that is, as we can see by the "after" photo, so obviously reversible. I salute this masterpiece of modern advertising, and hope that television advertisers can learn from it.

Kerry has a blog (or two)

Kerry has entered the fray, and will soon be blurting forth her every whimsical thought for your delectation and amusement.

Support her – she's not just a mum, you know, she's a blogwife! Besides, when you're as big as she is, you can't divert yourself with gymnastics…

Brain modeling – first steps

The following appeared in Kurtzweil AI:

IBM and Switzerland's Ecole Polytechnique Federale de Lausanne (EPFL) have teamed up to create the most ambitious project in the field of neuroscience: to simulate a mammalian brain on the world's most powerful supercomputer, IBM's Blue Gene. They plan to simulate the brain at every level of detail, even going down to molecular and gene expression levels of processing.

Several things come to mind after the initial "coooooooooooooool!!!!!". The first is that they are considering a truly vast undertaking here. Imagine the kind of data storage and transmission capacity that would be required to run that sort of model. Normally when considering this sort of thing, AI researchers produce an idealised model where the physical structure of a neuron is abstracted into a cell in a matrix that is able to represent the flow of information in the brain in a simplified way. What these researchers are suggesting is that they will model the brain in a physiologically authentic way. That would mean that rather than idealising their models at the cellular level they would have to model the behaviour of individual synapses. They would have to model the timing of the signals within the brain asynchronously, which would increase both the processing required and the memory imprint of the model.

Remember the success of the model of auditory perception that produced super-human recognition a few years ago? That was based on a more realistic neural network model, and had huge success. From what I can tell it never made it into the mainstream voice-recognition software because it was too processor intensive. This primate model would be orders of magnitude more expensive to run, and despite the fact that Blue Gene can make calculations at a rate of 2 per micrometre at the speed of light, it will have a lot of those to do. I wonder how slow this would be compared to the brain modeled.

I also wonder how they will quality check their model. How do you check that your model is working in the same way as a primate brain? Would this have to be matched with a similarly ambitious brain scanning project?

Another thing that this makes me wonder (after saying cool a few more times) is what sort of data storage capacity they would have to expend to produce such a model? Lets do a little thumbnail sketch to work out what the minimum value might be based on a model of a small primate like a squirrel monkey with similar cellular brain density to humans but a brain weighing only 22 grams (say about 2% of the mass of the brain).

  • Average weight of adult human brain = 1,300 – 1,400gm
  • Number of synapses for a "typical" neuron = 1,000 to 10,000
  • Number of molecules of neurotransmitter in one synaptic vesicle = 10,000-100,000
  • Average number of glial cells in brain = 10-50 times the number of neurons
  • Average number of neurons in the human brain = 10 billion to 100 billion

If we extrapolate these figures for a squirrel monkey the number of neurons would be something like 1 billion cells, each with (say) 5,000 synapses each with 50,000 neurotransmitters. Now if we stored some sort of data for the 3D location each of those neurotransmitters we would need a reasonably high precision location maybe a double precision float. That would be 8bytes * 3 dimensions * 50000 * 5000 * 1,000,000,000 which comes out at 6,000,000,000,000,000,000 bytes or 5-6 million terabytes. Obviously the neuro-transmitters are just a part of the model. The patterns of connections in the synapses would have to be modeled as well. If there are a billion neurons with 5000 synapses, there would have to be at least 5 terabytes of data for synaptic connections. Each one of those synapses would have its own state, and timing information. Maybe another 100 bytes or more for that information or another 500 terabytes.

I suspect that the value of modeling at this level is marginal, when they could represent the densities of neurotransmitters over time they could save the cost of data storage hugely. I wonder of there is 6 million terabytes of storage in the world! If each human on earth contributed a megabyte of storage then we might be able to store that sort of data.

Lets assume that they were able to compress the data storage requirements through abstraction to a millionth of the total I just described, or around 6 terabytes. I assume that at all times the synapses would be visited to update their status. That means that if the synapses were updated once every millisecond (which rings a bell, but may be too fast) then the system would have to perform 6 *10^15 operations per second. Assuming also that the software would have numerous housekeeping tasks and structural tasks to perform, so it might be no more than 25% efficient, in which case we are talking 2.4*10^16 operations per second. Blue Gene/L system they will be using is able to perform 2.28*10^12 flops, therefore they would take around 1000 seconds to perform one cycle of updates.

They will be restricting their models to cortical columns that would restrict their model to about 100 million synapses, which would be much more manageable in the short term. I wonder how long it will take before they are able to produce a machine that can process a full model of the human brain?

I am a charitable NGO and I didn’t know – it.

I have been receiving increasingly desperate messages from various scions of once elevated families of the Ivory Coast, Kenya and South Africa. They have all these funds tied up and the need ME to help them release it (in confidence of course, but I know you, dear reader, will not tell a soul)!

I feel honoured to be considered for the noble task of helping these courteous young people out of their dire circumstances. I just need a donation from you dear reader to allow me to free up the funds to send to them to free up their funds to send my cut back to me so that I can pass it on to you. Contact me for my bank account details.

And we’re off

I have set up the sourceforge project. You can find it here. I've also classified all of the work, and split it up into releases.

Here's what will go into release one:

Configuration Use native .NET configuration
Configuration Remove existing config assembly
Installers WIX installers
Runtime Control Add transactional support from COM+
Runtime Control Extend reverse engineering to examine SPs and create wrappers for them.
Runtime Control Configurable ID strategy
Runtime Control Configurable transaction isolation policy
Templates Move core templates into resource assembly
Testing Create a proper test database
Runtime Control Divide system between runtime and development projects
Runtime Control Standardise all names to CamelCase

I think the highest priority is the configuration rework. Configuration in the previous system was way too complicated. What we need is a very simple, very reliable system that can easily be expanded to accommodate something like the config app block at a later date. As soon as that is done, the key task will be converting it from its current broken state to a working state, and then splitting the system up into runtime and development arms. I will also do some work towards creating WIX installers for the runtime and development systems, including an installer for packaging source releases, that will allow the easy setup of a development environment for new volunteers on the project.

This is of course based on the "if you build it, they will come" model of open source development.

Meta-evolution – evolving the capacity to learn

The real value of a language learning (or any other kind of learning) organ, as Chomsky called it, is that its most valuable output is the _capacity_ to be so sensitive to the environment that mental processes grow to represent it. That is, the diversity of environments that humans find themselves in is so rich and varied that a hard wired and inflexible capacity would be of limited value compared to a "meta-learning" facility that develops to represent the environment the organism finds itself in.

Meta-evolution would be of more evolutionary value than just plain evolution – a learning capacity that can evolve in the real-time of an organism's life seems more valuable than the simple evolution of a set of skills and competences that must be evolved over time as environments change.

First Task – What to do?

I've posted a whole bunch (well 45) bugs on the GotDotNet bug tracker for enhancements that nORM should not have to live too long without. Some of them ought to be fairly easy to deliver, like changing method names to have a clean and consistent format. Others are a bit more of a challenge – like adding persistence support for XML documents and reverse engineering an entity model from a schema file.

I think that delivering these bits of functionality will take quite a while. If each of these things took an average of a week to do (which is optimistic!) It would take me about a year to deliver all of the enhancements. But by the end of that – the system will kick some serious arse!

The BugList/WishList page is here. Feel free to make suggestions if you can think of anything I've missed.