Truly abstracting a persistence mechanism

The initial design that I used when I made the ancestor of norm was based upon designs by Scott Ambler – The initial intent of Ambler's designs was definitely to provide an abstraction around the logic of a relational database. What I want to do with norm is to abstract the very notion of a data structure. When we persist an object into a row on a RDB it is almost irrelevant that the data is persistent. Of course its not truly irrelevant, else why bother at all? What I mean is that the persistence of the data store in an RDB is irrelevant – it's the concepts that are used that make ORM such a complicated enterprise, the notion of a relationship is subtly different from that used in the object domain. Therein lies the "object/relational impedance mismatch" that Ambler identifies in his hugely influential paper, the design of a robust persistence layer for relational databases.

As you can see the persistence mechanism is deliberately kept very simple, since there is little in the way of overlap (conceptually) between the APIs for a flat file persistence store and a relational database. In fact the notion of connection means different things in each mechanism.

So what I'm after is a way to bridge the gap between persistence stores so that mappings can be made between two different object models as easily as they can between object models and relational databases. What I'm wondering is whether Ambler's model is the right one to use for such an abstraction. My first task is to purge any domain pollution from the mapping system, the transaction manager and the command system.

My initial system was a very close parallel to Ambler's designs. But now I'm looking to diverge in the hope of a cleaner conceptual model. What most ORM systems do is to define a invertible function between the object domain and the relational domain. I propose to do the same thing, but I want to do it in a non-explicit way.

Normally the mapping is done by enumerating the domain set (the object domain normally), enumerating the range set (the relational model) and then defining the mappings between them. If you look closely at the mapping file formats for persistence mechanisms such as Torque, Hibernate, ObjectSpaces and norm's predecessor, they all followed this idiom using XML configuration files to define the mapping, and an in-memory model to serve as a runtime aid to the persistence broker to build its SQL commands.

That has to be the way of doing it ultimately, but I wonder whether we can't define the mapping in another way, rather like the difference between defining a set using an enumeration of all of its members or through the definition of a generative function that maps onto the set. i.e. Rather than say:

x = {2, 4, 6, 8}

we could say

x = {2i where i > 0 & i < 5}

It's definitely more complicated than explicitly enumerating the mappings, but might enable the easy solution of mappings in the case of inheritance where there are several solutions that all work.

To do this conceptual mappings we need to work out what the key abstractions that define the mapping functions:

  • whole/part relationship
  • complex type
  • association
  • CRUD operation
  • is-a relationship

Each of these things are present in every representation that I am considering. They exist in RDBMSs, object databases, and XML documents (i.e. a flat file, kinda:) But how they are represented and realized is vastly different between each of these technologies.

I wonder, and that's all I've done so far, whether if we defined how the underlying concept is laid out in each case we could do the mapping by specifying how that meaning was projected onto the concepts of the problem domain. Maybe I could perform the mapping by saying that this groups of concepts is achieved using this kind of mapping, and maybe the ORM can deduce the rest.

Of course, proper naming strategies in each domain dictate different names, and they are seldom held to consistently, so short of naming the attributes exhaustively there is no way of doing the mapping. So is it worth my time? Or am I just proposing a slight change of terminology so as not to give away the format of the persistence mechanism?

A sign of things to come…

Finding that we didn't have room for a drying machine, and figuring that Melbourne summers are long and hot and dry, we decided that a clothes horse or two would be a better short term investment. And, keen as she is, Kerry decided to give one a try.

Lots of baby clothes

It gives me the shivers – I see a future littered with drying stinky babies romper suits. Our neat and minimalist, and even quite stylishly modern on a good day, existence is about to become very biological. Not that I mind – I also see myself becoming virtually American in my capacity for maudlin sentimentality about fatherhood.

Like I said – very biological.

The power of advertising

I've always been a believer in the power of good advertising, which is why I was so pleased to see this fantastic example of the art in Melbourne this week. It has it all – it has the doleful situation of the foolish dolt who has so willfully left himself in a state of baldness that is, as we can see by the "after" photo, so obviously reversible. I salute this masterpiece of modern advertising, and hope that television advertisers can learn from it.

Kerry has a blog (or two)

Kerry has entered the fray, and will soon be blurting forth her every whimsical thought for your delectation and amusement.

Support her – she's not just a mum, you know, she's a blogwife! Besides, when you're as big as she is, you can't divert yourself with gymnastics…

Brain modeling – first steps

The following appeared in Kurtzweil AI:

IBM and Switzerland's Ecole Polytechnique Federale de Lausanne (EPFL) have teamed up to create the most ambitious project in the field of neuroscience: to simulate a mammalian brain on the world's most powerful supercomputer, IBM's Blue Gene. They plan to simulate the brain at every level of detail, even going down to molecular and gene expression levels of processing.

Several things come to mind after the initial "coooooooooooooool!!!!!". The first is that they are considering a truly vast undertaking here. Imagine the kind of data storage and transmission capacity that would be required to run that sort of model. Normally when considering this sort of thing, AI researchers produce an idealised model where the physical structure of a neuron is abstracted into a cell in a matrix that is able to represent the flow of information in the brain in a simplified way. What these researchers are suggesting is that they will model the brain in a physiologically authentic way. That would mean that rather than idealising their models at the cellular level they would have to model the behaviour of individual synapses. They would have to model the timing of the signals within the brain asynchronously, which would increase both the processing required and the memory imprint of the model.

Remember the success of the model of auditory perception that produced super-human recognition a few years ago? That was based on a more realistic neural network model, and had huge success. From what I can tell it never made it into the mainstream voice-recognition software because it was too processor intensive. This primate model would be orders of magnitude more expensive to run, and despite the fact that Blue Gene can make calculations at a rate of 2 per micrometre at the speed of light, it will have a lot of those to do. I wonder how slow this would be compared to the brain modeled.

I also wonder how they will quality check their model. How do you check that your model is working in the same way as a primate brain? Would this have to be matched with a similarly ambitious brain scanning project?

Another thing that this makes me wonder (after saying cool a few more times) is what sort of data storage capacity they would have to expend to produce such a model? Lets do a little thumbnail sketch to work out what the minimum value might be based on a model of a small primate like a squirrel monkey with similar cellular brain density to humans but a brain weighing only 22 grams (say about 2% of the mass of the brain).

  • Average weight of adult human brain = 1,300 – 1,400gm
  • Number of synapses for a "typical" neuron = 1,000 to 10,000
  • Number of molecules of neurotransmitter in one synaptic vesicle = 10,000-100,000
  • Average number of glial cells in brain = 10-50 times the number of neurons
  • Average number of neurons in the human brain = 10 billion to 100 billion

If we extrapolate these figures for a squirrel monkey the number of neurons would be something like 1 billion cells, each with (say) 5,000 synapses each with 50,000 neurotransmitters. Now if we stored some sort of data for the 3D location each of those neurotransmitters we would need a reasonably high precision location maybe a double precision float. That would be 8bytes * 3 dimensions * 50000 * 5000 * 1,000,000,000 which comes out at 6,000,000,000,000,000,000 bytes or 5-6 million terabytes. Obviously the neuro-transmitters are just a part of the model. The patterns of connections in the synapses would have to be modeled as well. If there are a billion neurons with 5000 synapses, there would have to be at least 5 terabytes of data for synaptic connections. Each one of those synapses would have its own state, and timing information. Maybe another 100 bytes or more for that information or another 500 terabytes.

I suspect that the value of modeling at this level is marginal, when they could represent the densities of neurotransmitters over time they could save the cost of data storage hugely. I wonder of there is 6 million terabytes of storage in the world! If each human on earth contributed a megabyte of storage then we might be able to store that sort of data.

Lets assume that they were able to compress the data storage requirements through abstraction to a millionth of the total I just described, or around 6 terabytes. I assume that at all times the synapses would be visited to update their status. That means that if the synapses were updated once every millisecond (which rings a bell, but may be too fast) then the system would have to perform 6 *10^15 operations per second. Assuming also that the software would have numerous housekeeping tasks and structural tasks to perform, so it might be no more than 25% efficient, in which case we are talking 2.4*10^16 operations per second. Blue Gene/L system they will be using is able to perform 2.28*10^12 flops, therefore they would take around 1000 seconds to perform one cycle of updates.

They will be restricting their models to cortical columns that would restrict their model to about 100 million synapses, which would be much more manageable in the short term. I wonder how long it will take before they are able to produce a machine that can process a full model of the human brain?

I am a charitable NGO and I didn’t know – it.

I have been receiving increasingly desperate messages from various scions of once elevated families of the Ivory Coast, Kenya and South Africa. They have all these funds tied up and the need ME to help them release it (in confidence of course, but I know you, dear reader, will not tell a soul)!

I feel honoured to be considered for the noble task of helping these courteous young people out of their dire circumstances. I just need a donation from you dear reader to allow me to free up the funds to send to them to free up their funds to send my cut back to me so that I can pass it on to you. Contact me for my bank account details.