Thoughts on Repository Requirements for the Portal

We had a long meeting to resolve some of the requirements, storage, and workflow issues. We spent most of the time on the script repository and git and didn't really come to a full conclusion.

Let me try to write down a few of the issues and let's see if this will help us close the loop.

What's the problem?

To run an experiment we need to provide a script to the EC. When executing the 'top-level' script, the EC will require other scripts. In our context this splits into two problems:

  1. Finding the scripts
  2. Recording a persistent link to the used scripts

Right now we have a simple search algorithm to locate scripts:

  1. Look in local directory
  2. Look in repository (which is currently also in the file system)

The 'persistent link' issue is most likely easiest resolved through an HTTP URL. To have the repository in a local directory has bothered me for a long time and we should really put that behind a web server which will also provide the necessary, and currently missing, access control.

However, this also means that if one of the scripts used by the EC is on the local filesystem it needs to be pushed into the repository with the resulting URL being recorded in the list of scripts used for a specific experiment run (that list is also what is being captured by the portal).

Now the scripts are not immutable but most of them, especially the top-level one will be modified regularly. In other words, we have a versioning problem and in a multi-preson project we have a concurrency problem. Essentially all the problems, code repository systems, such as svn and git are tackling. Not surprisingly, we picked git for what I have written about sofar.

However, yesterday's long discussion may have been an indication that things aren't as clean cut as we may have thought. One of the reasons is that we really want to hide the repository issues from the user and ideally also want to avoid a dependency on git for the EC if possible (as it requires more software to be installed on the user's machine).

Strawman Proposal

Let me now float a strawman proposal. Essentially I believe that we really don't need a sophisticated repository management systems, such as SVN or GIT. We don't need branching and elaborate support for merging, nor do we need to be too concerned about storage efficiency. Storage in a directory with some naming convention, a simple mechanism to detect update conflicts, and deferring management of all additional metadata to the associated portal.

Support for versions

We want to support versions of scripts and we want to support an experiment referring to the 'newest version 2.1' script. Now what an experiment is referring to, is a prototype or an application definition, and not necessarily a file.

To keep complexity limited, we should adopt the Java convention of linking a single 'module' to a single file (in fact, we are doing this already). In Java, the definition for class org.acme.Foo can be found in the file Foo.java in the org/acme directory. For us, this means mapping a module's URI to a file path, or more generally, to a URL.

Let's look at a few examples (this is really just from the top off my head):

(urn:omf:portal:)project:sub_project(opt):app|proto|exp:name:major(opt):minor(opt)

omf:tutorial:exp:basic_tutorial
omf:tutorial:exp:basic_tutorial:2
omf:tutorial:exp:basic_tutorial:2:2

tempo:nada:xbmc:exp:two_boxes
tempo:nada:xbmc:app:xbmc:2

Now this could be translated to the following assuming the default portal's repository is 'rep.mytestbed.net'

omf:tutorial:exp:basic_tutorial    => http://rep.mytestbed.net/omf/tutorial/exp/basic_tutorial
                            => http://rep.mytestbed.net/omf/tutorial/exp/basic_tutorial_02_03_01231AE...02.oedl

The first URL is what the EC produces, the second on is the redirect from the portal to the most recent version '2.3'. The trailing number could be a SHA indicating the revision - I'll get back to that in a minute. Having a redirect is a relatively simple way of letting the EC know which version is really being used as it needs to record this for a complete trail of an experiment. And as we can assume that the content behind each 'real' URL will never change, we can locally cache them.

Now, what about local files? Again, we need a naming convention on how to translate URIs into file names and vice versa. We could use the Java convention of nested directories, but I always found them quite cumbersome if I was just using a normal editor. There are a few options:

  • Flatten the name - e.g use '__' to indicate a ':' in the URI (could shorten that for the trailing version)
omf:tutorial:exp:basic_tutorial:2:3    => omf__tutorial__exp__basic_tutorial_2_3.oedl
  • Use defaults for the project name to shorten name
omf:tutorial:exp:basic_tutorial:2:3    => basic_tutorial_2_3.oedl   ... assuming that most local files are exp. files

% omf exec -p omf:tutorial basic_tutorial

or

% export OMF_PROJECT=omf:tutorial
% omf exec basic_tutorial
  • Use directory tree, but add some convenience functions to 'omf'
% export OMF_EDITOR=emacs
% omf edit exp:basic_tutorial

OK, that - or a version of it - would solve finding local files from a URI. But we need to persist any version we are using at a repository. That means that any local file needs to be first uploaded to a portal, and given a persistent URI. For that we have the following issues to resolve:

Resolve portal and project name

We shouldn't assume that there will only be a single portal, but we can simplify the problem by assuming that all scripts for a SPECIFIC project reside on ONE portal.

The project name associated with a script is necessary as it will be linked to access control. Please note, that the experiment itself maybe executed in the context of a different project.

Determining version

Again, to simplify things, we should restrict ourselves to a major and minor version which can be controlled by the user, and a revision automatically assigned by the repository/portal.

If no major version is given, it defaults to the highest major and the highest minor within the context of that major. If only the major is given, it defaults to the highest minor in that context. If this is the first time a script is uploaded it gets assigned major version 1, minor 0.

Determining 'update conflicts'

We spent most of the meeting discussing the situation where two experimenters change the same script. in the 'do nothing' case, each upload would simply create a new version. One solution I can think of would remember the SHA value of the file when it was last synced with the repository. By adding that SHA as an optional argument to the upload would allow the repository to detect a conflict by simply comparing that SHA with the SHA of the latest version within the specific major/minor.

POST http://rep.mytestbed.net/omf/tutorial/exp/basic_tutorial_02_03?ref=01231AE...02

which either results in an HTTP error indicating a conflict, or returns the full URI of the newly stored script. The SHA need to be stored locally. For instance, for every file uploaded a file with the same name is created in a .omf directory containing the SHA for the latest upload.

The obvious remaining issue is what to do when there is conflict. The simplest solution is to either fail the experiment with an appropriate error message, and/or allow the experimenter to force a new revision.

POST http://rep.mytestbed.net/omf/tutorial/exp/basic_tutorial_02_03?ref=01231AE...02&force=true

We most likely need some more sophisticated mechanisms for long-running experiments to be able to resolve this more flexibly. One potential strategy is to use the OIDL programming model where a conflict would create an event which can be handled by a state specifically and by a default in general.

Implementation

Implementation concerns fall into the following category

  1. Maintaining latest revision
  2. Resolving conflicts
  3. Performance

The first two and with it the entire mechanism outlined above can easily be realised with a Redmine controller. The scripts themselves are stored in a directory tree and as there is a corresponding portal record for every scripts stored, the latest revision can be found there. Resolving conflicts is essentially serialising revisions, which can be guaranteed through a database transaction.

Now that brings us to performance. Is going through Rails a real performance hog? Worrying about performance upfront is not always the best strategy, especially as we have little information on how much of an impact it would really have. Most likely the better question to ask is - is there a fundamental performance problem? Most of the portal requests will be for downloading a script of a specific version. The above proposal calls for a REDIRECT. The first request comes to the portal for the actual URL of the script. That can be cached for the most popular ones. The REDIRECT can point to a fast & simple web server as it is simply fetching a local file. Again, client side caching may take care of most of it.

Anything I'm missing?