Samizdat Storage Implementation
===============================

Query Pattern Translation
-------------------------

First, WHERE section of a query is parsed into a list of triples:

    pattern: position -> p, s, o.

Stage 1: Predicate Mapping

For each position i, predicate uriref is looked up in the map of
internal resource properties. All possible mappings are recorded in
c[i].table (same for subject and object) and c[i].field (for object
only: subject is always mapped to "id" field); all positions of nodes in
triples are recorded in pm[node]:

    c: position -> table, field
    pm: node -> positions*

Each ambiguous property mapping is compared with mappings for other
occurences of the same subject and object nodes in the pattern graph;
anytime non-empty intersection of mappings for the same node is found,
both subject and object mappings for ambiguous property are refined.

Stage 2: Relation Aliases and Join Conditions

Relation aliases c[i].alias are determined for each position i, such
that for all subject occurences of pattern[i].subject that were mapped
to the same table c[i].table, alias is the same, and for all positions
with differing table mapping or subject node, alias is different:

    c: position -> table, field, alias.

For all nodes that are mapped to more than one {alias, field} pair in
different positions, join conditions are generated. Additionally, for
each external resource, "Resource" table is joined by uriref.

Problems

Some RDF properties (such as dc:title) can be mapped to more than one
internal resource table, and queries on such ambiguous properties are
intended to select all classes of resources that match this property in
conjunction with the rest of the query.

The algorithm described above assumes that other query clauses refine
such ambiguous property mapping to one internal resource table. Queries
that fail this assumption will be translated incorrectly: only the
resource class from the first remaining mapping will be matched. This
should be taken into account in site-specific resource maps: ambiguous
properties should be avoided where possible, and their mappings should
go in order of resource class probability descension.

It is possible to solve this problem, but any precise solution will add
signinficant complexity to the resulting query. Solutions that would not
adversely affect performance are still being sought.

Example

    SELECT ?msg, ?title, ?name, ?date, ?rating
    WHERE (dc::title ?msg ?title)
          (s::author ?msg ?author)
          (s::fullName ?author ?name)
          (s::publishedDate ?msg ?date)
          (rdf::subject ?stmt ?msg)
          (rdf::predicate ?stmt s::tag)
          (rdf::object ?stmt s::Quality)
          (s::rating ?stmt ?rating)
    LITERAL ?rating >= -1
    ORDER BY ?rating
    USING rdf FOR http://www.w3.org/1999/02/22-rdf-syntax-ns#
          dc FOR http://purl.org/dc/elements/1.1/
          s FOR http://www.nongnu.org/samizdat/rdf/schema#

Following is a schematic explanation of the query translation algorithm
for the query example provided above. c0-c7 are clauses representing
query pattern triples. ci.s is the subject of ci triple, ci.o is object
of the same triple; both ci.s and ci.o store a list of possible mappings
of the triple property into internal resource tables and field names.
c0.a is the table alias that represents the property in the resulting
SQL join.

    c0 = (dc::title ?msg ?title)
      dc::title ->  Message.title, Item.title
      ?msg ->  c0.s => Message.id, Item.id
      ?title ->  c0.o => Message.title, Item.title

    c1 = (s::author ?msg ?author)
      s::author ->  Message.author
      ?msg ->  c1.s => Message.id
      ?author ->  c1.o => Message.author

    c2 = (s::fullName ?author ?name)
      s::fullName ->  Member.full_name
      ?author ->  c2.s => Member.id
      ?name ->  c2.o => Member.full_name

    c3 = (s::publishedDate ?msg ?date)
      s::publishedDate ->  Resource.published_date
      ?msg ->  c3.s => Resource.id
      ?date ->  c3.o => Resource.published_date

    c4 = (rdf::subject ?stmt ?msg)
      rdf::subject ->  Statement.subject
      ?stmt ->  c4.s => Statement.id
      ?msg ->  c4.o => Statement.subject

    c5 = (rdf::predicate ?stmt s::tag)
      rdf::predicate ->  Statement.predicate
      ?stmt ->  c5.s => Statement.id
      s::tag ->  c5.o => Statement.predicate

    c6 = (rdf::object ?stmt s::Quality)
      rdf::subject ->  Statement.object
      ?stmt ->  c6.s => Statement.id
      s::Quality ->  c6.o => Statement.object

    c7 = (s::rating ?stmt ?rating)
      s::rating ->  Proposition.rating
      ?stmt ->  c7.s => Proposition.id
      ?rating ->  c7.o => Proposition.rating

    ?msg = c0.s, c1.s, c3.s, c4.o
      -- reverse mapping of the node occurences
      c0.s => Message.id, Item.id
      c1.s => Message.id
      c3.s => Resource.id
      c4.o => Statement.subject
      -- refine ambiguous properties
      c0.s & c1.s > 0  ->  c0.s => Message.id  ->  c0.o => Message.title (1)
      -- define relation aliases (in subject positions)
      c0.a => Message a
      c1.a => Message a
      c3.a => Resource b
      -- join conditions and result binding
      a.id = b.id AS msg

    ?title = c0.o
      c0.o => Message.title, Item.title
      (as refined from ?msg (1): c0.o => Message.title)
      a.title AS title

    ?author = c1.o, c2.s
      c1.o => Message.author
      c2.s => Member.id
        c2.a => Member d
      a.author = d.id

    ?name = c2.o
      c2.o => Member.full_name
      d.full_name AS name

    ?date = c3.o
      c3.o => Resource.published_date
      b.published.date AS date

    ?stmt = c4.s, c5.s, c6.s, c7.s
      c4.s => Statement.id
      c5.s => Statement.id
      c6.s => Statement.id
      c7.s => Proposition.id
        c4.a => Statement c
        c5.a => Statement c
        c6.a => Statement c
        c7.a => Proposition e
      a.id = c.subject
      c.id = e.id

    s::tag = c5.o
      c5.o => Statement.predicate
      -- ground uriref nodes to resources
      Resource f
      c.predicate = f.id AND f.urierf = true AND f.label = 's::tag'

    s::Quality = c6.o
      c6.o => Statement.object
        Resource g
        c.object = g.id AND g.uriref = true AND g.label = 's::Quality'

    ?rating = c7.o
      c7.o => Proposition.rating
      e.rating AS rating

Resulting SQL Query

    SELECT a.id AS msg, a.title AS title, d.full_name AS name,
           b.published.date AS date, e.rating AS rating
    FROM Message a, Resource b, Statement c, Member d, Proposition e,
         Resource f, Resource g
    WHERE a.id = b.id
      AND a.author = d.id
      AND a.id = c.subject AND c.id = e.id
      AND c.predicate = f.id AND f.uriref = true AND f.label = 's::tag'
      AND c.object = g.id AND f.uriref = true AND g.label = 's::Quality'
      AND e.rating >= 2
    ORDER BY e.rating


Assertion Translation
---------------------

Initially, RDF clauses in assertion are translated using the same
procedure as for query pattern. Following mappings are generated:

    pattern: position -> p, s, o
    c: position -> table, field, alias
    pm: node -> positions*
    jc: join-conditions*

Stage 1: Resources

Insert missing resources from the assertion and all blank nodes from the
INSERT section, define mapping v: node -> value.

    if internal: v[node] = Resource(id),
        if missing, error: fake internal resource id;
    elsif literal, v[node] = node;
    elsif blank node and only in object position,
        v[node] = update[node] (update is generated from UPDATE section);
    elsif blank node: v[node] = (select node where subgraph) or
    external: v[node] = Resource(uriref, label=node),
        if missing or in INSERT section,
            v[node] = (insert blank resource);

To detect presence of resource blank nodes in the database, subgraph of
all query nodes and properties reachable from this node is matched
against the database. Subgraph is generated using following algorithm:

    subgraph = [node], w = [];
    do 
        stop = true;
        for each pattern triple,
            if subject in subgraph and triple not in w,
                subgraph.push object;
                w.push triple;
                stop = false;
    until stop.

Stage 2: Properties

Find reverse mapping of aliases to clauses that refer to them:

    a: alias -> positions*.

Note: a will not include additional aliases that are defined for uriref
nodes, they are not necessary since their ids are already recorded in v.

    for each alias,

    key_node = pattern[position].subject;
    table = c[position].table;
    for each position,
        node = pattern[position].object;
        field = c[position].field;
        value = v[node];
        if new[key_node] (if key_node was inserted at Stage 1) or
        if update[node],
            push [field, value]
    if new[key_node],
        insert into table (id, field, ...) values (v[key_node], value, ...);
    else update table set field = value, ... where id = v[key_node].

