PWanDB

From GPU

Contents

Progress on the Persistant WAN Database PWanDB

First of all, my apologies to all contributers to GPU, all volunteers keeping permanodes up, and all users, that i have not been working on GPU for over a year now. I needed a break, still do, but am slowly picking up my hobby projects. And one of the first is: the last (sub)project i started, PWanDB.

What is PWanDB

It is a name-value database system that synchronizes among multiple nodes. The data is not scattered among nodes, all nodes share identical copies of the database, once fully synchronized. Discrepancies will only occur when a node updates the local database, and other nodes are not synchronized. Using the GPU gnutella layer, we can synchronize data with 99% of the nodes almost instant, nodes that are / have been offline, will get the full dataset after syncing with another node that has.


  GPU  <==> gnutella network(1) <==> GPU
    ||                              ||
  NODE <==> sync procedure(2) <==> NODE

As this diagram shows, there are two complementing ways that nodes synchronize. Method 1, the GPU network, is to update changes in realtime. Method 2, the node-to-node synchronizing method, makes both nodes exchange the most recent dataset. This is usefull for nodes that have been offline, and/or to correct 'packetloss' in the gnutella network. So, data in a node can be considered up-to-date as long as GPU is running, but there is no guarantuee for accurency. AS GPU developer, you get used to that, and its no issue, as long as _most_ nodes have _most_ data.

The database itself, and its API, looks most like the windows registry. Name-value based pairs, in a hiearchical name tree structure.

What's new

New ideas

I plan an extension on this database. Next to data, it will also contain code (which is no more than data in another namespace). This will not affect the database or its interface itself, but is more an implementation of the data inside the database.

A GPU client can be asked to execute this code. You point a pwandb variable, mention its content is code, and run a gpu command that launches a script interpreter to execute this code. This will result in a developer being able to write (scripted) code, that can be executed on the distributed network by multiple nodes within seconds after storing the data.

For the to-be used script language there are several options, but they are not mutual exclusive. I suggest to just support a number of scriptlanguages, and see what fits best. Candidates are:

My suggestion on this new-to-be-developed language is: break up script in functions, each function is a pwandb variable, and scripts can access those functions and variables without additional effort.


New code

In recent change (feb 2008), i altered database a bit to better support the tree-like structure of the names in the database. This allows a client to browse the names, starting at lvl0, and slowly going deeper into a subtree.

I've created a client-side component, that allows applications to very easy communicate with a PWanDB database on localhost. GPU frontends and plugins, standalone apps, reading and setting a variable in the database is no more than a single line of code.

I've created a database browser GUI, allowing browsing, viewing and editing of values. Also, i plan to extend this GUI to have a 'run' button, allowing to execute scripts straight after editing. This GUI uses above component to communicate with the db. image:Pwandbbrowserv001.PNG


Brainstorming

I've been investigating the possibilities for integrating a scripted language. The idea is as follow: A programmer edits his code lcoally, stores it in the distributed database, then sends out a gpu job requesting other nodes to run this code.

Technical requirements for such interpreter:

1. This code must be strictly sandboxed. No interaction with the OS, filesystem or network is allowed except by a well defined API.

2. The interpreter must be 'light weight' in the sense of: easy compilable or linkable, cross-platform big pre.

3. The interpreted must allow integration of custom defined functions.

4. (Not strictly a requirement, but desired) The script language is based on a well-known language, or at least 'looks like', to avoid unnecessary learning curves. Numerous languages have been developed, stick to them in my opinion. Options: basic, javascript, c, c++, python, pascal, sql(...) etc.

Discussing each requirement

1. The reason for sandboxing is obvious: security. We dont want to compromise computers by a distributed script engine. ALl a script is allow to do is use (some) cpu and memory resources. This requirement eliminates a number of possibilities, like the windows scripting host, python, and a number of free and commercial script interpreters i found. Most tend to allow file io, up to worse. There are 2 candidates though: pascalscript and janscript. The latter has the disadvantage of being unconform any existing language, but speed and functionality we seek are present. Pascalscript has the functionality to. Therefore, i tend to choose for pascalscript. Our issue here is the non-osi license. It is open source, but a custom license. Todo: contact autors.

2. Besides arguments in #1 this leaves out windows scripting host, and a (number of) javascript implementations which tend to be all written in Java.

3. Both suggested janscript and pascalscript offer functionality for custom functions. However, a new to write language would be able to integrate some functionality into the language itself.

4. A new developed script interpreter might or might not ignore this requirement.

Functional requirements and desirements:

  • Easy integration with variables and code/functions in the pwandb.
  • Local node variable store
  • Local node SQL store
  • Debug output
  • Console output - both to localhost as distributed.

Concept: Viewport

In order for code to present output to the user, i suggest the concept of 'viewport'. A viewport can be seen as console output, either textual, graphical, both, or mixed. And either local _as well_ distributed.

  script-on-random-client:console-data ==> network ==> console-viewer-on-random-client.

For technical details its too early, but basically there are two approaches:

1. There might be one node managing a single viewport: all scripts forward 'console' data to this node, all 'console viewers' contact this node. Drawbacks: single point of failure. Potentional high network load for a single host. Advantage: overall least network load.

2. Another way is to broadcast all viewport data on the gpu network. Drawbacks: memory use on each node to cache content of all (active) viewports in use. High network load. Network load also if user is not interested in viewport. Advantages: distributed, no single point of failure.


I'll try to write my ideas on viewports in a more structured manner. Meanwhile, if you understand the concept, please help me brainstorming on it.

Reuse. There are a number of elements in GPU that already roll high dices to act as viewports. The gpu chat channels are well suited for both log as console data. The gpu whiteboard could act as a graphical output. Both already exist and are implemented using the broadcast approach #2 from above. Integrating them into script should require least work.


Update

Whilst pascal seemed such nice solution, the pascal script interpreter appeared to be anything but *Bullet Proof*. One of my experiments let the compiler crash with an AV (checked - it appeared to be a double freeing issue). Now, i have no intention at all to debug an pascal interpreter, and we do have strict need for a _SAFE AND SANDBOXED_ environment. So bye bye pascal script.

Another thought, is the concept of events. Some more extended than gpu jobs. Like chat line events, but also gpu job returns etc, could generate an event that triggest a script to process such data. Work in progress.

Continued thoughts on this in Brainstorm