Project

General

Profile

Actions

Feature #67

open

RR Data Storage Topology of Concern to Rules Data Published by RM, and Rules Data Received by RT

Added by Joseph Potvin almost 2 years ago. Updated almost 2 years ago.

Status:
Feedback
Priority:
High
Assignee:
Start date:
03/10/2022
Due date:
% Done:

0%

Estimated time:

Description

The DWDS 1.0.0 specification (Section 6.2) describes the data storage topology of RuleReserve service as follows:

The RuleReserve network performs three functions: storage, sifting, and messaging:
• Distributed storage of [rule.dwd] records on a [rulereserve.dwd] n x m table, one record per row, maintained online via the decentralized IPFS (Benet, 2014);
• Efficient sifting to reduce [rulereserve.dwd] to a set of [rule.dwd] rows that are deemed by their authors to be ‘in effect’ and ‘applicable’;
• High-speed on-demand messaging that receives [is.dwd] requests, and sends [ought1.dwd] responses;
...
All [rule.dwd] and [lookup.dwd] data is stored and addressed on the participating RuleReserve nodes across a distributed and deliberately redundant [m x n] matrix (i.e. m rows x n columns), referred to as [rulereserve.dwd]. The data of each indexed row is arranged like a long telex tape on which every [rule.dwd] and [lookup.dwd] record is splayed out horizontally. The sifting process may seem to be an enormous task, but it is done in massive parallel fashion across the large decentralized [rulereserve.dwd] array that is distributed on the IPFS network. The [is.dwd] message is pre-configured to function as a [sieve1.dwd] upon the [rulereserve.dwd] collection.

Further, Section 5.4.6.4 describes the "Horizontal Tape Topology":

The DWDS uses a horizontal tape topology for storage and retrieval with a large distributed table of [rule.dwd] and [lookup.dwd] records. This is essentially a virtual telex tape variant of Alan Turing’s horizontally-splayed table, which he first described as follows:
“Computing is normally done by writing certain symbols on paper. We may suppose this paper is divided into squares like a child's arithmetic book. In elementary arithmetic the two-dimensional character of the paper is sometimes used. But such a use is always avoidable, and I think that it will be agreed that the two-dimensional character of paper is no essential of computation. I assume then that the computation is carried out on one-dimensional paper, i.e. on a tape divided into squares. I shall also suppose that the number of symbols which may be printed is finite. (Turing, 1937, p. 249)”
Current-generation ‘wide-column’ data fabrics employ this topology for distributed storage and querying of structured data (Chang et al., 2008) (Carpenter & Hewitt, 2016) (Mahgoub et al., 2017). These are virtual manifestations of an unlimited number of physical Turing tapes laid out side-to-side. ... This is not easy to illustrate, but that’s to be expected because this topology is optimized for machine storage efficiency and processing speed. ...
This Horizontal Tape Topology enables compact storage, fast discovery and high-volume processing. It is optimized for machines, while the other two topologies discussed above trade off machine performance for human readability. Information was defined earlier as data with context. The [rule.dwd] and [lookup.dwd] records are stored in a [rulereserve.dwd] table, across a decentralized, distributed network of RuleReserve (RR) nodes, one row per record. An [is.dwd] message is also pre-formatted horizontally, so that it can be used directly as a sieve, with no need for reconfiguration during run-time.
Signals representing elements of a context can be recognized though primitive signal-matching. Data sifting can proceed immediately upon arrival of each [is.dwd] record. An [is.dwd] message contains (GIVEN) context data, which is used by an RR node to sift out all the rows from a consolidated [m x n] matrix of DWDS structured data which contain the signals of rules that are ‘in effect’ for that jurisdiction, time/date, and identity. Once sifting for context is completed, the (WHEN) particulars data from the same [is.dwd] message is used to sift for rows that contain signals that are ‘applicable’.

Efficiently placing any number of logic gates of diverse dimensions into the single wide-column RR database requires that the column headers have a generic and automatically-extensible naming structure. Shown below are suggested 'input condition' and 'output assertion' segments from a draft spreadsheet mock-up https://docs.google.com/spreadsheets/d/1V76YYPWhiNy71pvgQsdJP5yaZoFArx_3vSSlTuCasOg/edit#gid=916651765 ( See sheet entitled: Database-ALL-Examples ).

Sample section of "conditions" columns for logic gate data in the RR table (using the letter 'c'):

c.3.parameter   c.3.value   c3.notes                        c.3.1   c.3.2   c.3.3   c.3.4   c.3.5   c.3.6   ... 
buyer_profile   Sgc-n       Singapore Citizen, Addtl Prpty  00      00      00      00      00      00

Sample section of "assertions" columns for logic gate data in the RR table (using the letter 'a'):

a.3.parameter   a.3.value   a.3.notes                       a.3.1   a.3.2   a.3.3   a.3.4   a.3.5   a.3.6 ...
absd_rate       0.1*(price) Price from purchase order       00      00      01      00      00      00
Actions #1

Updated by Joseph Potvin almost 2 years ago

  • Assignee changed from Joseph Potvin to Ted Kim

Here are DRAFT details on the RR storage of logic gate data in a wide-column db or even a flat file, into which RM would be committing rules-as-data.

Note that the stored data should not contain any JSON syntax -- for re-constitution from the db or file back into JSON, the syntax notation can be automatically applied.

I do not know the correct 'data science' terminology for the structure or pattern shown here -- pls advise if this structure has a name.

~joseph

Actions #2

Updated by Ted Kim almost 2 years ago

  • Assignee changed from Ted Kim to Joseph Potvin

If you suspect that 00 is the default value for most of these cells, I think treating the data as a sparse matrix would be a good optimization strategy to consider.
Consider the below example of using a triple representation of a sparse matrix for your logic to look up the data from.

E.g.

Imagine you have a 5x5 logic table that looks like this (for simplicity i didn't write leading 0s):

A B C D E
10 0 0 0 0
0 10 0 0 0
10 0 0 0 11
1 0 0 0 0
0 0 0 1 0

for just 6 cells of data, you have 24 other cells that are 0 values. For about 12 bytes of data you're using 48 bytes in space.
Imagine now you have 10 rows of the full 256 logic columns for just 20 non-zero values. For 40 bytes of non-zero values, you're using 5000+ bytes.

Using a triplet representation of the above table, the same information can be represented like this:

Row Column Value
0 0 10
1 1 10
2 0 10
2 4 11
3 0 1
4 3 1

6 x 3 x 2 bytes = 36 bytes, down from 48 bytes representing all the 0 values like above. The saving is of course much greater if you consider that the first table has 251 additional columns beyond column 'E' with 0 values.

Does this make sense and do you think this is appropriate for your use case?

Actions #3

Updated by Ted Kim almost 2 years ago

for just 6 cells of data, you have 24 other cells that are 0 values. For about 12 bytes of data you're using 48 bytes in space.
I just noticed it's 5x5 which would mean 19 cells of 0 values (25 total cells). 5x5x2 bytes - 6x2bytes = 50-12 = 38 bytes.

Actions #4

Updated by Joseph Potvin almost 2 years ago

  • Assignee changed from Joseph Potvin to Ted Kim

RE: "If you suspect that 00 is the default value for most of these cells"

I would not make such an assumption, but we could make 00 a design default.

RE: "(for simplicity i didn't write leading 0s)"

Thought I'm not sticky on the matter, my preference is to retain the leading zeros to reduce uncertainty that a 0 should really have been a 01 or a 10, but suffered a typo. That's to say, '00' more definitively asserts that the value is NOT 01 and NOT 10.

RE: "Using a triplet representation of the above table, the same information can be represented like this"

Hm, good observation. I'd be fine with that IFF this triplet data structure can be place onto a Turing tape-like one-rule-per-row which meets the condition (or equivalent): "any number of logic gates of diverse dimensions into the single wide-column RR database requires that the column headers have a generic and automatically-extensible naming structure".

This one-rule-per-row is needed for the RR sifting operation (Sections 6.2 & 6.3 of the dissertation).

This something that Don, Bill, Nhamo and Wayne would have optimization vs trade-off opinions about.

Actions #5

Updated by Ted Kim almost 2 years ago

  • Assignee changed from Ted Kim to Joseph Potvin

In that case why dont you imagine those are just binary representations of integers?
i.e. 00 = 0, 01 = 1, 10 = 2, 11 = 3

For your consideration, triple representation of a matrix is really only useful if you suspect that a proportionately small non-zero values are in the table vs zero values. In the case of a 256 column table, a full representation beats out triple representation when you have 86 or more columns of non-zero values per row.

Actions #6

Updated by Joseph Potvin almost 2 years ago

  • Assignee changed from Joseph Potvin to Ted Kim

RE: "just binary representations of integers"

In the RR distributed data store, I'd go with whichever is faster to sift. From a position of maximum ignorance* I assumed {00,01,10,11} would be faster to sift than {0,1,2,3}. That may be wrong -- but what I have in mind is parsing latency in high volume data processing scenarios.


Actions #7

Updated by Ted Kim almost 2 years ago

  • Assignee changed from Ted Kim to Joseph Potvin

I'm all out of brain power for today..sorry I can't quite understand how the maximum ignorance principle fits in this context.

In general speed is down to size. The difference in the reading speed of "00" (string type) vs 0 (int type) is quite negligible IMO and not why I would suggest "0,1,2,3", although theoretically int type should be faster. I'm merely mentioning it from a database perspective since it gives enhanced clarity (greater probability to make a typo using combination of 2 characters over 4 distinct digits) with an added feature of easy constraints that can be applied at the database layer (minvalue=0 maxvalue=3). Not recommending for any specific implementation but I think it'd be useful to at least think about.

Actions #8

Updated by Joseph Potvin almost 2 years ago

RE: " how the maximum ignorance principle fits in this context"

Apologies -- that was just to show that "maximum ignorance" is a thing, and I was just saying that I didn't really have a clue about this point. ;-)

RE: "theoretically int type should be faster"

It would be interesting to bench test.

RE: " enhanced clarity (greater probability to make a typo using combination of 2 characters over 4 distinct digits)"

Agreed.

RE: "easy constraints that can be applied at the database layer (minvalue=0 maxvalue=3)"

My intention was to place processing (sifting) speed as the top priority -- on that metric maybe I am wrong. Maybe that is not the correct priority: perhaps data storage density should be the top priority.

Actions #9

Updated by Joseph Potvin almost 2 years ago

  • Assignee changed from Joseph Potvin to Ted Kim

Re-assigning...

Actions #10

Updated by Ted Kim almost 2 years ago

  • Status changed from New to Feedback

Based on the call earlier this evening I'm understanding the context a little better. I had thought that all the rows of rule+logic will coexist in a single table and you would have to perform a query on the table directly. If you've got a distributed set of tables of rules and you only look at a few rules at a time this doesn't seem critical to focus at the moment as it works fine with the way things already are.
Let's close the thread for now and revisit if we need to reconsider another degree of scaling in the future.

Actions

Also available in: Atom PDF