Bug #358
openUse normative terms instead of binary mapping in DWD pipe-separated text files.
0%
Description
Currently, in pipe-separated text files produced by rulemaker v4, the normative output assertions "MUST", "MUST NOT", "MAY", "SHOULD" are encoded as strings "00", "01", "10", "11".
This does not offer any machine-processing value (a 2 bytes string instead of 3-6 bytes) but hinders human-readability of the file format, which is a major design goal of the file format.
I propose to consider directly using those keywords to represent themselves, be it in user interfaces, file formats or implementation code, and avoid unnecessary conversion steps.
Or to simplify/shorten:
MUSTMAYNOT(for "MUST NOT")SHOULD
Files
Updated by Joseph Potvin 4 months ago
- File Saudi Arabian Monetary Authority_CyberSecurityFramework_DEMO+ARABIC_rule_aa8e54c7-fedd-49dd-ab44-ff3fa68fbf5d.txt Saudi Arabian Monetary Authority_CyberSecurityFramework_DEMO+ARABIC_rule_aa8e54c7-fedd-49dd-ab44-ff3fa68fbf5d.txt added
- File Saudi Arabian Monetary Authority_CyberSecurityFramework_DEMO+ARABIC_ruleGRAPHICAL_aa8e54c7-fedd-49dd-ab44-ff3fa68fbf5d.png Saudi Arabian Monetary Authority_CyberSecurityFramework_DEMO+ARABIC_ruleGRAPHICAL_aa8e54c7-fedd-49dd-ab44-ff3fa68fbf5d.png added
- Assignee set to Charles Langlois
There are several reasons for DWDS use of 00,01,10,11 The functionally critical reason (the fourth below) is most important, but the UI rationales are 'reason enough' for this design decision.
- The data model is consistent with any natural language (sample rule record in Arabic attached) ;
- The GUI icons are easy to represent consistently (same sample rule's RM GUI attached);
- Consistency with general literature on four-value mathematical logic, thus facilitating re-purposing of DWD for entirely different use cases while maintaining data consistency (thus, reducing potential forking);
- DWDS is designed for Internet-context scalability and high-frequency use cases where nanoseconds matter.
https://www.cs.uaf.edu/courses/cs441/notes/network-performance/index.html
https://www.reddit.com/r/todayilearned/comments/2wccwf/til_high_frequency_traders_pay_up_to_6000month_to/
The following detail on reason 4 is generated with Perplexity.ai :
At 100 million messages per second, shaving 1 byte off every message saves 100 MB/s of bandwidth, which is 0.8 Gbit/s; shaving 4 bytes saves 400 MB/s, or 3.2 Gbit/s.
Messages per second: 100,000,000 msg/s.
1 byte saved per message ; 100,000,000100,000,000 bytes/s = 100 MB/s 0.8 Gbit/s.
2 bytes saved per message ; 200 MB/s 1.6 Gbit/s.
4 bytes saved per message ; 400 MB/s 3.2 Gbit/s.
Bytes saved/msg Bytes/s saved MB/s saved Gbit/s saved
1 100,000,000 100 0.8
2 200,000,000 200 1.6
4 400,000,000 400 3.2
On a 10 Gbit/s link, 0.8–3.2 Gbit/s is a substantial fraction of capacity, especially once you include TCP/IP and Ethernet framing overhead.
Here are concrete scenarios where 1 byte per message at 100M msg/s can be material:
Multi-venue market data fan-out: Some market data distribution platforms demonstrate 100M updates per second with outgoing bandwidth around 149 Gbit/s, aiming for sub-10 ms end-to-end latency at the 99th percentile. In such an environment, even a 0.8 Gbit/s reduction from shaving 1 byte per update can be the difference between staying under physical link limits or needing additional 100G ports or nodes.
Horizontal scaling thresholds: High-volume equity markets can produce feed rates that burst into “millions of messages per second,” forcing designers to add more feed-handler cores or servers once queues grow and latency degrades. If each message is slightly smaller, more messages fit into a given NIC and CPU cache footprint, which delays the point at which an additional core or another server is required, and can avoid queue buildup during bursts.
Ultra-low-latency co-location links: HFT shops pay significant premiums for ultra-low-latency connectivity and co-located infrastructure, where microseconds and link utilization are both tightly engineered. If you are running near the usable capacity of a 10G or 25G link with very small messages, a 0.8 Gbit/s reduction can provide enough headroom to reduce retransmissions, avoid buffering in network devices, and keep latency tails within microsecond-level service targets.
Massive real-time pub/sub clusters: Pub/sub platforms used for financial tick distribution have reported configurations capable of delivering 100 million updates per second to subscribers with all updates in order and under 10 ms at p99 latency. At that envelope, any per-message overhead, including a single byte, directly inflates required aggregate bandwidth and can mandate extra network fabric or more nodes; trimming a byte can lower the cost or complexity of the cluster while keeping the same latency SLOs.
In all of these cases, the key is that messages are very small, rates are in the multi-million to hundred-million messages per second range, and systems are close enough to bandwidth or cache limits that a sub-gigabit saving is operationally meaningful rather than merely cosmetic.
Updated by Charles Langlois 4 months ago
Joseph Potvin wrote in #note-1:
There are several reasons for DWDS use of 00,01,10,11 The functionally critical reason (the fourth below) is most important, but the UI rationales are 'reason enough' for this design decision.
- The data model is consistent with any natural language (sample rule record in Arabic attached) ;
- The GUI icons are easy to represent consistently (same sample rule's RM GUI attached);
- Consistency with general literature on four-value mathematical logic, thus facilitating re-purposing of DWD for entirely different use cases while maintaining data consistency (thus, reducing potential forking);
- DWDS is designed for Internet-context scalability and high-frequency use cases where nanoseconds matter.
https://www.cs.uaf.edu/courses/cs441/notes/network-performance/index.html
https://www.reddit.com/r/todayilearned/comments/2wccwf/til_high_frequency_traders_pay_up_to_6000month_to/The following detail on reason 4 is generated with Perplexity.ai :
At 100 million messages per second, shaving 1 byte off every message saves 100 MB/s of bandwidth, which is 0.8 Gbit/s; shaving 4 bytes saves 400 MB/s, or 3.2 Gbit/s.
Messages per second: 100,000,000 msg/s.
1 byte saved per message ; 100,000,000100,000,000 bytes/s = 100 MB/s 0.8 Gbit/s.
2 bytes saved per message ; 200 MB/s 1.6 Gbit/s.
4 bytes saved per message ; 400 MB/s 3.2 Gbit/s.Bytes saved/msg Bytes/s saved MB/s saved Gbit/s saved
1 100,000,000 100 0.8
2 200,000,000 200 1.6
4 400,000,000 400 3.2On a 10 Gbit/s link, 0.8–3.2 Gbit/s is a substantial fraction of capacity, especially once you include TCP/IP and Ethernet framing overhead.
Here are concrete scenarios where 1 byte per message at 100M msg/s can be material:
Multi-venue market data fan-out: Some market data distribution platforms demonstrate 100M updates per second with outgoing bandwidth around 149 Gbit/s, aiming for sub-10 ms end-to-end latency at the 99th percentile. In such an environment, even a 0.8 Gbit/s reduction from shaving 1 byte per update can be the difference between staying under physical link limits or needing additional 100G ports or nodes.
Horizontal scaling thresholds: High-volume equity markets can produce feed rates that burst into “millions of messages per second,” forcing designers to add more feed-handler cores or servers once queues grow and latency degrades. If each message is slightly smaller, more messages fit into a given NIC and CPU cache footprint, which delays the point at which an additional core or another server is required, and can avoid queue buildup during bursts.
Ultra-low-latency co-location links: HFT shops pay significant premiums for ultra-low-latency connectivity and co-located infrastructure, where microseconds and link utilization are both tightly engineered. If you are running near the usable capacity of a 10G or 25G link with very small messages, a 0.8 Gbit/s reduction can provide enough headroom to reduce retransmissions, avoid buffering in network devices, and keep latency tails within microsecond-level service targets.
Massive real-time pub/sub clusters: Pub/sub platforms used for financial tick distribution have reported configurations capable of delivering 100 million updates per second to subscribers with all updates in order and under 10 ms at p99 latency. At that envelope, any per-message overhead, including a single byte, directly inflates required aggregate bandwidth and can mandate extra network fabric or more nodes; trimming a byte can lower the cost or complexity of the cluster while keeping the same latency SLOs.
In all of these cases, the key is that messages are very small, rates are in the multi-million to hundred-million messages per second range, and systems are close enough to bandwidth or cache limits that a sub-gigabit saving is operationally meaningful rather than merely cosmetic.
See my email response, but I'll add another concern.
Both the input conditions and the output assertions have a four-valued logic, with different meaning.
An input condition can be "true/present", "false/absent", "maybe/unknown", "both/contradiction". This is another set of four possible values that are distinct in meaning from the MUST/MAY/NOT/SHOULD of the output assertions.
This makes it even more relevant to avoid using abstract representations when that representation may carry different meanings depending where it is used. This is likely to be a source of confusion and errors for users and auditors.
For the sake of auditability and reliability, I strongly recommend explicit self-describing values over abstract encodings.
In any case, any implementation that has to deal with that format will have to parse those values based on context into a semantically relevant representation. The code should be designed so as to expose as much meaning as possible for the intended behavior to be clear and verifiable. Whether that concern remains in the code or is extended to the file format is up to the designers.
Updated by Charles Langlois 4 months ago
Joseph Potvin wrote in #note-1:
There are several reasons for DWDS use of 00,01,10,11 The functionally critical reason (the fourth below) is most important, but the UI rationales are 'reason enough' for this design decision.
- The data model is consistent with any natural language (sample rule record in Arabic attached) ;
- The GUI icons are easy to represent consistently (same sample rule's RM GUI attached);
- Consistency with general literature on four-value mathematical logic, thus facilitating re-purposing of DWD for entirely different use cases while maintaining data consistency (thus, reducing potential forking);
- DWDS is designed for Internet-context scalability and high-frequency use cases where nanoseconds matter.
https://www.cs.uaf.edu/courses/cs441/notes/network-performance/index.html
https://www.reddit.com/r/todayilearned/comments/2wccwf/til_high_frequency_traders_pay_up_to_6000month_to/The following detail on reason 4 is generated with Perplexity.ai :
At 100 million messages per second, shaving 1 byte off every message saves 100 MB/s of bandwidth, which is 0.8 Gbit/s; shaving 4 bytes saves 400 MB/s, or 3.2 Gbit/s.
Messages per second: 100,000,000 msg/s.
1 byte saved per message ; 100,000,000100,000,000 bytes/s = 100 MB/s 0.8 Gbit/s.
2 bytes saved per message ; 200 MB/s 1.6 Gbit/s.
4 bytes saved per message ; 400 MB/s 3.2 Gbit/s.Bytes saved/msg Bytes/s saved MB/s saved Gbit/s saved
1 100,000,000 100 0.8
2 200,000,000 200 1.6
4 400,000,000 400 3.2On a 10 Gbit/s link, 0.8–3.2 Gbit/s is a substantial fraction of capacity, especially once you include TCP/IP and Ethernet framing overhead.
Here are concrete scenarios where 1 byte per message at 100M msg/s can be material:
Multi-venue market data fan-out: Some market data distribution platforms demonstrate 100M updates per second with outgoing bandwidth around 149 Gbit/s, aiming for sub-10 ms end-to-end latency at the 99th percentile. In such an environment, even a 0.8 Gbit/s reduction from shaving 1 byte per update can be the difference between staying under physical link limits or needing additional 100G ports or nodes.
Horizontal scaling thresholds: High-volume equity markets can produce feed rates that burst into “millions of messages per second,” forcing designers to add more feed-handler cores or servers once queues grow and latency degrades. If each message is slightly smaller, more messages fit into a given NIC and CPU cache footprint, which delays the point at which an additional core or another server is required, and can avoid queue buildup during bursts.
Ultra-low-latency co-location links: HFT shops pay significant premiums for ultra-low-latency connectivity and co-located infrastructure, where microseconds and link utilization are both tightly engineered. If you are running near the usable capacity of a 10G or 25G link with very small messages, a 0.8 Gbit/s reduction can provide enough headroom to reduce retransmissions, avoid buffering in network devices, and keep latency tails within microsecond-level service targets.
Massive real-time pub/sub clusters: Pub/sub platforms used for financial tick distribution have reported configurations capable of delivering 100 million updates per second to subscribers with all updates in order and under 10 ms at p99 latency. At that envelope, any per-message overhead, including a single byte, directly inflates required aggregate bandwidth and can mandate extra network fabric or more nodes; trimming a byte can lower the cost or complexity of the cluster while keeping the same latency SLOs.
In all of these cases, the key is that messages are very small, rates are in the multi-million to hundred-million messages per second range, and systems are close enough to bandwidth or cache limits that a sub-gigabit saving is operationally meaningful rather than merely cosmetic.
My original email response:
The data model is consistent with any natural language (sample rule record in Arabic attached) ;
The GUI icons are easy to represent consistently (same sample rule's RM GUI attached);
Consistency with general literature on four-value mathematical logic, thus facilitating re-purposing of DWD for entirely different use cases while maintaining data consistency (thus, reducing potential forking);
DWDS is designed for Internet-context scalability and high-frequency use cases where nanoseconds matter.
the rest of the file format uses english and latin characters (e.g. for metadata keys, headers). the rest of the world is used to English being the technical langua franca. binary is not a technical langua franca for humans. The expression of rules themselves can be in any language (that's not part of the file format specification as I understand it), but a file format has to agree on a standard representation and encoding for shared concepts.
an icon is an icon, not data, not relevant to the file format. The UI can use whatever compact visual representation it needs for its specific concerns of visual space, but it is devoid of meaning unless a legend is present (or assumed available somewhere) to map it to meaningful concepts.
A UI should not dictate the fundamentals of the systems, especially not in terms of "effort" required. The file format must endure independently of the UI. the data in the file format does not depend on the UI, but on a shared understanding of that data, and according to your own requirements it should be independantly auditable without special software interfaces, in its raw form.
The overall usability and understandability of the system for users goes beyond the visuals of a specific part of the system. The UI will evolve, and should be expected to evolve more flexibly (and be less durable) than the file format.
Since we care about the human auditability of the file format, the concerns of the file format should be its own.
I'm somewhat more sensitive to this kind of argument as a programmer, and seeking generality is something I empathise with, but I'm concerned about reaching for such high level of universality at the expense of current practical concerns.
This risks rendering the core design too abstract to be approachable for people who care about a concrete use of the system today.Note that it is certainly possible to evolve the format to support new equivalent values and alternative (but functionally equivalent) representations if new use cases would prefer them. Though I would challenge you to present a real use case, material to the project and not purely speculative, that would not understand that four-valued logic in the terms in which the rest of the system design is presented everywhere else.
I don't believe that's relevant for the current design.
If you want the format to be maximally efficient so as to support that kind of use cases with such high throughput requirements, I can advise on other existing design choices that go against those requirements.
First and foremost, a text format oriented towards human auditability is not optimally designed for such high throughput. a binary encoding which does away with textual representation of structure would likely be required (or very beneficial) for that purpose. instead of a two-byte string, a real optimization would use a 1 byte encoding (256 possible values). Indeed a single character encoding would also be possible and marginally more useful to human understanding: "y", "n", "m", "b" (yes, no, maybe, both) and "s", "m", "n", "t" (should, may, not, must) (avoding any reuse of values would require choosing two distinct sets of arbitrary characters, principle remaining the same).
I believe the "INDEX" and other header rows are irrelevant for machine processing (only useful when displayed in a spreadsheet arrangement for human audits) and would need to go away in that scenario as wasteful.
I think this argument is putting the cartwheel before the horse. One should design for actual needs and current use cases before designing for all possible use cases. At worst, a different file format will be needed to correctly address some use cases when that use case is actual. This is a technical aspect of the system (file format representation), not a fundamental design decision being made forever , though there is a cost tradeoff between understandability for current use cases, and cost of change for future potential use cases. Design priorities need to be clarified for useful implementations to result. From my point of view the design considerations or arguments for current design choices are contradictory.
Updated by Joseph Potvin 4 months ago
All useful reflections for discussion with several (incl Nhamo Mtetwa, Bill Olders, Wayne Cunneyworth, probably also Dave Thomas).
Updated by Joseph Potvin 4 months ago
- File DNA-Based_Molecular_Computing_Storage_and_Communications_Liu-Yang-Xie-Sun-2022.pdf DNA-Based_Molecular_Computing_Storage_and_Communications_Liu-Yang-Xie-Sun-2022.pdf added
- "the rest of the world is used to English being the technical langua franca. binary is not a technical langua franca for humans" & "but a file format has to agree on a standard representation and encoding for shared concept"
Not sure why the {00,01,10,11} is a problem. It's equivalent to using {0,1} which lets implementers interpret that as {F,T} or {N,Y}. There are numerous linguistic reasons for the {00,01,10,11} choice which are deeper than mere preference, and certainly not pseudo-tech.
- "an icon is an icon"
I think you under-estimate this. For example, here is Calvin's working document -- not the final after user testing, but it illustrates the sort of effort: https://communications.xalgorithms.org/rule-maker/logic-icons/
- "I'm concerned about reaching for such high level of universality at the expense of current practical concerns. This risks rendering the core design too abstract to be approachable for people who care about a concrete use of the system today."
For me, universality is paramount, and has been instrumental to the success of several free/libre projects I have led in the past. When a free/libre application is designed for a general class of problem, then it can outlive the problem. I have always designed for the "meso" layer. This does not prevent anyone from forking for a "micro" scenario (what you refer to as "concrete/today"). But I have conversations with people about using DWDS for (a) legislated rules (e.g. trade); machine-to-machine control systems (e.g. routers); gene-based programming ("Quaternary nucleobase coding is the most efficient approach, since each kind of nucleobase carries specific information. For an instance, the data bits “00,” “01,” “10,” “11” could be encoded as “A,” “G,” “C,” “T,” respectively." Source: https://ieeexplore.ieee.org/document/9440535 (copy attached); it is possible that my own motivating use case (monetary systems design) may not use the default RuleMaker semantics.
... rather than changing from {00,01,10,11} it is more likely RuleMaker will eventually make its six grammatical elements more abstract, say, to adopt (or minimally extend) standard first-order predicate logic (FOL) symbol set, as our proposed symbols are already a close subset.
- "a text format oriented towards human auditability is not optimally designed for such high throughput"
However we have been trying to optimize among many design objectives.
RE: "I believe the "INDEX" and other header rows are irrelevant for machine processing (only useful when displayed in a spreadsheet arrangement for human audits) and would need to go away in that scenario as wasteful."
That is also what I explore with Nhamo and Don, but after thorough consideration we finally decided to retain it, both for processing and audit reasons. We can discuss on a call together with Don and Nhamo.
RE: "One should design for actual needs and current use cases before designing for all possible use cases."
Opposite to my core design instincts. When there's need for a strong cylinder with a grippable handle on one end and a strong flat part on the other, then Screwdriver = Paint Can Opener.
-