Bi Framework

%matplotlib inline

import sys
sys.path.append('../lib')

*After this overview the next step is to take dive into basic theoretical aspects before we start implementing anything *.

I said that Bi is based on large distributed vectors. This layer is superficially below Symbolic level that is why it is dubbed SUB-SYMBOLIC level.

Most programing languages are generally based on alphabetic-discrete-symbols, for Bi at the moment that is the second layer where most of the implementation resides. In the future I want to move as much as possible down to the SUB-SYMBOLIC layer.

REPRESENTATION

I call my atomic structure Semantic Distributed Pointer (SDP) which is 10_000 bits binary to differentiate from Nengo/NEF SP. (Still they resemble SP in many ways).

Rather than just calling them HyperVectors (as per Kanerva), the name SDP more closely describes the use in Bi.. But no worry, both are Ok.

Also to be distinguished by Numenta binary Sparse Distributed Representation (SDR). Which is normaly 2048 bits and is sparse, SDP is not.

Plus I reserve the right for SDP some day to be also real-valued if the need arise ( search : HRR - Holographic Reduced Representations, for more info ).

Local vs Distributed

Information in computers is stored locally, that is, in records with fields. For example let say we want to encode the colors : red, violet, orange and green. If we use local representation we have to use 4 bits, one for each ‘feature’.

In Local representation one unit is used per concept. This is common also in neural nets.

The alternative is to distribute information from many sources over shared units. Thus our previous color example we could use 2 or 3 bits to represent those colors, using multiple overlapping bits for representing a ‘feature’.

Dense vs Sparse

Representation could also be Dense or Sparse.

For example Numenta Grok SDR uses 2048 bit binary, 2% sparse representation i.e. 2% of bits are ONE the rest are ZERO.

SDP on the other hand are Dense : 10000 bit binary with 50% sparsity.

SDP are sparse in a different sense, they sparsely occupy the space they are allowed to occupy, as you will see in a minute.

Similarity

The other feature of vectors which is of utmost importance is the similarity between vectors. Real-valued vectors normally employ geometrical measures like Euclidean distance. In our case we are interested in binary vectors. Here again SDR and SDP differ.

SDR uses “Overlap” i.e. count_ones(A AND B)

*SDP uses “Hamming distance” i.e. count_ones(A XOR B) *

Kanerva : BSC (Binary Spatter Code)

In Bi at the base of it all are binary hyper-vectors or in Bi lingo Semantic Distributed Pointers (SDP). In the current implementation an SDP hyper-vector is 10_000 bits binary vector (size can be changed easily, but according to Kanerva that is the optimal size and who m’I to opine).

Why binary ?

I’ve been pursuing binary approach to machine intelligence (I make distinction between MI and AI) for a long time, because it is no brainer. Simplifies implementation and when people finally awake from “real-valued” hegemony ;), we will see some tremendous CPU acceleration in MI programs. Can you imagine there is not even software CPU optimized libraries for large bitarrays in almost any computer language !!! I’ve checked ;(. BTW FYI computers are machines build on binary !! go figure …

Why so big ?

The reason is that at those sizes we get the required characteristics for doing clever stuff.

The main characteristic of large vector spaces is ORTHOGONALITY. (Very important). (If you don’t know what orthogonal means, think “perpendicular” in 2D space).

If we randomly create 10_000 bits vector, where every bit has 50/50 chance of being 0 or 1, we are guaranteed that any time we do that the vector will be nearly orthogonal to all the previously created vectors. Wow, isn’t that amazing ? ( Kanerva mentioned somewhere that this is valid even if you create millions of vectors )

(Or we can talk about normally distributed real-valued vectors, with 100-300 dimensions : HRR).

This is HUGE, think about that for a minute ?!!

Every time we create new SDP it is guaranteed it will be far away from any previously created one. Perfect for defining new terms (data atoms, symbols), to which we can attach new meaning.

Lets do the following thought experiment …

Imagine a sphere and you are at the north pole point. If we assign that point the meaning of the word “car”. Closer to the north pole will be points more similar to “car”. F.e. “sport car”, “mazda”, “blue car”, “car with 3 wheels”, “bus”, rather than concepts like “chair”, “sky”, “grass” or “computer” which will be further away.

The majority of the points on the sphere will be on the equator i.e. all car-unrelated terms. In fact the points around the equator are 99.XX% of all the points, that is why they are so orthogonal. But not to worry, even less than 1% of 2^10_000 points is huge number i.e. you can still have billions of them to represent things similar to “car”.

In 10_000 bit space only billionth of the points are within 4700 bits range and another billionth further than 5300 bit away i.e. majority of points are between 4700 and 5300 bits hamming distance. (99.7% to be more precise).

Make any point/concept your north-pole and you will see the same distribution comparative to this new position.

From this metaphor we can see the benefit of highly orthogonal spaces. (BTW in reality the space is hypercube).

Creating the knowledge “point” (SDP) is the most basic operation we can do to instantiate new meaning/knowledge/data-point/symbol. It is our starting point from where everything blossoms. At the same time being distributed and large guarantees that it will be noise resistant.

In traditional programming languages when two pointers point to the same place we say they are identical, but if we change even a single bit they become totally different things, not so with SDP, similar vectors imply similar meanings. The SDP is both the pointer/address and the content/meaning at the same time.

So Semantic Distributed Pointer (SDP) it is.


from bi import *

Now that we are aware of the properties of large vector spaces let see what we can do with them.

Note: Once you become familiar with the framework I implore you to also play with sdp.nbits by making it smaller number, in bi.py file. This will speed the processing by alot. I’ve been able to make my tests work even with sdp.nbits = 1000, probably 2000 and up is better. One drawback will be that tiny-tuples will not allow 15 terms anymore (i.e. lower noise resistance, but may be the speedup is worth it), this will be a problem if you use large predicates. So test it and give me feedback.

Basic operations

Instantiating a new Symbol, Binding(*) and Bundling(+).

Instantiating a new symbol like we said already is done by randomly generating a vector in 10_000 bit space. We are guaranteed by the properties of the space that this new vector will be at approximate distance of ~5000 bits from all the other existing vectors, ergo new concept. We do that by calling :

car = sdp.rand()
print car
0101110011000101011010101110011011101111100101101001100111111110001111101001010101001101011111000011...

The second basic operation is Binding where we combine two vectors in a new one :

color = sdp.rand()
red = sdp.rand()
color_red = sdp.bind(color, red)

or we can just do :

color_red = color * red

Binding in SDP is implemented as Exclusive-OR(XOR) operation.

(if you write modules, I would recomend you using sdp.xxx() interface)

One consequence of binding is that the new SDP is again ~50% away from all the existing vectors.

lets check :

print sdp.dist(red, color_red)
print sdp.dist(color, color_red)
print sdp.dist(car, color_red)
4995
4926
5011

See all symbols are ~50% away of each other … orthogonal.

Now we can inspect the content of the bounded vector by probing it :

rv = sdp.probe(color, color_red)

guess which vector ‘rv’ will be close to : color, red or car ?

We do that by checking the hamming distance between the target vectors :

print sdp.dist(rv, car) 
print sdp.dist(rv, color) 
print sdp.dist(rv, red) # i.e. color * color_red => red 
4978
4999
0

i.e. rv is similar to vector red, in this case equivalent (that wont be the case most of the time).

Vectors at distance below sdp.true_thresh = 42% of 10000 are assumed similar i.e. any vector at most 4200 bits away is assumed close. (this threshold can be changed across the system if needed)

You can also check the distance this way, let’s do it with ‘red’ this time :

rv = sdp.probe(red, color_red)
print rv % car # % same as sdp.dist()
print rv % color # i.e. red * color_red => color 
print rv % red
5011
0
4999

What we see from this is that Bind operation is bidirectional i.e. if we think of the pair as variable-value then we can extract value by providing the variable and vs-versa. (In literature those are also called role-filler). In computer languages you can not get the variable name from the value.

That was Bind what about Bundle ?

Bundling also called superposition, set composition, merge, union … etc, is the operation by which we still combine vectors. The difference this time is that resulting SDP is closer to the operands i.e. not orthogonal. Bundling is implemented as thresholded-sum.

car_color = sdp.bundle([car, color])
print car_color % car #close
print car_color % color #close
print car_color % red #far-away
2486
2525
4968

see ‘car_color’ is now closer to ‘car’ and ‘color’, but still far away from ‘red’. But the main goal of using Bundle as the name implies is to bundle items together and create STRUCTURE. F.e. :

vechicle = sdp.rand() #new symbol
machine = vechicle * car + color * red #create bundle of bind-pairs i.e. struct

this very much resembles how we build structures (with fields) in other languages. The difference though is that the resulting SDP is the same size as any of its components, instead of growing in size as we add more fields and data (up to a point then it get too “noisy”). It is a sort of on the fly compression and behaves like a holographic image.

You can stuff up to 15 items this way before you start getting problems recognizing the constituents. (Empirically tested rather than mathematically proven. I call this tiny-tuple, but you will hear more about it later).

Lets poke the bear :

# machine = vechicle * car + color * red
is_car = sdp.probe(vechicle, machine) #extract "car" via "vechicle" from "machine"

# now that this is structure the distance is no longer 0, but still <42%
print sdp.dist(is_car, car) 

print red % (color * machine) #in one go, this time using "color" to get "red".
print sdp.sim(red, color * machine) # boolean : are the items similar?
print sdp.sim(car, color * machine) # nope
2533
2485
True
False

So we can still extract data from structure with single operation. This is very important.

The interesting part of the whole exercise is that in all the manipulations we did the size of the vectors (SDP) did not change as we built structures. In comparison in traditional programming languages structures like dicts/hashes and arrays need ever more space as we add elements, up to a point of course. The other difference is what we are storing is symbols, which forces us to think in different way. We don’t store Strings and Numbers and Dictionaries in our brains. Storing and using SYMBOLS on the other hand is more plausible approximation of how the brain works.

BSC, bundle-even anomaly ?

Because the way Bundling is implemented i.e. “Thresholded Sum”, bundling even number of symbols is “noisy”. When we are thresholding even number of bits and the numbers of ZEROs is equal to the number of ONEs the process is non deterministic, we randomly select 0 or 1.

For this reason it is preferable if you have the option to bundle odd number of items. BTW this is a problem of binary representation. Real valued vectors do not have this problem.

Next we have :

Mapping (*) and Permutation

Another important operations are Mapping and Permutation.

Mapping maps a vector/symbol to different part of SDP space at the same time preserving SIMILARITY.

# machine = vechicle * car + color * red
print "original : ", color_red % machine #how far away are the symbols

mmap = sdp.rand()  #create the map
m_machine = sdp.mapit(machine, mmap)
m_color_red = sdp.mapit(color_red, mmap)

print "mapped : ", m_color_red % m_machine #how far away are the mapped symbols
original :  2485
mapped :  2485

Internally sdp.bind(), sdp.probe(), sdp.unbind(), sdp.mapit(), sdp.unmap() for SDP are implemented using XOR, but that is true only for binary representation. Not so for other Holographic Reduced Representation (HRR), so better use the correct function to make your intentions semantically clear.

Permutation is simply swapping bits around. As we did with mapping we have to generate the permutation matrix first.

pmx = sdp.rand_perm_mx() #create the permutaion matrix 
vechicle_car = vechicle * car #bind-pair used in machine, so that they are similar

p_machine = sdp.permute(machine, pmx)
p_vcar = sdp.permute(vechicle_car, pmx)

print "original : ", vechicle_car % machine
print "permuted : ", p_vcar % p_machine #permutaton also keeps the similarity
original :  2533
permuted :  2533

A cheaper way to do permutation is to simply shift the bits.

print car
print sdp.roll(car, 2)
print car % ( sdp.roll(sdp.roll(car, 2), -2) ) # permute forth&back
0101110011000101011010101110011011101111100101101001100111111110001111101001010101001101011111000011...
0001011100110001010110101011100110111011111001011010011001111111100011111010010101010011010111110000...
0

Did you saw that you can reverse roll, the same goes for permutation via sdp.inv_permute().

machine % sdp.inv_permute(p_machine, pmx)
0

NOTE: Big benefit of what we discussed so far is that you can reason about those operation Algebraically, because they support for the most part : Associativity, Commutativity … etc. Read papers on the Internet for details.


Now the drawback of VSA from performance standpoint :

Cleanup memory

For the VSA (Vector symbolic architechture) to work, we need one more crucial element, normally not needed in the other architectures.

The so called Cleanup memory.

Why would you need such thing and what is it ?

Like we said higher cognition requires us to be able to build structures by composing symbols. In the absence of any other information it is not possible to tell whether some arbitrary vector represents a composite entity, an atomic entity, or nothing at all.

Also when we want to peek in those structures because we are dealing with distributed information we have to pry out the constituents. As a result we get noisy vectors like we already saw. We have to cleanup those vectors by comparing them to a memory that holds more pristine representations.

So the purpose of Cleanup memory (CUP) is to provide clean-reference-vectors, so we can de-noise vectors that result from operations on SDP’s.

There could be many types of cleanup memories.

The iconic Cleanup memory is Kanerva-SDM (Sparse distributed memory). Back in the day I implemented SDM with Perl/PDL, but I’m not planing to do that in Bi project until I get more robust system working. BTW it should not be hard to integrate if needed.

In the current Bi framework i built more prosaic CUP namely a Lexicon. Lexicon is a one-to-one cleanup memory, where each data item correspons to a separate-item in the memory. It is not very plausable to how the brain store information, but its simpler implementation allows me to start implementing VSA.

Lexicon behaves like normal dictionary when accessing symbols (with efficient SDP storage), you give it a symbol name and you get back SDP.

But that is not what cleanup memory is about. What is expected from CUP is given a noisy vector is to get back the nearest clean SDP.

Lexicon extends that by returning the symbol name of the nearest SDP, instead. You can always fetch the SDP by name if you need. The operation I mentioned is called .best_match().

We need the symbol name instead of SDP, so that we can integrate it into upper level algorithms as you will see later. My goal is overtime as I understand the system better to move as much as possible to the lower level.

Lets see it in action :

from lexicon import *

d = lex()
d.add('car')
0

Here is different ways to access the SDP by symbol-name :

print d.get('car')
print d.g('car')
print d.g(0)
print d['car']
1010101100100001010111010010110010100100111110111000101110001100101000011000111111001111111110001011...
1010101100100001010111010010110010100100111110111000101110001100101000011000111111001111111110001011...
1010101100100001010111010010110010100100111110111000101110001100101000011000111111001111111110001011...
1010101100100001010111010010110010100100111110111000101110001100101000011000111111001111111110001011...

Lets add more symbols and build a structure out of them :

d.add_items(['color', 'red', 'blue', 'shape', 'circle', 'square'])
d.lex
{'blue': 3,
 'car': 0,
 'circle': 5,
 'color': 1,
 'red': 2,
 'shape': 4,
 'square': 6}
obj = d['shape'] * d['circle'] + d['color'] * d['red'] #create struct
d.add('object1', obj) #add the struct to lexicon for future use
7

Now that we have several symbols in the lexicon, lets access/extract information from the struct by cleaning it up against the lexicon. The same thing we did with SDP variables.

print d.best_match(d['shape'] * obj)
print d.bm(d['color'] * obj)
print d.bm(obj)
circle
red
object1

The best-match is very expensive operation that is why as you would see later i employed different tactics to make overall system speed better.

d
            car : 1010101100100001010111010010110010100100111110111000101110001100101000011000111111001111111110001011... ...
          color : 0010110100010001000101010100001010100011001100100001001100001011100000010000010001011000010111101110... ...
            red : 0001000001100011110000000011010101111001011010010100010001111111110011100110111001110101101011011101... ...
           blue : 0001011000110110011101110010001000001110110011111110010010011111010101001100001111111111001100110010... ...
          shape : 0010100011011111101101001111010100100101001111001001101111010000101000010001111010110101100100010001... ...
         circle : 1101010001001100100000010010010001111110111101001110100001010000000011011101001111011001001010110011... ...
         square : 1010100010111101111101110111010101001101110001110000100001011011001000000010001100001110101001100101... ...
        object1 : 0111110011110010100101010111011111011010010110010111011101100000101011110110110101101100101110110011... ...

As I mentioned already all the SDP are stored compactly as a bitarray and will dynamically grow to accommodate new entries.


Structures

With operations we learned so far and the knowledge how to use CUP’s we have the basic process of how to build structures by composition and decomposition. Let see some examples :

Tiny tuple

I tested experimentally how many pairs-of-binds maximum can be bundled in a single SDP with 100% success extracting the items. The number is 15.

So I call this structure tiny-tuple. Here is how it looks visually.

a1 * b1 + a2 * b2 + a3 * b3 + .... + a15 * b15

Version 0.1 of Bi-language is wholly based only on tiny-tuple, no other structure is used yet.

Sequences

There are many ways to represent sequences.

If we use the symbol *> to signify permutation and <* inverse permutation then here is one way to encode sequence.

seq = a *> b + b *> c + c *> d + d *> e + … ; X <* seq , return next element in seq

You can manually do this chaining, but for easy testing I have build simple extension for ipython that you can use to experiment quickly some ideas before coding them down. (By default this extension creates for you symbols @a to @z. To create new symbol use double at i.e. @@new_sym) So we will first load the extension then create the sequence :

%load_ext bi_ip_ext
%do @seq = @a *> @b + @b *> @c + @c *> @d
@seq = @a *> @b + @b *> @c + @c *> @d
Value
 +- val True
 `- vtype 'bool'
Value(...)

Now we can query it. Again using best-match to cleanup/find the symbol.

%do bm: (@a <* @seq)
bm: (@a <* @seq)
Value
 +- val 'b'
 `- vtype 'string'
Value(...)
%do bm: (@b <* @seq)
bm: (@b <* @seq)
Value
 +- val 'c'
 `- vtype 'string'
Value(...)
%do bm: (@c <* @seq)
bm: (@c <* @seq)
Value
 +- val 'd'
 `- vtype 'string'
Value(...)
%do bm: (@d <* @seq)
bm: (@d <* @seq)
Value
 +- val None
 `- vtype 'string'
Value(...)

This example illustrate something you have to have in mind when building structures. If you were observant you may have asked yourself, why on earth would I need to use permutation, couldn’t we just use normal binding. We can’t ! because if you look again symbols in the sequence repeat, for example we have bind pairs a * b and b * c, where b is in both pairs. So how do we know when we probe with b which pair are we querying. To emulate one-directional binding we use permutation a *> b, this way bounded pair behave more like variable-value binding.

You can also use sdp.roll() to do the same thing, because the ipython-extension use only one and the same permutation matrix for all *> cases you are better experimenting with roll.

%do @seq2 = @a 5> @b + @b 5> @c + @c 5> @d
@seq2 = @a 5> @b + @b 5> @c + @c 5> @d
Value
 +- val True
 `- vtype 'bool'
Value(...)
%do bm: (@a <5 @seq2)
bm: (@a <5 @seq2)
Value
 +- val 'b'
 `- vtype 'string'
Value(...)

Here are other ways to represent sequence :

  • seq = a + a * b + a * b * c …; ==> probe(a), probe(ab) …

  • seq = 1 * a + 2 * b + 3 * c + 4 * d + … ; can be seen as tiny-tuple too

Of course all those representations suffer from the same limitations of how many bind pairs you can stuff in single SDP … 15 ! To solve this problem we need to build hierarchies.


Hierarchical and relational structures

Instead of trying to put all our eggs in single SDP we can spread the structure in many SDP’s. We can do that by building interdependent SDP’s i.e. build hierarchy.

For example a sequence may look like this :

seq = ab + ab * cd; ab = a + a * b ; cd = c + c * d

One additional decision we have to make when building hierarchical structure is to decide if we are going to use single Cleanup memory(CUP) or multiple. If we use multiple how do we split them, based on hierarchy level, scope, both … other ?

Lists can also be represented as hierarchy.

Version 0.2 will be all about hierarchical structures. One of the reasons I decided to go with Prolog-like interpreter is the intrinsic handling of relational and hierarchical structures, for languages as Python they are outside realm.

Currently Bi engine Unification process is capable of handling hierarchical structures, the parser may require some small modifications to handle expressions, the problem is the Knowledge DB. There are many options of how to build up hierarchiacal structures, that is why I postponed it for later.