salza
?
What is salza
is a practical implementation of two universal algorithmic information measures on sequences based on LZ77^{1}^{,}^{2} and relative LZ^{3}^{,}^{4} coding.
Its rationale are described in this preprint. We thank the anonymous reviewers of ISIT'19 for spotting an inconsistency in the DPI (for reference, here are the ISIT'19 slides).
salza
comes with builtin computation of two universal, normalized semidistances (one in the spirit of dissimilarity^{5}, the other in the spirit of similarity^{6}), and an implementation of causality inference using the (stable) PC algorithm^{7}.
salza
is multithreaded^{8}, too.
Corrigendum
As of salza
v0.2.8 (OOPS
v0.2.21), monotonicity over prefixes is now correctly enforced. Preprint updated.
Companion software
In case it is needed, drpt
will convert a (salza
) semidistance matrix into a true distance matrix.
To produce strings out of a numerical attributes matrix, use mattr2str
.
Licensing information
salza
is released as is, without any warranty, under a dual licensing scheme.
By default, salza
is distributed under the GNU Affero General Public License, version 3.
If you cannot comply with AGPLv3, please contact us.
salza
Installing salza
is written in C for GNU/Linux. We provide both precompiled binaries for Debian/Ubuntu amd64
architectures and instructions to compile from source.
Debian/Ubuntu repository
Please follow these instructions to add the repository to your system.
Once the repository is available on your system:
sudo apt install salza
To install the companion programs:
sudo apt install drpt mattr2str
Source code
Requirements
salza
makes use of the following software:

clang
,make
,cmake
,doxygen
,git
,  the
oops
library.
Cloning the source repository
Once oops
is compiled and installed, clone the git
tree:
git clone https://forge.uvolante.org/code/salza.git
mkdir p salza/build && cd salza/build
cmake DCMAKE_BUILD_TYPE=Release ..
make
sudo make install
The last step can be replaced with building a Debian package (type: cpack
) to be installed with sudo dpkg i salzaxxxxx.deb
.
Known issues
salza
eats a lot of memory. Launch it with the verbose option (v
) to monitor its behaviour.
In case it is too slow, that may be because your machine had to free memory. A suggestion is to kill the process and launch it again.
That proved useful in LXD containers where the guests do not have the same view of memory as the host (you think you have plenty of free memory, but you don't).
In any case, all strings have to fit in memory with their associated search structures.
Getting help
salza h
salza
Playing with Basic properties
$ cat x
aabbababbaabbbabbabaaababababababab
Computing selfsimilarity...
$ salza s i x
9.853782e+00
... is the same as computing the relative similarity given no prior:
$ salza s t relative i x
9.853782e+00
which is the same as computing the joint similarity of a string alone:
$ salza s t joint i x
9.853782e+00
The joint similarity of a string with copies of itself is the same as that of the string alone:
$ salza s t joint i x,x
9.853782e+00
But this is not the case for concatenation:
$ cat x x  salza s i 
1.084182e+01
Yet, as the intuition suggests, more concatenations only slightly increase the complexity:
$ cat x x x  salza s i 
1.089478e+01
$ cat x x x x  salza s i 
1.092117e+01
$ cat x x x x x  salza s i 
1.093698e+01
salza
Zeroes of The empty string has similarity zero:
$ echo n ""  salza s i 
0.000000e+00
$ echo n ""  salza s i  p x
0.000000e+00
A single character has similarity zero:
$ echo n "a"  salza s i 
0.000000e+00
The relative similarity of a string given itself is zero:
$ salza s i x p x
0.000000e+00
Monotonicity over prefixes
$ echo n "ab"  salza s i 
5.000000e01
$ echo n "abb"  salza s i 
1.555556e+00
$ echo n "abbb"  salza s i 
1.687500e+00
$ echo n "abbbb"  salza s i 
1.760000e+00
$ echo n "abbbbb"  salza s i 
1.805556e+00
$ echo n "abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"  salza s i 
1.970612e+00
$ echo n "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"  salza s i 
9.714286e01
$ echo n "abbbbbc"  salza s i 
2.405248e+00
Printing factorizations
$ echo n "abbbbbbbbbbbbb"  salza s i  o 
6162Z(1>12)
Groups of two hex numbers indicate a byte to be inserted, while Z(dist>len)
reads: "copy len
bytes located dist
bytes in the past".
Reproducing the figures in the preprint
The figures have been produced using:

OOPS
v0.2.18, build525660de

salza
v0.2.6, buildaeb66512
Getting the datasets
Clone the repository:
git clone https://forge.uvolante.org/dataset/strings.git
Optional: In case you want to use the NCD, finish the installation by creating lists of strings with their absolute path:
cd strings && ./finishinstall.sh
Reproducing the asymmetry heatmap
This Jupyter notebook uses these utility functions to produce the heatmap.
Reproducing the classification table
The NCD distance matrices are provided with the datasets.
To compute the NSD matrices:
 make sure you have installed
drpt
;  download the script computenxdmatrices ;
 update the
DATASET
variable incomputenxdmatrices
with the path to the datasets ;  run the script on the four datasets:
./computenxdmatrices markov
./computenxdmatrices gaussian
./computenxdmatrices wine
./computenxdmatrices iris
This Jupyter notebook uses these utility functions to provide 2D and 3D vizualisation, and computation of classification performance metrics.
Reproducing the dendrograms
First, compute the NSD semidistance matrices:
salza d i /path/to/datasets/languages/strings v > ~/languagesnsdsim.csv
salza d i /path/to/datasets/languages/strings cross v > ~/languagesnsd.csv
This Jupyter notebook uses these utility functions to plot the dendrograms.
Reproducing the causality (CP)DAGs
Make sure you have installed neato
(provided by graphviz
).
Computing a CPDAG from the Toussaint drafts:
salza g i /path/to/datasets/toussaint/strings skel stable eta 0.008 dag v  neato Tpdf > toussaint.pdf
Computing a stable skeleton from the languages
dataset:
salza g i /path/to/datasets/languages/strings skel stable v  neato Tpdf > languages.pdf
References

Jacob Ziv and Abraham Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, vol. 23, No. 3, pp. 337343, May 1977.
↩ 
Gang Chen, Simon J. Puglisi and William F. Smyth, "LempelZiv Factorization Using Less Time and Space", Mathematics in Computer Science, vol. 1, issue 4, pp. 605623, June 2008.
↩ 
Abraham D. Wyner and Jacob Ziv, "Fixed Data Base Version of the LempelZiv Compression Algorithm", IEEE Transactions on Information Theory, vol. 37, No. 3, pp. 878880, May 1991.
↩ 
Christopher Hoobin, Simon J. Puglisi and Justin Zobel, "Relative LempelZiv Factorization for Efficient Storage and Retrieval of Web Collections", Proceedings of the VLDB Endowment, vol. 5, issue 3, pp. 265273, November 2011.
↩ 
Jacob Ziv and Neri Merhav, "A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification", IEEE Transactions on Information Theory, vol. 39, No. 4, pp. 12701279, July 1993.
↩ 
Rudi Cilibrasi and Paul M. B. Vitányi, "Clustering by Compression", IEEE Transactions on Information Theory, vol. 51, No. 4, pp. 15231545, April 2005.
↩ 
Diego Colombo and Marloes H. Maathuis, "OrderIndependent ConstraintBased Causal Structure Learning", Journal of Machine Learning Research, vol. 15, pp. 39213962, November 2014 (link).
↩ 
Bil Lewis and Daniel J. Berg, "Threads Primer: A Guide to Multithreaded Programming", ISBN13 9780134436982, Prentice Hall, October 1995.
↩