... | ... | @@ -4,9 +4,9 @@ |
|
|
|
|
|
Its rationale are described in this [preprint](https://cloud.uvolante.org/index.php/s/AiGAcapjgiygcfH/download). We thank the anonymous reviewers of ISIT'19 for spotting an inconsistency in the DPI (for reference, here are the [ISIT'19 slides](https://cloud.uvolante.org/index.php/s/7oSeZKy9GKAMKKg/download)).
|
|
|
|
|
|
`salza` comes with built-in computation of two universal, normalized semi-distances (one in the spirit of dissimilarity[^3], the other in the spirit of similarity[^5]), and an implementation of causality inference using the (stable) PC algorithm[^6].
|
|
|
`salza` comes with built-in computation of two universal, normalized semi-distances (one in the spirit of dissimilarity[^5], the other in the spirit of similarity[^6]), and an implementation of causality inference using the (stable) PC algorithm[^7].
|
|
|
|
|
|
`salza` is multithreaded[^7], too.
|
|
|
`salza` is multithreaded[^8], too.
|
|
|
|
|
|
# Companion software
|
|
|
|
... | ... | @@ -51,16 +51,69 @@ git clone https://forge.uvolante.org/code/salza.git |
|
|
|
|
|
## Getting the datasets
|
|
|
|
|
|
Clone the repository:
|
|
|
```
|
|
|
git clone https://forge.uvolante.org/dataset/strings.git
|
|
|
```
|
|
|
|
|
|
Finish the installation (create lists of strings with absolute paths):
|
|
|
```
|
|
|
cd strings && ./finish-install.sh
|
|
|
```
|
|
|
|
|
|
## Reproducing the asymmetry heatmap
|
|
|
|
|
|
Download this Jupyter [notebook](https://cloud.uvolante.org/index.php/s/WtepD8fWZPd5QDN/download) along with these utility [functions](https://cloud.uvolante.org/index.php/s/NDXFB8CzxsAjk7p/download).
|
|
|
|
|
|
## Reproducing the classification table
|
|
|
|
|
|
The NCD distance matrices are provided with the datasets.
|
|
|
|
|
|
To compute the NSD matrices:
|
|
|
1. make sure you have installed `drpt` (see above) ;
|
|
|
2. download the script [compute-nxd-matrices](https://cloud.uvolante.org/index.php/s/gqPCRCKwFKwf887/download) ;
|
|
|
3. update the variable `DATASET` in `compute-nxd-matrices` with the path to the datasets ;
|
|
|
4. run the script on the four datasets:
|
|
|
```
|
|
|
./compute-nxd-matrices markov
|
|
|
./compute-nxd-matrices gaussian
|
|
|
./compute-nxd-matrices wine
|
|
|
./compute-nxd-matrices iris
|
|
|
```
|
|
|
|
|
|
This Jupyter [notebook](https://cloud.uvolante.org/index.php/s/jB4qgyMZx6PxttD/download) uses these utility [functions](https://cloud.uvolante.org/index.php/s/nMxXkzzED7GJweW/download) to provide 2D and 3D vizualisation, and computation of classification performance metrics.
|
|
|
|
|
|
## Reproducing the dendrograms
|
|
|
|
|
|
First, compute the NSD semi-distance matrices:
|
|
|
```
|
|
|
salza -d -i /path/to/dataset/languages/strings -v > ~/languages-nsd-sim.csv
|
|
|
salza -d -i /path/to/dataset/languages/strings --cross -v > ~/languages-nsd.csv
|
|
|
```
|
|
|
|
|
|
This Jupyter [notebook](https://cloud.uvolante.org/index.php/s/N7xNkTJsYF5sc5T/download) uses these utility [functions](https://cloud.uvolante.org/index.php/s/ERs88CFipPxZZaB/download) to plot the dendrograms.
|
|
|
|
|
|
## Reproducing the causality (CP)DAGs
|
|
|
|
|
|
Make sure you have installed `neato` (provided by `graphviz`).
|
|
|
|
|
|
Computing a CPDAG from the Toussaint drafts:
|
|
|
```
|
|
|
salza - g -i /path/to/dataset/toussaint/strings --skel stable --eta 0.008 --dag -v | neato -Tpdf > toussaint.pdf
|
|
|
```
|
|
|
|
|
|
Computing a stable skeleton from the `languages` dataset:
|
|
|
```
|
|
|
salza - g -i /path/to/dataset/dataset/strings --skel stable -v | neato -Tpdf > languages.pdf
|
|
|
```
|
|
|
|
|
|
# References
|
|
|
|
|
|
[^1]: Jacob Ziv and Abraham Lempel, _"A Universal Algorithm for Sequential Data Compression"_, IEEE Transactions on Information Theory, vol. 23, No. 3, pp. 337--343, May 1977.
|
|
|
[^2]: Maxime Crochemore and Lucian Ilie, _"Computing Longest Previous Factor in Linear Time and Applications"_, Information Processing Letters, vol. 160, issue 2, pp. 75--80, April 2008.
|
|
|
[^3]: Jacob Ziv and Neri Merhav, _"A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification"_, IEEE Transactions on Information Theory, vol. 39, No. 4, pp. 1270--1279, July 1993.
|
|
|
[^2]: Gang Chen, Simon J. Puglisi and William F. Smyth, _"Lempel-Ziv Factorization Using Less Time and Space"_, Mathematics in Computer Science, vol. 1, issue 4, pp. 605--623, June 2008.
|
|
|
[^3]: Abraham D. Wyner and Jacob Ziv, _"Fixed Data Base Version of the Lempel-Ziv Compression Algorithm"_, IEEE Transactions on Information Theory, vol. 37, No. 3, pp. 878--880, May 1991.
|
|
|
[^4]: Christopher Hoobin, Simon J. Puglisi and Justin Zobel, _"Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections"_, Proceedings of the VLDB Endowment, vol. 5, issue 3, pp. 265--273, November 2011.
|
|
|
[^5]: Rudi Cilibrasi and Paul M. B. Vitányi, _"Clustering by Compression"_, IEEE Transactions on Information Theory, vol. 51, No. 4, pp. 1523--1545, April 2005.
|
|
|
[^6]: Diego Colombo and Marloes H. Maathuis, _"Order-Independent Constraint-Based Causal Structure Learning"_, Journal of Machine Learning Research, vol. 15, pp. 3921--3962, November 2014 ([link](http://jmlr.org/papers/v15/colombo14a.html)).
|
|
|
[^7]: Bil Lewis and Daniel J. Berg, _"Threads Primer: A Guide to Multithreaded Programming"_, ISBN-13 978-0134436982, Prentice Hall, October 1995. |
|
|
\ No newline at end of file |
|
|
[^5]: Jacob Ziv and Neri Merhav, _"A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification"_, IEEE Transactions on Information Theory, vol. 39, No. 4, pp. 1270--1279, July 1993.
|
|
|
[^6]: Rudi Cilibrasi and Paul M. B. Vitányi, _"Clustering by Compression"_, IEEE Transactions on Information Theory, vol. 51, No. 4, pp. 1523--1545, April 2005.
|
|
|
[^7]: Diego Colombo and Marloes H. Maathuis, _"Order-Independent Constraint-Based Causal Structure Learning"_, Journal of Machine Learning Research, vol. 15, pp. 3921--3962, November 2014 ([link](http://jmlr.org/papers/v15/colombo14a.html)).
|
|
|
[^8]: Bil Lewis and Daniel J. Berg, _"Threads Primer: A Guide to Multithreaded Programming"_, ISBN-13 978-0134436982, Prentice Hall, October 1995. |
|
|
\ No newline at end of file |