OnlineCA.jl Documentation

Overview

OnlineCA.jl performs out-of-core Correspondence Analysis (CA) for extremely large scale matrix without loading the whole data on memory space.

CA decomposes the standardized residual matrix

S = D_r^{-1/2} (P - r_p c_p') D_c^{-1/2}

where P = X / n is the relative-frequency matrix, r_p = P 1 is the row-mass vector, c_p = P' 1 is the column-mass vector, and D_r = diag(r_p), D_c = diag(c_p) are the corresponding diagonal mass matrices. The matrix S is never explicitly formed. Instead, matrix–vector products S * v and S' * u are computed by streaming through the data, and the top singular triplets are extracted with a Halko-style randomized SVD.

Online CA is performed as the following two steps.

  • Step.1 Binarization : We assume that the data is a contingency table filled with non-negative integer counts and saved as comma-separated CSV, Matrix Market (MM), or Binary COO (BinCOO) file. Using the OnlinePCA package, these files are converted to Julia binary file by csv2bin, mm2bin or bincoo2bin, respectively. This step extremely accelerates I/O speed.
  • Step.2 Online CA : ca can be performed against the binary file generated by csv2bin. sparse_ca can be performed against the binary file generated by mm2bin. bincoo_ca can be performed against the binary file generated by bincoo2bin.

In addition, Multiple Correspondence Analysis (MCA) is supported via mca. It builds the indicator (complete-disjunctive) matrix from a categorical table and reuses bincoo_ca, with optional Benzécri / Greenacre eigenvalue corrections. Supplementary rows or columns can be projected into a trained CA / MCA space with project_rows / project_columns.

ca / sparse_ca / bincoo_ca are available as Julia API and command line tools (OnlineCA.jl (Julia API), OnlineCA.jl (Command line tool)). mca is available as Julia API only.

Reference