OnlineCA.jl Documentation
Overview
OnlineCA.jl performs out-of-core Correspondence Analysis (CA) for extremely large scale matrix without loading the whole data on memory space.
CA decomposes the standardized residual matrix
S = D_r^{-1/2} (P - r_p c_p') D_c^{-1/2}where P = X / n is the relative-frequency matrix, r_p = P 1 is the row-mass vector, c_p = P' 1 is the column-mass vector, and D_r = diag(r_p), D_c = diag(c_p) are the corresponding diagonal mass matrices. The matrix S is never explicitly formed. Instead, matrix–vector products S * v and S' * u are computed by streaming through the data, and the top singular triplets are extracted with a Halko-style randomized SVD.
Online CA is performed as the following two steps.
- Step.1 Binarization : We assume that the data is a contingency table filled with non-negative integer counts and saved as comma-separated CSV, Matrix Market (MM), or Binary COO (BinCOO) file. Using the
OnlinePCApackage, these files are converted to Julia binary file bycsv2bin,mm2binorbincoo2bin, respectively. This step extremely accelerates I/O speed. - Step.2 Online CA :
cacan be performed against the binary file generated bycsv2bin.sparse_cacan be performed against the binary file generated bymm2bin.bincoo_cacan be performed against the binary file generated bybincoo2bin.
In addition, Multiple Correspondence Analysis (MCA) is supported via mca. It builds the indicator (complete-disjunctive) matrix from a categorical table and reuses bincoo_ca, with optional Benzécri / Greenacre eigenvalue corrections. Supplementary rows or columns can be projected into a trained CA / MCA space with project_rows / project_columns.
ca / sparse_ca / bincoo_ca are available as Julia API and command line tools (OnlineCA.jl (Julia API), OnlineCA.jl (Command line tool)). mca is available as Julia API only.
Reference
- Halko-style Randomized SVD on the implicit standardized residual matrix : Halko, N. et al., 2011, Halko, N. et al., 2011