Pooling Data (Representing Factors)
Often, we have to deal with factors that take on a small number of levels:
dv = @data(["Group A", "Group A", "Group A", "Group B", "Group B", "Group B"])
The naive encoding used in a DataArray represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the PooledDataArray does:
pdv = @pdata(["Group A", "Group A", "Group A", "Group B", "Group B", "Group B"])
In addition to representing repeated data efficiently, the PooledDataArray allows us to determine the levels of the factor at any time using the levels function:
levels(pdv)
By default, a PooledDataArray is able to represent 232differents levels. You can use less memory by calling the compact function:
pdv = compact(pdv)
Often, you will have factors encoded inside a DataFrame with DataArray columns instead of PooledDataArray columns. You can do conversion of a single column using the pool function:
pdv = pool(dv)
Or you can edit the columns of a DataFrame in-place using the pool! function:
df = DataFrame(A = [1, 1, 1, 2, 2, 2], B = ["X", "X", "X", "Y", "Y", "Y"]) pool!(df, [:A, :B])
Pooling columns is important for working with the GLM package When fitting regression models, PooledDataArray columns in the input are translated into 0/1 indicator columns in the ModelMatrix with one column for each of the levels of the PooledDataArray. This allows one to analyze categorical data efficiently.