Pooling Data (Representing Factors)

Often, we have to deal with factors that take on a small number of levels:

dv = @data(["Group A", "Group A", "Group A",
            "Group B", "Group B", "Group B"])

The naive encoding used in a DataArray represents every entry of this vector as a full string. In contrast, we can represent the data more efficiently by replacing the strings with indices into a small pool of levels. This is what the PooledDataArray does:

pdv = @pdata(["Group A", "Group A", "Group A",
              "Group B", "Group B", "Group B"])

In addition to representing repeated data efficiently, the PooledDataArray allows us to determine the levels of the factor at any time using the levels function:

levels(pdv)

By default, a PooledDataArray is able to represent 2³²differents levels. You can use less memory by calling the compact function:

pdv = compact(pdv)

Often, you will have factors encoded inside a DataFrame with DataArray columns instead of PooledDataArray columns. You can do conversion of a single column using the pool function:

pdv = pool(dv)

Or you can edit the columns of a DataFrame in-place using the pool! function:

df = DataFrame(A = [1, 1, 1, 2, 2, 2],
               B = ["X", "X", "X", "Y", "Y", "Y"])
pool!(df, [:A, :B])

Pooling columns is important for working with the GLM package When fitting regression models, PooledDataArray columns in the input are translated into 0/1 indicator columns in the ModelMatrix with one column for each of the levels of the PooledDataArray. This allows one to analyze categorical data efficiently.