Data Manipulation

Joins

#Base.join — Function.

join(strings, delim, [last])

Join an array of strings into a single string, inserting the given delimiter between adjacent strings. If last is given, it will be used instead of delim between the last two strings. For example

join(["apples", "bananas", "pineapples"], ", ", " and ") == "apples, bananas and pineapples"

strings can be any iterable over elements x which are convertible to strings via print(io::IOBuffer, x).

Join two DataFrames

join(df1::AbstractDataFrame,
     df2::AbstractDataFrame;
     on::Union{Symbol, Vector{Symbol}} = Symbol[],
     kind::Symbol = :inner)

Arguments

df1, df2 : the two AbstractDataFrames to be joined

Keyword Arguments

on : a Symbol or Vector{Symbol}, the column(s) used as keys when joining; required argument except for kind = :cross
kind : the type of join, options include:
:inner : only include rows with keys that match in both df1 and df2, the default
:outer : include all rows from df1 and df2
:left : include all rows from df1
:right : include all rows from df2
:semi : return rows of df1 that match with the keys in df2
:anti : return rows of df1 that do not match with the keys in df2
:cross : a full Cartesian product of the key combinations; every row of df1 is matched with every row of df2

NAs are filled in where needed to complete joins.

Result

::DataFrame : the joined DataFrame

Examples

name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])

join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)

Reshaping

#DataFrames.melt — Function.

Stacks a DataFrame; convert from a wide to long format; see stack.

#DataFrames.stack — Function.

Stacks a DataFrame; convert from a wide to long format

stack(df::AbstractDataFrame, measure_vars, id_vars)
stack(df::AbstractDataFrame, measure_vars)
stack(df::AbstractDataFrame)
melt(df::AbstractDataFrame, id_vars, measure_vars)
melt(df::AbstractDataFrame, id_vars)

Arguments

df : the AbstractDataFrame to be stacked
measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars
id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.

Result

::DataFrame : the long-format dataframe with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

See also stackdf and meltdf for stacking methods that return a view into the original DataFrame. See unstack for converting from long to wide format.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])

#DataFrames.unstack — Function.

Unstacks a DataFrame; convert from a long to wide format

unstack(df::AbstractDataFrame, rowkey, colkey, value)
unstack(df::AbstractDataFrame, colkey, value)
unstack(df::AbstractDataFrame)

Arguments

df : the AbstractDataFrame to be unstacked
rowkey : the column with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value
colkey : the column holding the column names in wide format, defaults to :variable
value : the value column, defaults to :value

Result

::DataFrame : the wide-format dataframe

Examples

wide = DataFrame(id = 1:12,
                 a  = repeat([1:3;], inner = [4]),
                 b  = repeat([1:4;], inner = [3]),
                 c  = randn(12),
                 d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)

Note that there are some differences between the widened results above.

#DataFrames.stackdf — Function.

A stacked view of a DataFrame (long format)

Like stack and melt, but a view is returned rather than data copies.

stackdf(df::AbstractDataFrame, measure_vars, id_vars)
stackdf(df::AbstractDataFrame, measure_vars)
meltdf(df::AbstractDataFrame, id_vars, measure_vars)
meltdf(df::AbstractDataFrame, id_vars)

Arguments

df : the wide AbstractDataFrame
measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars
id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

Result

::DataFrame : the long-format dataframe with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])

#DataFrames.meltdf — Function.

A stacked view of a DataFrame (long format); see stackdf