Data Manipulation

Joins

#Base.joinFunction.


join(strings, delim, [last])

Join an array of strings into a single string, inserting the given delimiter between adjacent strings. If last is given, it will be used instead of delim between the last two strings. For example

join(["apples", "bananas", "pineapples"], ", ", " and ") == "apples, bananas and pineapples"

strings can be any iterable over elements x which are convertible to strings via print(io::IOBuffer, x).

Join two DataFrames

join(df1::AbstractDataFrame,
     df2::AbstractDataFrame;
     on::Union{Symbol, Vector{Symbol}} = Symbol[],
     kind::Symbol = :inner)

Arguments

  • df1, df2 : the two AbstractDataFrames to be joined

Keyword Arguments

  • on : a Symbol or Vector{Symbol}, the column(s) used as keys when joining; required argument except for kind = :cross

  • kind : the type of join, options include:

  • :inner : only include rows with keys that match in both df1 and df2, the default

  • :outer : include all rows from df1 and df2
  • :left : include all rows from df1
  • :right : include all rows from df2
  • :semi : return rows of df1 that match with the keys in df2
  • :anti : return rows of df1 that do not match with the keys in df2
  • :cross : a full Cartesian product of the key combinations; every row of df1 is matched with every row of df2

NAs are filled in where needed to complete joins.

Result

  • ::DataFrame : the joined DataFrame

Examples

name = DataFrame(ID = [1, 2, 3], Name = ["John Doe", "Jane Doe", "Joe Blogs"])
job = DataFrame(ID = [1, 2, 4], Job = ["Lawyer", "Doctor", "Farmer"])

join(name, job, on = :ID)
join(name, job, on = :ID, kind = :outer)
join(name, job, on = :ID, kind = :left)
join(name, job, on = :ID, kind = :right)
join(name, job, on = :ID, kind = :semi)
join(name, job, on = :ID, kind = :anti)
join(name, job, kind = :cross)

Reshaping

#DataFrames.meltFunction.


Stacks a DataFrame; convert from a wide to long format; see stack.

#DataFrames.stackFunction.


Stacks a DataFrame; convert from a wide to long format

stack(df::AbstractDataFrame, measure_vars, id_vars)
stack(df::AbstractDataFrame, measure_vars)
stack(df::AbstractDataFrame)
melt(df::AbstractDataFrame, id_vars, measure_vars)
melt(df::AbstractDataFrame, id_vars)

Arguments

  • df : the AbstractDataFrame to be stacked

  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars

  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

If neither measure_vars or id_vars are given, measure_vars defaults to all floating point columns.

Result

  • ::DataFrame : the long-format dataframe with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

See also stackdf and meltdf for stacking methods that return a view into the original DataFrame. See unstack for converting from long to wide format.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stack(d1, [:c, :d])
d1s2 = stack(d1, [:c, :d], [:a])
d1m = melt(d1, [:a, :b, :e])

#DataFrames.unstackFunction.


Unstacks a DataFrame; convert from a long to wide format

unstack(df::AbstractDataFrame, rowkey, colkey, value)
unstack(df::AbstractDataFrame, colkey, value)
unstack(df::AbstractDataFrame)

Arguments

  • df : the AbstractDataFrame to be unstacked

  • rowkey : the column with a unique key for each row, if not given, find a key by grouping on anything not a colkey or value

  • colkey : the column holding the column names in wide format, defaults to :variable

  • value : the value column, defaults to :value

Result

  • ::DataFrame : the wide-format dataframe

Examples

wide = DataFrame(id = 1:12,
                 a  = repeat([1:3;], inner = [4]),
                 b  = repeat([1:4;], inner = [3]),
                 c  = randn(12),
                 d  = randn(12))

long = stack(wide)
wide0 = unstack(long)
wide1 = unstack(long, :variable, :value)
wide2 = unstack(long, :id, :variable, :value)

Note that there are some differences between the widened results above.

#DataFrames.stackdfFunction.


A stacked view of a DataFrame (long format)

Like stack and melt, but a view is returned rather than data copies.

stackdf(df::AbstractDataFrame, measure_vars, id_vars)
stackdf(df::AbstractDataFrame, measure_vars)
meltdf(df::AbstractDataFrame, id_vars, measure_vars)
meltdf(df::AbstractDataFrame, id_vars)

Arguments

  • df : the wide AbstractDataFrame

  • measure_vars : the columns to be stacked (the measurement variables), a normal column indexing type, like a Symbol, Vector{Symbol}, Int, etc.; for melt, defaults to all variables that are not id_vars

  • id_vars : the identifier columns that are repeated during stacking, a normal column indexing type; for stack defaults to all variables that are not measure_vars

Result

  • ::DataFrame : the long-format dataframe with column :value holding the values of the stacked columns (measure_vars), with column :variable a Vector of Symbols with the measure_vars name, and with columns for each of the id_vars.

The result is a view because the columns are special AbstractVectors that return indexed views into the original DataFrame.

Examples

d1 = DataFrame(a = repeat([1:3;], inner = [4]),
               b = repeat([1:4;], inner = [3]),
               c = randn(12),
               d = randn(12),
               e = map(string, 'a':'l'))

d1s = stackdf(d1, [:c, :d])
d1s2 = stackdf(d1, [:c, :d], [:a])
d1m = meltdf(d1, [:a, :b, :e])

#DataFrames.meltdfFunction.


A stacked view of a DataFrame (long format); see stackdf