Title: | R Bindings for the 'polars' Rust Library |
---|---|
Description: | Lightning-fast 'DataFrame' library written in 'Rust'. Convert R data to 'Polars' data and vice versa. Perform fast, lazy, larger-than-memory and optimized data queries. 'Polars' is interoperable with the package 'arrow', as both are based on the 'Apache Arrow' Columnar Format. |
Authors: | Tatsuya Shima [aut, cre], Authors of the dependency Rust crates [aut] |
Maintainer: | Tatsuya Shima <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9000 |
Built: | 2025-01-19 06:23:49 UTC |
Source: | https://github.com/eitsupi/neo-r-polars |
The as_polars_df()
function creates a polars DataFrame from various R objects.
Polars DataFrame is based on a sequence of Polars Series,
so basically, the input object is converted to a list of
Polars Series by as_polars_series()
,
then a Polars DataFrame is created from the list.
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'polars_series' as_polars_df(x, ..., column_name = NULL, from_struct = TRUE) ## S3 method for class 'polars_data_frame' as_polars_df(x, ...) ## S3 method for class 'polars_group_by' as_polars_df(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_df( x, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE ) ## S3 method for class 'list' as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df(x, ...) ## S3 method for class ''NULL'' as_polars_df(x, ...)
as_polars_df(x, ...) ## Default S3 method: as_polars_df(x, ...) ## S3 method for class 'polars_series' as_polars_df(x, ..., column_name = NULL, from_struct = TRUE) ## S3 method for class 'polars_data_frame' as_polars_df(x, ...) ## S3 method for class 'polars_group_by' as_polars_df(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_df( x, ..., type_coercion = TRUE, predicate_pushdown = TRUE, projection_pushdown = TRUE, simplify_expression = TRUE, slice_pushdown = TRUE, comm_subplan_elim = TRUE, comm_subexpr_elim = TRUE, cluster_with_columns = TRUE, no_optimization = FALSE, streaming = FALSE ) ## S3 method for class 'list' as_polars_df(x, ...) ## S3 method for class 'data.frame' as_polars_df(x, ...) ## S3 method for class ''NULL'' as_polars_df(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
column_name |
A character or |
from_struct |
A logical. If |
type_coercion |
A logical, indicats type coercion optimization. |
predicate_pushdown |
A logical, indicats predicate pushdown optimization. |
projection_pushdown |
A logical, indicats projection pushdown optimization. |
simplify_expression |
A logical, indicats simplify expression optimization. |
slice_pushdown |
A logical, indicats slice pushdown optimization. |
comm_subplan_elim |
A logical, indicats tring to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim |
A logical, indicats tring to cache common subexpressions. |
cluster_with_columns |
A logical, indicats to combine sequential independent calls to with_columns. |
no_optimization |
A logical. If |
streaming |
A logical. If |
The default method of as_polars_df()
throws an error,
so we need to define methods for the classes we want to support.
The argument ...
(except name
) is passed to as_polars_series()
for each element of the list.
All elements of the list must be converted to the same length of Series by as_polars_series()
.
The name of the each element is used as the column name of the DataFrame.
For unnamed elements, the column name will be an empty string ""
or if the element is a Series,
the column name will be the name of the Series.
The argument ...
(except name
) is passed to as_polars_series()
for each column.
All columns must be converted to the same length of Series by as_polars_series()
.
This is a shortcut for <Series>$to_frame()
or
<Series>$struct$unnest()
, depending on the from_struct
argument and the Series data type.
The column_name
argument is passed to the name
argument of the $to_frame()
method.
This is a shortcut for <LazyFrame>$collect()
.
A polars DataFrame
as.list(<polars_data_frame>)
: Export the DataFrame as an R list.
as.data.frame(<polars_data_frame>)
: Export the DataFrame as an R data frame.
# list as_polars_df(list(a = 1:2, b = c("foo", "bar"))) # data.frame as_polars_df(data.frame(a = 1:2, b = c("foo", "bar"))) # polars_series s_int <- as_polars_series(1:2, "a") s_struct <- as_polars_series( data.frame(a = 1:2, b = c("foo", "bar")), "struct" ) ## Use the Series as a column as_polars_df(s_int) as_polars_df(s_struct, column_name = "values", from_struct = FALSE) ## Unnest the struct data as_polars_df(s_struct)
# list as_polars_df(list(a = 1:2, b = c("foo", "bar"))) # data.frame as_polars_df(data.frame(a = 1:2, b = c("foo", "bar"))) # polars_series s_int <- as_polars_series(1:2, "a") s_struct <- as_polars_series( data.frame(a = 1:2, b = c("foo", "bar")), "struct" ) ## Use the Series as a column as_polars_df(s_int) as_polars_df(s_struct, column_name = "values", from_struct = FALSE) ## Unnest the struct data as_polars_df(s_struct)
The as_polars_expr()
function creates a polars expression from various R objects.
This function is used internally by various polars functions that accept expressions.
In most cases, users should use pl$lit()
instead of this function, which is
a shorthand for as_polars_expr(x, as_lit = TRUE)
.
(In other words, this function can be considered as an internal implementation to realize
the lit
function of the Polars API in other languages.)
as_polars_expr(x, ...) ## Default S3 method: as_polars_expr(x, ...) ## S3 method for class 'polars_expr' as_polars_expr(x, ..., structify = FALSE) ## S3 method for class 'polars_series' as_polars_expr(x, ...) ## S3 method for class 'character' as_polars_expr(x, ..., as_lit = FALSE) ## S3 method for class 'logical' as_polars_expr(x, ...) ## S3 method for class 'integer' as_polars_expr(x, ...) ## S3 method for class 'double' as_polars_expr(x, ...) ## S3 method for class 'raw' as_polars_expr(x, ...) ## S3 method for class ''NULL'' as_polars_expr(x, ...)
as_polars_expr(x, ...) ## Default S3 method: as_polars_expr(x, ...) ## S3 method for class 'polars_expr' as_polars_expr(x, ..., structify = FALSE) ## S3 method for class 'polars_series' as_polars_expr(x, ...) ## S3 method for class 'character' as_polars_expr(x, ..., as_lit = FALSE) ## S3 method for class 'logical' as_polars_expr(x, ...) ## S3 method for class 'integer' as_polars_expr(x, ...) ## S3 method for class 'double' as_polars_expr(x, ...) ## S3 method for class 'raw' as_polars_expr(x, ...) ## S3 method for class ''NULL'' as_polars_expr(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
structify |
A logical. If |
as_lit |
A logical value indicating whether to treat vector as literal values or not.
This argument is always set to |
Because R objects are typically mapped to Series, this function often calls as_polars_series()
internally.
However, unlike R, Polars has scalars of length 1, so if an R object is converted to a Series of length 1,
this function get the first value of the Series and convert it to a scalar literal.
If you want to implement your own conversion from an R class to a Polars object,
define an S3 method for as_polars_series()
instead of this function.
Create a Series by calling as_polars_series()
and then convert that Series to an Expr.
If the length of the Series is 1
, it will be converted to a scalar value.
Additional arguments ...
are passed to as_polars_series()
.
If the as_lit
argument is FALSE
(default), this function will call pl$col()
and
the character vector is treated as column names.
A polars expression
Since R has no scalar class, each of the following types of length 1 cases is specially converted to a scalar literal.
character: String
logical: Boolean
integer: Int32
double: Float64
These types' NA
is converted to a null
literal with casting to the corresponding Polars type.
The raw type vector is converted to a Binary scalar.
raw: Binary
NULL
is converted to a Null type null
literal.
NULL: Null
For other R class, the default S3 method is called and R object will be converted via
as_polars_series()
. So the type mapping is defined by as_polars_series()
.
as_polars_series()
: R -> Polars type mapping is mostly defined by this function.
# character ## as_lit = FALSE (default) as_polars_expr("a") # Same as `pl$col("a")` as_polars_expr(c("a", "b")) # Same as `pl$col("a", "b")` ## as_lit = TRUE as_polars_expr(character(0), as_lit = TRUE) as_polars_expr("a", as_lit = TRUE) as_polars_expr(NA_character_, as_lit = TRUE) as_polars_expr(c("a", "b"), as_lit = TRUE) # logical as_polars_expr(logical(0)) as_polars_expr(TRUE) as_polars_expr(NA) as_polars_expr(c(TRUE, FALSE)) # integer as_polars_expr(integer(0)) as_polars_expr(1L) as_polars_expr(NA_integer_) as_polars_expr(c(1L, 2L)) # double as_polars_expr(double(0)) as_polars_expr(1) as_polars_expr(NA_real_) as_polars_expr(c(1, 2)) # raw as_polars_expr(raw(0)) as_polars_expr(charToRaw("foo")) # NULL as_polars_expr(NULL) # default method (for list) as_polars_expr(list()) as_polars_expr(list(1)) as_polars_expr(list(1, 2)) # default method (for Date) as_polars_expr(as.Date(integer(0))) as_polars_expr(as.Date("2021-01-01")) as_polars_expr(as.Date(c("2021-01-01", "2021-01-02"))) # polars_series ## Unlike the default method, this method does not extract the first value as_polars_series(1) |> as_polars_expr() # polars_expr as_polars_expr(pl$col("a", "b")) as_polars_expr(pl$col("a", "b"), structify = TRUE)
# character ## as_lit = FALSE (default) as_polars_expr("a") # Same as `pl$col("a")` as_polars_expr(c("a", "b")) # Same as `pl$col("a", "b")` ## as_lit = TRUE as_polars_expr(character(0), as_lit = TRUE) as_polars_expr("a", as_lit = TRUE) as_polars_expr(NA_character_, as_lit = TRUE) as_polars_expr(c("a", "b"), as_lit = TRUE) # logical as_polars_expr(logical(0)) as_polars_expr(TRUE) as_polars_expr(NA) as_polars_expr(c(TRUE, FALSE)) # integer as_polars_expr(integer(0)) as_polars_expr(1L) as_polars_expr(NA_integer_) as_polars_expr(c(1L, 2L)) # double as_polars_expr(double(0)) as_polars_expr(1) as_polars_expr(NA_real_) as_polars_expr(c(1, 2)) # raw as_polars_expr(raw(0)) as_polars_expr(charToRaw("foo")) # NULL as_polars_expr(NULL) # default method (for list) as_polars_expr(list()) as_polars_expr(list(1)) as_polars_expr(list(1, 2)) # default method (for Date) as_polars_expr(as.Date(integer(0))) as_polars_expr(as.Date("2021-01-01")) as_polars_expr(as.Date(c("2021-01-01", "2021-01-02"))) # polars_series ## Unlike the default method, this method does not extract the first value as_polars_series(1) |> as_polars_expr() # polars_expr as_polars_expr(pl$col("a", "b")) as_polars_expr(pl$col("a", "b"), structify = TRUE)
The as_polars_lf()
function creates a LazyFrame from various R objects.
It is basically a shortcut for as_polars_df(x, ...) with the
$lazy()
method.
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_lf(x, ...)
as_polars_lf(x, ...) ## Default S3 method: as_polars_lf(x, ...) ## S3 method for class 'polars_lazy_frame' as_polars_lf(x, ...)
x |
An R object. |
... |
Additional arguments passed to the methods. |
Create a DataFrame by calling as_polars_df()
and then create a LazyFrame from the DataFrame.
Additional arguments ...
are passed to as_polars_df()
.
A polars LazyFrame
The as_polars_series()
function creates a polars Series from various R objects.
The Data Type of the Series is determined by the class of the input object.
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_series' as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_data_frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'double' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer' as_polars_series(x, name = NULL, ...) ## S3 method for class 'character' as_polars_series(x, name = NULL, ...) ## S3 method for class 'logical' as_polars_series(x, name = NULL, ...) ## S3 method for class 'raw' as_polars_series(x, name = NULL, ...) ## S3 method for class 'factor' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Date' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXct' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'difftime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'hms' as_polars_series(x, name = NULL, ...) ## S3 method for class 'blob' as_polars_series(x, name = NULL, ...) ## S3 method for class 'array' as_polars_series(x, name = NULL, ...) ## S3 method for class ''NULL'' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ..., strict = FALSE) ## S3 method for class 'AsIs' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer64' as_polars_series(x, name = NULL, ...) ## S3 method for class 'ITime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_unspecified' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_duration' as_polars_series(x, name = NULL, ...)
as_polars_series(x, name = NULL, ...) ## Default S3 method: as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_series' as_polars_series(x, name = NULL, ...) ## S3 method for class 'polars_data_frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'double' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer' as_polars_series(x, name = NULL, ...) ## S3 method for class 'character' as_polars_series(x, name = NULL, ...) ## S3 method for class 'logical' as_polars_series(x, name = NULL, ...) ## S3 method for class 'raw' as_polars_series(x, name = NULL, ...) ## S3 method for class 'factor' as_polars_series(x, name = NULL, ...) ## S3 method for class 'Date' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXct' as_polars_series(x, name = NULL, ...) ## S3 method for class 'POSIXlt' as_polars_series(x, name = NULL, ...) ## S3 method for class 'difftime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'hms' as_polars_series(x, name = NULL, ...) ## S3 method for class 'blob' as_polars_series(x, name = NULL, ...) ## S3 method for class 'array' as_polars_series(x, name = NULL, ...) ## S3 method for class ''NULL'' as_polars_series(x, name = NULL, ...) ## S3 method for class 'list' as_polars_series(x, name = NULL, ..., strict = FALSE) ## S3 method for class 'AsIs' as_polars_series(x, name = NULL, ...) ## S3 method for class 'data.frame' as_polars_series(x, name = NULL, ...) ## S3 method for class 'integer64' as_polars_series(x, name = NULL, ...) ## S3 method for class 'ITime' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_unspecified' as_polars_series(x, name = NULL, ...) ## S3 method for class 'vctrs_rcrd' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_time_point' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_sys_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_zoned_time' as_polars_series(x, name = NULL, ...) ## S3 method for class 'clock_duration' as_polars_series(x, name = NULL, ...)
x |
An R object. |
name |
A single string or |
... |
Additional arguments passed to the methods. |
strict |
A logical value to indicate whether throwing an error when
the input list's elements have different data types.
If |
The default method of as_polars_series()
throws an error,
so we need to define S3 methods for the classes we want to support.
In R, a list can contain elements of different types, but in Polars (Apache Arrow),
all elements must have the same type.
So the as_polars_series()
function automatically casts all elements to the same type
or throws an error, depending on the strict
argument.
If you want to create a list with all elements of the same type in R,
consider using the vctrs::list_of()
function.
Since a list can contain another list, the strict
argument is also used
when creating Series from the inner list in the case of classes constructed on top of a list,
such as data.frame or vctrs_rcrd.
Sub-day values will be ignored (floored to the day).
Sub-millisecond values will be ignored (floored to the millisecond).
If the tzone
attribute is not present or an empty string (""
),
the Series' dtype will be Datetime without timezone.
Sub-nanosecond values will be ignored (floored to the nanosecond).
Sub-millisecond values will be rounded to milliseconds.
Sub-nanosecond values will be ignored (floored to the nanosecond).
If the hms vector contains values greater-equal to 24-oclock or less than 0-oclock, an error will be thrown.
Calendrical durations (years, quarters, months) are treated as chronologically with the internal representation of seconds. Please check the clock_duration documentation for more details.
This method is a shortcut for <DataFrame>$to_struct()
.
<Series>$to_r_vector()
: Export the Series as an R vector.
as_polars_df()
: Create a Polars DataFrame from an R object.
# double as_polars_series(c(NA, 1, 2)) # integer as_polars_series(c(NA, 1:2)) # character as_polars_series(c(NA, "foo", "bar")) # logical as_polars_series(c(NA, TRUE, FALSE)) # raw as_polars_series(charToRaw("foo")) # factor as_polars_series(factor(c(NA, "a", "b"))) # Date as_polars_series(as.Date(c(NA, "2021-01-01"))) ## Sub-day precision will be ignored as.Date(c(-0.5, 0, 0.5)) |> as_polars_series() # POSIXct with timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # POSIXct without timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"))) # POSIXlt as_polars_series(as.POSIXlt(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # difftime as_polars_series(as.difftime(c(NA, 1), units = "days")) ## Sub-millisecond values will be rounded to milliseconds as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "secs") |> as_polars_series() as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "weeks") |> as_polars_series() # NULL as_polars_series(NULL) # list as_polars_series(list(NA, NULL, list(), 1, "foo", TRUE)) ## 1st element will be `null` due to the casting failure as_polars_series(list(list("bar"), "foo")) # data.frame as_polars_series( data.frame(x = 1:2, y = c("foo", "bar"), z = I(list(1, 2))) ) # vctrs_unspecified if (requireNamespace("vctrs", quietly = TRUE)) { as_polars_series(vctrs::unspecified(3L)) } # hms if (requireNamespace("hms", quietly = TRUE)) { as_polars_series(hms::as_hms(c(NA, "01:00:00"))) } # blob if (requireNamespace("blob", quietly = TRUE)) { as_polars_series(blob::as_blob(c(NA, "foo", "bar"))) } # integer64 if (requireNamespace("bit64", quietly = TRUE)) { as_polars_series(bit64::as.integer64(c(NA, "9223372036854775807"))) } # clock_naive_time if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::naive_time_parse(c( NA, "1900-01-01T12:34:56.123456789", "2020-01-01T12:34:56.123456789" ), precision = "nanosecond")) } # clock_duration if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_nanoseconds(c(NA, 1))) } ## Calendrical durations are treated as chronologically if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_years(c(NA, 1))) }
# double as_polars_series(c(NA, 1, 2)) # integer as_polars_series(c(NA, 1:2)) # character as_polars_series(c(NA, "foo", "bar")) # logical as_polars_series(c(NA, TRUE, FALSE)) # raw as_polars_series(charToRaw("foo")) # factor as_polars_series(factor(c(NA, "a", "b"))) # Date as_polars_series(as.Date(c(NA, "2021-01-01"))) ## Sub-day precision will be ignored as.Date(c(-0.5, 0, 0.5)) |> as_polars_series() # POSIXct with timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # POSIXct without timezone as_polars_series(as.POSIXct(c(NA, "2021-01-01 00:00:00.123456789"))) # POSIXlt as_polars_series(as.POSIXlt(c(NA, "2021-01-01 00:00:00.123456789"), "UTC")) # difftime as_polars_series(as.difftime(c(NA, 1), units = "days")) ## Sub-millisecond values will be rounded to milliseconds as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "secs") |> as_polars_series() as.difftime(c(0.0005, 0.0010, 0.0015, 0.0020), units = "weeks") |> as_polars_series() # NULL as_polars_series(NULL) # list as_polars_series(list(NA, NULL, list(), 1, "foo", TRUE)) ## 1st element will be `null` due to the casting failure as_polars_series(list(list("bar"), "foo")) # data.frame as_polars_series( data.frame(x = 1:2, y = c("foo", "bar"), z = I(list(1, 2))) ) # vctrs_unspecified if (requireNamespace("vctrs", quietly = TRUE)) { as_polars_series(vctrs::unspecified(3L)) } # hms if (requireNamespace("hms", quietly = TRUE)) { as_polars_series(hms::as_hms(c(NA, "01:00:00"))) } # blob if (requireNamespace("blob", quietly = TRUE)) { as_polars_series(blob::as_blob(c(NA, "foo", "bar"))) } # integer64 if (requireNamespace("bit64", quietly = TRUE)) { as_polars_series(bit64::as.integer64(c(NA, "9223372036854775807"))) } # clock_naive_time if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::naive_time_parse(c( NA, "1900-01-01T12:34:56.123456789", "2020-01-01T12:34:56.123456789" ), precision = "nanosecond")) } # clock_duration if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_nanoseconds(c(NA, 1))) } ## Calendrical durations are treated as chronologically if (requireNamespace("clock", quietly = TRUE)) { as_polars_series(clock::duration_years(c(NA, 1))) }
This S3 method is basically a shortcut of
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = FALSE, struct = "tibble")
.
Additionally, you can check or repair the column names by specifying the .name_repair
argument.
Because polars DataFrame allows empty column name, which is not generally valid column name in R data frame.
## S3 method for class 'polars_data_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
A tibble
as.data.frame(<polars_object>)
: Export the polars object as a basic data frame.
# Polars DataFrame may have empty column name df <- pl$DataFrame(x = 1:2, c("a", "b")) df # Without checking or repairing the column names tibble::as_tibble(df, .name_repair = "minimal") tibble::as_tibble(df$lazy(), .name_repair = "minimal") # You can make that unique tibble::as_tibble(df, .name_repair = "unique") tibble::as_tibble(df$lazy(), .name_repair = "unique")
# Polars DataFrame may have empty column name df <- pl$DataFrame(x = 1:2, c("a", "b")) df # Without checking or repairing the column names tibble::as_tibble(df, .name_repair = "minimal") tibble::as_tibble(df$lazy(), .name_repair = "minimal") # You can make that unique tibble::as_tibble(df, .name_repair = "unique") tibble::as_tibble(df$lazy(), .name_repair = "unique")
This S3 method is a shortcut for
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = FALSE, struct = "dataframe")
.
## S3 method for class 'polars_data_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.data.frame( x, ..., int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
An R data frame
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.data.frame(df) as.data.frame(df$lazy())
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.data.frame(df) as.data.frame(df$lazy())
This S3 method calls as_polars_df(x, ...)$get_columns()
or
as_polars_df(x, ...)$to_struct()$to_r_vector(ensure_vector = TRUE)
depending on the as_series
argument.
## S3 method for class 'polars_data_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
## S3 method for class 'polars_data_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") ) ## S3 method for class 'polars_lazy_frame' as.list( x, ..., as_series = FALSE, int64 = c("double", "character", "integer", "integer64"), date = c("Date", "IDate"), time = c("hms", "ITime"), struct = c("dataframe", "tibble"), decimal = c("double", "character"), as_clock_class = FALSE, ambiguous = c("raise", "earliest", "latest", "null"), non_existent = c("raise", "null") )
x |
A polars object |
... |
Passed to |
as_series |
Whether to convert each column to an R vector or a Series.
If |
int64 |
Determine how to convert Polars' Int64, UInt32, or UInt64 type values to R type. One of the followings:
|
date |
Determine how to convert Polars' Date type values to R class. One of the followings:
|
time |
Determine how to convert Polars' Time type values to R class. One of the followings:
|
struct |
Determine how to convert Polars' Struct type values to R class. One of the followings:
|
decimal |
Determine how to convert Polars' Decimal type values to R type. One of the followings: |
as_clock_class |
A logical value indicating whether to export datetimes and duration as the clock package's classes.
|
ambiguous |
Determine how to deal with ambiguous datetimes.
Only applicable when
|
non_existent |
Determine how to deal with non-existent datetimes.
Only applicable when
|
Arguments other than x
and as_series
are passed to <Series>$to_r_vector()
,
so they are ignored when as_series=TRUE
.
A list
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.list(df, as_series = TRUE) as.list(df, as_series = FALSE) as.list(df$lazy(), as_series = TRUE) as.list(df$lazy(), as_series = FALSE)
df <- as_polars_df(list(a = 1:3, b = 4:6)) as.list(df, as_series = TRUE) as.list(df, as_series = FALSE) as.list(df$lazy(), as_series = TRUE) as.list(df$lazy(), as_series = FALSE)
Functions to check if the object is a polars object.
is_*
functions return TRUE
of FALSE
depending on the class of the object.
check_*
functions throw an informative error if the object is not the correct class.
Suffixes are corresponding to the polars object classes:
*_dtype
: For polars data types.
*_df
: For polars data frames.
*_expr
: For polars expressions.
*_lf
: For polars lazy frames.
*_selector
: For polars selectors.
*_series
: For polars series.
is_polars_dtype(x) is_polars_df(x) is_polars_expr(x, ...) is_polars_lf(x) is_polars_selector(x, ...) is_polars_series(x) is_list_of_polars_dtype(x, n = NULL) check_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_df( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_expr( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_lf( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_selector( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_series( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_list_of_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() )
is_polars_dtype(x) is_polars_df(x) is_polars_expr(x, ...) is_polars_lf(x) is_polars_selector(x, ...) is_polars_series(x) is_list_of_polars_dtype(x, n = NULL) check_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_df( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_expr( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_lf( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_selector( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_polars_series( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() ) check_list_of_polars_dtype( x, ..., allow_null = FALSE, arg = caller_arg(x), call = caller_env() )
x |
An object to check. |
... |
Arguments passed to |
n |
Expected length of a vector. |
allow_null |
If |
arg |
An argument name as a string. This argument will be mentioned in error messages as the input that is at the origin of a problem. |
call |
The execution environment of a currently
running function, e.g. |
check_polars_*
functions are derived from the standalone-types-check
functions
from the rlang package
(Can be installed with usethis::use_standalone("r-lib/rlang", file = "types-check")
).
is_polars_*
functions return TRUE
or FALSE
.
check_polars_*
functions return NULL
invisibly if the input is valid.
is_polars_df(as_polars_df(mtcars)) is_polars_df(mtcars) # Use `check_polars_*` functions in a function # to ensure the input is a polars object sample_func <- function(x) { check_polars_df(x) TRUE } sample_func(as_polars_df(mtcars)) try(sample_func(mtcars))
is_polars_df(as_polars_df(mtcars)) is_polars_df(mtcars) # Use `check_polars_*` functions in a function # to ensure the input is a polars object sample_func <- function(x) { check_polars_df(x) TRUE } sample_func(as_polars_df(mtcars)) try(sample_func(mtcars))
cs
is an environment class object that stores all
selector functions of the R Polars API which mimics the Python Polars API.
It is intended to work the same way in Python as if you had imported
Python Polars Selectors with import polars.selectors as cs
.
cs
cs
An object of class polars_object
of length 29.
There are 4 supported operators for selectors:
&
to combine conditions with AND, e.g. select columns that contain
"oo"
and end with "t"
with cs$contains("oo") & cs$ends_with("t")
;
|
to combine conditions with OR, e.g. select columns that contain
"oo"
or end with "t"
with cs$contains("oo") | cs$ends_with("t")
;
-
to substract conditions, e.g. select all columns that have alphanumeric
names except those that contain "a"
with
cs$alphanumeric() - cs$contains("a")
;
!
to invert the selection, e.g. select all columns that are not of data
type String
with !cs$string()
.
Note that Python Polars uses ~
instead of !
to invert selectors.
cs # How many members are in the `cs` environment? length(cs)
cs # How many members are in the `cs` environment? length(cs)
Select all columns
cs__all()
cs__all()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame(dt = as.Date(c("2000-1-1")), value = 10) # Select all columns, casting them to string: df$select(cs$all()$cast(pl$String)) # Select all columns except for those matching the given dtypes: df$select(cs$all() - cs$numeric())
df <- pl$DataFrame(dt = as.Date(c("2000-1-1")), value = 10) # Select all columns, casting them to string: df$select(cs$all()$cast(pl$String)) # Select all columns except for those matching the given dtypes: df$select(cs$all() - cs$numeric())
Select all columns with alphabetic names (e.g. only letters)
cs__alpha(ascii_only = FALSE, ..., ignore_spaces = FALSE)
cs__alpha(ascii_only = FALSE, ..., ignore_spaces = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
... |
These dots are for future extensions and must be empty. |
ignore_spaces |
Indicate whether to ignore the presence of spaces in column names; if so, only the other (non-space) characters are considered. |
Matching column names cannot contain any non-alphabetic characters. Note
that the definition of “alphabetic” consists of all valid Unicode alphabetic
characters (p{Alphabetic}
) by default; this can be changed by setting
ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( no1 = c(100, 200, 300), café = c("espresso", "latte", "mocha"), `t or f` = c(TRUE, FALSE, NA), hmm = c("aaa", "bbb", "ccc"), 都市 = c("東京", "大阪", "京都") ) # Select columns with alphabetic names; note that accented characters and # kanji are recognised as alphabetic here: df$select(cs$alpha()) # Constrain the definition of “alphabetic” to ASCII characters only: df$select(cs$alpha(ascii_only = TRUE)) df$select(cs$alpha(ascii_only = TRUE, ignore_spaces = TRUE)) # Select all columns except for those with alphabetic names: df$select(!cs$alpha()) df$select(!cs$alpha(ignore_spaces = TRUE))
df <- pl$DataFrame( no1 = c(100, 200, 300), café = c("espresso", "latte", "mocha"), `t or f` = c(TRUE, FALSE, NA), hmm = c("aaa", "bbb", "ccc"), 都市 = c("東京", "大阪", "京都") ) # Select columns with alphabetic names; note that accented characters and # kanji are recognised as alphabetic here: df$select(cs$alpha()) # Constrain the definition of “alphabetic” to ASCII characters only: df$select(cs$alpha(ascii_only = TRUE)) df$select(cs$alpha(ascii_only = TRUE, ignore_spaces = TRUE)) # Select all columns except for those with alphabetic names: df$select(!cs$alpha()) df$select(!cs$alpha(ignore_spaces = TRUE))
Select all columns with alphanumeric names (e.g. only letters and the digits 0-9)
cs__alphanumeric(ascii_only = FALSE, ..., ignore_spaces = FALSE)
cs__alphanumeric(ascii_only = FALSE, ..., ignore_spaces = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
... |
These dots are for future extensions and must be empty. |
ignore_spaces |
Indicate whether to ignore the presence of spaces in column names; if so, only the other (non-space) characters are considered. |
Matching column names cannot contain any non-alphabetic characters. Note
that the definition of “alphabetic” consists of all valid Unicode alphabetic
characters (p{Alphabetic}
) and digit characters (d
) by default; this can
be changed by setting ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( `1st_col` = c(100, 200, 300), flagged = c(TRUE, FALSE, TRUE), `00prefix` = c("01:aa", "02:bb", "03:cc"), `last col` = c("x", "y", "z") ) # Select columns with alphanumeric names: df$select(cs$alphanumeric()) df$select(cs$alphanumeric(ignore_spaces = TRUE)) # Select all columns except for those with alphanumeric names: df$select(!cs$alphanumeric()) df$select(!cs$alphanumeric(ignore_spaces = TRUE))
df <- pl$DataFrame( `1st_col` = c(100, 200, 300), flagged = c(TRUE, FALSE, TRUE), `00prefix` = c("01:aa", "02:bb", "03:cc"), `last col` = c("x", "y", "z") ) # Select columns with alphanumeric names: df$select(cs$alphanumeric()) df$select(cs$alphanumeric(ignore_spaces = TRUE)) # Select all columns except for those with alphanumeric names: df$select(!cs$alphanumeric()) df$select(!cs$alphanumeric(ignore_spaces = TRUE))
Select all binary columns
cs__binary()
cs__binary()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( a = charToRaw("hello"), b = "world", c = charToRaw("!"), d = ":" ) # Select binary columns: df$select(cs$binary()) # Select all columns except for those that are binary: df$select(!cs$binary())
df <- pl$DataFrame( a = charToRaw("hello"), b = "world", c = charToRaw("!"), d = ":" ) # Select binary columns: df$select(cs$binary()) # Select all columns except for those that are binary: df$select(!cs$binary())
Select all boolean columns
cs__boolean()
cs__boolean()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( a = 1:4, b = c(FALSE, TRUE, FALSE, TRUE) ) # Select and invert boolean columns: df$with_columns(inverted = cs$boolean()$not()) # Select all columns except for those that are boolean: df$select(!cs$boolean())
df <- pl$DataFrame( a = 1:4, b = c(FALSE, TRUE, FALSE, TRUE) ) # Select and invert boolean columns: df$with_columns(inverted = cs$boolean()$not()) # Select all columns except for those that are boolean: df$select(!cs$boolean())
Select all columns matching the given dtypes
cs__by_dtype(...)
cs__by_dtype(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dt = as.Date(c("1999-12-31", "2024-1-1", "2010-7-5")), value = c(1234500, 5000555, -4500000), other = c("foo", "bar", "foo") ) # Select all columns with date or string dtypes: df$select(cs$by_dtype(pl$Date, pl$String)) # Select all columns that are not of date or string dtype: df$select(!cs$by_dtype(pl$Date, pl$String)) # Group by string columns and sum the numeric columns: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort("other")
df <- pl$DataFrame( dt = as.Date(c("1999-12-31", "2024-1-1", "2010-7-5")), value = c(1234500, 5000555, -4500000), other = c("foo", "bar", "foo") ) # Select all columns with date or string dtypes: df$select(cs$by_dtype(pl$Date, pl$String)) # Select all columns that are not of date or string dtype: df$select(!cs$by_dtype(pl$Date, pl$String)) # Group by string columns and sum the numeric columns: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort("other")
Select all columns matching the given indices (or range objects)
cs__by_index(indices)
cs__by_index(indices)
indices |
One or more column indices (or ranges). Negative indexing is supported. |
Matching columns are returned in the order in which their indexes appear in the selector, not the underlying schema order.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
vals <- as.list(0.5 * 0:100) names(vals) <- paste0("c", 0:100) df <- pl$DataFrame(!!!vals) df # Select columns by index (the two first/last columns): df$select(cs$by_index(c(0, 1, -2, -1))) # Use seq() df$select(cs$by_index(c(0, seq(1, 101, 20)))) df$select(cs$by_index(c(0, seq(101, 0, -25)))) # Select only odd-indexed columns: df$select(!cs$by_index(seq(0, 100, 2)))
vals <- as.list(0.5 * 0:100) names(vals) <- paste0("c", 0:100) df <- pl$DataFrame(!!!vals) df # Select columns by index (the two first/last columns): df$select(cs$by_index(c(0, 1, -2, -1))) # Use seq() df$select(cs$by_index(c(0, seq(1, 101, 20)))) df$select(cs$by_index(c(0, seq(101, 0, -25)))) # Select only odd-indexed columns: df$select(!cs$by_index(seq(0, 100, 2)))
Select all columns matching the given names
cs__by_name(..., require_all = TRUE)
cs__by_name(..., require_all = TRUE)
... |
< |
require_all |
Whether to match all names (the default) or any of the names. |
Matching columns are returned in the order in which their indexes appear in the selector, not the underlying schema order.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns by name: df$select(cs$by_name("foo", "bar")) # Match any of the given columns by name: df$select(cs$by_name("baz", "moose", "foo", "bear", require_all = FALSE)) # Match all columns except for those given: df$select(!cs$by_name("foo", "bar"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns by name: df$select(cs$by_name("foo", "bar")) # Match any of the given columns by name: df$select(cs$by_name("baz", "moose", "foo", "bear", require_all = FALSE)) # Match all columns except for those given: df$select(!cs$by_name("foo", "bar"))
Select all categorical columns
cs__categorical()
cs__categorical()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("xx", "yy"), bar = c(123, 456), baz = c(2.0, 5.5), .schema_overrides = list(foo = pl$Categorical()), ) # Select categorical columns: df$select(cs$categorical()) # Select all columns except for those that are categorical: df$select(!cs$categorical())
df <- pl$DataFrame( foo = c("xx", "yy"), bar = c(123, 456), baz = c(2.0, 5.5), .schema_overrides = list(foo = pl$Categorical()), ) # Select categorical columns: df$select(cs$categorical()) # Select all columns except for those that are categorical: df$select(!cs$categorical())
Select columns whose names contain the given literal substring(s)
cs__contains(...)
cs__contains(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that contain the substring "ba": df$select(cs$contains("ba")) # Select columns that contain the substring "ba" or the letter "z": df$select(cs$contains("ba", "z")) # Select all columns except for those that contain the substring "ba": df$select(!cs$contains("ba"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that contain the substring "ba": df$select(cs$contains("ba")) # Select columns that contain the substring "ba" or the letter "z": df$select(cs$contains("ba", "z")) # Select all columns except for those that contain the substring "ba": df$select(!cs$contains("ba"))
Select all date columns
cs__date()
cs__date()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")) ) # Select date columns: df$select(cs$date()) # Select all columns except for those that are dates: df$select(!cs$date())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")) ) # Select date columns: df$select(cs$date()) # Select all columns except for those that are dates: df$select(!cs$date())
Select all datetime columns
cs__datetime(time_unit = c("ms", "us", "ns"), time_zone = list("*", NULL))
cs__datetime(time_unit = c("ms", "us", "ns"), time_zone = list("*", NULL))
time_unit |
One (or more) of the allowed time unit precision strings,
|
time_zone |
One of the followings. The value or each element of the vector
will be passed to the
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
chr_vec <- c("1999-07-21 05:20:16.987654", "2000-05-16 06:21:21.123456") df <- pl$DataFrame( tstamp_tokyo = as.POSIXlt(chr_vec, tz = "Asia/Tokyo"), tstamp_utc = as.POSIXct(chr_vec, tz = "UTC"), tstamp = as.POSIXct(chr_vec), dt = as.Date(chr_vec), ) # Select all datetime columns: df$select(cs$datetime()) # Select all datetime columns that have "ms" precision: df$select(cs$datetime("ms")) # Select all datetime columns that have any timezone: df$select(cs$datetime(time_zone = "*")) # Select all datetime columns that have a specific timezone: df$select(cs$datetime(time_zone = "UTC")) # Select all datetime columns that have NO timezone: df$select(cs$datetime(time_zone = NULL)) # Select all columns except for datetime columns: df$select(!cs$datetime())
chr_vec <- c("1999-07-21 05:20:16.987654", "2000-05-16 06:21:21.123456") df <- pl$DataFrame( tstamp_tokyo = as.POSIXlt(chr_vec, tz = "Asia/Tokyo"), tstamp_utc = as.POSIXct(chr_vec, tz = "UTC"), tstamp = as.POSIXct(chr_vec), dt = as.Date(chr_vec), ) # Select all datetime columns: df$select(cs$datetime()) # Select all datetime columns that have "ms" precision: df$select(cs$datetime("ms")) # Select all datetime columns that have any timezone: df$select(cs$datetime(time_zone = "*")) # Select all datetime columns that have a specific timezone: df$select(cs$datetime(time_zone = "UTC")) # Select all datetime columns that have NO timezone: df$select(cs$datetime(time_zone = NULL)) # Select all columns except for datetime columns: df$select(!cs$datetime())
Select all decimal columns
cs__decimal()
cs__decimal()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c("2.0005", "-50.5555"), .schema_overrides = list( bar = pl$Decimal(), baz = pl$Decimal(scale = 5, precision = 10) ) ) # Select decimal columns: df$select(cs$decimal()) # Select all columns except for those that are decimal: df$select(!cs$decimal())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c("2.0005", "-50.5555"), .schema_overrides = list( bar = pl$Decimal(), baz = pl$Decimal(scale = 5, precision = 10) ) ) # Select decimal columns: df$select(cs$decimal()) # Select all columns except for those that are decimal: df$select(!cs$decimal())
Select all columns having names consisting only of digits
cs__digit(ascii_only = FALSE)
cs__digit(ascii_only = FALSE)
ascii_only |
Indicate whether to consider only ASCII alphabetic characters, or the full Unicode range of valid letters (accented, idiographic, etc). |
Matching column names cannot contain any non-digit characters. Note that the
definition of "digit" consists of all valid Unicode digit characters (d
)
by default; this can be changed by setting ascii_only = TRUE
.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( key = c("aaa", "bbb"), `2001` = 1:2, `2025` = 3:4 ) # Select columns with digit names: df$select(cs$digit()) # Select all columns except for those with digit names: df$select(!cs$digit()) # Demonstrate use of ascii_only flag (by default all valid unicode digits # are considered, but this can be constrained to ascii 0-9): df <- pl$DataFrame(`१९९९` = 1999, `२०७७` = 2077, `3000` = 3000) df$select(cs$digit()) df$select(cs$digit(ascii_only = TRUE))
df <- pl$DataFrame( key = c("aaa", "bbb"), `2001` = 1:2, `2025` = 3:4 ) # Select columns with digit names: df$select(cs$digit()) # Select all columns except for those with digit names: df$select(!cs$digit()) # Demonstrate use of ascii_only flag (by default all valid unicode digits # are considered, but this can be constrained to ascii 0-9): df <- pl$DataFrame(`१९९९` = 1999, `२०७७` = 2077, `3000` = 3000) df$select(cs$digit()) df$select(cs$digit(ascii_only = TRUE))
Select all duration columns, optionally filtering by time unit
cs__duration(time_unit = c("ms", "us", "ns"))
cs__duration(time_unit = c("ms", "us", "ns"))
time_unit |
One (or more) of the allowed time unit precision strings,
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dur_ms = clock::duration_milliseconds(1:2), dur_us = clock::duration_microseconds(1:2), dur_ns = clock::duration_nanoseconds(1:2), ) # Select duration columns: df$select(cs$duration()) # Select all duration columns that have "ms" precision: df$select(cs$duration("ms")) # Select all duration columns that have "ms" OR "ns" precision: df$select(cs$duration(c("ms", "ns"))) # Select all columns except for those that are duration: df$select(!cs$duration())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dur_ms = clock::duration_milliseconds(1:2), dur_us = clock::duration_microseconds(1:2), dur_ns = clock::duration_nanoseconds(1:2), ) # Select duration columns: df$select(cs$duration()) # Select all duration columns that have "ms" precision: df$select(cs$duration("ms")) # Select all duration columns that have "ms" OR "ns" precision: df$select(cs$duration(c("ms", "ns"))) # Select all columns except for those that are duration: df$select(!cs$duration())
Select columns that end with the given substring(s)
cs__ends_with(...)
cs__ends_with(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that end with the substring "z": df$select(cs$ends_with("z")) # Select columns that end with either the letter "z" or "r": df$select(cs$ends_with("z", "r")) # Select all columns except for those that end with the substring "z": df$select(!cs$ends_with("z"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that end with the substring "z": df$select(cs$ends_with("z")) # Select columns that end with either the letter "z" or "r": df$select(cs$ends_with("z", "r")) # Select all columns except for those that end with the substring "z": df$select(!cs$ends_with("z"))
Select all columns except those matching the given columns, datatypes, or selectors
cs__exclude(...)
cs__exclude(...)
... |
< |
If excluding a single selector it is simpler to write as !selector
instead.
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( aa = 1:3, ba = c("a", "b", NA), cc = c(NA, 2.5, 1.5) ) # Exclude by column name(s): df$select(cs$exclude("ba", "xx")) # Exclude using a column name, a selector, and a dtype: df$select(cs$exclude("aa", cs$string(), pl$Int32))
df <- pl$DataFrame( aa = 1:3, ba = c("a", "b", NA), cc = c(NA, 2.5, 1.5) ) # Exclude by column name(s): df$select(cs$exclude("ba", "xx")) # Exclude using a column name, a selector, and a dtype: df$select(cs$exclude("aa", cs$string(), pl$Int32))
Select the first column in the current scope
cs__first()
cs__first()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the first column: df$select(cs$first()) # Select everything except for the first column: df$select(!cs$first())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the first column: df$select(cs$first()) # Select everything except for the first column: df$select(!cs$first())
Select all float columns.
cs__float()
cs__float()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE), .schema_overrides = list(baz = pl$Float32, zap = pl$Float64), ) # Select all float columns: df$select(cs$float()) # Select all columns except for those that are float: df$select(!cs$float())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE), .schema_overrides = list(baz = pl$Float32, zap = pl$Float64), ) # Select all float columns: df$select(cs$float()) # Select all columns except for those that are float: df$select(!cs$float())
Select all integer columns.
cs__integer()
cs__integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1 ) # Select all integer columns: df$select(cs$integer()) # Select all columns except for those that are integer: df$select(!cs$integer())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1 ) # Select all integer columns: df$select(cs$integer()) # Select all columns except for those that are integer: df$select(!cs$integer())
Select the last column in the current scope
cs__last()
cs__last()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the last column: df$select(cs$last()) # Select everything except for the last column: df$select(!cs$last())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select the last column: df$select(cs$last()) # Select everything except for the last column: df$select(!cs$last())
Select all columns that match the given regex pattern
cs__matches(pattern)
cs__matches(pattern)
pattern |
A valid regular expression pattern, compatible with the
|
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(0, 1) ) # Match column names containing an "a", preceded by a character that is not # "z": df$select(cs$matches("[^z]a")) # Do not match column names ending in "R" or "z" (case-insensitively): df$select(!cs$matches(r"((?i)R|z$)"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(0, 1) ) # Match column names containing an "a", preceded by a character that is not # "z": df$select(cs$matches("[^z]a")) # Do not match column names ending in "R" or "z" (case-insensitively): df$select(!cs$matches(r"((?i)R|z$)"))
Select all numeric columns.
cs__numeric()
cs__numeric()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1, .schema_overrides = list(bar = pl$Int16, baz = pl$Float32, zap = pl$UInt8), ) # Select all numeric columns: df$select(cs$numeric()) # Select all columns except for those that are numeric: df$select(!cs$numeric())
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123L, 456L), baz = c(2.0, 5.5), zap = 0:1, .schema_overrides = list(bar = pl$Int16, baz = pl$Float32, zap = pl$UInt8), ) # Select all numeric columns: df$select(cs$numeric()) # Select all columns except for those that are numeric: df$select(!cs$numeric())
Select all signed integer columns
cs__signed_integer()
cs__signed_integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select signed integer columns: df$select(cs$signed_integer()) # Select all columns except for those that are signed integer: df$select(!cs$signed_integer()) # Select all integer columns (both signed and unsigned): df$select(cs$integer())
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select signed integer columns: df$select(cs$signed_integer()) # Select all columns except for those that are signed integer: df$select(!cs$signed_integer()) # Select all integer columns (both signed and unsigned): df$select(cs$integer())
Select columns that start with the given substring(s)
cs__starts_with(...)
cs__starts_with(...)
... |
< |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that start with the substring "b": df$select(cs$starts_with("b")) # Select columns that start with either the letter "b" or "z": df$select(cs$starts_with("b", "z")) # Select all columns except for those that start with the substring "b": df$select(!cs$starts_with("b"))
df <- pl$DataFrame( foo = c("x", "y"), bar = c(123, 456), baz = c(2.0, 5.5), zap = c(FALSE, TRUE) ) # Select columns that start with the substring "b": df$select(cs$starts_with("b")) # Select columns that start with either the letter "b" or "z": df$select(cs$starts_with("b", "z")) # Select all columns except for those that start with the substring "b": df$select(!cs$starts_with("b"))
Select all String (and, optionally, Categorical) string columns.
cs__string(..., include_categorical = FALSE)
cs__string(..., include_categorical = FALSE)
... |
These dots are for future extensions and must be empty. |
include_categorical |
If |
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( w = c("xx", "yy", "xx", "yy", "xx"), x = c(1, 2, 1, 4, -2), y = c(3.0, 4.5, 1.0, 2.5, -2.0), z = c("a", "b", "a", "b", "b") )$with_columns( z = pl$col("z")$cast(pl$Categorical()) ) # Group by all string columns, sum the numeric columns, then sort by the # string cols: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort(cs$string()) # Group by all string and categorical columns: df$ group_by(cs$string(include_categorical = TRUE))$ agg(cs$numeric()$sum())$ sort(cs$string(include_categorical = TRUE))
df <- pl$DataFrame( w = c("xx", "yy", "xx", "yy", "xx"), x = c(1, 2, 1, 4, -2), y = c(3.0, 4.5, 1.0, 2.5, -2.0), z = c("a", "b", "a", "b", "b") )$with_columns( z = pl$col("z")$cast(pl$Categorical()) ) # Group by all string columns, sum the numeric columns, then sort by the # string cols: df$group_by(cs$string())$agg(cs$numeric()$sum())$sort(cs$string()) # Group by all string and categorical columns: df$ group_by(cs$string(include_categorical = TRUE))$ agg(cs$numeric()$sum())$ sort(cs$string(include_categorical = TRUE))
Select all temporal columns
cs__temporal()
cs__temporal()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), value = 1:2 ) # Match all temporal columns: df$select(cs$temporal()) # Match all temporal columns except for time columns: df$select(cs$temporal() - cs$datetime()) # Match all columns except for temporal columns: df$select(!cs$temporal())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), value = 1:2 ) # Match all temporal columns: df$select(cs$temporal()) # Match all temporal columns except for time columns: df$select(cs$temporal() - cs$datetime()) # Match all columns except for temporal columns: df$select(!cs$temporal())
Select all time columns
cs__time()
cs__time()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), tm = hms::parse_hms(c("0:0:0", "23:59:59")) ) # Select time columns: df$select(cs$time()) # Select all columns except for those that are time: df$select(!cs$time())
df <- pl$DataFrame( dtm = as.POSIXct(c("2001-5-7 10:25", "2031-12-31 00:30")), dt = as.Date(c("1999-12-31", "2024-8-9")), tm = hms::parse_hms(c("0:0:0", "23:59:59")) ) # Select time columns: df$select(cs$time()) # Select all columns except for those that are time: df$select(!cs$time())
Select all unsigned integer columns
cs__unsigned_integer()
cs__unsigned_integer()
A Polars selector
cs for the documentation on operators supported by Polars selectors.
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select unsigned integer columns: df$select(cs$unsigned_integer()) # Select all columns except for those that are unsigned integer: df$select(!cs$unsigned_integer()) # Select all integer columns (both unsigned and unsigned): df$select(cs$integer())
df <- pl$DataFrame( foo = c(-123L, -456L), bar = c(3456L, 6789L), baz = c(7654L, 4321L), zap = c("ab", "cd"), .schema_overrides = list(bar = pl$UInt32, baz = pl$UInt64), ) # Select unsigned integer columns: df$select(cs$unsigned_integer()) # Select all columns except for those that are unsigned integer: df$select(!cs$unsigned_integer()) # Select all integer columns (both unsigned and unsigned): df$select(cs$integer())
Cast DataFrame column(s) to the specified dtype
dataframe__cast(..., .strict = TRUE)
dataframe__cast(..., .strict = TRUE)
A polars DataFrame
df <- pl$DataFrame( foo = 1:3, bar = c(6, 7, 8), ham = as.Date(c("2020-01-02", "2020-03-04", "2020-05-06")) ) # Cast only some columns df$cast(foo = pl$Float32, bar = pl$UInt8) # Cast all columns to the same type df$cast(pl$String)
df <- pl$DataFrame( foo = 1:3, bar = c(6, 7, 8), ham = as.Date(c("2020-01-02", "2020-03-04", "2020-05-06")) ) # Cast only some columns df$cast(foo = pl$Float32, bar = pl$UInt8) # Cast all columns to the same type df$cast(pl$String)
This is a cheap operation that does not copy data. Assigning does not copy the DataFrame (environment object). This is because environment objects have reference semantics. Calling $clone() creates a new environment, which can be useful when dealing with attributes (see examples).
dataframe__clone()
dataframe__clone()
A polars DataFrame
df1 <- as_polars_df(iris) # Assigning does not copy the DataFrame (environment object), calling # $clone() creates a new environment. df2 <- df1 df3 <- df1$clone() rlang::env_label(df1) rlang::env_label(df2) rlang::env_label(df3) # Cloning can be useful to add attributes to data used in a function without # adding those attributes to the original object. # Make a function to take a DataFrame, add an attribute, and return a # DataFrame: give_attr <- function(data) { attr(data, "created_on") <- "2024-01-29" data } df2 <- give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't attributes(df1) # Use $clone() inside the function to avoid that give_attr <- function(data) { data <- data$clone() attr(data, "created_on") <- "2024-01-29" data } df1 <- as_polars_df(iris) df2 <- give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
df1 <- as_polars_df(iris) # Assigning does not copy the DataFrame (environment object), calling # $clone() creates a new environment. df2 <- df1 df3 <- df1$clone() rlang::env_label(df1) rlang::env_label(df2) rlang::env_label(df3) # Cloning can be useful to add attributes to data used in a function without # adding those attributes to the original object. # Make a function to take a DataFrame, add an attribute, and return a # DataFrame: give_attr <- function(data) { attr(data, "created_on") <- "2024-01-29" data } df2 <- give_attr(df1) # Problem: the original DataFrame also gets the attribute while it shouldn't attributes(df1) # Use $clone() inside the function to avoid that give_attr <- function(data) { data <- data$clone() attr(data, "created_on") <- "2024-01-29" data } df1 <- as_polars_df(iris) df2 <- give_attr(df1) # now, the original DataFrame doesn't get this attribute attributes(df1)
Drop columns of a DataFrame
dataframe__drop(..., strict = TRUE)
dataframe__drop(..., strict = TRUE)
... |
< |
strict |
Validate that all column names exist in the schema and throw an exception if a column name does not exist in the schema. |
A polars DataFrame
as_polars_df(mtcars)$drop(c("mpg", "hp")) # equivalent as_polars_df(mtcars)$drop("mpg", "hp")
as_polars_df(mtcars)$drop(c("mpg", "hp")) # equivalent as_polars_df(mtcars)$drop("mpg", "hp")
Check whether the DataFrame is equal to another DataFrame
dataframe__equals(other, ..., null_equal = TRUE)
dataframe__equals(other, ..., null_equal = TRUE)
other |
DataFrame to compare with. |
A logical value
dat1 <- as_polars_df(iris) dat2 <- as_polars_df(iris) dat3 <- as_polars_df(mtcars) dat1$equals(dat2) dat1$equals(dat3)
dat1 <- as_polars_df(iris) dat2 <- as_polars_df(iris) dat3 <- as_polars_df(mtcars) dat1$equals(dat2) dat1$equals(dat3)
Filter rows of a DataFrame
dataframe__filter(...)
dataframe__filter(...)
A polars DataFrame
df <- as_polars_df(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 <- iris iris2[c(1, 3, 5), "Species"] <- NA df <- as_polars_df(iris2) df$filter(pl$col("Species") == "setosa")
df <- as_polars_df(iris) df$filter(pl$col("Sepal.Length") > 5) # This is equivalent to # df$filter(pl$col("Sepal.Length") > 5 & pl$col("Petal.Width") < 1) df$filter(pl$col("Sepal.Length") > 5, pl$col("Petal.Width") < 1) # rows where condition is NA are dropped iris2 <- iris iris2[c(1, 3, 5), "Species"] <- NA df <- as_polars_df(iris2) df$filter(pl$col("Species") == "setosa")
Get the DataFrame as a list of Series
dataframe__get_columns()
dataframe__get_columns()
A list of Series
df <- pl$DataFrame(foo = c(1, 2, 3), bar = c(4, 5, 6)) df$get_columns() df <- pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
df <- pl$DataFrame(foo = c(1, 2, 3), bar = c(4, 5, 6)) df$get_columns() df <- pl$DataFrame( a = 1:4, b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$get_columns()
Group a DataFrame
dataframe__group_by(..., .maintain_order = FALSE)
dataframe__group_by(..., .maintain_order = FALSE)
Within each group, the order of the rows is always preserved,
regardless of the maintain_order
argument.
GroupBy (a DataFrame with special groupby methods like $agg()
)
<DataFrame>$partition_by()
df <- pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is # consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
df <- pl$DataFrame( a = c("a", "b", "a", "b", "c"), b = c(1, 2, 1, 3, 3), c = c(5, 4, 3, 2, 1) ) df$group_by("a")$agg(pl$col("b")$sum()) # Set `maintain_order = TRUE` to ensure the order of the groups is # consistent with the input. df$group_by("a", maintain_order = TRUE)$agg(pl$col("c")) # Group by multiple columns by passing a list of column names. df$group_by(c("a", "b"))$agg(pl$max("c")) # Or pass some arguments to group by multiple columns in the same way. # Expressions are also accepted. df$group_by("a", pl$col("b") %/% 2)$agg( pl$col("c")$mean() ) # The columns will be renamed to the argument names. df$group_by(d = "a", e = pl$col("b") %/% 2)$agg( pl$col("c")$mean() )
Start a new lazy query from a DataFrame.
dataframe__lazy()
dataframe__lazy()
A polars LazyFrame
pl$DataFrame(a = 1:2, b = c(NA, "a"))$lazy()
pl$DataFrame(a = 1:2, b = c(NA, "a"))$lazy()
Get number of chunks used by the ChunkedArrays of this DataFrame
dataframe__n_chunks(strategy = c("first", "all"))
dataframe__n_chunks(strategy = c("first", "all"))
strategy |
Return the number of chunks of the |
An integer vector.
df <- pl$DataFrame( a = c(1, 2, 3, 4), b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$n_chunks() df$n_chunks(strategy = "all")
df <- pl$DataFrame( a = c(1, 2, 3, 4), b = c(0.5, 4, 10, 13), c = c(TRUE, TRUE, FALSE, TRUE) ) df$n_chunks() df$n_chunks(strategy = "all")
This will make sure all subsequent operations have optimal and predictable performance.
dataframe__rechunk()
dataframe__rechunk()
A polars DataFrame
Select and perform operations on a subset of columns only. This discards
unmentioned columns (like .()
in data.table
and contrarily to
dplyr::mutate()
).
One cannot use new variables in subsequent expressions in the same
$select()
call. For instance, if you create a variable x
, you will only
be able to use it in another $select()
or $with_columns()
call.
dataframe__select(...)
dataframe__select(...)
... |
< |
A polars DataFrame
as_polars_df(iris)$select( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 )
as_polars_df(iris)$select( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 )
Get a slice of the DataFrame.
dataframe__slice(offset, length = NULL)
dataframe__slice(offset, length = NULL)
offset |
Start index, can be a negative value. This is 0-indexed, so
|
length |
Length of the slice. If |
A polars DataFrame
# skip the first 2 rows and take the 4 following rows as_polars_df(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
# skip the first 2 rows and take the 4 following rows as_polars_df(mtcars)$slice(2, 4) # this is equivalent to: mtcars[3:6, ]
Sort a DataFrame
dataframe__sort( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
dataframe__sort( ..., descending = FALSE, nulls_last = FALSE, multithreaded = TRUE, maintain_order = FALSE )
A polars DataFrame
df <- mtcars df$mpg[1] <- NA df <- as_polars_df(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
df <- mtcars df$mpg[1] <- NA df <- as_polars_df(df) df$sort("mpg") df$sort("mpg", nulls_last = TRUE) df$sort("cyl", "mpg") df$sort(c("cyl", "mpg")) df$sort(c("cyl", "mpg"), descending = TRUE) df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE)) df$sort(pl$col("cyl"), pl$col("mpg"))
Select column as Series at index location
dataframe__to_series(index = 0)
dataframe__to_series(index = 0)
index |
Index of the column to return as Series. Defaults to 0, which is the first column. |
Series or NULL
df <- as_polars_df(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use index = 1 to extract the *2nd* column df$to_series(index = 1) # doesn't error if the column isn't there df$to_series(index = 8)
df <- as_polars_df(iris[1:10, ]) # default is to extract the first column df$to_series() # Polars is 0-indexed, so we use index = 1 to extract the *2nd* column df$to_series(index = 1) # doesn't error if the column isn't there df$to_series(index = 8)
Convert a DataFrame to a Series of type Struct
dataframe__to_struct(name = "")
dataframe__to_struct(name = "")
name |
A character. Name for the struct Series. |
A Series of the struct type
df <- pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), ) df$to_struct("nums")
df <- pl$DataFrame( a = 1:5, b = c("one", "two", "three", "four", "five"), ) df$to_struct("nums")
Add columns or modify existing ones with expressions. This is similar to
dplyr::mutate()
as it keeps unmentioned columns (unlike $select()
).
However, unlike dplyr::mutate()
, one cannot use new variables in subsequent
expressions in the same $with_columns()
call. For instance, if you create a
variable x
, you will only be able to use it in another $with_columns()
or $select()
call.
dataframe__with_columns(...)
dataframe__with_columns(...)
... |
< |
A polars DataFrame
as_polars_df(iris)$with_columns( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 ) # same query l_expr <- list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) as_polars_df(iris)$with_columns(l_expr) as_polars_df(iris)$with_columns( SW_add_2 = (pl$col("Sepal.Width") + 2), # unnamed expr will keep name "Sepal.Length" pl$col("Sepal.Length")$abs() )
as_polars_df(iris)$with_columns( abs_SL = pl$col("Sepal.Length")$abs(), add_2_SL = pl$col("Sepal.Length") + 2 ) # same query l_expr <- list( pl$col("Sepal.Length")$abs()$alias("abs_SL"), (pl$col("Sepal.Length") + 2)$alias("add_2_SL") ) as_polars_df(iris)$with_columns(l_expr) as_polars_df(iris)$with_columns( SW_add_2 = (pl$col("Sepal.Width") + 2), # unnamed expr will keep name "Sepal.Length" pl$col("Sepal.Length")$abs() )
Compute absolute values
expr__abs()
expr__abs()
A polars expression
df <- pl$DataFrame(a = -1:2) df$with_columns(abs = pl$col("a")$abs())
df <- pl$DataFrame(a = -1:2) df$with_columns(abs = pl$col("a")$abs())
Method equivalent of addition operator expr + other
.
expr__add(other)
expr__add(other)
other |
Element to add. Can be a string (only if |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = 1:5) df$with_columns( `x+int` = pl$col("x")$add(2L), `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod()) ) df <- pl$DataFrame( x = c("a", "d", "g"), y = c("b", "e", "h"), z = c("c", "f", "i") ) df$with_columns( pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz") )
df <- pl$DataFrame(x = 1:5) df$with_columns( `x+int` = pl$col("x")$add(2L), `x+expr` = pl$col("x")$add(pl$col("x")$cum_prod()) ) df <- pl$DataFrame( x = c("a", "d", "g"), y = c("b", "e", "h"), z = c("c", "f", "i") ) df$with_columns( pl$col("x")$add(pl$col("y"))$add(pl$col("z"))$alias("xyz") )
Should be used in aggregation context only.
expr__agg_groups()
expr__agg_groups()
A polars expression
df <- pl$DataFrame( group = rep(c("one", "two"), each = 3), value = c(94, 95, 96, 97, 97, 99) ) df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())
df <- pl$DataFrame( group = rep(c("one", "two"), each = 3), value = c(94, 95, 96, 97, 97, 99) ) df$group_by("group", maintain_order = TRUE)$agg(pl$col("value")$agg_groups())
Rename the expression
expr__alias(name)
expr__alias(name)
name |
The new name. |
A polars expression
# Rename an expression to avoid overwriting an existing column df <- pl$DataFrame(a = 1:3, b = c("x", "y", "z")) df$with_columns( pl$col("a") + 10, pl$col("b")$str$to_uppercase()$alias("c") ) # Overwrite the default name of literal columns to prevent errors due to # duplicate column names. df$with_columns( pl$lit(TRUE)$alias("c"), pl$lit(4)$alias("d") )
# Rename an expression to avoid overwriting an existing column df <- pl$DataFrame(a = 1:3, b = c("x", "y", "z")) df$with_columns( pl$col("a") + 10, pl$col("b")$str$to_uppercase()$alias("c") ) # Overwrite the default name of literal columns to prevent errors due to # duplicate column names. df$with_columns( pl$lit(TRUE)$alias("c"), pl$lit(4)$alias("d") )
This method is an expression - not to be confused with pl$all()
which is a function to select all columns.
expr__all(..., ignore_nulls = TRUE)
expr__all(..., ignore_nulls = TRUE)
... |
These dots are for future extensions and must be empty. |
ignore_nulls |
If |
A polars expression
df <- pl$DataFrame( a = c(TRUE, TRUE), b = c(TRUE, FALSE), c = c(NA, TRUE), d = c(NA, NA) ) # By default, ignore null values. If there are only nulls, then all() returns # TRUE. df$select(pl$col("*")$all()) # If we set ignore_nulls = FALSE, then we don't know if all values in column # "c" are TRUE, so it returns null df$select(pl$col("*")$all(ignore_nulls = FALSE))
df <- pl$DataFrame( a = c(TRUE, TRUE), b = c(TRUE, FALSE), c = c(NA, TRUE), d = c(NA, NA) ) # By default, ignore null values. If there are only nulls, then all() returns # TRUE. df$select(pl$col("*")$all()) # If we set ignore_nulls = FALSE, then we don't know if all values in column # "c" are TRUE, so it returns null df$select(pl$col("*")$all(ignore_nulls = FALSE))
Combine two boolean expressions with AND.
expr__and(other)
expr__and(other)
other |
Element to add. Can be a string (only if |
A polars expression
pl$lit(TRUE) & TRUE pl$lit(TRUE)$and(pl$lit(TRUE))
pl$lit(TRUE) & TRUE pl$lit(TRUE)$and(pl$lit(TRUE))
Check if any boolean value in a column is true
expr__any(..., ignore_nulls = TRUE)
expr__any(..., ignore_nulls = TRUE)
... |
These dots are for future extensions and must be empty. |
ignore_nulls |
If |
A polars expression
df <- pl$DataFrame( a = c(TRUE, FALSE), b = c(FALSE, FALSE), c = c(NA, FALSE) ) df$select(pl$col("*")$any()) # If we set ignore_nulls = FALSE, then we don't know if any values in column # "c" is TRUE, so it returns null df$select(pl$col("*")$any(ignore_nulls = FALSE))
df <- pl$DataFrame( a = c(TRUE, FALSE), b = c(FALSE, FALSE), c = c(NA, FALSE) ) df$select(pl$col("*")$any()) # If we set ignore_nulls = FALSE, then we don't know if any values in column # "c" is TRUE, so it returns null df$select(pl$col("*")$any(ignore_nulls = FALSE))
Append expressions
expr__append(other, ..., upcast = TRUE)
expr__append(other, ..., upcast = TRUE)
other |
Expression to append. |
... |
These dots are for future extensions and must be empty. |
upcast |
If |
A polars expression
df <- pl$DataFrame(a = 8:10, b = c(NA, 4, 4)) df$select(pl$all()$head(1)$append(pl$all()$tail(1)))
df <- pl$DataFrame(a = 8:10, b = c(NA, 4, 4)) df$select(pl$all()$head(1)$append(pl$all()$tail(1)))
This is done using the HyperLogLog++ algorithm for cardinality estimation.
expr__approx_n_unique()
expr__approx_n_unique()
A polars expression
df <- pl$DataFrame(n = c(1, 1, 2)) df$select(pl$col("n")$approx_n_unique()) df <- pl$DataFrame(n = 0:1000) df$select( exact = pl$col("n")$n_unique(), approx = pl$col("n")$approx_n_unique() )
df <- pl$DataFrame(n = c(1, 1, 2)) df$select(pl$col("n")$approx_n_unique()) df <- pl$DataFrame(n = 0:1000) df$select( exact = pl$col("n")$n_unique(), approx = pl$col("n")$approx_n_unique() )
Compute inverse cosine
expr__arccos()
expr__arccos()
A polars expression
pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA))$ with_columns(arccos = pl$col("a")$arccos())
pl$DataFrame(a = c(-1, cos(0.5), 0, 1, NA))$ with_columns(arccos = pl$col("a")$arccos())
Compute inverse hyperbolic cosine
expr__arccosh()
expr__arccosh()
A polars expression
pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA))$ with_columns(arccosh = pl$col("a")$arccosh())
pl$DataFrame(a = c(-1, cosh(0.5), 0, 1, NA))$ with_columns(arccosh = pl$col("a")$arccosh())
Compute inverse sine
expr__arcsin()
expr__arcsin()
A polars expression
pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA))$ with_columns(arcsin = pl$col("a")$arcsin())
pl$DataFrame(a = c(-1, sin(0.5), 0, 1, NA))$ with_columns(arcsin = pl$col("a")$arcsin())
Compute inverse hyperbolic sine
expr__arcsinh()
expr__arcsinh()
A polars expression
pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA))$ with_columns(arcsinh = pl$col("a")$arcsinh())
pl$DataFrame(a = c(-1, sinh(0.5), 0, 1, NA))$ with_columns(arcsinh = pl$col("a")$arcsinh())
Compute inverse tangent
expr__arctan()
expr__arctan()
A polars expression
pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$ with_columns(arctan = pl$col("a")$arctan())
pl$DataFrame(a = c(-1, tan(0.5), 0, 1, NA_real_))$ with_columns(arctan = pl$col("a")$arctan())
Compute inverse hyperbolic tangent
expr__arctanh()
expr__arctanh()
A polars expression
pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA))$ with_columns(arctanh = pl$col("a")$arctanh())
pl$DataFrame(a = c(-1, tanh(0.5), 0, 1, NA))$ with_columns(arctanh = pl$col("a")$arctanh())
Get the index of the maximal value
expr__arg_max()
expr__arg_max()
A polars expression
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_max())
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_max())
Get the index of the minimal value
expr__arg_min()
expr__arg_min()
A polars expression
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_min())
df <- pl$DataFrame(a = c(20, 10, 30)) df$select(pl$col("a")$arg_min())
Get the index values that would sort this column.
expr__arg_sort(..., descending = FALSE, nulls_last = FALSE)
expr__arg_sort(..., descending = FALSE, nulls_last = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Sort in descending order. |
nulls_last |
Place null values last. |
A polars expression
pl$arg_sort_by() to find the row indices that would sort multiple columns.
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_sorted = pl$col("a")$arg_sort())
pl$DataFrame( a = c(6, 1, 0, NA, Inf, NaN) )$with_columns(arg_sorted = pl$col("a")$arg_sort())
Return indices where expression is true
expr__arg_true()
expr__arg_true()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 1)) df$select((pl$col("a") == 1)$arg_true())
df <- pl$DataFrame(a = c(1, 1, 2, 1)) df$select((pl$col("a") == 1)$arg_true())
Get the index of the first unique value
expr__arg_unique()
expr__arg_unique()
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$col("a")$arg_unique()) df$select(pl$col("b")$arg_unique())
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$col("a")$arg_unique()) df$select(pl$col("b")$arg_unique())
Fill missing values with the next non-null value
expr__backward_fill(limit = NULL)
expr__backward_fill(limit = NULL)
fill |
The number of consecutive null values to backward fill. |
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(NA, NA, 2) ) df$select(pl$all()$backward_fill()) df$select(pl$all()$backward_fill(limit = 1))
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(NA, NA, 2) ) df$select(pl$all()$backward_fill()) df$select(pl$all()$backward_fill(limit = 1))
k
smallest elementsNon-null elements are always preferred over null elements. The output is not
guaranteed to be in any particular order, call $sort() after
this function if you wish the output to be sorted. This has time complexity
.
expr__bottom_k(k = 5)
expr__bottom_k(k = 5)
k |
Number of elements to return. |
A polars expression
df <- pl$DataFrame(value = c(1, 98, 2, 3, 99, 4)) df$select( top_k = pl$col("value")$top_k(k = 3), bottom_k = pl$col("value")$bottom_k(k = 3) )
df <- pl$DataFrame(value = c(1, 98, 2, 3, 99, 4)) df$select( top_k = pl$col("value")$top_k(k = 3), bottom_k = pl$col("value")$bottom_k(k = 3) )
k
smallest elements of the by
column(s)Non-null elements are always preferred over null elements. The output is not
guaranteed to be in any particular order, call $sort() after
this function if you wish the output to be sorted. This has time complexity
.
expr__bottom_k_by(by, k = 5, ..., reverse = FALSE)
expr__bottom_k_by(by, k = 5, ..., reverse = FALSE)
by |
Column(s) used to determine the smallest elements. Accepts expression input. Strings are parsed as column names. |
k |
Number of elements to return. |
A polars expression
df <- pl$DataFrame( a = 1:6, b = 6:1, c = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana") ) # Get the bottom 2 rows by column a or b: df$select( pl$all()$bottom_k_by("a", 2)$name$suffix("_btm_by_a"), pl$all()$bottom_k_by("b", 2)$name$suffix("_btm_by_b") ) # Get the bottom 2 rows by multiple columns with given order. df$select( pl$all()$ bottom_k_by(c("c", "a"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_ca"), pl$all()$ bottom_k_by(c("c", "b"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_cb"), ) # Get the bottom 2 rows by column a in each group df$group_by("c", maintain_order = TRUE)$agg( pl$all()$bottom_k_by("a", 2) )$explode(pl$all()$exclude("c"))
df <- pl$DataFrame( a = 1:6, b = 6:1, c = c("Apple", "Orange", "Apple", "Apple", "Banana", "Banana") ) # Get the bottom 2 rows by column a or b: df$select( pl$all()$bottom_k_by("a", 2)$name$suffix("_btm_by_a"), pl$all()$bottom_k_by("b", 2)$name$suffix("_btm_by_b") ) # Get the bottom 2 rows by multiple columns with given order. df$select( pl$all()$ bottom_k_by(c("c", "a"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_ca"), pl$all()$ bottom_k_by(c("c", "b"), 2, reverse = c(FALSE, TRUE))$ name$suffix("_btm_by_cb"), ) # Get the bottom 2 rows by column a in each group df$group_by("c", maintain_order = TRUE)$agg( pl$all()$bottom_k_by("a", 2) )$explode(pl$all()$exclude("c"))
Cast between DataType
expr__cast(dtype, ..., strict = TRUE, wrap_numerical = FALSE)
expr__cast(dtype, ..., strict = TRUE, wrap_numerical = FALSE)
dtype |
DataType to cast to. |
... |
These dots are for future extensions and must be empty. |
strict |
If |
wrap_numerical |
If |
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(1, 2, 3)) df$with_columns( pl$col("a")$cast(pl$dtypes$Float64), pl$col("b")$cast(pl$dtypes$Int32) ) # strict FALSE, inserts null for any cast failure pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series() # strict TRUE, raise any failure as an error when query is executed. tryCatch( { pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series() }, error = function(e) e )
df <- pl$DataFrame(a = 1:3, b = c(1, 2, 3)) df$with_columns( pl$col("a")$cast(pl$dtypes$Float64), pl$col("b")$cast(pl$dtypes$Int32) ) # strict FALSE, inserts null for any cast failure pl$lit(c(100, 200, 300))$cast(pl$dtypes$UInt8, strict = FALSE)$to_series() # strict TRUE, raise any failure as an error when query is executed. tryCatch( { pl$lit("a")$cast(pl$dtypes$Float64, strict = TRUE)$to_series() }, error = function(e) e )
Compute cube root
expr__cbrt()
expr__cbrt()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(cbrt = pl$col("a")$cbrt())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(cbrt = pl$col("a")$cbrt())
This only works on floating point Series.
expr__ceil()
expr__ceil()
A polars expression
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( ceil = pl$col("a")$ceil() )
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( ceil = pl$col("a")$ceil() )
This method only works for numeric and temporal columns. To clip other data types, consider writing a when-then-otherwise expression.
expr__clip(lower_bound = NULL, upper_bound = NULL)
expr__clip(lower_bound = NULL, upper_bound = NULL)
lower_bound |
Lower bound. Accepts expression input. Non-expression inputs are parsed as literals. |
upper_bound |
Upper bound. Accepts expression input. Non-expression inputs are parsed as literals. |
This method only works for numeric and temporal columns. To clip other data types, consider writing a when-then-otherwise expression.
A polars expression
df <- pl$DataFrame(a = c(-50, 5, 50, NA)) # Specifying both a lower and upper bound: df$with_columns( clip = pl$col("a")$clip(1, 10) ) # Specifying only a single bound: df$with_columns( clip = pl$col("a")$clip(upper_bound = 10) )
df <- pl$DataFrame(a = c(-50, 5, 50, NA)) # Specifying both a lower and upper bound: df$with_columns( clip = pl$col("a")$clip(1, 10) ) # Specifying only a single bound: df$with_columns( clip = pl$col("a")$clip(upper_bound = 10) )
Compute cosine
expr__cos()
expr__cos()
A polars expression
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(cosine = pl$col("a")$cos())
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(cosine = pl$col("a")$cos())
Compute hyperbolic cosine
expr__cosh()
expr__cosh()
A polars expression
pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA))$ with_columns(cosh = pl$col("a")$cosh())
pl$DataFrame(a = c(-1, acosh(2), 0, 1, NA))$ with_columns(cosh = pl$col("a")$cosh())
Compute cotangent
expr__cot()
expr__cot()
A polars expression
pl$DataFrame(a = c(0, pi / 2, -5, NA))$ with_columns(cotangent = pl$col("a")$cot())
pl$DataFrame(a = c(0, pi / 2, -5, NA))$ with_columns(cotangent = pl$col("a")$cot())
Get the number of non-null elements in the column
expr__count()
expr__count()
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$count())
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$count())
Return the cumulative count of the non-null values in the column
expr__cum_count(..., reverse = FALSE)
expr__cum_count(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
A polars expression
pl$DataFrame(a = 1:4)$with_columns( cum_count = pl$col("a")$cum_count(), cum_count_reversed = pl$col("a")$cum_count(reverse = TRUE) )
pl$DataFrame(a = 1:4)$with_columns( cum_count = pl$col("a")$cum_count(), cum_count_reversed = pl$col("a")$cum_count(reverse = TRUE) )
Return the cumulative max computed at every element.
expr__cum_max(..., reverse = FALSE)
expr__cum_max(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_max = pl$col("a")$cum_max(), cum_max_reversed = pl$col("a")$cum_max(reverse = TRUE) )
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_max = pl$col("a")$cum_max(), cum_max_reversed = pl$col("a")$cum_max(reverse = TRUE) )
Return the cumulative min computed at every element.
expr__cum_min(..., reverse = FALSE)
expr__cum_min(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_min = pl$col("a")$cum_min(), cum_min_reversed = pl$col("a")$cum_min(reverse = TRUE) )
pl$DataFrame(a = c(1:4, 2L))$with_columns( cum_min = pl$col("a")$cum_min(), cum_min_reversed = pl$col("a")$cum_min(reverse = TRUE) )
Return the cumulative product computed at every element.
expr__cum_prod(..., reverse = FALSE)
expr__cum_prod(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = 1:4)$with_columns( cum_prod = pl$col("a")$cum_prod(), cum_prod_reversed = pl$col("a")$cum_prod(reverse = TRUE) )
pl$DataFrame(a = 1:4)$with_columns( cum_prod = pl$col("a")$cum_prod(), cum_prod_reversed = pl$col("a")$cum_prod(reverse = TRUE) )
Return the cumulative sum computed at every element.
expr__cum_sum(..., reverse = FALSE)
expr__cum_sum(..., reverse = FALSE)
... |
These dots are for future extensions and must be empty. |
reverse |
If |
The Dtypes Int8, UInt8, Int16 and UInt16 are cast to Int64 before summing to prevent overflow issues.
A polars expression
pl$DataFrame(a = 1:4)$with_columns( cum_sum = pl$col("a")$cum_sum(), cum_sum_reversed = pl$col("a")$cum_sum(reverse = TRUE) )
pl$DataFrame(a = 1:4)$with_columns( cum_sum = pl$col("a")$cum_sum(), cum_sum_reversed = pl$col("a")$cum_sum(reverse = TRUE) )
Return the cumulative count of the non-null values in the column
expr__cumulative_eval(expr, ..., min_periods = 1, parallel = FALSE)
expr__cumulative_eval(expr, ..., min_periods = 1, parallel = FALSE)
expr |
Expression to evaluate. |
... |
These dots are for future extensions and must be empty. |
min_periods |
Number of valid values (i.e. |
parallel |
Run in parallel. Don’t do this in a group by or another operation that already has much parallelization. |
This can be really slow as it can have O(n^2)
complexity. Don’t use this
for operations that visit all elements.
A polars expression
df <- pl$DataFrame(values = 1:5) df$with_columns( pl$col("values")$cumulative_eval( pl$element()$first() - pl$element()$last()**2 ) )
df <- pl$DataFrame(values = 1:5) df$with_columns( pl$col("values")$cumulative_eval( pl$element()$first() - pl$element()$last()**2 ) )
expr__cut( breaks, ..., labels = NULL, left_closed = FALSE, include_breaks = FALSE )
expr__cut( breaks, ..., labels = NULL, left_closed = FALSE, include_breaks = FALSE )
breaks |
List of unique cut points. |
... |
These dots are for future extensions and must be empty. |
labels |
Names of the categories. The number of labels must be equal to the number of cut points plus one. |
left_closed |
Set the intervals to be left-closed instead of right-closed. |
include_breaks |
Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct. |
A polars expression
# Divide a column into three categories. df <- pl$DataFrame(foo = -2:2) df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c")) ) # Add both the category and the breakpoint. df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE) )$unnest()
# Divide a column into three categories. df <- pl$DataFrame(foo = -2:2) df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), labels = c("a", "b", "c")) ) # Add both the category and the breakpoint. df$with_columns( cut = pl$col("foo")$cut(c(-1, 1), include_breaks = TRUE) )$unnest()
Convert from radians to degrees
expr__degrees()
expr__degrees()
A polars expression
pl$DataFrame(a = c(1, 2, 4) * pi)$ with_columns(degrees = pl$col("a")$degrees())
pl$DataFrame(a = c(1, 2, 4) * pi)$ with_columns(degrees = pl$col("a")$degrees())
Calculate the n-th discrete difference between elements
expr__diff(n = 1, null_behavior = c("ignore", "drop"))
expr__diff(n = 1, null_behavior = c("ignore", "drop"))
n |
Integer indicating the number of slots to shift. |
null_behavior |
How to handle null values. Must be |
A polars expression
pl$DataFrame(a = c(20, 10, 30, 25, 35))$with_columns( diff_default = pl$col("a")$diff(), diff_2_ignore = pl$col("a")$diff(2, "ignore") )
pl$DataFrame(a = c(20, 10, 30, 25, 35))$with_columns( diff_default = pl$col("a")$diff(), diff_2_ignore = pl$col("a")$diff(2, "ignore") )
Compute the dot/inner product between two Expressions
expr__dot(expr)
expr__dot(expr)
other |
Expression to compute dot product with. |
A polars expression
df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select(pl$col("a")$dot(pl$col("b")))
df <- pl$DataFrame(a = c(1, 3, 5), b = c(2, 4, 6)) df$select(pl$col("a")$dot(pl$col("b")))
The original order of the remaining elements is preserved. A NaN
value is
not the same as a null
value. To drop null
values, use
$drop_nulls().
expr__drop_nans()
expr__drop_nans()
A polars expression
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nans())
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nans())
The original order of the remaining elements is preserved. A null
value is
not the same as a NaN
value. To drop NaN
values, use
$drop_nans().
expr__drop_nulls()
expr__drop_nulls()
A polars expression
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nulls())
df <- pl$DataFrame(a = c(1, NA, 3, NaN)) df$select(pl$col("a")$drop_nulls())
Uses the formula -sum(pk * log(pk)
where pk
are discrete probabilities.
expr__entropy(base = exp(1), ..., normalize = TRUE)
expr__entropy(base = exp(1), ..., normalize = TRUE)
base |
Numeric value used as base, defaults to |
... |
These dots are for future extensions and must be empty. |
normalize |
Normalize |
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$entropy(base = 2)) df$select(pl$col("a")$entropy(base = 2, normalize = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$entropy(base = 2)) df$select(pl$col("a")$entropy(base = 2, normalize = FALSE))
This propagates null values, i.e. any comparison involving null
will
return null
. Use $eq_missing()
to consider null
values as equal.
expr__eq(other)
expr__eq(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq(pl$col("y")), eq_missing = pl$col("x")$eq_missing(pl$col("y")) )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq(pl$col("y")), eq_missing = pl$col("x")$eq_missing(pl$col("y")) )
null
propagationThis considers that null values are equal. It differs from
$eq()
where null values are propagated.
expr__eq_missing(other)
expr__eq_missing(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq("y"), eq_missing = pl$col("x")$eq_missing("y") )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( eq = pl$col("x")$eq("y"), eq_missing = pl$col("x")$eq_missing("y") )
Compute exponentially-weighted moving mean
expr__ewm_mean( ..., com, span, half_life, alpha, adjust = TRUE, min_periods = 1, ignore_nulls = FALSE )
expr__ewm_mean( ..., com, span, half_life, alpha, adjust = TRUE, min_periods = 1, ignore_nulls = FALSE )
... |
These dots are for future extensions and must be empty. |
com |
Specify decay in terms of center of mass,
. |
span |
Specify decay in terms of span,
|
half_life |
Specify decay in terms of half-life,
|
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
ignore_nulls |
Ignore missing values when calculating weights.
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_mean(com = 1, ignore_nulls = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_mean(com = 1, ignore_nulls = FALSE))
Given observations ,
, ...,
at times
,
, ...,
, the EWMA is calculated as
where is the
half_life
.
expr__ewm_mean_by(by, ..., half_life)
expr__ewm_mean_by(by, ..., half_life)
by |
Times to calculate average by. Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data type. |
half_life |
Unit over which observation decays to half its value. Can be created either from a timedelta, or by using the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
A polars expression
df <- pl$DataFrame( values = c(0, 1, 2, NA, 4), times = as.Date( c("2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17") ) ) df$with_columns( result = pl$col("values")$ewm_mean_by("times", half_life = "4d") )
df <- pl$DataFrame( values = c(0, 1, 2, NA, 4), times = as.Date( c("2020-01-01", "2020-01-03", "2020-01-10", "2020-01-15", "2020-01-17") ) ) df$with_columns( result = pl$col("values")$ewm_mean_by("times", half_life = "4d") )
Compute exponentially-weighted moving standard deviation
expr__ewm_std( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
expr__ewm_std( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
... |
These dots are for future extensions and must be empty. |
com |
Specify decay in terms of center of mass,
. |
span |
Specify decay in terms of span,
|
half_life |
Specify decay in terms of half-life,
|
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
bias |
If |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
ignore_nulls |
Ignore missing values when calculating weights.
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_std(com = 1, ignore_nulls = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_std(com = 1, ignore_nulls = FALSE))
Compute exponentially-weighted moving variance
expr__ewm_var( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
expr__ewm_var( ..., com, span, half_life, alpha, adjust = TRUE, bias = FALSE, min_periods = 1, ignore_nulls = FALSE )
... |
These dots are for future extensions and must be empty. |
com |
Specify decay in terms of center of mass,
. |
span |
Specify decay in terms of span,
|
half_life |
Specify decay in terms of half-life,
|
alpha |
Specify smoothing factor alpha directly, |
adjust |
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings:
|
bias |
If |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
ignore_nulls |
Ignore missing values when calculating weights.
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_var(com = 1, ignore_nulls = FALSE))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$ewm_var(com = 1, ignore_nulls = FALSE))
Exclude columns from a multi-column expression.
expr__exclude(...)
expr__exclude(...)
... |
The name or datatype of the column(s) to exclude. Accepts regular
expression input. Regular expressions should start with |
A polars expression
df <- pl$DataFrame(aa = 1:2, ba = c("a", NA), cc = c(NA, 2.5)) df # Exclude by column name(s): df$select(pl$all()$exclude("ba")) # Exclude by regex, e.g. removing all columns whose names end with the # letter "a": df$select(pl$all()$exclude("^.*a$")) # Exclude by dtype(s), e.g. removing all columns of type Int64 or Float64: df$select(pl$all()$exclude(pl$Int64, pl$Float64))
df <- pl$DataFrame(aa = 1:2, ba = c("a", NA), cc = c(NA, 2.5)) df # Exclude by column name(s): df$select(pl$all()$exclude("ba")) # Exclude by regex, e.g. removing all columns whose names end with the # letter "a": df$select(pl$all()$exclude("^.*a$")) # Exclude by dtype(s), e.g. removing all columns of type Int64 or Float64: df$select(pl$all()$exclude(pl$Int64, pl$Float64))
Compute the exponential
expr__exp()
expr__exp()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(exp = pl$col("a")$exp())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(exp = pl$col("a")$exp())
This means that every item is expanded to a new row.
expr__explode()
expr__explode()
A polars expression
df <- pl$DataFrame( groups = c("a", "b"), values = list(1:2, 3:4) ) df$select(pl$col("values")$explode())
df <- pl$DataFrame( groups = c("a", "b"), values = list(1:2, 3:4) ) df$select(pl$col("values")$explode())
n
copies of a valueExtend the Series with n
copies of a value
expr__extend_constant(value, n)
expr__extend_constant(value, n)
value |
A constant literal value or a unit expression with which to
extend the expression result Series. This can be |
n |
The number of additional values that will be added. |
A polars expression
df <- pl$DataFrame(values = 1:3) df$select(pl$col("values")$extend_constant(99, n = 2))
df <- pl$DataFrame(values = 1:3) df$select(pl$col("values")$extend_constant(99, n = 2))
NaN
value with a fill valueFill floating point NaN
value with a fill value
expr__fill_nan(value)
expr__fill_nan(value)
value |
Value used to fill |
A polars expression
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_nan = pl$col("a")$fill_nan(99) )
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_nan = pl$col("a")$fill_nan(99) )
Fill floating point null value with a fill value
expr__fill_null(value, strategy = NULL, limit = NULL)
expr__fill_null(value, strategy = NULL, limit = NULL)
value |
Value used to fill null values. Can be missing if |
strategy |
Strategy used to fill null values. Must be one of
|
limit |
Number of consecutive null values to fill when using the
|
A polars expression
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_null_zero = pl$col("a")$fill_null(strategy = "zero"), filled_null_99 = pl$col("a")$fill_null(99), filled_null_forward = pl$col("a")$fill_null(strategy = "forward"), filled_null_expr = pl$col("a")$fill_null(pl$col("a")$median()) )
df <- pl$DataFrame(a = c(1, NA, 2, NaN)) df$with_columns( filled_null_zero = pl$col("a")$fill_null(strategy = "zero"), filled_null_99 = pl$col("a")$fill_null(99), filled_null_forward = pl$col("a")$fill_null(strategy = "forward"), filled_null_expr = pl$col("a")$fill_null(pl$col("a")$median()) )
Elements where the filter does not evaluate to TRUE
are discarded,
including nulls. This is mostly useful in an aggregation context. If you
want to filter on a DataFrame level, use
DataFrame$filter()
or
LazyFrame$filter()
.
expr__filter(...)
expr__filter(...)
... |
< |
A polars expression
df <- pl$DataFrame( group_col = c("g1", "g1", "g2"), b = c(1, 2, 3) ) df df$group_by("group_col")$agg( lt = pl$col("b")$filter(pl$col("b") < 2), gte = pl$col("b")$filter(pl$col("b") >= 2) )
df <- pl$DataFrame( group_col = c("g1", "g1", "g2"), b = c(1, 2, 3) ) df df$group_by("group_col")$agg( lt = pl$col("b")$filter(pl$col("b") < 2), gte = pl$col("b")$filter(pl$col("b") >= 2) )
Get the first value
expr__first()
expr__first()
A polars expression
pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())
pl$DataFrame(x = 3:1)$with_columns(first = pl$col("x")$first())
This is an alias for $explode().
expr__flatten()
expr__flatten()
A polars expression
df <- pl$DataFrame( group = c("a", "b", "b"), values = list(1:2, 2:3, 4) ) df$group_by("group")$agg(pl$col("values")$flatten())
df <- pl$DataFrame( group = c("a", "b", "b"), values = list(1:2, 2:3, 4) ) df$group_by("group")$agg(pl$col("values")$flatten())
This only works on floating point Series.
expr__floor()
expr__floor()
A polars expression
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( floor = pl$col("a")$floor() )
df <- pl$DataFrame(a = c(0.3, 0.5, 1.0, 1.1)) df$with_columns( floor = pl$col("a")$floor() )
Method equivalent of floor division operator expr %/% other
.
$floordiv()
is an alias for $floor_div()
, which exists for compatibility
with Python Polars.
expr__floor_div(other) expr__floordiv(other)
expr__floor_div(other) expr__floordiv(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = 1:5) df$with_columns( `x/2` = pl$col("x")$true_div(2), `x%/%2` = pl$col("x")$floor_div(2) )
df <- pl$DataFrame(x = 1:5) df$with_columns( `x/2` = pl$col("x")$true_div(2), `x%/%2` = pl$col("x")$floor_div(2) )
Fill missing values with the last non-null value
expr__forward_fill(limit = NULL)
expr__forward_fill(limit = NULL)
fill |
The number of consecutive null values to forward fill. |
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(2, NA, NA) ) df$select(pl$all()$forward_fill()) df$select(pl$all()$forward_fill(limit = 1))
df <- pl$DataFrame( a = c(1, 2, NA), b = c(4, NA, 6), c = c(2, NA, NA) ) df$select(pl$all()$forward_fill()) df$select(pl$all()$forward_fill(limit = 1))
Take values by index
expr__gather(indices)
expr__gather(indices)
indices |
An expression that leads to a UInt32 dtyped Series. |
A polars expression
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$gather(c(2, 1)) )
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$gather(c(2, 1)) )
n
-th value in the Series and return as a new SeriesTake every n
-th value in the Series and return as a new Series
expr__gather_every(n, offset = 0)
expr__gather_every(n, offset = 0)
n |
Gather every n-th row. |
offset |
Starting index. |
A polars expression
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$gather_every(3)) df$select(pl$col("foo")$gather_every(3, offset = 1))
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$gather_every(3)) df$select(pl$col("foo")$gather_every(3, offset = 1))
Check greater or equal inequality
expr__ge(other)
expr__ge(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_ge = pl$col("x")$ge(pl$lit(2)), with_symbol = pl$col("x") >= pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_ge = pl$col("x")$ge(pl$lit(2)), with_symbol = pl$col("x") >= pl$lit(2) )
Return a single value by index
expr__get(index)
expr__get(index)
index |
An expression that leads to a UInt32 dtyped Series. |
A polars expression
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$get(1) )
df <- pl$DataFrame( group = c("one", "one", "one", "two", "two", "two"), value = c(1, 98, 2, 3, 99, 4) ) df$group_by("group", maintain_order = TRUE)$agg( pl$col("value")$get(1) )
Check greater or equal inequality
expr__gt(other)
expr__gt(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_gt = pl$col("x")$gt(pl$lit(2)), with_symbol = pl$col("x") > pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_gt = pl$col("x")$gt(pl$lit(2)), with_symbol = pl$col("x") > pl$lit(2) )
Check whether the expression contains one or more null values
expr__has_nulls()
expr__has_nulls()
A polars expression
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(350, 650, 850) ) df$select(pl$all()$has_nulls())
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(350, 650, 850) ) df$select(pl$all()$has_nulls())
Hash elements
expr__hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)
expr__hash(seed = 0, seed_1 = NULL, seed_2 = NULL, seed_3 = NULL)
seed |
Integer, random seed parameter. Defaults to 0. |
seed_1 , seed_2 , seed_3
|
Integer, random seed parameters. Default to
|
This implementation of hash does not guarantee stable results across different Polars versions. Its stability is only guaranteed within a single version.
A polars expression
df <- pl$DataFrame(a = c(1, 2, NA), b = c("x", NA, "z")) df$with_columns(pl$all()$hash(10, 20, 30, 40))
df <- pl$DataFrame(a = c(1, 2, NA), b = c("x", NA, "z")) df$with_columns(pl$all()$hash(10, 20, 30, 40))
Get the first n elements
expr__head(n = 10)
expr__head(n = 10)
n |
Number of elements to take. |
A polars expression
pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))
pl$DataFrame(x = 1:11)$select(pl$col("x")$head(3))
expr__hist( bins = NULL, ..., bin_count = NULL, include_category = FALSE, include_breakpoint = FALSE )
expr__hist( bins = NULL, ..., bin_count = NULL, include_category = FALSE, include_breakpoint = FALSE )
bins |
Discretizations to make. If |
... |
These dots are for future extensions and must be empty. |
bin_count |
If no bins provided, this will be used to determine the distance of the bins. |
include_category |
Include a column that shows the intervals as categories. |
include_breakpoint |
Include a column that indicates the upper breakpoint. |
A polars expression
df <- pl$DataFrame(a = c(1, 3, 8, 8, 2, 1, 3)) df$select(pl$col("a")$hist(bins = 1:3)) df$select( pl$col("a")$hist( bins = 1:3, include_category = TRUE, include_breakpoint = TRUE ) )
df <- pl$DataFrame(a = c(1, 3, 8, 8, 2, 1, 3)) df$select(pl$col("a")$hist(bins = 1:3)) df$select( pl$col("a")$hist( bins = 1:3, include_category = TRUE, include_breakpoint = TRUE ) )
Aggregate values into a list
expr__implode()
expr__implode()
A polars expression
df <- pl$DataFrame(a = 1:3, b = 4:6) df$with_columns(pl$col("a")$implode())
df <- pl$DataFrame(a = 1:3, b = 4:6) df$with_columns(pl$col("a")$implode())
Fill null values using interpolation
expr__interpolate(method = c("linear", "nearest"))
expr__interpolate(method = c("linear", "nearest"))
method |
Interpolation method. Must be one of |
A polars expression
df <- pl$DataFrame(a = c(1, NA, 3), b = c(1, NaN, 3)) df$with_columns( a_interpolated = pl$col("a")$interpolate(), b_interpolated = pl$col("b")$interpolate() )
df <- pl$DataFrame(a = c(1, NA, 3), b = c(1, NaN, 3)) df$with_columns( a_interpolated = pl$col("a")$interpolate(), b_interpolated = pl$col("b")$interpolate() )
Fill null values using interpolation based on another column
expr__interpolate_by(by)
expr__interpolate_by(by)
by |
Column to interpolate values based on. |
A polars expression
df <- pl$DataFrame(a = c(1, NA, NA, 3), b = c(1, 2, 7, 8)) df$with_columns( a_interpolated = pl$col("a")$interpolate_by("b") )
df <- pl$DataFrame(a = c(1, NA, NA, 3), b = c(1, 2, 7, 8)) df$with_columns( a_interpolated = pl$col("a")$interpolate_by("b") )
Check if an expression is between the given lower and upper bounds
expr__is_between( lower_bound, upper_bound, closed = c("both", "left", "right", "none") )
expr__is_between( lower_bound, upper_bound, closed = c("both", "left", "right", "none") )
lower_bound |
Lower bound value. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. |
upper_bound |
Upper bound value. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. |
closed |
Define which sides of the interval are closed (inclusive). Must
be one of |
If the value of the lower_bound
is greater than that of the upper_bound
then the result will be FALSE
, as no value can satisfy the condition.
A polars expression
df <- pl$DataFrame(num = 1:5) df$with_columns( is_between = pl$col("num")$is_between(2, 4) ) # Use the closed argument to include or exclude the values at the bounds: df$with_columns( is_between = pl$col("num")$is_between(2, 4, closed = "left") ) # You can also use strings as well as numeric/temporal values (note: ensure # that string literals are wrapped with lit so as not to conflate them with # column names): df <- pl$DataFrame(a = letters[1:5]) df$with_columns( is_between = pl$col("a")$is_between(pl$lit("a"), pl$lit("c")) ) # Use column expressions as lower/upper bounds, comparing to a literal value: df <- pl$DataFrame(a = 1:5, b = 5:1) df$with_columns( between_ab = pl$lit(3)$is_between(pl$col("a"), pl$col("b")) )
df <- pl$DataFrame(num = 1:5) df$with_columns( is_between = pl$col("num")$is_between(2, 4) ) # Use the closed argument to include or exclude the values at the bounds: df$with_columns( is_between = pl$col("num")$is_between(2, 4, closed = "left") ) # You can also use strings as well as numeric/temporal values (note: ensure # that string literals are wrapped with lit so as not to conflate them with # column names): df <- pl$DataFrame(a = letters[1:5]) df$with_columns( is_between = pl$col("a")$is_between(pl$lit("a"), pl$lit("c")) ) # Use column expressions as lower/upper bounds, comparing to a literal value: df <- pl$DataFrame(a = 1:5, b = 5:1) df$with_columns( between_ab = pl$lit(3)$is_between(pl$col("a"), pl$col("b")) )
Return a boolean mask indicating duplicated values
expr__is_duplicated()
expr__is_duplicated()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_duplicated())
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_duplicated())
Check if elements are finite
expr__is_finite()
expr__is_finite()
A polars expression
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_finite = pl$col("a")$is_finite(), b_finite = pl$col("b")$is_finite() )
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_finite = pl$col("a")$is_finite(), b_finite = pl$col("b")$is_finite() )
Return a boolean mask indicating the first occurrence of each distinct value
expr__is_first_distinct()
expr__is_first_distinct()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_first_distinct = pl$col("a")$is_first_distinct() )
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_first_distinct = pl$col("a")$is_first_distinct() )
Check if elements of an expression are present in another expression
expr__is_in(other)
expr__is_in(other)
other |
Accepts expression input. Strings are parsed as column names. |
A polars expression
df <- pl$DataFrame( sets = list(1:3, 1:2, 9:10), optional_members = 1:3 ) df$with_columns( contains = pl$col("optional_members")$is_in("sets") )
df <- pl$DataFrame( sets = list(1:3, 1:2, 9:10), optional_members = 1:3 ) df$with_columns( contains = pl$col("optional_members")$is_in("sets") )
Check if elements are infinite
expr__is_infinite()
expr__is_infinite()
A polars expression
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_infinite = pl$col("a")$is_infinite(), b_infinite = pl$col("b")$is_infinite() )
df <- pl$DataFrame(a = c(1, 2), b = c(3, Inf)) df$with_columns( a_infinite = pl$col("a")$is_infinite(), b_infinite = pl$col("b")$is_infinite() )
Return a boolean mask indicating the last occurrence of each distinct value
expr__is_last_distinct()
expr__is_last_distinct()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_last_distinct = pl$col("a")$is_last_distinct() )
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$with_columns( is_last_distinct = pl$col("a")$is_last_distinct() )
Floating point NaN
(Not A Number) should not be confused with missing data
represented as NA
(in R) or null
(in Polars).
expr__is_nan()
expr__is_nan()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_nan = pl$col("a")$is_nan(), b_nan = pl$col("b")$is_nan() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_nan = pl$col("a")$is_nan(), b_nan = pl$col("b")$is_nan() )
Floating point NaN
(Not A Number) should not be confused with missing data
represented as NA
(in R) or null
(in Polars).
expr__is_not_nan()
expr__is_not_nan()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_nan = pl$col("a")$is_not_nan(), b_not_nan = pl$col("b")$is_not_nan() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_nan = pl$col("a")$is_not_nan(), b_not_nan = pl$col("b")$is_not_nan() )
Check if elements are not NULL
expr__is_not_null()
expr__is_not_null()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_null = pl$col("a")$is_not_null(), b_not_null = pl$col("b")$is_not_null() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_not_null = pl$col("a")$is_not_null(), b_not_null = pl$col("b")$is_not_null() )
Check if elements are NULL
expr__is_null()
expr__is_null()
A polars expression
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_null = pl$col("a")$is_null(), b_null = pl$col("b")$is_null() )
df <- pl$DataFrame( a = c(1, 2, NA, 1, 5), b = c(1, 2, NaN, 1, 5) ) df$with_columns( a_null = pl$col("a")$is_null(), b_null = pl$col("b")$is_null() )
Return a boolean mask indicating unique values
expr__is_unique()
expr__is_unique()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_unique())
df <- pl$DataFrame(a = c(1, 1, 2, 3, 2)) df$select(pl$col("a")$is_unique())
Kurtosis is the fourth central moment divided by the square of the variance.
If Fisher’s definition is used, then 3.0 is subtracted from the result to
give 0.0 for a normal distribution. If bias
is FALSE
then the kurtosis
is calculated using k
statistics to eliminate bias coming from biased
moment estimators.
expr__kurtosis(..., fisher = TRUE, bias = TRUE)
expr__kurtosis(..., fisher = TRUE, bias = TRUE)
... |
These dots are for future extensions and must be empty. |
fisher |
If |
bias |
If |
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 1)) df$select(pl$col("x")$kurtosis())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 1)) df$select(pl$col("x")$kurtosis())
Get the last value
expr__last()
expr__last()
A polars expression
pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())
pl$DataFrame(x = 3:1)$with_columns(last = pl$col("x")$last())
Check lower or equal inequality
expr__le(other)
expr__le(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_le = pl$col("x")$le(pl$lit(2)), with_symbol = pl$col("x") <= pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_le = pl$col("x")$le(pl$lit(2)), with_symbol = pl$col("x") <= pl$lit(2) )
Null values are counted in the total.
expr__len()
expr__len()
A polars expression
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$len())
df <- pl$DataFrame(a = 1:3, b = c(NA, 4, 4)) df$select(pl$all()$len())
This is an alias for $head().
expr__limit(n = 10)
expr__limit(n = 10)
n |
Number of rows to return. |
A polars expression
df <- pl$DataFrame(a = 1:9) df$select(pl$col("a")$limit(3))
df <- pl$DataFrame(a = 1:9) df$select(pl$col("a")$limit(3))
Compute the logarithm
expr__log(base = exp(1))
expr__log(base = exp(1))
base |
Numeric value used as base, defaults to |
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns( log = pl$col("a")$log(), log_base_2 = pl$col("a")$log(base = 2) )
pl$DataFrame(a = c(1, 2, 4))$ with_columns( log = pl$col("a")$log(), log_base_2 = pl$col("a")$log(base = 2) )
Compute the base-10 logarithm
expr__log10()
expr__log10()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log10 = pl$col("a")$log10())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log10 = pl$col("a")$log10())
This computes log(1 + x)
but is more numerically stable for x
close to
zero.
expr__log1p()
expr__log1p()
A polars expression
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log1p = pl$col("a")$log1p())
pl$DataFrame(a = c(1, 2, 4))$ with_columns(log1p = pl$col("a")$log1p())
Returns a unit Series with the lowest value possible for the dtype of this expression.
expr__lower_bound()
expr__lower_bound()
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$lower_bound())
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$lower_bound())
Check strictly lower inequality
expr__lt(other)
expr__lt(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = 1:3) df$with_columns( with_lt = pl$col("x")$lt(pl$lit(2)), with_symbol = pl$col("x") < pl$lit(2) )
df <- pl$DataFrame(x = 1:3) df$with_columns( with_lt = pl$col("x")$lt(pl$lit(2)), with_symbol = pl$col("x") < pl$lit(2) )
Get the maximum value
expr__max()
expr__max()
A polars expression
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(max = pl$col("x")$max())
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(max = pl$col("x")$max())
Get mean value
expr__mean()
expr__mean()
A polars expression
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(mean = pl$col("x")$mean())
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(mean = pl$col("x")$mean())
Get median value
expr__median()
expr__median()
A polars expression
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(median = pl$col("x")$median())
pl$DataFrame(x = c(1, 3, 4, NA))$ with_columns(median = pl$col("x")$median())
Get the minimum value
expr__min()
expr__min()
A polars expression
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(min = pl$col("x")$min())
pl$DataFrame(x = c(1, NaN, 3))$ with_columns(min = pl$col("x")$min())
Method equivalent of modulus operator expr %% other
.
expr__mod(other)
expr__mod(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = -5L:5L) df$with_columns( `x%%2` = pl$col("x")$mod(2) )
df <- pl$DataFrame(x = -5L:5L) df$with_columns( `x%%2` = pl$col("x")$mod(2) )
Compute the most occurring value(s)
expr__mode()
expr__mode()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 3), b = c(1, 1, 2, 2)) df$select(pl$col("a")$mode()) df$select(pl$col("b")$mode())
df <- pl$DataFrame(a = c(1, 1, 2, 3), b = c(1, 1, 2, 2)) df$select(pl$col("a")$mode()) df$select(pl$col("b")$mode())
Method equivalent of multiplication operator expr * other
.
expr__mul(other)
expr__mul(other)
other |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = c(1, 2, 4, 8, 16)) df$with_columns( `x*2` = pl$col("x")$mul(2), `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2)) )
df <- pl$DataFrame(x = c(1, 2, 4, 8, 16)) df$with_columns( `x*2` = pl$col("x")$mul(2), `x * xlog2` = pl$col("x")$mul(pl$col("x")$log(2)) )
null
is considered to be a unique value for the purposes of this operation.
expr__n_unique()
expr__n_unique()
A polars expression
df <- pl$DataFrame( x = c(1, 1, 2, 2, 3), y = c(1, 1, 1, NA, NA) ) df$select( x_unique = pl$col("x")$n_unique(), y_unique = pl$col("y")$n_unique() )
df <- pl$DataFrame( x = c(1, 1, 2, 2, 3), y = c(1, 1, 1, NA, NA) ) df$select( x_unique = pl$col("x")$n_unique(), y_unique = pl$col("y")$n_unique() )
This returns NaN
if there are any.
expr__nan_max()
expr__nan_max()
A polars expression
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_max = pl$col("x")$nan_max())
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_max = pl$col("x")$nan_max())
This returns NaN
if there are any.
expr__nan_min()
expr__nan_min()
A polars expression
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_min = pl$col("x")$nan_min())
pl$DataFrame(x = c(1, NA, 3, NaN, Inf))$ with_columns(nan_min = pl$col("x")$nan_min())
This propagates null values, i.e. any comparison involving null
will
return null
. Use $ne_missing()
to consider null
values as equal.
expr__ne(other)
expr__ne(other)
other |
A literal or expression value to compare with. |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne(pl$col("y")), ne_missing = pl$col("x")$ne_missing(pl$col("y")) )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne(pl$col("y")), ne_missing = pl$col("x")$ne_missing(pl$col("y")) )
null
propagationMethod equivalent of addition operator expr + other
.
expr__ne_missing(other)
expr__ne_missing(other)
other |
Element to add. Can be a string (only if |
A polars expression
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne("y"), ne_missing = pl$col("x")$ne_missing("y") )
df <- pl$DataFrame(x = c(NA, FALSE, TRUE), y = c(TRUE, TRUE, TRUE)) df$with_columns( ne = pl$col("x")$ne("y"), ne_missing = pl$col("x")$ne_missing("y") )
Negate a boolean expression
expr__not()
expr__not()
A polars expression
df <- pl$DataFrame(a = c(TRUE, FALSE, FALSE, NA)) df$with_columns(a_not = pl$col("a")$not()) # Same result with "!" df$with_columns(a_not = !pl$col("a"))
df <- pl$DataFrame(a = c(TRUE, FALSE, FALSE, NA)) df$with_columns(a_not = pl$col("a")$not()) # Same result with "!" df$with_columns(a_not = !pl$col("a"))
Count null values
expr__null_count()
expr__null_count()
A polars expression
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(1, 2, 2) ) df$select(pl$all()$null_count())
df <- pl$DataFrame( a = c(NA, 1, NA), b = c(10, NA, 300), c = c(1, 2, 2) ) df$select(pl$all()$null_count())
Combine two boolean expressions with OR.
expr__or(other)
expr__or(other)
other |
Element to add. Can be a string (only if |
A polars expression
pl$lit(TRUE) | FALSE pl$lit(TRUE)$or(pl$lit(TRUE))
pl$lit(TRUE) | FALSE pl$lit(TRUE)$or(pl$lit(TRUE))
This expression is similar to performing a group by aggregation and joining the result back into the original DataFrame. The outcome is similar to how window functions work in PostgreSQL.
expr__over( ..., order_by = NULL, mapping_strategy = c("group_to_rows", "join", "explode") )
expr__over( ..., order_by = NULL, mapping_strategy = c("group_to_rows", "join", "explode") )
... |
|
order_by |
Order the window functions/aggregations with the partitioned
groups by the result of the expression passed to |
mapping_strategy |
One of the following:
|
A polars expression
# Pass the name of a column to compute the expression over that column. df <- pl$DataFrame( a = c("a", "a", "b", "b", "b"), b = c(1, 2, 3, 5, 3), c = c(5, 4, 2, 1, 3) ) df$with_columns( pl$col("c")$max()$over("a")$name$suffix("_max") ) # Expression input is supported. df$with_columns( pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max") ) # Group by multiple columns by passing several column names a or list of # expressions. df$with_columns( pl$col("c")$min()$over("a", "b")$name$suffix("_min") ) group_vars <- list(pl$col("a"), pl$col("b")) df$with_columns( pl$col("c")$min()$over(!!!group_vars)$name$suffix("_min") ) # Or use positional arguments to group by multiple columns in the same way. df$with_columns( pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min") ) # Alternative mapping strategy: join values in a list output df$with_columns( top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join") ) # order_by specifies how values are sorted within a group, which is # essential when the operation depends on the order of values df <- pl$DataFrame( g = c(1, 1, 1, 1, 2, 2, 2, 2), t = c(1, 2, 3, 4, 4, 1, 2, 3), x = c(10, 20, 30, 40, 10, 20, 30, 40) ) # without order_by, the first and second values in the second group would # be inverted, which would be wrong df$with_columns( x_lag = pl$col("x")$shift(1)$over("g", order_by = "t") )
# Pass the name of a column to compute the expression over that column. df <- pl$DataFrame( a = c("a", "a", "b", "b", "b"), b = c(1, 2, 3, 5, 3), c = c(5, 4, 2, 1, 3) ) df$with_columns( pl$col("c")$max()$over("a")$name$suffix("_max") ) # Expression input is supported. df$with_columns( pl$col("c")$max()$over(pl$col("b") %/% 2)$name$suffix("_max") ) # Group by multiple columns by passing several column names a or list of # expressions. df$with_columns( pl$col("c")$min()$over("a", "b")$name$suffix("_min") ) group_vars <- list(pl$col("a"), pl$col("b")) df$with_columns( pl$col("c")$min()$over(!!!group_vars)$name$suffix("_min") ) # Or use positional arguments to group by multiple columns in the same way. df$with_columns( pl$col("c")$min()$over("a", pl$col("b") %% 2)$name$suffix("_min") ) # Alternative mapping strategy: join values in a list output df$with_columns( top_2 = pl$col("c")$top_k(2)$over("a", mapping_strategy = "join") ) # order_by specifies how values are sorted within a group, which is # essential when the operation depends on the order of values df <- pl$DataFrame( g = c(1, 1, 1, 1, 2, 2, 2, 2), t = c(1, 2, 3, 4, 4, 1, 2, 3), x = c(10, 20, 30, 40, 10, 20, 30, 40) ) # without order_by, the first and second values in the second group would # be inverted, which would be wrong df$with_columns( x_lag = pl$col("x")$shift(1)$over("g", order_by = "t") )
Computes the percentage change (as fraction) between current element and
most-recent non-null element at least n
period(s) before the current
element. By default it computes the change from the previous row.
expr__pct_change(n = 1)
expr__pct_change(n = 1)
n |
Integer or Expr indicating the number of periods to shift for forming percent change. |
A polars expression
df <- pl$DataFrame(a = c(10:12, NA, 12)) df$with_columns( pct_change = pl$col("a")$pct_change() )
df <- pl$DataFrame(a = c(10:12, NA, 12)) df$with_columns( pct_change = pl$col("a")$pct_change() )
Get a boolean mask of the local maximum peaks
expr__peak_max()
expr__peak_max()
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_max = pl$col("x")$peak_max())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_max = pl$col("x")$peak_max())
Get a boolean mask of the local minimum peaks
expr__peak_min()
expr__peak_min()
A polars expression
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_min = pl$col("x")$peak_min())
df <- pl$DataFrame(x = c(1, 2, 3, 2, 3, 4, 5, 2)) df$with_columns(peak_min = pl$col("x")$peak_min())
Method equivalent of exponentiation operator expr ^ exponent
.
expr__pow(other)
expr__pow(other)
exponent |
Numeric literal or expression value. |
A polars expression
Arithmetic operators
df <- pl$DataFrame(x = c(1, 2, 4, 8)) df$with_columns( cube = pl$col("x")$pow(3), `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2)) )
df <- pl$DataFrame(x = c(1, 2, 4, 8)) df$with_columns( cube = pl$col("x")$pow(3), `x^xlog2` = pl$col("x")$pow(pl$col("x")$log(2)) )
Compute the product of an expression.
expr__product()
expr__product()
A polars expression
pl$DataFrame(a = 1:3, b = c(NA, 4, 4))$ select(pl$all()$product())
pl$DataFrame(a = 1:3, b = c(NA, 4, 4))$ select(pl$all()$product())
expr__qcut( quantiles, ..., labels = NULL, left_closed = FALSE, allow_duplicates = FALSE, include_breaks = FALSE )
expr__qcut( quantiles, ..., labels = NULL, left_closed = FALSE, allow_duplicates = FALSE, include_breaks = FALSE )
quantiles |
Either a vector of quantile probabilities between 0 and 1 or a positive integer determining the number of bins with uniform probability. |
... |
These dots are for future extensions and must be empty. |
labels |
Names of the categories. The number of labels must be equal to the number of categories. |
left_closed |
Set the intervals to be left-closed instead of right-closed. |
allow_duplicates |
If |
include_breaks |
Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct. |
A polars expression
# Divide a column into three categories according to pre-defined quantile # probabilities. df <- pl$DataFrame(foo = -2:2) df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c")) ) # Divide a column into two categories using uniform quantile probabilities. df$with_columns( qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE) ) # Add both the category and the breakpoint. df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE) )$unnest()
# Divide a column into three categories according to pre-defined quantile # probabilities. df <- pl$DataFrame(foo = -2:2) df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c")) ) # Divide a column into two categories using uniform quantile probabilities. df$with_columns( qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE) ) # Add both the category and the breakpoint. df$with_columns( qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE) )$unnest()
Get quantile value(s)
expr__quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear") )
expr__quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear") )
quantile |
Quantile between 0.0 and 1.0. |
interpolation |
Interpolation method. Must be one of |
A polars expression
df <- pl$DataFrame(a = 0:5) df$select(pl$col("a")$quantile(0.3)) df$select(pl$col("a")$quantile(0.3, interpolation = "higher")) df$select(pl$col("a")$quantile(0.3, interpolation = "lower")) df$select(pl$col("a")$quantile(0.3, interpolation = "midpoint")) df$select(pl$col("a")$quantile(0.3, interpolation = "linear"))
df <- pl$DataFrame(a = 0:5) df$select(pl$col("a")$quantile(0.3)) df$select(pl$col("a")$quantile(0.3, interpolation = "higher")) df$select(pl$col("a")$quantile(0.3, interpolation = "lower")) df$select(pl$col("a")$quantile(0.3, interpolation = "midpoint")) df$select(pl$col("a")$quantile(0.3, interpolation = "linear"))
Convert from degrees to radians
expr__radians()
expr__radians()
A polars expression
pl$DataFrame(a = c(-720, -540, -360, -180, 0, 180, 360, 540, 720))$ with_columns(radians = pl$col("a")$radians())
pl$DataFrame(a = c(-720, -540, -360, -180, 0, 180, 360, 540, 720))$ with_columns(radians = pl$col("a")$radians())
Assign ranks to data, dealing with ties appropriately
expr__rank( method = c("average", "min", "max", "dense", "ordinal", "random"), ..., descending = FALSE, seed = NULL )
expr__rank( method = c("average", "min", "max", "dense", "ordinal", "random"), ..., descending = FALSE, seed = NULL )
method |
The method used to assign ranks to tied elements. Must be one of the following:
|
... |
These dots are for future extensions and must be empty. |
descending |
Rank in descending order. |
seed |
Integer. Only used if |
A polars expression
# Default is to use the "average" method to break ties df <- pl$DataFrame(a = c(3, 6, 1, 1, 6)) df$with_columns(rank = pl$col("a")$rank()) # Ordinal method df$with_columns(rank = pl$col("a")$rank("ordinal")) # Use "rank" with "over" to rank within groups: df <- pl$DataFrame( a = c(1, 1, 2, 2, 2), b = c(6, 7, 5, 14, 11) ) df$with_columns( rank = pl$col("b")$rank()$over("a") )
# Default is to use the "average" method to break ties df <- pl$DataFrame(a = c(3, 6, 1, 1, 6)) df$with_columns(rank = pl$col("a")$rank()) # Ordinal method df$with_columns(rank = pl$col("a")$rank("ordinal")) # Use "rank" with "over" to rank within groups: df <- pl$DataFrame( a = c(1, 1, 2, 2, 2), b = c(6, 7, 5, 14, 11) ) df$with_columns( rank = pl$col("b")$rank()$over("a") )
Create a single chunk of memory for this Series
expr__rechunk()
expr__rechunk()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2)) # Create a Series with 3 nulls, append column a then rechunk df$select(pl$repeat(NA, 3)$append(pl$col("a"))$rechunk())
df <- pl$DataFrame(a = c(1, 1, 2)) # Create a Series with 3 nulls, append column a then rechunk df$select(pl$repeat(NA, 3)$append(pl$col("a"))$rechunk())
This operation is only allowed for 64-bit integers. For lower bits integers, you can safely use the $cast() operation.
expr__reinterpret(..., signed = TRUE)
expr__reinterpret(..., signed = TRUE)
... |
These dots are for future extensions and must be empty. |
signed |
If |
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2))$cast(pl$UInt64) # Create a Series with 3 nulls, append column a then rechunk df$with_columns( reinterpreted = pl$col("a")$reinterpret() )
df <- pl$DataFrame(a = c(1, 1, 2))$cast(pl$UInt64) # Create a Series with 3 nulls, append column a then rechunk df$with_columns( reinterpreted = pl$col("a")$reinterpret() )
The repeated elements are expanded into a List dtype.
expr__repeat_by(by)
expr__repeat_by(by)
by |
Numeric column that determines how often the values will be repeated. The column will be coerced to UInt32. Give this dtype to make the coercion a no-op. Accepts expression input, strings are parsed as column names. |
A polars expression
df <- pl$DataFrame(a = c("x", "y", "z"), n = 1:3) df$with_columns( repeated = pl$col("a")$repeat_by("n") )
df <- pl$DataFrame(a = c("x", "y", "z"), n = 1:3) df$with_columns( repeated = pl$col("a")$repeat_by("n") )
This allows one to recode values in a column, leaving all other values
unchanged. See $replace_strict()
to give a default
value to all other values and to specify the output datatype.
expr__replace(old, new)
expr__replace(old, new)
old |
Value or vector of values to replace. Accepts expression input.
Vectors are parsed as Series, other non-expression inputs are parsed as
literals. Also accepts a list of values like |
new |
Value or vector of values to replace by. Accepts expression
input. Vectors are parsed as Series, other non-expression inputs are parsed
as literals. Length must match the length of |
The global string cache must be enabled when replacing categorical values.
A polars expression
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace(2, 100)) df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200))) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace(mapping)) # The original data type is preserved when replacing by values of a # different data type. Use $replace_strict() to replace and change the # return data type. df <- pl$DataFrame(a = c("x", "y", "z")) mapping <- list(x = 1, y = 2, z = 3) df$with_columns(replaced = pl$col("a")$replace(mapping)) # "old" and "new" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace( old = pl$col("a")$max(), new = pl$col("b")$sum() ) )
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace(2, 100)) df$with_columns(replaced = pl$col("a")$replace(c(2, 3), c(100, 200))) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace(mapping)) # The original data type is preserved when replacing by values of a # different data type. Use $replace_strict() to replace and change the # return data type. df <- pl$DataFrame(a = c("x", "y", "z")) mapping <- list(x = 1, y = 2, z = 3) df$with_columns(replaced = pl$col("a")$replace(mapping)) # "old" and "new" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace( old = pl$col("a")$max(), new = pl$col("b")$sum() ) )
This changes all the values in a column, either using a specific replacement
or a default one. See $replace()
to replace only a subset
of values.
expr__replace_strict(old, new, ..., default = NULL, return_dtype = NULL)
expr__replace_strict(old, new, ..., default = NULL, return_dtype = NULL)
old |
Value or vector of values to replace. Accepts expression input.
Vectors are parsed as Series, other non-expression inputs are parsed as
literals. Also accepts a list of values like |
new |
Value or vector of values to replace by. Accepts expression
input. Vectors are parsed as Series, other non-expression inputs are parsed
as literals. Length must match the length of |
... |
These dots are for future extensions and must be empty. |
default |
Set values that were not replaced to this value. If |
return_dtype |
The data type of the resulting expression. If |
The global string cache must be enabled when replacing categorical values.
A polars expression
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1)) df$with_columns( replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1) ) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1)) # By default, an error is raised if any non-null values were not replaced. # Specify a default to set all values that were not matched. tryCatch( df$with_columns(replaced = pl$col("a")$replace_strict(mapping)), error = function(e) print(e) ) # one can specify the data type to return instead of automatically # inferring it df$with_columns( replaced = pl$col("a")$replace_strict( mapping, default = 1, return_dtype = pl$Int32 ) ) # "old", "new", and "default" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace_strict( old = pl$col("a")$max(), new = pl$col("b")$sum(), default = pl$col("b"), ) )
df <- pl$DataFrame(a = c(1, 2, 2, 3)) # "old" and "new" can take vectors of length 1 or of same length df$with_columns(replaced = pl$col("a")$replace_strict(2, 100, default = 1)) df$with_columns( replaced = pl$col("a")$replace_strict(c(2, 3), c(100, 200), default = 1) ) # "old" can be a named list where names are values to replace, and values are # the replacements mapping <- list(`2` = 100, `3` = 200) df$with_columns(replaced = pl$col("a")$replace_strict(mapping, default = -1)) # By default, an error is raised if any non-null values were not replaced. # Specify a default to set all values that were not matched. tryCatch( df$with_columns(replaced = pl$col("a")$replace_strict(mapping)), error = function(e) print(e) ) # one can specify the data type to return instead of automatically # inferring it df$with_columns( replaced = pl$col("a")$replace_strict( mapping, default = 1, return_dtype = pl$Int32 ) ) # "old", "new", and "default" can take Expr df <- pl$DataFrame(a = c(1, 2, 2, 3), b = c(1.5, 2.5, 5, 1)) df$with_columns( replaced = pl$col("a")$replace_strict( old = pl$col("a")$max(), new = pl$col("b")$sum(), default = pl$col("b"), ) )
Reshape this Expr to a flat Series or a Series of Lists
expr__reshape(dimensions)
expr__reshape(dimensions)
dimensions |
A integer vector of length of the dimension size.
If |
nested_type |
The nested data type to create. List only supports 2 dimensions, whereas Array supports an arbitrary number of dimensions. |
If a single dimension is given, results in an expression of the original data type. If a multiple dimensions are given, results in an expression of data type List with shape equal to the dimensions.
A polars expression
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$reshape(9)) df$select(pl$col("foo")$reshape(c(3, 3))) # Use `-1` to infer the other dimension df$select(pl$col("foo")$reshape(c(-1, 3))) df$select(pl$col("foo")$reshape(c(3, -1))) # One can specify more than 2 dimensions by using the Array type df <- pl$DataFrame(foo = 1:12) df$select( pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2)) )
df <- pl$DataFrame(foo = 1:9) df$select(pl$col("foo")$reshape(9)) df$select(pl$col("foo")$reshape(c(3, 3))) # Use `-1` to infer the other dimension df$select(pl$col("foo")$reshape(c(-1, 3))) df$select(pl$col("foo")$reshape(c(3, -1))) # One can specify more than 2 dimensions by using the Array type df <- pl$DataFrame(foo = 1:12) df$select( pl$col("foo")$reshape(c(3, 2, 2), nested_type = pl$Array(pl$Float32, 2)) )
Reverse an expression
expr__reverse()
expr__reverse()
A polars expression
df <- pl$DataFrame( a = 1:5, fruits = c("banana", "banana", "apple", "apple", "banana"), b = 5:1 ) df$with_columns( pl$all()$reverse()$name$suffix("_reverse") )
df <- pl$DataFrame( a = 1:5, fruits = c("banana", "banana", "apple", "apple", "banana"), b = 5:1 ) df$with_columns( pl$all()$reverse()$name$suffix("_reverse") )
Run-length encoding (RLE) encodes data by storing each run of identical values as a single value and its length.
expr__rle()
expr__rle()
A polars expression
df <- pl$DataFrame(a = c(1, 1, 2, 1, NA, 1, 3, 3)) df$select(pl$col("a")$rle())$unnest("a")
df <- pl$DataFrame(a = c(1, 1, 2, 1, NA, 1, 3, 3)) df$select(pl$col("a")$rle())$unnest("a")
The ID starts at 0 and increases by one each time the value of the column changes.
expr__rle_id()
expr__rle_id()
This functionality is especially useful for defining a new group for every time a column’s value changes, rather than for every distinct value of that column.
A polars expression
df <- pl$DataFrame( a = c(1, 2, 1, 1, 1), b = c("x", "x", NA, "y", "y") ) df$with_columns( rle_id_a = pl$col("a")$rle_id(), rle_id_ab = pl$struct("a", "b")$rle_id() )
df <- pl$DataFrame( a = c(1, 2, 1, 1, 1), b = c("x", "x", NA, "y", "y") ) df$with_columns( rle_id_a = pl$col("a")$rle_id(), rle_id_ab = pl$struct("a", "b")$rle_id() )
If you have a time series <t_0, t_1, ..., t_n>
, then by default the
windows created will be:
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset
, then the windows will be:
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
expr__rolling(index_column, ..., period, offset = NULL, closed = "right")
expr__rolling(index_column, ..., period, offset = NULL, closed = "right")
index_column |
Character. Name of the column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order. In case of a rolling group by on indices, dtype needs to be one of UInt32, UInt64, Int32, Int64. Note that the first three get cast to Int64, so if performance matters use an Int64 column. |
... |
These dots are for future extensions and must be empty. |
period |
Length of the window - must be non-negative. |
offset |
Offset of the window. Default is |
closed |
Define which sides of the range are closed (inclusive).
One of the following: |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
dates <- as.POSIXct( c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32","2020-01-08 23:16:43" ) ) df <- pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1)) df$with_columns( sum_a = pl$col("a")$sum()$rolling(index_column = "dt", period = "2d"), min_a = pl$col("a")$min()$rolling(index_column = "dt", period = "2d"), max_a = pl$col("a")$max()$rolling(index_column = "dt", period = "2d") )
dates <- as.POSIXct( c( "2020-01-01 13:45:48", "2020-01-01 16:42:13", "2020-01-01 16:45:09", "2020-01-02 18:12:48", "2020-01-03 19:45:32","2020-01-08 23:16:43" ) ) df <- pl$DataFrame(dt = dates, a = c(3, 7, 5, 9, 2, 1)) df$with_columns( sum_a = pl$col("a")$sum()$rolling(index_column = "dt", period = "2d"), min_a = pl$col("a")$min()$rolling(index_column = "dt", period = "2d"), max_a = pl$col("a")$max()$rolling(index_column = "dt", period = "2d") )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_max( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_max( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_max = pl$col("a")$rolling_max( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_max = pl$col("a")$rolling_max( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_max = pl$col("a")$rolling_max(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_max_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_max_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling max with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h" ) ) # Compute the rolling max with the closure of windows on both sides df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling max with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h" ) ) # Compute the rolling max with the closure of windows on both sides df_temporal$with_columns( rolling_row_max = pl$col("index")$rolling_max_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_mean( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_mean( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_mean = pl$col("a")$rolling_mean( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_mean = pl$col("a")$rolling_mean( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_mean = pl$col("a")$rolling_mean(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_mean_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_mean_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling mean with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h" ) ) # Compute the rolling mean with the closure of windows on both sides df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling mean with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h" ) ) # Compute the rolling mean with the closure of windows on both sides df_temporal$with_columns( rolling_row_mean = pl$col("index")$rolling_mean_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_median( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_median( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_median = pl$col("a")$rolling_median( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_median = pl$col("a")$rolling_median( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_median = pl$col("a")$rolling_median(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_median_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_median_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling median with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h" ) ) # Compute the rolling median with the closure of windows on both sides df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling median with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h" ) ) # Compute the rolling median with the closure of windows on both sides df_temporal$with_columns( rolling_row_median = pl$col("index")$rolling_median_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_min( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_min( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_min = pl$col("a")$rolling_min( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_min = pl$col("a")$rolling_min( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_min = pl$col("a")$rolling_min(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_min_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_min_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling min with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h" ) ) # Compute the rolling min with the closure of windows on both sides df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling min with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h" ) ) # Compute the rolling min with the closure of windows on both sides df_temporal$with_columns( rolling_row_min = pl$col("index")$rolling_min_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_quantile( quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
quantile |
Quantile between 0.0 and 1.0. |
interpolation |
Interpolation method. Must be one of |
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4 ) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2) ) ) # Specify weights and interpolation method: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2), interpolation = "linear" ) ) # Center the values in the window df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 5, center = TRUE ) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4 ) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2) ) ) # Specify weights and interpolation method: df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 4, weights = c(0.2, 0.4, 0.4, 0.2), interpolation = "linear" ) ) # Center the values in the window df$with_columns( rolling_quantile = pl$col("a")$rolling_quantile( quantile = 0.25, window_size = 5, center = TRUE ) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_quantile_by( by, window_size, ..., quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_quantile_by( by, window_size, ..., quantile, interpolation = c("nearest", "higher", "lower", "midpoint", "linear"), min_periods = 1, closed = c("right", "both", "left", "none") )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
quantile |
Quantile between 0.0 and 1.0. |
interpolation |
Interpolation method. Must be one of |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling quantile with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h" ) ) # Compute the rolling quantile with the closure of windows on both sides df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling quantile with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h" ) ) # Compute the rolling quantile with the closure of windows on both sides df_temporal$with_columns( rolling_row_quantile = pl$col("index")$rolling_quantile_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_skew(window_size, ..., bias = TRUE)
expr__rolling_skew(window_size, ..., bias = TRUE)
window_size |
The length of the window in number of elements. |
... |
These dots are for future extensions and must be empty. |
bias |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = c(1, 4, 2, 9)) df$with_columns( rolling_skew = pl$col("a")$rolling_skew(3) )
df <- pl$DataFrame(a = c(1, 4, 2, 9)) df$with_columns( rolling_skew = pl$col("a")$rolling_skew(3) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_std( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
expr__rolling_std( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_std = pl$col("a")$rolling_std( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_std = pl$col("a")$rolling_std( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_std = pl$col("a")$rolling_std(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_std_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
expr__rolling_std_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling std with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h" ) ) # Compute the rolling std with the closure of windows on both sides df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling std with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h" ) ) # Compute the rolling std with the closure of windows on both sides df_temporal$with_columns( rolling_row_std = pl$col("index")$rolling_std_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_sum( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
expr__rolling_sum( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_sum = pl$col("a")$rolling_sum( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_sum = pl$col("a")$rolling_sum( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_sum = pl$col("a")$rolling_sum(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_sum_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
expr__rolling_sum_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none") )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling sum with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h" ) ) # Compute the rolling sum with the closure of windows on both sides df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling sum with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h" ) ) # Compute the rolling sum with the closure of windows on both sides df_temporal$with_columns( rolling_row_sum = pl$col("index")$rolling_sum_by( "date", window_size = "2h", closed = "both" ) )
A window of length window_size
will traverse the array. The values that
fill this window will (optionally) be multiplied with the weights given by
the weights
vector. The resulting values will be aggregated.
The window at a given row will include the row itself, and the
window_size - 1
elements before it.
expr__rolling_var( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
expr__rolling_var( window_size, weights = NULL, ..., min_periods = NULL, center = FALSE, ddof = 1 )
window_size |
The length of the window in number of elements. |
weights |
An optional slice with the same length as the window that will be multiplied elementwise with the values in the window. |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
center |
If |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_var = pl$col("a")$rolling_var( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 3, center = TRUE) )
df <- pl$DataFrame(a = 1:6) df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 2) ) # Specify weights to multiply the values in the window with: df$with_columns( rolling_var = pl$col("a")$rolling_var( window_size = 2, weights = c(0.25, 0.75) ) ) # Center the values in the window df$with_columns( rolling_var = pl$col("a")$rolling_var(window_size = 3, center = TRUE) )
Given a by
column <t_0, t_1, ..., t_n>
, then closed = "right"
(the
default) means the windows will be:
(t_0 - window_size, t_0]
(t_1 - window_size, t_1]
…
(t_n - window_size, t_n]
expr__rolling_var_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
expr__rolling_var_by( by, window_size, ..., min_periods = 1, closed = c("right", "both", "left", "none"), ddof = 1 )
by |
Should be DateTime, Date, UInt64, UInt32, Int64, or Int32 data
type after conversion by |
window_size |
The length of the window. Can be a dynamic temporal size indicated by a timedelta or the following string language:
Or combine them: By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year". |
min_periods |
The number of values in the window that should be
non-null before computing a result. If |
closed |
Define which sides of the interval are closed (inclusive).
Default is |
If you want to compute multiple aggregation statistics over the same dynamic
window, consider using $rolling()
- this method can cache
the window size computation.
A polars expression
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling var with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h" ) ) # Compute the rolling var with the closure of windows on both sides df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h", closed = "both" ) )
df_temporal <- pl$select( index = 0:24, date = pl$datetime_range( as.POSIXct("2001-01-01"), as.POSIXct("2001-01-02"), "1h" ) ) # Compute the rolling var with the temporal windows closed on the right # (default) df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h" ) ) # Compute the rolling var with the closure of windows on both sides df_temporal$with_columns( rolling_row_var = pl$col("index")$rolling_var_by( "date", window_size = "2h", closed = "both" ) )
Round underlying floating point data by decimals digits
expr__round(decimals)
expr__round(decimals)
decimals |
Number of decimals to round by. |
A polars expression
df <- pl$DataFrame(a = c(0.33, 0.52, 1.02, 1.17)) df$with_columns( rounded = pl$col("a")$round(1) )
df <- pl$DataFrame(a = c(0.33, 0.52, 1.02, 1.17)) df$with_columns( rounded = pl$col("a")$round(1) )
Round to a number of significant figures
expr__round_sig_figs(digits)
expr__round_sig_figs(digits)
digits |
Number of significant figures to round to. |
A polars expression
df <- pl$DataFrame(a = c(0.01234, 3.333, 1234)) df$with_columns( rounded = pl$col("a")$round_sig_figs(2) )
df <- pl$DataFrame(a = c(0.01234, 3.333, 1234)) df$with_columns( rounded = pl$col("a")$round_sig_figs(2) )
Sample from this expression
expr__sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
expr__sample( n = NULL, ..., fraction = NULL, with_replacement = FALSE, shuffle = FALSE, seed = NULL )
n |
Number of items to return. Cannot be used with |
... |
These dots are for future extensions and must be empty. |
fraction |
Fraction of items to return. Cannot be used with |
with_replacement |
Allow values to be sampled more than once. |
shuffle |
Shuffle the order of sampled data points. |
seed |
Seed for the random number generator. If |
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$sample( fraction = 1, with_replacement = TRUE, seed = 1 ))
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$sample( fraction = 1, with_replacement = TRUE, seed = 1 ))
This returns -1 if x is lower than 0, 0 if x == 0, and 1 if x is greater than 0.
expr__search_sorted(element, side = c("any", "left", "right"))
expr__search_sorted(element, side = c("any", "left", "right"))
element |
Expression or scalar value. |
side |
Must be one of the following:
|
A polars expression
df <- pl$DataFrame(values = c(1, 2, 3, 5)) df$select( zero = pl$col("values")$search_sorted(0), three = pl$col("values")$search_sorted(3), six = pl$col("values")$search_sorted(6), )
df <- pl$DataFrame(values = c(1, 2, 3, 5)) df$select( zero = pl$col("values")$search_sorted(0), three = pl$col("values")$search_sorted(3), six = pl$col("values")$search_sorted(6), )
Enables downstream code to user fast paths for sorted arrays.
Warning: This can lead to incorrect results if the data is NOT sorted!! Use with care!
expr__set_sorted(..., descending = FALSE)
expr__set_sorted(..., descending = FALSE)
... |
These dots are for future extensions and must be empty. |
descending |
Whether the Series order is descending. |
A polars expression
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$set_sorted()$max())
df <- pl$DataFrame(a = 1:3) df$select(pl$col("a")$set_sorted()$max())
Shift values by the given number of indices
expr__shift(n = 1, ..., fill_value = NULL)
expr__shift(n = 1, ..., fill_value = NULL)
n |
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead. |
... |
These dots are for future extensions and must be empty. |
fill_value |
Fill the resulting null values with this value. |
A polars expression
# By default, values are shifted forward by one index. df <- pl$DataFrame(a = 1:4) df$with_columns(shift = pl$col("a")$shift()) # Pass a negative value to shift in the opposite direction instead. df$with_columns(shift = pl$col("a")$shift(-2)) # Specify fill_value to fill the resulting null values. df$with_columns(shift = pl$col("a")$shift(-2, fill_value = 100))
# By default, values are shifted forward by one index. df <- pl$DataFrame(a = 1:4) df$with_columns(shift = pl$col("a")$shift()) # Pass a negative value to shift in the opposite direction instead. df$with_columns(shift = pl$col("a")$shift(-2)) # Specify fill_value to fill the resulting null values. df$with_columns(shift = pl$col("a")$shift(-2, fill_value = 100))
Shrink to the dtype needed to fit the extrema of this Series. This can be used to reduce memory pressure.
expr__shrink_dtype()
expr__shrink_dtype()
A polars expression
df <- pl$DataFrame(a = c(-112, 2, 112))$cast(pl$Int64) df$with_columns( shrunk = pl$col("a")$shrink_dtype() )
df <- pl$DataFrame(a = c(-112, 2, 112))$cast(pl$Int64) df$with_columns( shrunk = pl$col("a")$shrink_dtype() )
Note this is shuffled independently of any other column or Expression.
If you want each row to stay the same use
df$sample(shuffle = TRUE)
.
expr__shuffle(seed = NULL)
expr__shuffle(seed = NULL)
seed |
Integer indicating the seed for the random number generator. If
|
A polars expression
df <- pl$DataFrame(a = 1:3) df$with_columns( shuffled = pl$col("a")$shuffle(seed = 1) )
df <- pl$DataFrame(a = 1:3) df$with_columns( shuffled = pl$col("a")$shuffle(seed = 1) )
This returns -1 if x is lower than 0, 0 if x == 0, and 1 if x is greater than 0.
expr__sign()
expr__sign()
A polars expression
df <- pl$DataFrame(a = c(-9, 0, 0, 4, NA)) df$with_columns(sign = pl$col("a")$sign())
df <- pl$DataFrame(a = c(-9, 0, 0, 4, NA)) df$with_columns(sign = pl$col("a")$sign())
Compute sine
expr__sin()
expr__sin()
A polars expression
pl$DataFrame(a = c(0, pi / 2, pi, NA))$ with_columns(sine = pl$col("a")$sin())
pl$DataFrame(a = c(0, pi / 2<