Skip to content

Series.cast/2 silently converts invalid values to nil #1136

Description

@thbar

While working on large CSV aggregations (we're north of 1.8k input files at the moment, ending up grouped in the same destination dataset), it took me a bit of time initially to realize that while load_csv! will raise at the first opportunity if it meets a value not matching the provided dtype, calling cast will silently convert the field to nil and continue (which is also useful, of course).

Here is an up-to-date reproduction to showcase this:

Mix.install([
  {:explorer, "~> 0.11.1"}
])

ExUnit.start()

defmodule Repro do
  use ExUnit.Case
  alias Explorer.DataFrame, as: DF

  @incoming_data "field\nABC\n12.4"

  test "loading from CSV seems strict" do
    assert_raise RuntimeError, ~r/could not parse `ABC` as dtype `f64` at column 'field'/, fn ->
      DF.load_csv!(@incoming_data, dtypes: [{:field, {:f, 64}}])
    end
  end

  test "but casting from string, not strict" do
    result = @incoming_data
    |> DF.load_csv!(dtypes: [{:field, :string}])
    |> DF.mutate_with(fn df ->
      [field: Explorer.Series.cast(df[:field], {:f, 64})]
    end)
    |> Access.get(:field)
    |> Explorer.Series.to_list()

    # non-castable data has been translated to `nil`,
    # something which can catch offguard quite a bit
    assert result == [nil, 12.4]
  end
end

Current notes from my exploration

  • Polars has both strict & non-strict ways of doing things
  • the Explorer code-base uses strict_cast in 2 places at least
  • but it does not expose strictness as an option to the end user currently
  • I could not find (so far) mentions of the behaviour (silenceness) in the cast documentation
  • my understanding is that exposing this could be a bit involved (not to mention defaulting to strict if we wanted to)

I thought it would be useful to open a discussion on this, since it could very much take off guard other people (especially in Elixir, where things are usually stricter, & more typing is being introduced).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions