Inconsistencies with the `==` operator in R

I found a bug with the == operator in R!

Author

Published

August 6, 2019

One of the cool things about working on gradethis (grader need to be renamed) is that we end up doing things that aren’t common in R (i.e., grading and comparing code).

I discovered an inconsistency with the == operator when comparing (long) R expressions.

A quick primer on expressions

In R, you can create an expression using the quote() function. This is essentially the code that R will execute. It is similar to the “string” that will be executed, but an actual string in R will return a string, not a command or set of instructions that R will execute.

Compare:

3 + 3

[1] 6

Which will return the executed result of 3 + 3 and

"3 + 3"

[1] "3 + 3"

which will return the string "3 + 3"

with:

quote(3 + 3)

3 + 3

which returns the expression 3 + 3 that is the instruction to R without actually evaluating it.

If we wanted to evaluate the expression, we can call eval.

eval(quote(3 + 3))

[1] 6

You can read more about expressions in the Expressions Chapter in Advanced R.

The “bug”

The bug was detected in gradethis where we want to compare student submitted code with the instructor solution. There are multiple steps in the comparison process, but the first step is to simply check if the two bits of code are the same. That way, we can stop there and not have to go through the process to detect where the actual differences are.

The comparison code was originally written to use == to compare the expressions.

user <- quote(3 + 3)
solution <- quote(3 + 3)

user == solution

[1] TRUE

Garrett Grolemund put in a bunch of examples that show some weird behaviour. I initially thought it had to do with name spacing the function name, or after using the : notation to select columns in a dataframe via tidyselect.

When the two expressions are the same, we get TRUE as expected

# supposed to return TRUE
u <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = TRUE))
s <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = TRUE))
u == s

[1] TRUE

But when we change the values for na.rm, we also get TRUE when the expressions are not the same.

# supposed to return FALSE
u <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = TRUE))
s <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = FALSE))
u == s

[1] FALSE

But it seems if we get rid of the tidyselect column selector, we get the correct result.

# If we remove the third argument the error goes away
u <- quote(tidyr::gather(key = key, value = value, na.rm = TRUE))
s <- quote(tidyr::gather(key = key, value = value, na.rm = FALSE))
u == s

[1] FALSE

I brought this up on our daily shiny-core stand-ups and Winston Chang thought it may have something to do with the deparse function since it doesn’t actually matter what the expressions being compared are.

u <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 1))
s <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 2))
u == s

[1] FALSE

You can see Winston’s comment and link to R code in question here.

Pretty much when == is used to compare expressions, the expressions are passed through deparse. When deparse is run on an expression, it breaks it up into vectors that are 60L characters long, which is fine, but the R bug is when the comparison is only performed with the first element of the vector. That’s why only the end of the expressions seem to “not matter”.

Reporting the bug

I reported the findings to the r-devel mailing list

Where, even after botching my first listserv submission, I got a response from Martin Maechler (R-core)

Looking at that and its context, I think we (R core) should reconsider that implementation of ‘==’ which indeed does about the same thing as deparse {which also truncates at some point by default; something very very reasonable for error messages, but undesirable in other cases}.

But I think it’s fair expectation that comparing calls [“language”] with ‘==’ should compare the full call’s syntax even if that may occasionally be very long.

So it is actually a behavior that will get patched one day.

The fix

We ended up making changes to gradethis by using identical() while comparing quoted expressions.

u <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 1))
s <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 2))
identical(u, s)

[1] FALSE

Using identical() is a much better way when we are comparing code and results, because == will return a matrix when comparing 2 dataframes where using all has problems when there are NA missing values.

We want to see if the 2 vectors are the same

u <- c(1, 2, 3)
s <- c(1, 2, NA)
all(u == s)

[1] NA

We can remove missing values, but now when either the user code or solution code does contains an NA it gets ignored.

u <- c(1, 2, 3)
s <- c(1, 2, NA)
all(u == s, na.rm = TRUE)

[1] TRUE

u <- c(1, 2, NA)
s <- c(1, 2, 3)
all(u == s, na.rm = TRUE)

[1] TRUE

Now, we nudge toward using identical and raise a warning when we detect ==.

u <- c(1, 2, NA)
s <- c(1, 2, 3)
identical(u, s)

[1] FALSE

Does Donald Knuth owe me a dollar now?