Skip to content
Snippets Groups Projects
Commit b94fb284 authored by Nattavut Sutyanyong's avatar Nattavut Sutyanyong Committed by Herman van Hovell
Browse files

[SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results


## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #16467 from nsyca/19017.

(cherry picked from commit cdb691eb)
Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
parent d128b6a3
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment