Skip to content
  • Josh Rosen's avatar
    f2faa7af
    [SPARK-7251] Perform sequential scan when iterating over BytesToBytesMap · f2faa7af
    Josh Rosen authored
    This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations.
    
    This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries.  In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length).
    
    This patch incorporates / closes #5836.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #6159 from JoshRosen/SPARK-7251 and squashes the following commits:
    
    05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY
    2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity
    bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap
    f5feadf [Josh Rosen] Add test for iterating over an empty map
    273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap
    f2faa7af
    [SPARK-7251] Perform sequential scan when iterating over BytesToBytesMap
    Josh Rosen authored
    This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations.
    
    This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries.  In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length).
    
    This patch incorporates / closes #5836.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes #6159 from JoshRosen/SPARK-7251 and squashes the following commits:
    
    05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY
    2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity
    bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap
    f5feadf [Josh Rosen] Add test for iterating over an empty map
    273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap
Loading