Skip to content
Snippets Groups Projects
  1. Dec 04, 2015
    • Dmitry Erastov's avatar
      [SPARK-6990][BUILD] Add Java linting script; fix minor warnings · d0d82227
      Dmitry Erastov authored
      This replaces https://github.com/apache/spark/pull/9696
      
      Invoke Checkstyle and print any errors to the console, failing the step.
      Use Google's style rules modified according to
      https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
      Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
      multiple violations being present in the codebase.
      
      Suggest fixing those TODOs in a separate PR(s).
      
      More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
      
      Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
      
      > Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
      > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
      > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
      > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
      > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
      
      Also fix some of the minor violations that didn't require sweeping changes.
      
      Apologies for the previous botched PRs - I finally figured out the issue.
      
      cr: JoshRosen, pwendell
      
      > I state that the contribution is my original work, and I license the work to the project under the project's open source license.
      
      Author: Dmitry Erastov <derastov@gmail.com>
      
      Closes #9867 from dskrvk/master.
      d0d82227
  2. Dec 01, 2015
    • Nong Li's avatar
      [SPARK-12030] Fix Platform.copyMemory to handle overlapping regions. · 2cef1cdf
      Nong Li authored
      This bug was exposed as memory corruption in Timsort which uses copyMemory to copy
      large regions that can overlap. The prior implementation did not handle this case
      half the time and always copied forward, resulting in the data being corrupt.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #10068 from nongli/spark-12030.
      2cef1cdf
  3. Nov 17, 2015
  4. Nov 05, 2015
    • Davies Liu's avatar
      [SPARK-7542][SQL] Support off-heap index/sort buffer · eec74ba8
      Davies Liu authored
      This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution.
      
      Closes #8068
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9477 from davies/unsafe_timsort.
      eec74ba8
  5. Nov 04, 2015
    • Davies Liu's avatar
      [SPARK-11493] remove bitset from BytesToBytesMap · 1b6a5d4a
      Davies Liu authored
      Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset.
      
      For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway).
      
      For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark.
      
       For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false):
      
      ```
      sqlContext.range(1<<20).write.parquet("small")
      df = sqlContext.read.parquet('small')
      for i in range(3):
          t = time.time()
          df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2")
          df2.join(df, df.id == df2.id2).count()
          print time.time() -t
      ```
      
      Having bitset (used time in seconds):
      ```
      17.5404241085
      10.2758829594
      10.5786800385
      ```
      After removing bitset (used time in seconds):
      ```
      21.8939979076
      12.4132959843
      9.97224712372
      ```
      
      cc rxin nongli
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9452 from davies/remove_bitset.
      1b6a5d4a
  6. Nov 02, 2015
  7. Oct 30, 2015
    • Davies Liu's avatar
      [SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative memory management · 56419cf1
      Davies Liu authored
      This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed.
      
      Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling).
      
      The PrepareRDD may be not needed anymore, could be removed in follow up PR.
      
      The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration).
      
      ```python
      sqlContext.setConf("spark.sql.shuffle.partitions", "1")
      df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s")
      df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
      j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2")
      j.explain()
      print j.count()
      ```
      
      For thread-safety, here what I'm got:
      
      1) Without calling spill(), the operators should only be used by single thread, no safety problems.
      
      2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems.
      
      3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it.
      
      4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning.
      
      5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9241 from davies/force_spill.
      56419cf1
  8. Oct 25, 2015
    • Josh Rosen's avatar
      [SPARK-10984] Simplify *MemoryManager class structure · 85e654c5
      Josh Rosen authored
      This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes:
      
      - MemoryManager
      - StaticMemoryManager
      - ExecutorMemoryManager
      - TaskMemoryManager
      - ShuffleMemoryManager
      
      This is fairly confusing. To simplify things, this patch consolidates several of these classes:
      
      - ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager.
      - TaskMemoryManager is moved into Spark Core.
      
      **Key changes and tasks**:
      
      - [x] Merge ExecutorMemoryManager into MemoryManager.
        - [x] Move pooling logic into Allocator.
      - [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`.
      - [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager.
      - [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager.
      - [x] Merge ShuffleMemoryManager into MemoryManager.
        - [x] Move code
        - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.**
      - [x] Port ShuffleMemoryManagerSuite tests.
      - [x] Move classes from `unsafe` package to `memory` package.
      - [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction.
      - [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation
        - [x] AbstractBytesToBytesMapSuite
        - [x] UnsafeExternalSorterSuite
        - [x] UnsafeFixedWidthAggregationMapSuite
        - [x] UnsafeKVExternalSorterSuite
      
      **Compatiblity notes**:
      
      - This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9127 from JoshRosen/SPARK-10984.
      85e654c5
  9. Oct 12, 2015
    • Davies Liu's avatar
      [SPARK-10990] [SPARK-11018] [SQL] improve unrolling of complex types · c4da5345
      Davies Liu authored
      This PR improve the unrolling and read of complex types in columnar cache:
      1) Using UnsafeProjection to do serialization of complex types, so they will not be serialized three times (two for actualSize)
      2) Copy the bytes from UnsafeRow/UnsafeArrayData to ByteBuffer directly, avoiding the immediate byte[]
      3) Using the underlying array in ByteBuffer to create UTF8String/UnsafeRow/UnsafeArrayData without copy.
      
      Combine these optimizations,  we can reduce the unrolling time from 25s to 21s (20% less), reduce the scanning time from 3.5s to 2.5s (28% less).
      
      ```
      df = sqlContext.read.parquet(path)
      t = time.time()
      df.cache()
      df.count()
      print 'unrolling', time.time() - t
      
      for i in range(10):
          t = time.time()
          print df.select("*")._jdf.queryExecution().toRdd().count()
          print time.time() - t
      ```
      
      The schema is
      ```
      root
       |-- a: struct (nullable = true)
       |    |-- b: long (nullable = true)
       |    |-- c: string (nullable = true)
       |-- d: array (nullable = true)
       |    |-- element: long (containsNull = true)
       |-- e: map (nullable = true)
       |    |-- key: long
       |    |-- value: string (valueContainsNull = true)
      ```
      
      Now the columnar cache depends on that UnsafeProjection support all the data types (including UDT), this PR also fix that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9016 from davies/complex2.
      c4da5345
  10. Oct 07, 2015
  11. Oct 01, 2015
  12. Sep 15, 2015
  13. Aug 31, 2015
  14. Aug 18, 2015
    • Davies Liu's avatar
      [SPARK-10095] [SQL] use public API of BigInteger · 270ee677
      Davies Liu authored
      In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations).
      
      So we should use the public API instead.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8286 from davies/portable_decimal.
      270ee677
  15. Aug 15, 2015
  16. Aug 14, 2015
    • Davies Liu's avatar
      [SPARK-9946] [SPARK-9589] [SQL] fix NPE and thread-safety in TaskMemoryManager · 3bc55287
      Davies Liu authored
      Currently, we access the `page.pageNumer` after it's freed, that could be modified by other thread, cause NPE.
      
      The same TaskMemoryManager could be used by multiple threads (for example, Python UDF and TransportScript), so it should be thread safe to allocate/free memory/page. The underlying Bitset and HashSet are not thread safe, we should put them inside a synchronized block.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8177 from davies/memory_manager.
      3bc55287
  17. Aug 11, 2015
    • Reynold Xin's avatar
      [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform. · d378396f
      Reynold Xin authored
      PlatformDependent.UNSAFE is way too verbose.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8094 from rxin/SPARK-9815 and squashes the following commits:
      
      229b603 [Reynold Xin] [SPARK-9815] Rename PlatformDependent.UNSAFE -> Platform.
      d378396f
  18. Aug 08, 2015
  19. Aug 07, 2015
    • Reynold Xin's avatar
      [SPARK-9700] Pick default page size more intelligently. · 4309262e
      Reynold Xin authored
      Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node).
      
      This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8012 from rxin/pagesize and squashes the following commits:
      
      16f4756 [Reynold Xin] Fixed failing test.
      5afd570 [Reynold Xin] private...
      0d5fb98 [Reynold Xin] Update default value.
      674a6cd [Reynold Xin] Address review feedback.
      dc00e05 [Reynold Xin] Merge with master.
      73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.
      4309262e
  20. Aug 06, 2015
    • Davies Liu's avatar
      [SPARK-9644] [SQL] Support update DecimalType with precision > 18 in UnsafeRow · 5b965d64
      Davies Liu authored
      In order to support update a varlength (actually fixed length) object, the space should be preserved even  it's null. And, we can't call setNullAt(i) for it anymore, we because setNullAt(i) will remove the offset of the preserved space, should call setDecimal(i, null, precision) instead.
      
      After this, we can do hash based aggregation on DecimalType with precision > 18. In a tests, this could decrease the end-to-end run time of aggregation query from 37 seconds (sort based) to 24 seconds (hash based).
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7978 from davies/update_decimal and squashes the following commits:
      
      bed8100 [Davies Liu] isSettable -> isMutable
      923c9eb [Davies Liu] address comments and fix bug
      385891d [Davies Liu] Merge branch 'master' of github.com:apache/spark into update_decimal
      36a1872 [Davies Liu] fix tests
      cd6c524 [Davies Liu] support set decimal with precision > 18
      5b965d64
    • zhichao.li's avatar
      [SPARK-8266] [SQL] add function translate · aead18ff
      zhichao.li authored
      ![translate](http://www.w3resource.com/PostgreSQL/postgresql-translate-function.png)
      
      Author: zhichao.li <zhichao.li@intel.com>
      
      Closes #7709 from zhichao-li/translate and squashes the following commits:
      
      9418088 [zhichao.li] refine checking condition
      f2ab77a [zhichao.li] clone string
      9d88f2d [zhichao.li] fix indent
      6aa2962 [zhichao.li] style
      e575ead [zhichao.li] add python api
      9d4bab0 [zhichao.li] add special case for fodable and refactor unittest
      eda7ad6 [zhichao.li] update to use TernaryExpression
      cdfd4be [zhichao.li] add function translate
      aead18ff
  21. Aug 04, 2015
    • Josh Rosen's avatar
      [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter · ab8ee1a3
      Josh Rosen authored
      This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits:
      
      967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
      948c344 [Josh Rosen] Add large records tests for KV sorter.
      3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method
      380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
      27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite
      a49baef [Josh Rosen] Address initial round of review comments
      3edb931 [Josh Rosen] Remove accidentally-committed debug statements.
      2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
      ab8ee1a3
    • Tarek Auel's avatar
      [SPARK-8244] [SQL] string function: find in set · b1f88a38
      Tarek Auel authored
      This PR is based on #7186 (just fix the conflict), thanks to tarekauel .
      
      find_in_set(string str, string strList): int
      
      Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3.
      
      Only add this to SQL, not DataFrame.
      
      Closes #7186
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7900 from davies/find_in_set and squashes the following commits:
      
      4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
      8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
      243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility
      1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix
      e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8
      0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244
      208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix
      71b2e69 [Tarek Auel] [SPARK-8244] find_in_set
      66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244
      61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison
      4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244
      e3b20c8 [Tarek Auel] [SPARK-8244] added type check
      1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
      b1f88a38
  22. Aug 03, 2015
    • Matthew Brandyberry's avatar
      [SPARK-9483] Fix UTF8String.getPrefix for big-endian. · b79b4f5f
      Matthew Brandyberry authored
      Previous code assumed little-endian.
      
      Author: Matthew Brandyberry <mbrandy@us.ibm.com>
      
      Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits:
      
      ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments.
      17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for big-endian.
      b79b4f5f
    • Wenchen Fan's avatar
      [SPARK-9404][SPARK-9542][SQL] unsafe array data and map data · 608353c8
      Wenchen Fan authored
      This PR adds a UnsafeArrayData, current we encode it in this way:
      
      first 4 bytes is the # elements
      then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null.
      followed by the elements themselves
      
      an example:  [10, 11, 12, 13, null, 14] will be encoded as:
      5, 28, 32, 36, 40, -44, 44, 10, 11, 12, 13, 14
      
      Note that, when we read a UnsafeArrayData from bytes, we can read the first 4 bytes as numElements and take the rest(first 4 bytes skipped) as value region.
      
      unsafe map data just use 2 unsafe array data, first 4 bytes is # of elements, second 4 bytes is numBytes of key array, the follows key array data and value array data.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7752 from cloud-fan/unsafe-array and squashes the following commits:
      
      3269bd7 [Wenchen Fan] fix a bug
      6445289 [Wenchen Fan] add unit tests
      49adf26 [Wenchen Fan] add unsafe map
      20d1039 [Wenchen Fan] add comments and unsafe converter
      821b8db [Wenchen Fan] add unsafe array
      608353c8
  23. Aug 02, 2015
    • Reynold Xin's avatar
      [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter. · 9d03ad91
      Reynold Xin authored
      The detailed approach is documented in UnsafeKVExternalSorterSuite.testKVSorter(), working as follows:
      
      1. Create input by generating data randomly based on the given key/value schema (which is also randomly drawn from a list of candidate types)
      2. Run UnsafeKVExternalSorter on the generated data
      3. Collect the output from the sorter, and make sure the keys are sorted in ascending order
      4. Sort the input by both key and value, and sort the sorter output also by both key and value. Compare the sorted input and sorted output together to make sure all the key/values match.
      5. Check memory allocation to make sure there is no memory leak.
      
      There is also a spill flag. When set to true, the sorter will spill probabilistically roughly every 100 records.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7873 from rxin/kvsorter-randomized-test and squashes the following commits:
      
      a08c251 [Reynold Xin] Resource cleanup.
      0488b5c [Reynold Xin] [SPARK-9543][SQL] Add randomized testing for UnsafeKVExternalSorter.
      9d03ad91
  24. Aug 01, 2015
    • Davies Liu's avatar
      [SPARK-9459] [SQL] use generated FromUnsafeProjection to do deep copy for UTF8String and struct · 57084e0c
      Davies Liu authored
      When accessing a column in UnsafeRow, it's good to avoid the copy, then we should do deep copy when turn the UnsafeRow into generic Row, this PR brings generated FromUnsafeProjection to do that.
      
      This PR also fix the expressions that cache the UTF8String, which should also copy it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7840 from davies/avoid_copy and squashes the following commits:
      
      230c8a1 [Davies Liu] address comment
      fd797c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into avoid_copy
      e095dd0 [Davies Liu] rollback rename
      8ef5b0b [Davies Liu] copy String in Columnar
      81360b8 [Davies Liu] fix class name
      9aecb88 [Davies Liu] use FromUnsafeProjection to do deep copy for UTF8String and struct
      57084e0c
    • HuJiayin's avatar
      [SPARK-8269] [SQL] string function: initcap · 00cd92f3
      HuJiayin authored
      This PR is based on #7208 , thanks to HuJiayin
      
      Closes #7208
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7850 from davies/initcap and squashes the following commits:
      
      54472e9 [Davies Liu] fix python test
      17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      3a906e4 [Davies Liu] implement title case in UTF8String
      8b2506a [HuJiayin] Update functions.py
      2cd43e5 [HuJiayin] fix python style check
      b616c0e [HuJiayin] add python api
      1f5a0ef [HuJiayin] add codegen
      7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
      6a0b958 [HuJiayin] add column
      c79482d [HuJiayin] support soundex
      7ce416b [HuJiayin] support initcap rebase code
      00cd92f3
    • Reynold Xin's avatar
      [SPARK-9520] [SQL] Support in-place sort in UnsafeFixedWidthAggregationMap · 3d1535d4
      Reynold Xin authored
      This pull request adds a sortedIterator method to UnsafeFixedWidthAggregationMap that sorts its data in-place by the grouping key.
      
      This is needed so we can fallback to external sorting for aggregation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7849 from rxin/bytes2bytes-sorting and squashes the following commits:
      
      75018c6 [Reynold Xin] Updated documentation.
      81a8694 [Reynold Xin] [SPARK-9520][SQL] Support in-place sort in UnsafeFixedWidthAggregationMap.
      3d1535d4
    • zhichao.li's avatar
      [SPARK-8263] [SQL] substr/substring should also support binary type · c5166f7a
      zhichao.li authored
      This is based on #7641, thanks to zhichao-li
      
      Closes #7641
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7848 from davies/substr and squashes the following commits:
      
      461b709 [Davies Liu] remove bytearry from tests
      b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
      01d795e [zhichao.li] scala style
      99aa130 [zhichao.li] add substring to dataframe
      4f68bfe [zhichao.li] add binary type support for substring
      c5166f7a
    • Reynold Xin's avatar
      [SPARK-9517][SQL] BytesToBytesMap should encode data the same way as UnsafeExternalSorter · d90f2cf7
      Reynold Xin authored
      BytesToBytesMap current encodes key/value data in the following format:
      ```
      8B key length, key data, 8B value length, value data
      ```
      
      UnsafeExternalSorter, on the other hand, encodes data this way:
      ```
      4B record length, data
      ```
      
      As a result, we cannot pass records encoded by BytesToBytesMap directly into UnsafeExternalSorter for sorting. However, if we rearrange data slightly, we can then pass the key/value records directly into UnsafeExternalSorter:
      ```
      4B key+value length, 4B key length, key data, value data
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7845 from rxin/kvsort-rebase and squashes the following commits:
      
      5716b59 [Reynold Xin] Fixed test.
      2e62ccb [Reynold Xin] Updated BytesToBytesMap's data encoding to put the key first.
      a51b641 [Reynold Xin] Added a KV sorter interface.
      d90f2cf7
  25. Jul 31, 2015
    • Josh Rosen's avatar
      [SPARK-9464][SQL] Property checks for UTF8String · 14f26344
      Josh Rosen authored
      This PR is based on the original work by JoshRosen in #7780, which adds ScalaCheck property-based tests for UTF8String.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7830 from yjshen/utf8-property-checks and squashes the following commits:
      
      593da3a [Yijie Shen] resolve comments
      c0800e6 [Yijie Shen] Finish all todos in suite
      52f51a0 [Josh Rosen] Add some more failing tests
      49ed0697 [Josh Rosen] Rename suite
      9209c64 [Josh Rosen] UTF8String Property Checks.
      14f26344
    • zhichao.li's avatar
      [SPARK-8264][SQL]add substring_index function · 6996bd2e
      zhichao.li authored
      This PR is based on #7533 , thanks to zhichao-li
      
      Closes #7533
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7843 from davies/str_index and squashes the following commits:
      
      391347b [Davies Liu] add python api
      3ce7802 [Davies Liu] fix substringIndex
      f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
      515519b [zhichao.li] add foldable and remove null checking
      9546991 [zhichao.li] scala style
      67c253a [zhichao.li] hide some apis and clean code
      b19b013 [zhichao.li] add codegen and clean code
      ac863e9 [zhichao.li] reduce the calling of numChars
      12e108f [zhichao.li] refine unittest
      d92951b [zhichao.li] add lastIndexOf
      52d7b03 [zhichao.li] add substring_index function
      6996bd2e
    • Josh Rosen's avatar
      [SPARK-9451] [SQL] Support entries larger than default page size in... · 8cb415a4
      Josh Rosen authored
      [SPARK-9451] [SQL] Support entries larger than default page size in BytesToBytesMap & integrate with ShuffleMemoryManager
      
      This patch adds support for entries larger than the default page size in BytesToBytesMap.  These large rows are handled by allocating special overflow pages to hold individual entries.
      
      In addition, this patch integrates BytesToBytesMap with the ShuffleMemoryManager:
      
      - Move BytesToBytesMap from `unsafe` to `core` so that it can import `ShuffleMemoryManager`.
      - Before allocating new data pages, ask the ShuffleMemoryManager to reserve the memory:
        - `putNewKey()` now returns a boolean to indicate whether the insert succeeded or failed due to a lack of memory.  The caller can use this value to respond to the memory pressure (e.g. by spilling).
      - `UnsafeFixedWidthAggregationMap. getAggregationBuffer()` now returns `null` to signal failure due to a lack of memory.
      - Updated all uses of these classes to handle these error conditions.
      - Added new tests for allocating large records and for allocations which fail due to memory pressure.
      - Extended the `afterAll()` test teardown methods to detect ShuffleMemoryManager leaks.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7762 from JoshRosen/large-rows and squashes the following commits:
      
      ae7bc56 [Josh Rosen] Fix compilation
      82fc657 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-rows
      34ab943 [Josh Rosen] Remove semi
      31a525a [Josh Rosen] Integrate BytesToBytesMap with ShuffleMemoryManager.
      626b33c [Josh Rosen] Move code to sql/core and spark/core packages so that ShuffleMemoryManager can be integrated
      ec4484c [Josh Rosen] Move BytesToBytesMap from unsafe package to core.
      642ed69 [Josh Rosen] Rename size to numElements
      bea1152 [Josh Rosen] Add basic test.
      2cd3570 [Josh Rosen] Remove accidental duplicated code
      07ff9ef [Josh Rosen] Basic support for large rows in BytesToBytesMap.
      8cb415a4
    • HuJiayin's avatar
      [SPARK-8271][SQL]string function: soundex · 4d5a6e7b
      HuJiayin authored
      This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738
      
      It's based on #7115 , thanks to HuJiayin
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7812 from davies/soundex and squashes the following commits:
      
      fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
      a4bd6d8 [Davies Liu] fix soundex
      2538908 [HuJiayin] add codegen soundex
      d15d329 [HuJiayin] add back ut
      ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
      e2dec2c [HuJiayin] support soundex rebase code
      4d5a6e7b
  26. Jul 30, 2015
    • Reynold Xin's avatar
      [SPARK-9460] Fix prefix generation for UTF8String. · a20e743f
      Reynold Xin authored
      Previously we could be getting garbage data if the number of bytes is 0, or on JVMs that are 4 byte aligned, or when compressedoops is on.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7789 from rxin/utf8string and squashes the following commits:
      
      86ffa3e [Reynold Xin] Mask out data outside of valid range.
      4d647ed [Reynold Xin] Mask out data.
      c6e8794 [Reynold Xin] [SPARK-9460] Fix prefix generation for UTF8String.
      a20e743f
  27. Jul 29, 2015
    • Reynold Xin's avatar
      [SPARK-9460] Avoid byte array allocation in StringPrefixComparator. · 07fd7d36
      Reynold Xin authored
      As of today, StringPrefixComparator converts the long values back to byte arrays in order to compare them. This patch optimizes this to compare the longs directly, rather than turning the longs into byte arrays and comparing them byte by byte (unsigned).
      
      This only works on little-endian architecture right now.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7765 from rxin/SPARK-9460 and squashes the following commits:
      
      e4908cc [Reynold Xin] Stricter randomized tests.
      4c8d094 [Reynold Xin] [SPARK-9460] Avoid byte array allocation in StringPrefixComparator.
      07fd7d36
Loading