Skip to content
Snippets Groups Projects
  1. Jul 21, 2015
    • Yijie Shen's avatar
      [SPARK-9173][SQL]UnionPushDown should also support Intersect and Except · ae230596
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9173
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7540 from yjshen/union_pushdown and squashes the following commits:
      
      278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown
      91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except
      ae230596
    • Pedro Rodriguez's avatar
      [SPARK-8230][SQL] Add array/map size method · 560c658a
      Pedro Rodriguez authored
      Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230
      
      Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.
      
      Things to review:
      1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
      2. In Python code, should it be in a `1.5.0` function array or in a collections array?
      3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
      4. Something else?
      
      Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
      Author: Pedro Rodriguez <prodriguez@trulia.com>
      
      Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:
      
      9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
      9aea3bb [Pedro Rodriguez] removed imports from python docs
      15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
      d88247c [Pedro Rodriguez] removed python code
      bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
      59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
      c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
      130839f [Pedro Rodriguez] fixed failing test
      aa9bade [Pedro Rodriguez] fix style
      e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
      0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
      9a1a2ff [Pedro Rodriguez] added unit tests for map size
      2bfbcb6 [Pedro Rodriguez] added unit test for size
      20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
      b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
      99a6a5c [Pedro Rodriguez] fixed failing test
      cac75ac [Pedro Rodriguez] fix style
      933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
      42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
      f9c3b8a [Pedro Rodriguez] added unit tests for map size
      2515d9f [Pedro Rodriguez] added documentation
      0e60541 [Pedro Rodriguez] added unit test for size
      acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
      84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
      560c658a
    • Cheng Hao's avatar
      [SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace · 8c8f0ef5
      Cheng Hao authored
      Add expressions `regex_extract` & `regex_replace`
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #7468 from chenghao-intel/regexp and squashes the following commits:
      
      e5ea476 [Cheng Hao] minor update for documentation
      ef96fd6 [Cheng Hao] update the code gen
      72cf28f [Cheng Hao] Add more log for compilation error
      4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
      8c8f0ef5
    • Cheng Lian's avatar
      [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC · d38c5029
      Cheng Lian authored
      This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7444 from liancheng/spark-9100 and squashes the following commits:
      
      284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
      e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
      d38c5029
    • Tarek Auel's avatar
      [SPARK-9161][SQL] codegen FormatNumber · 1ddd0f2f
      Tarek Auel authored
      Jira https://issues.apache.org/jira/browse/SPARK-9161
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7545 from tarekauel/SPARK-9161 and squashes the following commits:
      
      21425c8 [Tarek Auel] [SPARK-9161][SQL] codegen FormatNumber
      1ddd0f2f
    • Shivaram Venkataraman's avatar
      [SPARK-9179] [BUILD] Use default primary author if unspecified · 228ab65a
      Shivaram Venkataraman authored
      Fixes feature introduced in #7508 to use the default value if nothing is specified in command line
      
      cc liancheng rxin pwendell
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7558 from shivaram/merge-script-fix and squashes the following commits:
      
      7092141 [Shivaram Venkataraman] Use default primary author if unspecified
      228ab65a
    • Josh Rosen's avatar
      [SPARK-9023] [SQL] Followup for #7456 (Efficiency improvements for UnsafeRows in Exchange) · 48f8fd46
      Josh Rosen authored
      This patch addresses code review feedback from #7456.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7551 from JoshRosen/unsafe-exchange-followup and squashes the following commits:
      
      76dbdf8 [Josh Rosen] Add comments + more methods to UnsafeRowSerializer
      3d7a1f2 [Josh Rosen] Add writeToStream() method to UnsafeRow
      48f8fd46
    • Reynold Xin's avatar
      [SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names. · 67570bee
      Reynold Xin authored
      It can be ambiguous whether that is a string literal or a column name.
      
      cc marmbrus
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7556 from rxin/str-exprs and squashes the following commits:
      
      92afa83 [Reynold Xin] [SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names.
      67570bee
    • Tarek Auel's avatar
      [SPARK-9157] [SQL] codegen substring · 560b355c
      Tarek Auel authored
      https://issues.apache.org/jira/browse/SPARK-9157
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7534 from tarekauel/SPARK-9157 and squashes the following commits:
      
      e65e3e9 [Tarek Auel] [SPARK-9157] indent fix
      44e89f8 [Tarek Auel] [SPARK-9157] use EMPTY_UTF8
      37d54c4 [Tarek Auel] Merge branch 'master' into SPARK-9157
      60732ea [Tarek Auel] [SPARK-9157] created substringSQL in UTF8String
      18c3576 [Tarek Auel] [SPARK-9157][SQL] remove slice pos
      1a2e611 [Tarek Auel] [SPARK-9157][SQL] codegen substring
      560b355c
    • Josh Rosen's avatar
      [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and... · c032b0bf
      Josh Rosen authored
      [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL
      
      This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort.  The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`.
      
      Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).
      
      In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7194 from JoshRosen/nan and squashes the following commits:
      
      983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
      88bd73c [Josh Rosen] Fix Row.equals()
      a702e2e [Josh Rosen] normalization -> canonicalization
      a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow
      fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
      fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions
      c1fd4fe [Josh Rosen] Fold NaN test into existing test framework
      b31eb19 [Josh Rosen] Uncomment failing tests
      7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145)
      58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability."
      fc6b4d2 [Josh Rosen] Update CodeGenerator
      3998ef2 [Josh Rosen] Remove unused code
      a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs
      a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability.
      6f03f85 [Josh Rosen] Fix bug in Double / Float ordering
      42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite
      bfca524 [Josh Rosen] Change ordering so that NaN is maximum value.
      8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume()
      b20837b [Josh Rosen] Add failing test for new NaN comparision ordering
      5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite
      d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
      630ebc5 [Josh Rosen] Specify an ordering for NaN values.
      9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests
      13fc06a [Josh Rosen] Add regression test for NaN sorting issue
      f9efbb5 [Josh Rosen] Fix ORDER BY NULL
      e7dc4fb [Josh Rosen] Add very generic test for ordering
      7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
      b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
      5acdd5c [Josh Rosen] Infinity and NaN are interesting.
      ab76cbd [Josh Rosen] Move code to Catalyst package.
      d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
      c032b0bf
    • Holden Karau's avatar
      [SPARK-9204][ML] Add default params test for linearyregression suite · 4d97be95
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7553 from holdenk/SPARK-9204-add-default-params-test-to-linear-regression and squashes the following commits:
      
      630ba19 [Holden Karau] style fix
      faa08a3 [Holden Karau] Add default params test for linearyregression suite
      4d97be95
    • Tarek Auel's avatar
      [SPARK-9132][SPARK-9163][SQL] codegen conv · a3c7a3ce
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9132
      https://issues.apache.org/jira/browse/SPARK-9163
      
      rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`.
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7552 from tarekauel/SPARK-9163 and squashes the following commits:
      
      40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix
      fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv
      a3c7a3ce
  2. Jul 20, 2015
    • Eric Liang's avatar
      [SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula · 1cbdd899
      Eric Liang authored
      This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.
      
      The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
      
      mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7483 from ericl/spark-8774 and squashes the following commits:
      
      3dfac0c [Eric Liang] update
      17ef516 [Eric Liang] more comments
      1753a0f [Eric Liang] make glm generic
      b0f50f8 [Eric Liang] equivalence test
      550d56d [Eric Liang] export methods
      c015697 [Eric Liang] second pass
      117949a [Eric Liang] comments
      5afbc67 [Eric Liang] test label columns
      6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
      3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
      ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
      0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
      e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
      d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
      29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
      d1959d2 [Eric Liang] clarify comment
      2db68aa [Eric Liang] second round of comments
      dc3c943 [Eric Liang] address comments
      5765ec6 [Eric Liang] fix style checks
      1f361b0 [Eric Liang] doc
      d33211b [Eric Liang] r support
      fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
      1cbdd899
    • Yu ISHIKAWA's avatar
      [SPARK-9052] [SPARKR] Fix comments after curly braces · 2bdf9914
      Yu ISHIKAWA authored
      [[SPARK-9052] Fix comments after curly braces - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9052)
      
      This is the full result of lintr at the rivision:01155162.
      [[SPARK-9052] the result of lint-r at the revision:01155162](https://gist.github.com/yu-iskw/e7246041b173a3f29482)
      
      This is the difference of the result between before and after.
      https://gist.github.com/yu-iskw/e7246041b173a3f29482/revisions
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #7440 from yu-iskw/SPARK-9052 and squashes the following commits:
      
      015d738 [Yu ISHIKAWA] Fix the indentations and move the placement of commna
      5cc30fe [Yu ISHIKAWA] Fix the indentation in a condition
      4ead0e5 [Yu ISHIKAWA] [SPARK-9052][SparkR] Fix comments after curly braces
      2bdf9914
    • Tarek Auel's avatar
      [SPARK-9164] [SQL] codegen hex/unhex · 936a96cb
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9164
      
      The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`.  This allows me to call them from `eval` and `codeGen`
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7548 from tarekauel/SPARK-9164 and squashes the following commits:
      
      dd91c57 [Tarek Auel] [SPARK-9164][SQL] codegen hex/unhex
      936a96cb
    • Reynold Xin's avatar
      [SPARK-9142][SQL] Removing unnecessary self types in expressions. · e90543e5
      Reynold Xin authored
      Also added documentation to expressions to explain the important traits and abstract classes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7550 from rxin/remove-self-types and squashes the following commits:
      
      b2a3ec1 [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in expressions.
      e90543e5
    • Tarek Auel's avatar
      [SPARK-9156][SQL] codegen StringSplit · 6853ac7c
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9156
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7547 from tarekauel/SPARK-9156 and squashes the following commits:
      
      0be2700 [Tarek Auel] [SPARK-9156][SQL] indention fix
      b860eaf [Tarek Auel] [SPARK-9156][SQL] codegen StringSplit
      5ad6a1f [Tarek Auel] [SPARK-9156] codegen StringSplit
      6853ac7c
    • Tarek Auel's avatar
      [SPARK-9178][SQL] Add an empty string constant to UTF8String · 047ccc8c
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9178
      
      In order to avoid calls of `UTF8String.fromString("")` this pr adds an `EMPTY_STRING` constant to `UTF8String`. An `UTF8String` is immutable, so we can use a constant, isn't it?
      
      I searched for current usage of `UTF8String.fromString("")` with
      `grep -R  "UTF8String.fromString(\"\")" .`
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7509 from tarekauel/SPARK-9178 and squashes the following commits:
      
      8d6c405 [Tarek Auel] [SPARK-9178] revert intellij indents
      3627b80 [Tarek Auel] [SPARK-9178] revert concat tests changes
      3f5fbf5 [Tarek Auel] [SPARK-9178] rebase and add final to UTF8String.EMPTY_UTF8
      47cda68 [Tarek Auel] Merge branch 'master' into SPARK-9178
      4a37344 [Tarek Auel] [SPARK-9178] changed name to EMPTY_UTF8, added tests
      748b87a [Tarek Auel] [SPARK-9178] Add empty string constant to UTF8String
      047ccc8c
    • Carson Wang's avatar
      [SPARK-9187] [WEBUI] Timeline view may show negative value for running tasks · 66bb8003
      Carson Wang authored
      For running tasks, the executorRunTime metrics is 0 which causes negative executorComputingTime in the timeline. It also causes an incorrect SchedulerDelay time.
      ![timelinenegativevalue](https://cloud.githubusercontent.com/assets/9278199/8770953/f4362378-2eec-11e5-81e6-a06a07c04794.png)
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7526 from carsonwang/timeline-negValue and squashes the following commits:
      
      7b17db2 [Carson Wang] Fix negative value in timeline view
      66bb8003
    • Meihua Wu's avatar
      [SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1 · ff3c72db
      Meihua Wu authored
      Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1
      Also include unit tests to verify the fix.
      
      mengxr brkyvz
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits:
      
      fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1
      ff3c72db
    • Joseph K. Bradley's avatar
      [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests · a5d05819
      Joseph K. Bradley authored
      Several places in the PySpark SparseVector docs have one defined as:
      ```
      SparseVector(4, [2, 4], [1.0, 2.0])
      ```
      The index 4 goes out of bounds (but this is not checked).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits:
      
      c806a65 [Joseph K. Bradley] fixed doc test
      e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests
      a5d05819
    • Cheng Lian's avatar
      [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery · a1064df0
      Cheng Lian authored
      This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery.  The acceleration is done by the following means:
      
      - Turning off schema merging by default
      
        Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.
      
      - Avoiding `FileSystem.globStatus()` call when possible
      
        `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3).  This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`).
      
        This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially.
      
      - Listing leaf files in parallel when the number of input paths exceeds a threshold
      
        Listing leaf files is required by partition discovery.  Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC.  In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.
      
        The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.
      
      - Discovering Parquet schema in parallel
      
        Currently, schema merging is also done on driver side, and needs to read footers of all part-files.  This PR uses a Spark job to do schema merging.  Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7396 from liancheng/accel-parquet and squashes the following commits:
      
      5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row]
      ff32cd0 [Cheng Lian] Excludes directories while listing leaf files
      3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false"
      b1646aa [Cheng Lian] Should allow empty input paths
      32e5f0d [Cheng Lian] Moves schema merging to executor side
      a1064df0
    • Tarek Auel's avatar
      [SPARK-9160][SQL] codegen encode, decode · dac7dbf5
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9160
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7543 from tarekauel/SPARK-9160 and squashes the following commits:
      
      7528f0e [Tarek Auel] [SPARK-9160][SQL] codegen encode, decode
      dac7dbf5
    • Tarek Auel's avatar
      [SPARK-9159][SQL] codegen ascii, base64, unbase64 · c9db8eaa
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9159
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits:
      
      772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64
      c9db8eaa
    • Tarek Auel's avatar
      [SPARK-9155][SQL] codegen StringSpace · 4863c11e
      Tarek Auel authored
      Jira https://issues.apache.org/jira/browse/SPARK-9155
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7531 from tarekauel/SPARK-9155 and squashes the following commits:
      
      423c426 [Tarek Auel] [SPARK-9155] language typo fix
      e34bd1b [Tarek Auel] [SPARK-9155] moved creation of blank string to UTF8String
      4bc33e6 [Tarek Auel] [SPARK-9155] codegen StringSpace
      4863c11e
    • Cheng Lian's avatar
      [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning · dde0e12f
      Cheng Lian authored
      This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details.
      
      [1]: https://github.com/liancheng/spark/commit/536ac41a7e6b2abeb1f6ec1a6491bbf09ed3e591
      [2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391
      [3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      Author: Cheng Lian <lian@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits:
      
      5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore
      536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421
      dde0e12f
    • Davies Liu's avatar
      [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type · 9f913c4f
      Davies Liu authored
      This PR also remove the duplicated code between registerFunction and UserDefinedFunction.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7450 from davies/fix_return_type and squashes the following commits:
      
      e80bf9f [Davies Liu] remove debugging code
      f94b1f6 [Davies Liu] fix mima
      8f9c58b [Davies Liu] convert returned object from UDF into internal type
      9f913c4f
    • Mateusz Buśkiewicz's avatar
      [SPARK-9101] [PySpark] Add missing NullType · 02181fb6
      Mateusz Buśkiewicz authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9101
      
      Author: Mateusz Buśkiewicz <mateusz.buskiewicz@getbase.com>
      
      Closes #7499 from sixers/spark-9101 and squashes the following commits:
      
      dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
      97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types
      02181fb6
    • Imran Rashid's avatar
      [SPARK-8103][core] DAGScheduler should not submit multiple concurrent attempts for a stage · 80e2568b
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-8103
      
      cc kayousterhout (thanks for the extra test case)
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      Author: Imran Rashid <squito@users.noreply.github.com>
      
      Closes #6750 from squito/SPARK-8103 and squashes the following commits:
      
      fb3acfc [Imran Rashid] fix log msg
      e01b7aa [Imran Rashid] fix some comments, style
      584acd4 [Imran Rashid] simplify going from taskId to taskSetMgr
      e43ac25 [Imran Rashid] Merge branch 'master' into SPARK-8103
      6bc23af [Imran Rashid] update log msg
      4470fa1 [Imran Rashid] rename
      c04707e [Imran Rashid] style
      88b61cc [Imran Rashid] add tests to make sure that TaskSchedulerImpl schedules correctly with zombie attempts
      d7f1ef2 [Imran Rashid] get rid of activeTaskSets
      a21c8b5 [Imran Rashid] Merge branch 'master' into SPARK-8103
      906d626 [Imran Rashid] fix merge
      109900e [Imran Rashid] Merge branch 'master' into SPARK-8103
      c0d4d90 [Imran Rashid] Revert "Index active task sets by stage Id rather than by task set id"
      f025154 [Imran Rashid] Merge pull request #2 from kayousterhout/imran_SPARK-8103
      baf46e1 [Kay Ousterhout] Index active task sets by stage Id rather than by task set id
      19685bb [Imran Rashid] switch to using latestInfo.attemptId, and add comments
      a5f7c8c [Imran Rashid] remove comment for reviewers
      227b40d [Imran Rashid] style
      517b6e5 [Imran Rashid] get rid of SparkIllegalStateException
      b2faef5 [Imran Rashid] faster check for conflicting task sets
      6542b42 [Imran Rashid] remove extra stageAttemptId
      ada7726 [Imran Rashid] reviewer feedback
      d8eb202 [Imran Rashid] Merge branch 'master' into SPARK-8103
      46bc26a [Imran Rashid] more cleanup of debug garbage
      cb245da [Imran Rashid] finally found the issue ... clean up debug stuff
      8c29707 [Imran Rashid] Merge branch 'master' into SPARK-8103
      89a59b6 [Imran Rashid] more printlns ...
      9601b47 [Imran Rashid] more debug printlns
      ecb4e7d [Imran Rashid] debugging printlns
      b6bc248 [Imran Rashid] style
      55f4a94 [Imran Rashid] get rid of more random test case since kays tests are clearer
      7021d28 [Imran Rashid] update test since listenerBus.waitUntilEmpty now throws an exception instead of returning a boolean
      883fe49 [Kay Ousterhout] Unit tests for concurrent stages issue
      6e14683 [Imran Rashid] unit test just to make sure we fail fast on concurrent attempts
      06a0af6 [Imran Rashid] ignore for jenkins
      c443def [Imran Rashid] better fix and simpler test case
      28d70aa [Imran Rashid] wip on getting a better test case ...
      a9bf31f [Imran Rashid] wip
      80e2568b
    • Reynold Xin's avatar
      [SQL] Remove space from DataFrame Scala/Java API. · c6fe9b4a
      Reynold Xin authored
      I don't think this function is useful at all in Scala/Java, since users can easily compute n * space easily.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7530 from rxin/remove-space and squashes the following commits:
      
      c147873 [Reynold Xin] [SQL] Remove space from DataFrame Scala/Java API.
      c6fe9b4a
    • Wenchen Fan's avatar
      [SPARK-9186][SQL] make deterministic describing the tree rather than the expression · 04db58ae
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7525 from cloud-fan/deterministic and squashes the following commits:
      
      4189bfa [Wenchen Fan] make deterministic describing the tree rather than the expression
      04db58ae
    • Tarek Auel's avatar
      [SPARK-9177][SQL] Reuse of calendar object in WeekOfYear · a15ecd05
      Tarek Auel authored
      https://issues.apache.org/jira/browse/SPARK-9177
      
      rxin Are we sure that this is thread safe? chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine.
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7516 from tarekauel/SPARK-9177 and squashes the following commits:
      
      0c1313a [Tarek Auel] [SPARK-9177] utilize more powerful addMutableState
      6e2f03f [Tarek Auel] Merge branch 'master' into SPARK-9177
      a69ec92 [Tarek Auel] [SPARK-9177] address comment
      6cfb180 [Tarek Auel] [SPARK-9177] calendar as lazy transient val
      ff97b09 [Tarek Auel] [SPARK-9177] Reuse calendar object in interpreted code and codegen
      a15ecd05
    • Tarek Auel's avatar
      [SPARK-9153][SQL] codegen StringLPad/StringRPad · 5112b7f5
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9153
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits:
      
      3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback
      92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad
      5112b7f5
    • MechCoder's avatar
      [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test · d0b4e93f
      MechCoder authored
      Python API for the KS-test
      
      Statistics.kolmogorovSmirnovTest(data, distName, *params)
      I'm not quite sure how to support the callable function since it is not serializable.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7430 from MechCoder/spark-8996 and squashes the following commits:
      
      2dd009d [MechCoder] minor
      021d233 [MechCoder] Remove one wrapper and other minor stuff
      49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
      d0b4e93f
    • George Dittmar's avatar
      [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector · 3f7de7db
      George Dittmar authored
      Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.
      
      Author: George Dittmar <georgedittmar@gmail.com>
      Author: George <dittmar@Georges-MacBook-Pro.local>
      Author: dittmarg <george.dittmar@webtrends.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:
      
      3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
      127dec5 [Xiangrui Meng] update argmax impl
      2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
      98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      5fd9380 [George Dittmar] fixing style check error
      42341fb [George Dittmar] refactoring arg max check to better handle zero values
      b22af46 [George Dittmar] Fixing spaces between commas in unit test
      f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
      aa330e3 [George Dittmar] Fixing some last if else spacing issues
      ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
      d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
      ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
      3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
      b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
      f21dcce [George Dittmar] commit
      af17981 [dittmarg] Initial work fixing bug that was made clear in pr
      eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
      4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      df9538a [George] Added argmax to sparse vector and added unit test
      3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
      04677af [George] initial work on adding argmax to Vector and SparseVector
      3f7de7db
    • Josh Rosen's avatar
      [SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange · 79ec0729
      Josh Rosen authored
      This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows.  It also makes several general efficiency improvements to Exchange.
      
      Key changes:
      
      - When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle.  After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs.  This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs.
      - Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand.
      - When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows.  This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes.  Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits:
      
      7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite
      0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter
      a27cfc1 [Josh Rosen] Add missing newline
      741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange.
      359c6a4 [Josh Rosen] Remove println() and add comments
      93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange
      8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them
      dd9c66d [Josh Rosen] Fix for copying logic
      035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer
      7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle
      cbea80b [Josh Rosen] Add UnsafeRowSerializer
      0f2ac86 [Josh Rosen] Import ordering
      3ca8515 [Josh Rosen] Big code simplification in Exchange
      3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
      79ec0729
    • Jacky Li's avatar
      [SQL][DOC] Minor document fix in HadoopFsRelationProvider · 972d8900
      Jacky Li authored
      Catch this while reading the code
      
      Author: Jacky Li <lee.unreal@gmail.com>
      Author: Jacky Li <jackylk@users.noreply.github.com>
      
      Closes #7524 from jackylk/patch-11 and squashes the following commits:
      
      b679011 [Jacky Li] fix doc
      e10e211 [Jacky Li] [SQL] Minor document fix in HadoopFsRelationProvider
      972d8900
    • Reynold Xin's avatar
      5bdf16da
    • Wenchen Fan's avatar
      [SPARK-9185][SQL] improve code gen for mutable states to support complex initialization · 930253e0
      Wenchen Fan authored
      Sometimes we need more than one step to initialize the mutable states in code gen like https://github.com/apache/spark/pull/7516
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7521 from cloud-fan/init and squashes the following commits:
      
      2106445 [Wenchen Fan] improve code gen for mutable states
      930253e0
  3. Jul 19, 2015
Loading