Skip to content
Snippets Groups Projects
  1. Jul 20, 2015
    • Tarek Auel's avatar
      [SPARK-9164] [SQL] codegen hex/unhex · 936a96cb
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9164
      
      The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`.  This allows me to call them from `eval` and `codeGen`
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7548 from tarekauel/SPARK-9164 and squashes the following commits:
      
      dd91c57 [Tarek Auel] [SPARK-9164][SQL] codegen hex/unhex
      936a96cb
    • Reynold Xin's avatar
      [SPARK-9142][SQL] Removing unnecessary self types in expressions. · e90543e5
      Reynold Xin authored
      Also added documentation to expressions to explain the important traits and abstract classes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7550 from rxin/remove-self-types and squashes the following commits:
      
      b2a3ec1 [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in expressions.
      e90543e5
    • Tarek Auel's avatar
      [SPARK-9156][SQL] codegen StringSplit · 6853ac7c
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9156
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7547 from tarekauel/SPARK-9156 and squashes the following commits:
      
      0be2700 [Tarek Auel] [SPARK-9156][SQL] indention fix
      b860eaf [Tarek Auel] [SPARK-9156][SQL] codegen StringSplit
      5ad6a1f [Tarek Auel] [SPARK-9156] codegen StringSplit
      6853ac7c
    • Tarek Auel's avatar
      [SPARK-9178][SQL] Add an empty string constant to UTF8String · 047ccc8c
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9178
      
      In order to avoid calls of `UTF8String.fromString("")` this pr adds an `EMPTY_STRING` constant to `UTF8String`. An `UTF8String` is immutable, so we can use a constant, isn't it?
      
      I searched for current usage of `UTF8String.fromString("")` with
      `grep -R  "UTF8String.fromString(\"\")" .`
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7509 from tarekauel/SPARK-9178 and squashes the following commits:
      
      8d6c405 [Tarek Auel] [SPARK-9178] revert intellij indents
      3627b80 [Tarek Auel] [SPARK-9178] revert concat tests changes
      3f5fbf5 [Tarek Auel] [SPARK-9178] rebase and add final to UTF8String.EMPTY_UTF8
      47cda68 [Tarek Auel] Merge branch 'master' into SPARK-9178
      4a37344 [Tarek Auel] [SPARK-9178] changed name to EMPTY_UTF8, added tests
      748b87a [Tarek Auel] [SPARK-9178] Add empty string constant to UTF8String
      047ccc8c
    • Carson Wang's avatar
      [SPARK-9187] [WEBUI] Timeline view may show negative value for running tasks · 66bb8003
      Carson Wang authored
      For running tasks, the executorRunTime metrics is 0 which causes negative executorComputingTime in the timeline. It also causes an incorrect SchedulerDelay time.
      ![timelinenegativevalue](https://cloud.githubusercontent.com/assets/9278199/8770953/f4362378-2eec-11e5-81e6-a06a07c04794.png)
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7526 from carsonwang/timeline-negValue and squashes the following commits:
      
      7b17db2 [Carson Wang] Fix negative value in timeline view
      66bb8003
    • Meihua Wu's avatar
      [SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1 · ff3c72db
      Meihua Wu authored
      Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1
      Also include unit tests to verify the fix.
      
      mengxr brkyvz
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits:
      
      fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1
      ff3c72db
    • Joseph K. Bradley's avatar
      [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests · a5d05819
      Joseph K. Bradley authored
      Several places in the PySpark SparseVector docs have one defined as:
      ```
      SparseVector(4, [2, 4], [1.0, 2.0])
      ```
      The index 4 goes out of bounds (but this is not checked).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits:
      
      c806a65 [Joseph K. Bradley] fixed doc test
      e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests
      a5d05819
    • Cheng Lian's avatar
      [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery · a1064df0
      Cheng Lian authored
      This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery.  The acceleration is done by the following means:
      
      - Turning off schema merging by default
      
        Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.
      
      - Avoiding `FileSystem.globStatus()` call when possible
      
        `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3).  This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`).
      
        This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially.
      
      - Listing leaf files in parallel when the number of input paths exceeds a threshold
      
        Listing leaf files is required by partition discovery.  Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC.  In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.
      
        The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.
      
      - Discovering Parquet schema in parallel
      
        Currently, schema merging is also done on driver side, and needs to read footers of all part-files.  This PR uses a Spark job to do schema merging.  Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7396 from liancheng/accel-parquet and squashes the following commits:
      
      5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row]
      ff32cd0 [Cheng Lian] Excludes directories while listing leaf files
      3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false"
      b1646aa [Cheng Lian] Should allow empty input paths
      32e5f0d [Cheng Lian] Moves schema merging to executor side
      a1064df0
    • Tarek Auel's avatar
      [SPARK-9160][SQL] codegen encode, decode · dac7dbf5
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9160
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7543 from tarekauel/SPARK-9160 and squashes the following commits:
      
      7528f0e [Tarek Auel] [SPARK-9160][SQL] codegen encode, decode
      dac7dbf5
    • Tarek Auel's avatar
      [SPARK-9159][SQL] codegen ascii, base64, unbase64 · c9db8eaa
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9159
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits:
      
      772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64
      c9db8eaa
    • Tarek Auel's avatar
      [SPARK-9155][SQL] codegen StringSpace · 4863c11e
      Tarek Auel authored
      Jira https://issues.apache.org/jira/browse/SPARK-9155
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7531 from tarekauel/SPARK-9155 and squashes the following commits:
      
      423c426 [Tarek Auel] [SPARK-9155] language typo fix
      e34bd1b [Tarek Auel] [SPARK-9155] moved creation of blank string to UTF8String
      4bc33e6 [Tarek Auel] [SPARK-9155] codegen StringSpace
      4863c11e
    • Cheng Lian's avatar
      [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning · dde0e12f
      Cheng Lian authored
      This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details.
      
      [1]: https://github.com/liancheng/spark/commit/536ac41a7e6b2abeb1f6ec1a6491bbf09ed3e591
      [2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391
      [3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      Author: Cheng Lian <lian@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits:
      
      5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore
      536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421
      dde0e12f
    • Davies Liu's avatar
      [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type · 9f913c4f
      Davies Liu authored
      This PR also remove the duplicated code between registerFunction and UserDefinedFunction.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7450 from davies/fix_return_type and squashes the following commits:
      
      e80bf9f [Davies Liu] remove debugging code
      f94b1f6 [Davies Liu] fix mima
      8f9c58b [Davies Liu] convert returned object from UDF into internal type
      9f913c4f
    • Mateusz Buśkiewicz's avatar
      [SPARK-9101] [PySpark] Add missing NullType · 02181fb6
      Mateusz Buśkiewicz authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9101
      
      Author: Mateusz Buśkiewicz <mateusz.buskiewicz@getbase.com>
      
      Closes #7499 from sixers/spark-9101 and squashes the following commits:
      
      dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
      97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types
      02181fb6
    • Imran Rashid's avatar
      [SPARK-8103][core] DAGScheduler should not submit multiple concurrent attempts for a stage · 80e2568b
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-8103
      
      cc kayousterhout (thanks for the extra test case)
      
      Author: Imran Rashid <irashid@cloudera.com>
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      Author: Imran Rashid <squito@users.noreply.github.com>
      
      Closes #6750 from squito/SPARK-8103 and squashes the following commits:
      
      fb3acfc [Imran Rashid] fix log msg
      e01b7aa [Imran Rashid] fix some comments, style
      584acd4 [Imran Rashid] simplify going from taskId to taskSetMgr
      e43ac25 [Imran Rashid] Merge branch 'master' into SPARK-8103
      6bc23af [Imran Rashid] update log msg
      4470fa1 [Imran Rashid] rename
      c04707e [Imran Rashid] style
      88b61cc [Imran Rashid] add tests to make sure that TaskSchedulerImpl schedules correctly with zombie attempts
      d7f1ef2 [Imran Rashid] get rid of activeTaskSets
      a21c8b5 [Imran Rashid] Merge branch 'master' into SPARK-8103
      906d626 [Imran Rashid] fix merge
      109900e [Imran Rashid] Merge branch 'master' into SPARK-8103
      c0d4d90 [Imran Rashid] Revert "Index active task sets by stage Id rather than by task set id"
      f025154 [Imran Rashid] Merge pull request #2 from kayousterhout/imran_SPARK-8103
      baf46e1 [Kay Ousterhout] Index active task sets by stage Id rather than by task set id
      19685bb [Imran Rashid] switch to using latestInfo.attemptId, and add comments
      a5f7c8c [Imran Rashid] remove comment for reviewers
      227b40d [Imran Rashid] style
      517b6e5 [Imran Rashid] get rid of SparkIllegalStateException
      b2faef5 [Imran Rashid] faster check for conflicting task sets
      6542b42 [Imran Rashid] remove extra stageAttemptId
      ada7726 [Imran Rashid] reviewer feedback
      d8eb202 [Imran Rashid] Merge branch 'master' into SPARK-8103
      46bc26a [Imran Rashid] more cleanup of debug garbage
      cb245da [Imran Rashid] finally found the issue ... clean up debug stuff
      8c29707 [Imran Rashid] Merge branch 'master' into SPARK-8103
      89a59b6 [Imran Rashid] more printlns ...
      9601b47 [Imran Rashid] more debug printlns
      ecb4e7d [Imran Rashid] debugging printlns
      b6bc248 [Imran Rashid] style
      55f4a94 [Imran Rashid] get rid of more random test case since kays tests are clearer
      7021d28 [Imran Rashid] update test since listenerBus.waitUntilEmpty now throws an exception instead of returning a boolean
      883fe49 [Kay Ousterhout] Unit tests for concurrent stages issue
      6e14683 [Imran Rashid] unit test just to make sure we fail fast on concurrent attempts
      06a0af6 [Imran Rashid] ignore for jenkins
      c443def [Imran Rashid] better fix and simpler test case
      28d70aa [Imran Rashid] wip on getting a better test case ...
      a9bf31f [Imran Rashid] wip
      80e2568b
    • Reynold Xin's avatar
      [SQL] Remove space from DataFrame Scala/Java API. · c6fe9b4a
      Reynold Xin authored
      I don't think this function is useful at all in Scala/Java, since users can easily compute n * space easily.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7530 from rxin/remove-space and squashes the following commits:
      
      c147873 [Reynold Xin] [SQL] Remove space from DataFrame Scala/Java API.
      c6fe9b4a
    • Wenchen Fan's avatar
      [SPARK-9186][SQL] make deterministic describing the tree rather than the expression · 04db58ae
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7525 from cloud-fan/deterministic and squashes the following commits:
      
      4189bfa [Wenchen Fan] make deterministic describing the tree rather than the expression
      04db58ae
    • Tarek Auel's avatar
      [SPARK-9177][SQL] Reuse of calendar object in WeekOfYear · a15ecd05
      Tarek Auel authored
      https://issues.apache.org/jira/browse/SPARK-9177
      
      rxin Are we sure that this is thread safe? chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine.
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7516 from tarekauel/SPARK-9177 and squashes the following commits:
      
      0c1313a [Tarek Auel] [SPARK-9177] utilize more powerful addMutableState
      6e2f03f [Tarek Auel] Merge branch 'master' into SPARK-9177
      a69ec92 [Tarek Auel] [SPARK-9177] address comment
      6cfb180 [Tarek Auel] [SPARK-9177] calendar as lazy transient val
      ff97b09 [Tarek Auel] [SPARK-9177] Reuse calendar object in interpreted code and codegen
      a15ecd05
    • Tarek Auel's avatar
      [SPARK-9153][SQL] codegen StringLPad/StringRPad · 5112b7f5
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-9153
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits:
      
      3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback
      92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad
      5112b7f5
    • MechCoder's avatar
      [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test · d0b4e93f
      MechCoder authored
      Python API for the KS-test
      
      Statistics.kolmogorovSmirnovTest(data, distName, *params)
      I'm not quite sure how to support the callable function since it is not serializable.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7430 from MechCoder/spark-8996 and squashes the following commits:
      
      2dd009d [MechCoder] minor
      021d233 [MechCoder] Remove one wrapper and other minor stuff
      49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
      d0b4e93f
    • George Dittmar's avatar
      [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector · 3f7de7db
      George Dittmar authored
      Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.
      
      Author: George Dittmar <georgedittmar@gmail.com>
      Author: George <dittmar@Georges-MacBook-Pro.local>
      Author: dittmarg <george.dittmar@webtrends.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:
      
      3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
      127dec5 [Xiangrui Meng] update argmax impl
      2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
      98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      5fd9380 [George Dittmar] fixing style check error
      42341fb [George Dittmar] refactoring arg max check to better handle zero values
      b22af46 [George Dittmar] Fixing spaces between commas in unit test
      f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
      aa330e3 [George Dittmar] Fixing some last if else spacing issues
      ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
      d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
      ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
      3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
      b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
      f21dcce [George Dittmar] commit
      af17981 [dittmarg] Initial work fixing bug that was made clear in pr
      eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
      4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      df9538a [George] Added argmax to sparse vector and added unit test
      3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
      04677af [George] initial work on adding argmax to Vector and SparseVector
      3f7de7db
    • Josh Rosen's avatar
      [SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange · 79ec0729
      Josh Rosen authored
      This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows.  It also makes several general efficiency improvements to Exchange.
      
      Key changes:
      
      - When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle.  After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs.  This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs.
      - Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand.
      - When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows.  This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes.  Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits:
      
      7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite
      0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter
      a27cfc1 [Josh Rosen] Add missing newline
      741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange.
      359c6a4 [Josh Rosen] Remove println() and add comments
      93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange
      8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them
      dd9c66d [Josh Rosen] Fix for copying logic
      035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer
      7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle
      cbea80b [Josh Rosen] Add UnsafeRowSerializer
      0f2ac86 [Josh Rosen] Import ordering
      3ca8515 [Josh Rosen] Big code simplification in Exchange
      3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
      79ec0729
    • Jacky Li's avatar
      [SQL][DOC] Minor document fix in HadoopFsRelationProvider · 972d8900
      Jacky Li authored
      Catch this while reading the code
      
      Author: Jacky Li <lee.unreal@gmail.com>
      Author: Jacky Li <jackylk@users.noreply.github.com>
      
      Closes #7524 from jackylk/patch-11 and squashes the following commits:
      
      b679011 [Jacky Li] fix doc
      e10e211 [Jacky Li] [SQL] Minor document fix in HadoopFsRelationProvider
      972d8900
    • Reynold Xin's avatar
      5bdf16da
    • Wenchen Fan's avatar
      [SPARK-9185][SQL] improve code gen for mutable states to support complex initialization · 930253e0
      Wenchen Fan authored
      Sometimes we need more than one step to initialize the mutable states in code gen like https://github.com/apache/spark/pull/7516
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7521 from cloud-fan/init and squashes the following commits:
      
      2106445 [Wenchen Fan] improve code gen for mutable states
      930253e0
  2. Jul 19, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-9172][SQL] Make DecimalPrecision support for Intersect and Except · d743bec6
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9172
      
      Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`.
      
      Besides, add unit test for `DecimalPrecision` as well.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7511 from viirya/more_decimalprecieion and squashes the following commits:
      
      4d29d10 [Liang-Chi Hsieh] Fix code comment.
      9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.
      d743bec6
    • Tathagata Das's avatar
      [SPARK-9030] [STREAMING] [HOTFIX] Make sure that no attempts to create Kinesis... · 93eb2acf
      Tathagata Das authored
      [SPARK-9030] [STREAMING] [HOTFIX] Make sure that no attempts to create Kinesis streams is made when no enabled
      
      Problem: Even when the environment variable to enable tests are disabled, the `beforeAll` of the KinesisStreamSuite attempted to find AWS credentials to create Kinesis stream, and failed.
      
      Solution: Made sure all accesses to KinesisTestUtils, that created streams, is under `testOrIgnore`
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7519 from tdas/kinesis-tests and squashes the following commits:
      
      64d6d06 [Tathagata Das] Removed empty lines.
      7c18473 [Tathagata Das] Putting all access to KinesisTestUtils inside testOrIgnore
      93eb2acf
    • Reynold Xin's avatar
      [SPARK-8241][SQL] string function: concat_ws. · 163e3f1d
      Reynold Xin authored
      I also changed the semantics of concat w.r.t. null back to the same behavior as Hive.
      That is to say, concat now returns null if any input is null.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7504 from rxin/concat_ws and squashes the following commits:
      
      83fd950 [Reynold Xin] Fixed type casting.
      3ae85f7 [Reynold Xin] Write null better.
      cdc7be6 [Reynold Xin] Added code generation for pure string mode.
      a61c4e4 [Reynold Xin] Updated comments.
      2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.
      163e3f1d
    • Herman van Hovell's avatar
      [SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup · 7a812453
      Herman van Hovell authored
      This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved.
      
      Git commit message is wrong BTW :(...
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits:
      
      4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
      7a812453
    • Nicholas Hwang's avatar
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions())... · a803ac3e
      Nicholas Hwang authored
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
      
      I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.
      
      Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.
      
      This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.
      
      As an illustrative example, submit the following to `spark-submit`:
      ```
      from pyspark import SparkConf, SparkContext
      import collections
      
      def updateCounter(acc, val):
          print 'update acc:', acc
          print 'update val:', val
          acc[val] += 1
          return acc
      
      def comboCounter(acc1, acc2):
          print 'combo acc1:', acc1
          print 'combo acc2:', acc2
          acc1.update(acc2)
          return acc1
      
      def main():
          conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
          sc = SparkContext(conf = conf)
      
          print '======= AGGREGATING with ONE PARTITION ======='
          print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)
      
          print '======= AGGREGATING with TWO PARTITIONS ======='
          print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)
      
      if __name__ == "__main__":
          main()
      ```
      
      One probably expects this to output the following:
      ```
      Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
      ```
      
      But it instead outputs this (regardless of the number of partitions):
      ```
      Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
      ```
      
      This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.
      
      I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.
      
      Author: Nicholas Hwang <moogling@gmail.com>
      
      Closes #7378 from njhwang/master and squashes the following commits:
      
      659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
      8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
      56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
      391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
      2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
      ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
      7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3.
      90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
      a803ac3e
    • Cheng Lian's avatar
      [HOTFIX] [SQL] Fixes compilation error introduced by PR #7506 · 34ed82bb
      Cheng Lian authored
      PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits:
      
      7ea7e89 [Cheng Lian] Fixes compilation error
      34ed82bb
    • Cheng Lian's avatar
      [SPARK-9179] [BUILD] Allows committers to specify primary author of the PR to be merged · bc24289f
      Cheng Lian authored
      It's a common case that some contributor contributes an initial version of a feature/bugfix, and later on some other people (mostly committers) fork and add more improvements. When merging these PRs, we probably want to specify the original author as the primary author. Currently we can only do this by running
      
      ```
      $ git commit --amend --author="name <email>"
      ```
      
      manually right before the merge script pushes to Apache Git repo. It would be nice if the script accepts user specified primary author information.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7508 from liancheng/spark-9179 and squashes the following commits:
      
      218d88e [Cheng Lian] Allows committers to specify primary author of the PR to be merged
      bc24289f
    • Reynold Xin's avatar
      [SQL] Make date/time functions more consistent with other database systems. · 3427937e
      Reynold Xin authored
      This pull request fixes some of the problems in #6981.
      
      - Added date functions to `__all__` so they get exposed
      - Rename day_of_month -> dayofmonth
      - Rename day_in_year -> dayofyear
      - Rename week_of_year -> weekofyear
      - Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@databricks.com>
      
      Closes #7506 from rxin/datetime and squashes the following commits:
      
      0cb24d9 [Reynold Xin] Export all functions in Python.
      e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
      9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
      3427937e
    • Tarek Auel's avatar
      [SPARK-8199][SQL] follow up; revert change in test · a53d13f7
      Tarek Auel authored
      rxin / davies
      
      Sorry for that unnecessary change. And thanks again for all your support!
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits:
      
      d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test
      c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test
      67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test
      a53d13f7
    • Carl Anders Düvel's avatar
      [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 · 344d1567
      Carl Anders Düvel authored
      We are running Spark 1.4.0 in production and ran into problems because after a network hiccup (which happens often in our current environment) no more metrics were reported to graphite leaving us blindfolded about the current state of our spark applications. [This problem](https://github.com/dropwizard/metrics/commit/70559816f1fc3a0a0122b5263d5478ff07396991) was fixed in the current version of the metrics library. We run spark with this change  in production now and have seen no problems. We also had a look at the commit history since 3.1.0 and did not detect any potentially  incompatible changes but many fixes which could potentially help other users as well.
      
      Author: Carl Anders Düvel <c.a.duevel@gmail.com>
      
      Closes #7493 from hackbert/bump-metrics-lib-version and squashes the following commits:
      
      6677565 [Carl Anders Düvel] [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 in order to get this fix https://github.com/dropwizard/metrics/commit/70559816f1fc3a0a0122b5263d5478ff07396991
      344d1567
    • Liang-Chi Hsieh's avatar
      [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API · 9b644c41
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9166
      
      Simply capture and hide `IllegalArgumentException` in Python API.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7497 from viirya/hide_illegalargument and squashes the following commits:
      
      8324dce [Liang-Chi Hsieh] Fix python style.
      9ace67d [Liang-Chi Hsieh] Also check exception message.
      8b2ce5c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into hide_illegalargument
      7be016a [Liang-Chi Hsieh] Capture and hide IllegalArgumentException in Python.
      9b644c41
    • Reynold Xin's avatar
      89d13585
    • Herman van Hovell's avatar
      [SPARK-8638] [SQL] Window Function Performance Improvements · a9a0d0ce
      Herman van Hovell authored
      ## Description
      Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information.
      
      ## Improvements
      * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse.
      * Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement.
      * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR.
      * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code.
      
      ## Benchmarking
      I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type:
      
      Frame | Master | SPARK-8638
      ----- | ------ | ----------
      Entire Frame | 2 s | 1 s
      Sliding | 18 s | 1 s
      Growing | 14 s | 0.9 s
      Shrinking | 13 s | 1 s
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits:
      
      3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
      2eb3b33 [Herman van Hovell] Corrected reverse range frame processing.
      2cd2d5b [Herman van Hovell] Corrected reverse range frame processing.
      b0654d7 [Herman van Hovell] Tests for exotic frame specifications.
      e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code.
      1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite.
      ac2f682 [Herman van Hovell] Added a few more comments.
      1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods.
      bb020e6 [Herman van Hovell] Major overhaul of Window operator.
      a9a0d0ce
    • Reynold Xin's avatar
      Fixed test cases. · 04c1b49f
      Reynold Xin authored
      04c1b49f
    • Tarek Auel's avatar
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK... · 83b682be
      Tarek Auel authored
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
      
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8199
      https://issues.apache.org/jira/browse/SPARK-8184
      https://issues.apache.org/jira/browse/SPARK-8183
      https://issues.apache.org/jira/browse/SPARK-8182
      https://issues.apache.org/jira/browse/SPARK-8181
      https://issues.apache.org/jira/browse/SPARK-8180
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-8177
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-9115
      
      Regarding `day`and `dayofmonth` are both necessary?
      
      ~~I am going to add `Quarter` to this PR as well.~~ Done.
      
      ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Tarek Auel <tarek.auel@gmail.com>
      
      Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:
      
      f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
      bb567b6 [Tarek Auel] [SPARK-8199] fixed test
      3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
      256c357 [Tarek Auel] [SPARK-8199] code cleanup
      5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
      6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
      4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
      ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
      70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
      3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
      fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
      cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
      746b80a [Tarek Auel] [SPARK-8199] build fix
      0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
      523542d [Tarek Auel] [SPARK-8199] address comments
      2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
      d01b977 [Tarek Auel] [SPARK-8199] python underscore
      56c4a92 [Tarek Auel] [SPARK-8199] update python docu
      e223bc0 [Tarek Auel] [SPARK-8199] refactoring
      d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
      b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
      1b2e540 [Tarek Auel] [SPARK-8119] style fix
      0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
      ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
      1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
      740af0e [Tarek Auel] implement date function using a calculation based on days
      4fb66da [Tarek Auel] WIP: date functions on calculation only
      1a436c9 [Tarek Auel] wip
      f775f39 [Tarek Auel] fixed return type
      ad17e96 [Tarek Auel] improved implementation
      c42b444 [Tarek Auel] Removed merge conflict file
      ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
      10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
      7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
      f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
      6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
      d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
      7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
      5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
      eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
      f120415 [Tarek Auel] improved runtime
      a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
      5fe74e1 [Tarek Auel] fixed python style
      3bfac90 [Tarek Auel] fixed style
      356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
      02efc5d [Tarek Auel] removed doubled code
      a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
      b680db6 [Tarek Auel] added codegeneration to all functions
      c739788 [Tarek Auel] added support for quarter SPARK-8178
      849fb41 [Tarek Auel] fixed stupid test
      638596f [Tarek Auel] improved codegen
      4d8049b [Tarek Auel] fixed tests and added type check
      5ebb235 [Tarek Auel] resolved naming conflict
      d0e2f99 [Tarek Auel] date functions
      83b682be
Loading