Skip to content
Snippets Groups Projects
  1. Jul 31, 2015
    • zsxwing's avatar
      [SPARK-8564] [STREAMING] Add the Python API for Kinesis · 3afc1de8
      zsxwing authored
      This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6955 from zsxwing/kinesis-python and squashes the following commits:
      
      e42e471 [zsxwing] Merge branch 'master' into kinesis-python
      455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
      32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      5082d28 [zsxwing] Fix the syntax error for Python 2.6
      fca416b [zsxwing] Fix wrong comparison
      96670ff [zsxwing] Fix the compilation error after merging master
      756a128 [zsxwing] Merge branch 'master' into kinesis-python
      6c37395 [zsxwing] Print stack trace for debug
      7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
      cc9d071 [zsxwing] Fix the python test errors
      466b425 [zsxwing] Add python tests for Kinesis
      e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      3da2601 [zsxwing] Fix the kinesis folder
      687446b [zsxwing] Fix the error message and the maven output path
      add2beb [zsxwing] Merge branch 'master' into kinesis-python
      4957c0b [zsxwing] Add the Python API for Kinesis
      3afc1de8
    • Herman van Hovell's avatar
      [SPARK-8640] [SQL] Enable Processing of Multiple Window Frames in a Single Window Operator · 39ab199a
      Herman van Hovell authored
      This PR enables the processing of multiple window frames in a single window operator. This should improve the performance of processing multiple window expressions wich share partition by/order by clauses, because it will be more efficient with respect to memory use and group processing.
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #7515 from hvanhovell/SPARK-8640 and squashes the following commits:
      
      f0e1c21 [Herman van Hovell] Changed Window Logical/Physical plans to use partition by/order by specs directly instead of using WindowSpec.
      e1711c2 [Herman van Hovell] Enabled the processing of multiple window frames in a single Window operator.
      39ab199a
    • Iulian Dragos's avatar
      [SPARK-8979] Add a PID based rate estimator · 0a1d2ca4
      Iulian Dragos authored
      Based on #7600
      
      /cc tdas
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      Author: François Garillot <francois@garillot.net>
      
      Closes #7648 from dragos/topic/streaming-bp/pid and squashes the following commits:
      
      aa5b097 [Iulian Dragos] Add more comments, made all PID constant parameters positive, a couple more tests.
      93b74f8 [Iulian Dragos] Better explanation of historicalError.
      7975b0c [Iulian Dragos] Add configuration for PID.
      26cfd78 [Iulian Dragos] A couple of variable renames.
      d0bdf7c [Iulian Dragos] Update to latest version of the code, various style and name improvements.
      d58b845 [François Garillot] [SPARK-8979][Streaming] Implements a PIDRateEstimator
      0a1d2ca4
    • Yanbo Liang's avatar
      [SPARK-6885] [ML] decision tree support predict class probabilities · e8bdcdea
      Yanbo Liang authored
      Decision tree support predict class probabilities.
      Implement the prediction probabilities function referred the old DecisionTree API and the [sklean API](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L593).
      I make the DecisionTreeClassificationModel inherit from ProbabilisticClassificationModel, make the predictRaw to return the raw counts vector and make raw2probabilityInPlace/predictProbability return the probabilities for each prediction.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7694 from yanboliang/spark-6885 and squashes the following commits:
      
      08d5b7f [Yanbo Liang] fix ImpurityStats null parameters and raw2probabilityInPlace sum = 0 issue
      2174278 [Yanbo Liang] solve merge conflicts
      7e90ba8 [Yanbo Liang] fix typos
      33ae183 [Yanbo Liang] fix annotation
      ff043d3 [Yanbo Liang] raw2probabilityInPlace should operate in-place
      c32d6ce [Yanbo Liang] optimize calculateImpurityStats function again
      6167fb0 [Yanbo Liang] optimize calculateImpurityStats function
      fbbe2ec [Yanbo Liang] eliminate duplicated struct and code
      beb1634 [Yanbo Liang] try to eliminate impurityStats for each LearningNode
      99e8943 [Yanbo Liang] code optimization
      5ec3323 [Yanbo Liang] implement InformationGainAndImpurityStats
      227c91b [Yanbo Liang] refactor LearningNode to store ImpurityCalculator
      d746ffc [Yanbo Liang] decision tree support predict class probabilities
      e8bdcdea
    • Yuhao Yang's avatar
      [SPARK-9231] [MLLIB] DistributedLDAModel method for top topics per document · 4011a947
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-9231
      
      Helper method in DistributedLDAModel of this form:
      ```
      /**
       * For each document, return the top k weighted topics for that document.
       * return RDD of (doc ID, topic indices, topic weights)
       */
      def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])]
      ```
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7785 from hhbyyh/topTopicsPerdoc and squashes the following commits:
      
      30ad153 [Yuhao Yang] small fix
      fd24580 [Yuhao Yang] add topTopics per document to DistributedLDAModel
      4011a947
    • Alexander Ulanov's avatar
      [SPARK-9471] [ML] Multilayer Perceptron · 6add4edd
      Alexander Ulanov authored
      This pull request contains the following feature for ML:
         - Multilayer Perceptron classifier
      
      This implementation is based on our initial pull request with bgreeven: https://github.com/apache/spark/pull/1290 and inspired by very insightful suggestions from mengxr and witgo (I would like to thank all other people from the mentioned thread for useful discussions). The original code was extensively tested and benchmarked. Since then, I've addressed two main requirements that prevented the code from merging into the main branch:
         - Extensible interface, so it will be easy to implement new types of networks
           - Main building blocks are traits `Layer` and `LayerModel`. They are used for constructing layers of ANN. New layers can be added by extending the `Layer` and `LayerModel` traits. These traits are private in this release in order to save path to improve them based on community feedback
           - Back propagation is implemented in general form, so there is no need to change it (optimization algorithm) when new layers are implemented
         - Speed and scalability: this implementation has to be comparable in terms of speed to the state of the art single node implementations.
           - The developed benchmark for large ANN shows that the proposed code is on par with C++ CPU implementation and scales nicely with the number of workers. Details can be found here: https://github.com/avulanov/ann-benchmark
      
         - DBN and RBM by witgo https://github.com/witgo/spark/tree/ann-interface-gemm-dbn
         - Dropout https://github.com/avulanov/spark/tree/ann-interface-gemm
      
      mengxr and dbtsai kindly agreed to perform code review.
      
      Author: Alexander Ulanov <nashb@yandex.ru>
      Author: Bert Greevenbosch <opensrc@bertgreevenbosch.nl>
      
      Closes #7621 from avulanov/SPARK-2352-ann and squashes the following commits:
      
      4806b6f [Alexander Ulanov] Addressing reviewers comments.
      a7e7951 [Alexander Ulanov] Default blockSize: 100. Added documentation to blockSize parameter and DataStacker class
      f69bb3d [Alexander Ulanov] Addressing reviewers comments.
      374bea6 [Alexander Ulanov] Moving ANN to ML package. GradientDescent constructor is now spark private.
      43b0ae2 [Alexander Ulanov] Addressing reviewers comments. Adding multiclass test.
      9d18469 [Alexander Ulanov] Addressing reviewers comments: unnecessary copy of data in predict
      35125ab [Alexander Ulanov] Style fix in tests
      e191301 [Alexander Ulanov] Apache header
      a226133 [Alexander Ulanov] Multilayer Perceptron regressor and classifier
      6add4edd
    • Davies Liu's avatar
      [SQL] address comments for to_date/trunc · 0024da91
      Davies Liu authored
      This PR address the comments in #7805
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7817 from davies/trunc and squashes the following commits:
      
      f729d5f [Davies Liu] rollback
      cb7f7832 [Davies Liu] genCode() is protected
      31e52ef [Davies Liu] fix style
      ed1edc7 [Davies Liu] address comments for #7805
      0024da91
    • tedyu's avatar
      [SPARK-9446] Clear Active SparkContext in stop() method · 27ae851c
      tedyu authored
      In thread 'stopped SparkContext remaining active' on mailing list, Andres observed the following in driver log:
      ```
      15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: <address removed>
      15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
      Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors
              at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
              at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
              at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
              at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
              at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
              at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
              at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
      Caused by: java.lang.InterruptedException
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
              at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
              at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
              at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
              at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
              at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down
      
              at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
              at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
              at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
              ... 6 more
      ```
      Effect of the above exception is that a stopped SparkContext is returned to user since SparkContext.clearActiveContext() is not called.
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #7756 from tedyu/master and squashes the following commits:
      
      7339ff2 [tedyu] Move null assignment out of tryLogNonFatalError block
      6e02cd9 [tedyu] Use Utils.tryLogNonFatalError to guard resource release
      f5fb519 [tedyu] Clear Active SparkContext in stop() method using finally
      27ae851c
    • zsxwing's avatar
      [SPARK-9497] [SPARK-9509] [CORE] Use ask instead of askWithRetry · 04a49edf
      zsxwing authored
      `RpcEndpointRef.askWithRetry` throws `SparkException` rather than `TimeoutException`. Use ask to replace it because we don't need to retry here.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7824 from zsxwing/SPARK-9497 and squashes the following commits:
      
      7bfc2b4 [zsxwing] Use ask instead of askWithRetry
      04a49edf
    • Yu ISHIKAWA's avatar
      [SPARK-9053] [SPARKR] Fix spaces around parens, infix operators etc. · fc0e57e5
      Yu ISHIKAWA authored
      ### JIRA
      [[SPARK-9053] Fix spaces around parens, infix operators etc. - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9053)
      
      ### The Result of `lint-r`
      [The result of lint-r at the rivision:a4c83cb1](https://gist.github.com/yu-iskw/d253d7f8ef351f86443d)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #7584 from yu-iskw/SPARK-9053 and squashes the following commits:
      
      613170f [Yu ISHIKAWA] Ignore a warning about a space before a left parentheses
      ede61e1 [Yu ISHIKAWA] Ignores two warnings about a space before a left parentheses. TODO: After updating `lintr`, we will remove the ignores
      de3e0db [Yu ISHIKAWA] Add '## nolint start' & '## nolint end' statement to ignore infix space warnings
      e233ea8 [Yu ISHIKAWA] [SPARK-9053][SparkR] Fix spaces around parens, infix operators etc.
      fc0e57e5
    • Davies Liu's avatar
      [SPARK-9500] add TernaryExpression to simplify ternary expressions · 6bba7509
      Davies Liu authored
      There lots of duplicated code in ternary expressions, create a TernaryExpression for them to reduce duplicated code.
      
      cc chenghao-intel
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7816 from davies/ternary and squashes the following commits:
      
      ed2bf76 [Davies Liu] add TernaryExpression
      6bba7509
    • WangTaoTheTonic's avatar
      [SPARK-9496][SQL]do not print the password in config · a3a85d73
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-9496
      
      We better do not print the password in log.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #7815 from WangTaoTheTonic/master and squashes the following commits:
      
      c7a5145 [WangTaoTheTonic] do not print the password in config
      a3a85d73
    • Liang-Chi Hsieh's avatar
      [SPARK-9152][SQL] Implement code generation for Like and RLike · 0244170b
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9152
      
      This PR implements code generation for `Like` and `RLike`.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7561 from viirya/like_rlike_codegen and squashes the following commits:
      
      fe5641b [Liang-Chi Hsieh] Add test for NonFoldableLiteral.
      ccd1b43 [Liang-Chi Hsieh] For comments.
      0086723 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
      50df9a8 [Liang-Chi Hsieh] Use nullSafeCodeGen.
      8092a68 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
      696d451 [Liang-Chi Hsieh] Check expression foldable.
      48e5536 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
      aea58e0 [Liang-Chi Hsieh] For comments.
      46d946f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into like_rlike_codegen
      a0fb76e [Liang-Chi Hsieh] For comments.
      6cffe3c [Liang-Chi Hsieh] For comments.
      69f0fb6 [Liang-Chi Hsieh] Add code generation for Like and RLike.
      0244170b
    • Yanbo Liang's avatar
      [SPARK-9214] [ML] [PySpark] support ml.NaiveBayes for Python · 69b62f76
      Yanbo Liang authored
      support ml.NaiveBayes for Python
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7568 from yanboliang/spark-9214 and squashes the following commits:
      
      5ee3fd6 [Yanbo Liang] fix typos
      3ecd046 [Yanbo Liang] fix typos
      f9c94d1 [Yanbo Liang] change lambda_ to smoothing and fix other issues
      180452a [Yanbo Liang] fix typos
      7dda1f4 [Yanbo Liang] support ml.NaiveBayes for Python
      69b62f76
    • Ram Sriharsha's avatar
      [SPARK-7690] [ML] Multiclass classification Evaluator · 4e5919bf
      Ram Sriharsha authored
      Multiclass Classification Evaluator for ML Pipelines. F1 score, precision, recall, weighted precision and weighted recall are supported as available metrics.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #7475 from harsha2010/SPARK-7690 and squashes the following commits:
      
      9bf4ec7 [Ram Sriharsha] fix indentation
      3f09a85 [Ram Sriharsha] cleanup doc
      16115ae [Ram Sriharsha] code review fixes
      032d2a3 [Ram Sriharsha] fix test
      eec9865 [Ram Sriharsha] Fix Python Indentation
      1dbeffd [Ram Sriharsha] Merge branch 'master' into SPARK-7690
      68cea85 [Ram Sriharsha] Merge branch 'master' into SPARK-7690
      54c03de [Ram Sriharsha] [SPARK-7690][ml][WIP] Multiclass Evaluator for ML Pipeline
      4e5919bf
  2. Jul 30, 2015
    • Daoyuan Wang's avatar
      [SPARK-8176] [SPARK-8197] [SQL] function to_date/ trunc · 83670fc9
      Daoyuan Wang authored
      This PR is based on #6988 , thanks to adrian-wang .
      
      This brings two SQL functions: to_date() and trunc().
      
      Closes #6988
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7805 from davies/to_date and squashes the following commits:
      
      2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date
      310dd55 [Daoyuan Wang] remove dup test in rebase
      980b092 [Daoyuan Wang] resolve rebase conflict
      a476c5a [Daoyuan Wang] address comments from davies
      d44ea5f [Daoyuan Wang] function to_date, trunc
      83670fc9
    • cody koeninger's avatar
      [SPARK-9472] [STREAMING] consistent hadoop configuration, streaming only · 9307f565
      cody koeninger authored
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #7772 from koeninger/streaming-hadoop-config and squashes the following commits:
      
      5267284 [cody koeninger] [SPARK-4229][Streaming] consistent hadoop configuration, streaming only
      9307f565
    • Josh Rosen's avatar
      [SPARK-9489] Remove unnecessary compatibility and requirements checks from Exchange · 3c66ff72
      Josh Rosen authored
      While reviewing yhuai's patch for SPARK-2205 (#7773), I noticed that Exchange's `compatible` check may be incorrectly returning `false` in many cases.  As far as I know, this is not actually a problem because the `compatible`, `meetsRequirements`, and `needsAnySort` checks are serving only as short-circuit performance optimizations that are not necessary for correctness.
      
      In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children.  This should be safe because we rewrite the tree in a single bottom-up pass.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7807 from JoshRosen/SPARK-9489 and squashes the following commits:
      
      9d76ce9 [Josh Rosen] [SPARK-9489] Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange
      3c66ff72
    • Sean Owen's avatar
      [SPARK-9077] [MLLIB] Improve error message for decision trees when numExamples... · 65fa4181
      Sean Owen authored
      [SPARK-9077] [MLLIB] Improve error message for decision trees when numExamples < maxCategoriesPerFeature
      
      Improve error message when number of examples is less than arity of high-arity categorical feature
      
      CC jkbradley is this about what you had in mind? I know it's a starter, but was on my list to close out in the short term.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7800 from srowen/SPARK-9077 and squashes the following commits:
      
      b8f6cdb [Sean Owen] Improve error message when number of examples is less than arity of high-arity categorical feature
      65fa4181
    • Liang-Chi Hsieh's avatar
      [SPARK-6319][SQL] Throw AnalysisException when using BinaryType on Join and Aggregate · 351eda0e
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6319
      
      Spark SQL uses plain byte arrays to represent binary values. However, the arrays are compared by reference rather than by values. Thus, we should not use BinaryType on Join and Aggregate in current implementation.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7787 from viirya/agg_no_binary_type and squashes the following commits:
      
      4f76cac [Liang-Chi Hsieh] Throw AnalysisException when using BinaryType on Join and Aggregate.
      351eda0e
    • Davies Liu's avatar
      [SPARK-9425] [SQL] support DecimalType in UnsafeRow · 0b1a464b
      Davies Liu authored
      This PR brings the support of DecimalType in UnsafeRow, for precision <= 18, it's settable, otherwise it's not settable.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7758 from davies/unsafe_decimal and squashes the following commits:
      
      478b1ba [Davies Liu] address comments
      536314c [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
      7c2e77a [Davies Liu] fix JoinedRow
      76d6fa4 [Davies Liu] fix tests
      99d3151 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
      d49c6ae [Davies Liu] support DecimalType in UnsafeRow
      0b1a464b
    • Reynold Xin's avatar
      [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting &... · e7a0976e
      Reynold Xin authored
      [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7803 from rxin/SPARK-9458 and squashes the following commits:
      
      5b032dc [Reynold Xin] Fix string.
      b670dbb [Reynold Xin] [SPARK-9458][SPARK-9469][SQL] Code generate prefix computation in sorting & moves unsafe conversion out of TungstenSort.
      e7a0976e
    • Xiangrui Meng's avatar
      [SPARK-7157][SQL] add sampleBy to DataFrame · df326695
      Xiangrui Meng authored
      This was previously committed but then reverted due to test failures (see #6769).
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7755 from rxin/SPARK-7157 and squashes the following commits:
      
      fbf9044 [Xiangrui Meng] fix python test
      542bd37 [Xiangrui Meng] update test
      604fe6d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
      f051afd [Xiangrui Meng] use udf instead of building expression
      f4e9425 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
      8fb990b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
      103beb3 [Xiangrui Meng] add Java-friendly sampleBy
      991f26f [Xiangrui Meng] fix seed
      4a14834 [Xiangrui Meng] move sampleBy to stat
      832f7cc [Xiangrui Meng] add sampleBy to DataFrame
      df326695
    • Xiangrui Meng's avatar
      [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg · ca71cc8c
      Xiangrui Meng authored
      This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR.
      
      Closes #7731
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7746 from mengxr/SPARK-9408 and squashes the following commits:
      
      0e05a3b [Xiangrui Meng] merge master
      1135551 [Xiangrui Meng] add a comment for str(...)
      c48cae0 [Xiangrui Meng] update tests
      173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py
      ca71cc8c
    • Tathagata Das's avatar
      [STREAMING] [TEST] [HOTFIX] Fixed Kinesis test to not throw weird errors when... · 1afdeb7b
      Tathagata Das authored
      [STREAMING] [TEST] [HOTFIX] Fixed Kinesis test to not throw weird errors when Kinesis tests are enabled without AWS keys
      
      If Kinesis tests are enabled by env ENABLE_KINESIS_TESTS = 1 but no AWS credentials are found, the desired behavior is the fail the test using with
      ```
      Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kinesis.KinesisBackedBlockRDDSuite *** ABORTED *** (3 seconds, 5 milliseconds)
      [info]   java.lang.Exception: Kinesis tests enabled, but could get not AWS credentials
      ```
      
      Instead KinesisStreamSuite fails with
      
      ```
      [info] - basic operation *** FAILED *** (3 seconds, 35 milliseconds)
      [info]   java.lang.IllegalArgumentException: requirement failed: Stream not yet created, call createStream() to create one
      [info]   at scala.Predef$.require(Predef.scala:233)
      [info]   at org.apache.spark.streaming.kinesis.KinesisTestUtils.streamName(KinesisTestUtils.scala:77)
      [info]   at org.apache.spark.streaming.kinesis.KinesisTestUtils$$anonfun$deleteStream$1.apply(KinesisTestUtils.scala:150)
      [info]   at org.apache.spark.streaming.kinesis.KinesisTestUtils$$anonfun$deleteStream$1.apply(KinesisTestUtils.scala:150)
      [info]   at org.apache.spark.Logging$class.logWarning(Logging.scala:71)
      [info]   at org.apache.spark.streaming.kinesis.KinesisTestUtils.logWarning(KinesisTestUtils.scala:39)
      [info]   at org.apache.spark.streaming.kinesis.KinesisTestUtils.deleteStream(KinesisTestUtils.scala:150)
      [info]   at org.apache.spark.streaming.kinesis.KinesisStreamSuite$$anonfun$3.apply$mcV$sp(KinesisStreamSuite.scala:111)
      [info]   at org.apache.spark.streaming.kinesis.KinesisStreamSuite$$anonfun$3.apply(KinesisStreamSuite.scala:86)
      [info]   at org.apache.spark.streaming.kinesis.KinesisStreamSuite$$anonfun$3.apply(KinesisStreamSuite.scala:86)
      ```
      This is because attempting to delete a non-existent Kinesis stream throws uncaught exception. This PR fixes it.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7809 from tdas/kinesis-test-hotfix and squashes the following commits:
      
      7c372e6 [Tathagata Das] Fixed test
      1afdeb7b
    • Calvin Jia's avatar
      [SPARK-9199] [CORE] Update Tachyon dependency from 0.6.4 -> 0.7.0 · 04c84091
      Calvin Jia authored
      No new dependencies are added. The exclusion changes are due to the change in tachyon-client 0.7.0's project structure.
      
      There is no client side API change in Tachyon 0.7.0 so no code changes are required.
      
      Author: Calvin Jia <jia.calvin@gmail.com>
      
      Closes #7577 from calvinjia/SPARK-9199 and squashes the following commits:
      
      4e81e40 [Calvin Jia] Update Tachyon dependency from 0.6.4 -> 0.7.0
      04c84091
    • Hossein's avatar
      [SPARK-8742] [SPARKR] Improve SparkR error messages for DataFrame API · 157840d1
      Hossein authored
      This patch improves SparkR error message reporting, especially with DataFrame API. When there is a user error (e.g., malformed SQL query), the message of the cause is sent back through the RPC and the R client reads it and returns it back to user.
      
      cc shivaram
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #7742 from falaki/SPARK-8742 and squashes the following commits:
      
      4f643c9 [Hossein] Not logging exceptions in RBackendHandler
      4a8005c [Hossein] Returning stack track of causing exception from RBackendHandler
      5cf17f0 [Hossein] Adding unit test for error messages from SQLContext
      2af75d5 [Hossein] Reading error message in case of failure and stoping with that message
      f479c99 [Hossein] Wrting exception cause message in JVM
      157840d1
    • Eric Liang's avatar
      [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula · e7905a93
      Eric Liang authored
      Preview:
      
      ```
      > summary(m)
                  features coefficients
      1        (Intercept)    1.6765001
      2       Sepal_Length    0.3498801
      3 Species.versicolor   -0.9833885
      4  Species.virginica   -1.0075104
      
      ```
      
      Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
      
      cc mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7771 from ericl/summary and squashes the following commits:
      
      ccd54c3 [Eric Liang] second pass
      a5ca93b [Eric Liang] comments
      2772111 [Eric Liang] clean up
      70483ef [Eric Liang] fix test
      7c247d4 [Eric Liang] Merge branch 'master' into summary
      3c55024 [Eric Liang] working
      8c539aa [Eric Liang] first pass
      e7905a93
    • Joseph K. Bradley's avatar
      [SPARK-6684] [MLLIB] [ML] Add checkpointing to GBTs · be7be6d4
      Joseph K. Bradley authored
      Add checkpointing to GradientBoostedTrees, GBTClassifier, GBTRegressor
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7804 from jkbradley/gbt-checkpoint3 and squashes the following commits:
      
      3fbd7ba [Joseph K. Bradley] tiny fix
      b3e160c [Joseph K. Bradley] unset checkpoint dir after test
      9cc3a04 [Joseph K. Bradley] added checkpointing to GBTs
      be7be6d4
    • martinzapletal's avatar
      [SPARK-8671] [ML] Added isotonic regression to the pipeline API. · 7f7a319c
      martinzapletal authored
      Author: martinzapletal <zapletal-martin@email.cz>
      
      Closes #7517 from zapletal-martin/SPARK-8671-isotonic-regression-api and squashes the following commits:
      
      8c435c1 [martinzapletal] Review https://github.com/apache/spark/pull/7517 feedback update.
      bebbb86 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
      b68efc0 [martinzapletal] Added tests for param validation.
      07c12bd [martinzapletal] Comments and refactoring.
      834fcf7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
      b611fee [martinzapletal] SPARK-8671. Added first version of isotonic regression to pipeline API
      7f7a319c
    • zsxwing's avatar
      [SPARK-9479] [STREAMING] [TESTS] Fix ReceiverTrackerSuite failure for maven... · 0dbd6963
      zsxwing authored
      [SPARK-9479] [STREAMING] [TESTS] Fix ReceiverTrackerSuite failure for maven build and other potential test failures in Streaming
      
      See https://issues.apache.org/jira/browse/SPARK-9479 for the failure cause.
      
      The PR includes the following changes:
      1. Make ReceiverTrackerSuite create StreamingContext in the test body.
      2. Fix places that don't stop StreamingContext. I verified no SparkContext was stopped in the shutdown hook locally after this fix.
      3. Fix an issue that `ReceiverTracker.endpoint` may be null.
      4. Make sure stopping SparkContext in non-main thread won't fail other tests.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7797 from zsxwing/fix-ReceiverTrackerSuite and squashes the following commits:
      
      3a4bb98 [zsxwing] Fix another potential NPE
      d7497df [zsxwing] Fix ReceiverTrackerSuite; make sure StreamingContext in tests is closed
      0dbd6963
    • Feynman Liang's avatar
      [SPARK-9454] Change LDASuite tests to use vector comparisons · 89cda69e
      Feynman Liang authored
      jkbradley Changes the current hacky string-comparison for vector compares.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7775 from feynmanliang/SPARK-9454-ldasuite-vector-compare and squashes the following commits:
      
      bd91a82 [Feynman Liang] Remove println
      905c76e [Feynman Liang] Fix string compare in distributed EM
      2f24c13 [Feynman Liang] Improve LDASuite tests
      89cda69e
    • Daoyuan Wang's avatar
      [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290]... · 1abf7dc1
      Daoyuan Wang authored
      [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation
      
      This PR is based on #7589 , thanks to adrian-wang
      
      Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
      add/subtract of date/timestamp and interval.
      
      Closes #7589
      
      cc rxin
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7754 from davies/date_add and squashes the following commits:
      
      e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      6224ce4 [Davies Liu] fix conclict
      bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      e47ff2c [Davies Liu] add python api, fix date functions
      01943d0 [Davies Liu] Merge branch 'master' into date_add
      522e91a [Daoyuan Wang] fix
      e8a639a [Daoyuan Wang] fix
      42df486 [Daoyuan Wang] fix style
      87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
      1a68e03 [Daoyuan Wang] poc of time interval calculation
      c506661 [Daoyuan Wang] function date_add , date_sub
      1abf7dc1
    • Feynman Liang's avatar
      [SPARK-5567] [MLLIB] Add predict method to LocalLDAModel · d8cfd531
      Feynman Liang authored
      jkbradley hhbyyh
      
      Adds `topicDistributions` to LocalLDAModel. Please review after #7757 is merged.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7760 from feynmanliang/SPARK-5567-predict-in-LDA and squashes the following commits:
      
      0ad1134 [Feynman Liang] Remove println
      27b3877 [Feynman Liang] Code review fixes
      6bfb87c [Feynman Liang] Remove extra newline
      476f788 [Feynman Liang] Fix checks and doc for variationalInference
      061780c [Feynman Liang] Code review cleanup
      3be2947 [Feynman Liang] Rename topicDistribution -> topicDistributions
      2a821a6 [Feynman Liang] Add predict methods to LocalLDAModel
      d8cfd531
    • Reynold Xin's avatar
      [SPARK-9460] Fix prefix generation for UTF8String. · a20e743f
      Reynold Xin authored
      Previously we could be getting garbage data if the number of bytes is 0, or on JVMs that are 4 byte aligned, or when compressedoops is on.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7789 from rxin/utf8string and squashes the following commits:
      
      86ffa3e [Reynold Xin] Mask out data outside of valid range.
      4d647ed [Reynold Xin] Mask out data.
      c6e8794 [Reynold Xin] [SPARK-9460] Fix prefix generation for UTF8String.
      a20e743f
    • Daoyuan Wang's avatar
      [SPARK-8174] [SPARK-8175] [SQL] function unix_timestamp, from_unixtime · 6d94bf6a
      Daoyuan Wang authored
      unix_timestamp(): long
      Gets current Unix timestamp in seconds.
      
      unix_timestamp(string|date): long
      Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
      
      unix_timestamp(string date, string pattern): long
      Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.
      
      from_unixtime(bigint unixtime[, string format]): string
      Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".
      
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8174
      https://issues.apache.org/jira/browse/SPARK-8175
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #7644 from adrian-wang/udfunixtime and squashes the following commits:
      
      2fe20c4 [Daoyuan Wang] util.Date
      ea2ec16 [Daoyuan Wang] use util.Date for better performance
      a2cf929 [Daoyuan Wang] doc return null instead of 0
      f6f070a [Daoyuan Wang] address comments from davies
      6a4cbb3 [Daoyuan Wang] temp
      56ded53 [Daoyuan Wang] rebase and address comments
      14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime
      6d94bf6a
    • Imran Rashid's avatar
      [SPARK-9437] [CORE] avoid overflow in SizeEstimator · 06b6a074
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-9437
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #7750 from squito/SPARK-9437_size_estimator_overflow and squashes the following commits:
      
      29493f1 [Imran Rashid] prevent another potential overflow
      bc1cb82 [Imran Rashid] avoid overflow
      06b6a074
    • Josh Rosen's avatar
      [SPARK-8850] [SQL] Enable Unsafe mode by default · 520ec0ff
      Josh Rosen authored
      This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:
      
      **List of fixed blockers**:
      
      - [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
      - [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
      - [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
      - [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
      - [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
      - [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
      - [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
      - [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
      - [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
      - [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
        - [x] Wrong answer in `join_1to1` (fixed by #7680)
        - [x] Wrong answer in `join_nulls` (fixed by #7680)
        - [x] Managed memory OOM / leak in `lateral_view`
        - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
        - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
      - [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
      - [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:
      
      83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
      f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
      963f567 [Josh Rosen] Reduce buffer size for R tests
      d6986de [Josh Rosen] Lower page size in PySpark tests
      013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
      5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
      ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
      715517b [Josh Rosen] Enable Unsafe by default
      520ec0ff
    • Marcelo Vanzin's avatar
      [SPARK-9388] [YARN] Make executor info log messages easier to read. · ab78b1d2
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7706 from vanzin/SPARK-9388 and squashes the following commits:
      
      028b990 [Marcelo Vanzin] Single log statement.
      3c5fb6a [Marcelo Vanzin] YARN not Yarn.
      5bcd7a0 [Marcelo Vanzin] [SPARK-9388] [yarn] Make executor info log messages easier to read.
      ab78b1d2
    • Mridul Muralidharan's avatar
      [SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN · e5353465
      Mridul Muralidharan authored
      This change adds code to notify the scheduler backend when a container dies in YARN.
      
      Author: Mridul Muralidharan <mridulm@yahoo-inc.com>
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7431 from vanzin/SPARK-8297 and squashes the following commits:
      
      471e4a0 [Marcelo Vanzin] Fix unit test after merge.
      d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
      3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
      537da6f [Marcelo Vanzin] Make an expected log less scary.
      04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request.
      8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297
      687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master
      9218fcc [Mridul Muralidharan] Fix failing testcase
      362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      9ee1307 [Mridul Muralidharan] Fix SPARK-8297
      a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
      e5353465
Loading