Skip to content
Snippets Groups Projects
  1. May 22, 2015
    • Josh Rosen's avatar
      [SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled · eac00691
      Josh Rosen authored
      SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.
      
      This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:
      
      e19726d [Josh Rosen] Add fix for SPARK-7766.
      71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
      eac00691
    • Ram Sriharsha's avatar
      [SPARK-7574] [ML] [DOC] User guide for OneVsRest · 509d55ab
      Ram Sriharsha authored
      Including Iris Dataset (after shuffling and relabeling 3 -> 0 to confirm to 0 -> numClasses-1 labeling). Could not find an existing dataset in data/mllib for multiclass classification.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6296 from harsha2010/SPARK-7574 and squashes the following commits:
      
      645427c [Ram Sriharsha] cleanup
      46c41b1 [Ram Sriharsha] cleanup
      2f76295 [Ram Sriharsha] Code Review Fixes
      ebdf103 [Ram Sriharsha] Java Example
      c026613 [Ram Sriharsha] Code Review fixes
      4b7d1a6 [Ram Sriharsha] minor cleanup
      13bed9c [Ram Sriharsha] add wikipedia link
      bb9dbfa [Ram Sriharsha] Clean up naming
      6f90db1 [Ram Sriharsha] [SPARK-7574][ml][doc] User guide for OneVsRest
      509d55ab
    • Patrick Wendell's avatar
      Revert "[BUILD] Always run SQL tests in master build." · c63036cd
      Patrick Wendell authored
      This reverts commit 147b6be3.
      c63036cd
    • Ram Sriharsha's avatar
      [SPARK-7404] [ML] Add RegressionEvaluator to spark.ml · f490b3b4
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6344 from harsha2010/SPARK-7404 and squashes the following commits:
      
      16b9d77 [Ram Sriharsha] consistent naming
      7f100b6 [Ram Sriharsha] cleanup
      c46044d [Ram Sriharsha] Merge with Master + Code Review Fixes
      188fa0a [Ram Sriharsha] Merge branch 'master' into SPARK-7404
      f5b6a4c [Ram Sriharsha] cleanup doc
      97beca5 [Ram Sriharsha] update test to use R packages
      32dd310 [Ram Sriharsha] fix indentation
      f93b812 [Ram Sriharsha] fix test
      1b6ebb3 [Ram Sriharsha] [SPARK-7404][ml] Add RegressionEvaluator to spark.ml
      f490b3b4
    • Michael Armbrust's avatar
      [SPARK-6743] [SQL] Fix empty projections of cached data · 3b68cb04
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6165 from marmbrus/wrongColumn and squashes the following commits:
      
      4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
      aad7eab [Michael Armbrust] rxins comments
      f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
      3b68cb04
    • Cheng Lian's avatar
      [MINOR] [SQL] Ignores Thrift server UISeleniumSuite · 4e5220c3
      Cheng Lian authored
      This Selenium test case has been flaky for a while and led to frequent Jenkins build failure. Let's disable it temporarily until we figure out a proper solution.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6345 from liancheng/ignore-selenium-test and squashes the following commits:
      
      09996fe [Cheng Lian] Ignores Thrift server UISeleniumSuite
      4e5220c3
    • Cheng Hao's avatar
      [SPARK-7322][SQL] Window functions in DataFrame · f6f2eeb1
      Cheng Hao authored
      This closes #6104.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6343 from rxin/window-df and squashes the following commits:
      
      026d587 [Reynold Xin] Address code review feedback.
      dc448fe [Reynold Xin] Fixed Hive tests.
      9794d9d [Reynold Xin] Moved Java test package.
      9331605 [Reynold Xin] Refactored API.
      3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window
      d625a64 [Cheng Hao] Update the dataframe window API as suggsted
      c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
      3b1865f [Cheng Hao] scaladoc typos
      f3fd2d0 [Cheng Hao] polish the unit test
      6847825 [Cheng Hao] Add additional analystcs functions
      57e3bc0 [Cheng Hao] typos
      24a08ec [Cheng Hao] scaladoc
      28222ed [Cheng Hao] fix bug of range/row Frame
      1d91865 [Cheng Hao] style issue
      53f89f2 [Cheng Hao] remove the over from the functions.scala
      964c013 [Cheng Hao] add more unit tests and window functions
      64e18a7 [Cheng Hao] Add Window Function support for DataFrame
      f6f2eeb1
    • Joseph K. Bradley's avatar
      [SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler · 2728c3df
      Joseph K. Bradley authored
      Added user guide sections with code examples.
      Also added small Java unit tests to test Java example in guide.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits:
      
      cd47f4b [Joseph K. Bradley] Updated based on code review
      f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3
      0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests
      a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF
      2728c3df
    • Xiangrui Meng's avatar
      [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4 · 8f11c611
      Xiangrui Meng authored
      Some changes to the pipeilne APIs:
      
      1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does.
      1. Move Evaluator to ml.evaluation.
      1. Mention larger metric values are better.
      1. PipelineModel doc. “compiled” -> “fitted”
      1. Hide object PolynomialExpansion.
      1. Hide object VectorAssembler.
      1. Word2Vec.minCount (and other) -> group param
      1. ParamValidators -> DeveloperApi
      1. Hide MetadataUtils/SchemaUtils.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits:
      
      9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well
      e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark
      08ef61f [Xiangrui Meng] update pipieline APIs
      8f11c611
  2. May 21, 2015
    • Mike Dusenberry's avatar
      [DOCS] [MLLIB] Fixing broken link in MLlib Linear Methods documentation. · e4136ea6
      Mike Dusenberry authored
      Just a small change: fixed a broken link in the MLlib Linear Methods documentation by removing a newline character between the link title and link address.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6340 from dusenberrymw/Fix_MLlib_Linear_Methods_link and squashes the following commits:
      
      0a57818 [Mike Dusenberry] Fixing broken link in MLlib Linear Methods documentation.
      e4136ea6
    • Hari Shreedharan's avatar
      [SPARK-7657] [YARN] Add driver logs links in application UI, in cluster mode. · 956c4c91
      Hari Shreedharan authored
      This PR adds the URLs to the driver logs to `SparkListenerApplicationStarted` event, which is later used by the `ExecutorsListener` to populate the URLs to the driver logs in its own state. This info is then used when the UI is rendered to display links to the logs.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6166 from harishreedharan/am-log-link and squashes the following commits:
      
      943fc4f [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into am-log-link
      9e5c04b [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into am-log-link
      b3f9b9d [Hari Shreedharan] Updated comment based on feedback.
      0840a95 [Hari Shreedharan] Move the result and sc.stop back to original location, minor import changes.
      537a2f7 [Hari Shreedharan] Add test to ensure the log urls are populated and valid.
      4033725 [Hari Shreedharan] Adding comments explaining how node reports are used to get the log urls.
      6c5c285 [Hari Shreedharan] Import order.
      346f4ea [Hari Shreedharan] Review feedback fixes.
      629c1dc [Hari Shreedharan] Cleanup.
      99fb1a3 [Hari Shreedharan] Send the log urls in App start event, to ensure that other listeners are not affected.
      c0de336 [Hari Shreedharan] Ensure new unit test cleans up after itself.
      50cdae3 [Hari Shreedharan] Added unit test, made the approach generic.
      402e8e4 [Hari Shreedharan] Use `NodeReport` to get the URL for the logs. Also, make the environment variables generic so other cluster managers can use them as well.
      1cf338f [Hari Shreedharan] [SPARK-7657][YARN] Add driver link in application UI, in cluster mode.
      956c4c91
    • Xiangrui Meng's avatar
      [SPARK-7219] [MLLIB] Output feature attributes in HashingTF · 85b96372
      Xiangrui Meng authored
      This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6308 from mengxr/SPARK-7219 and squashes the following commits:
      
      9bd2922 [Xiangrui Meng] address comments
      e82a68a [Xiangrui Meng] remove sqlContext from test suite
      995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219
      2194703 [Xiangrui Meng] add test for attributes
      178ae23 [Xiangrui Meng] update HashingTF with tests
      91a6106 [Xiangrui Meng] WIP
      85b96372
    • Xiangrui Meng's avatar
      [SPARK-7794] [MLLIB] update RegexTokenizer default settings · f5db4b41
      Xiangrui Meng authored
      The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:
      
      5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
      f5db4b41
    • Davies Liu's avatar
      [SPARK-7783] [SQL] [PySpark] add DataFrame.rollup/cube in Python · 17791a58
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6311 from davies/rollup and squashes the following commits:
      
      0261db1 [Davies Liu] use @since
      a51ca6b [Davies Liu] Merge branch 'master' of github.com:apache/spark into rollup
      8ad5af4 [Davies Liu] Update dataframe.py
      ade3841 [Davies Liu] add DataFrame.rollup/cube in Python
      17791a58
    • Tathagata Das's avatar
      [SPARK-7776] [STREAMING] Added shutdown hook to StreamingContext · d68ea24d
      Tathagata Das authored
      Shutdown hook to stop SparkContext was added recently. This results in ugly errors when a streaming application is terminated by ctrl-C.
      
      ```
      Exception in thread "Thread-27" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:736)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:735)
      	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
      	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:735)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1468)
      	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
      	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1403)
      	at org.apache.spark.SparkContext.stop(SparkContext.scala:1642)
      	at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:559)
      	at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2266)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236)
      	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236)
      	at scala.util.Try$.apply(Try.scala:161)
      	at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2218)
      	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
      ```
      
      This is because the Spark's shutdown hook stops the context, and the streaming jobs fail in the middle. The correct solution is to stop the streaming context before the spark context. This PR adds the shutdown hook to do so with a priority higher than the SparkContext's shutdown hooks priority.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6307 from tdas/SPARK-7776 and squashes the following commits:
      
      e3d5475 [Tathagata Das] Added conf to specify graceful shutdown
      4c18652 [Tathagata Das] Added shutdown hook to StreamingContxt.
      d68ea24d
    • Yin Huai's avatar
      [SPARK-7737] [SQL] Use leaf dirs having data files to discover partitions. · 347b5010
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-7737
      
      cc liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6329 from yhuai/spark-7737 and squashes the following commits:
      
      7e0dfc7 [Yin Huai] Use leaf dirs having data files to discover partitions.
      347b5010
    • Yin Huai's avatar
      [BUILD] Always run SQL tests in master build. · 147b6be3
      Yin Huai authored
      Seems our master build does not run HiveCompatibilitySuite (because _RUN_SQL_TESTS is not set). This PR introduces a property `AMP_JENKINS_PRB` to differentiate a PR build and a regular build. If a build is a regular one, we always set _RUN_SQL_TESTS to true.
      
      cc JoshRosen nchammas
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5955 from yhuai/runSQLTests and squashes the following commits:
      
      3d399bc [Yin Huai] Always run SQL tests in master build.
      147b6be3
    • Liang-Chi Hsieh's avatar
      [SPARK-7800] isDefined should not marked too early in putNewKey · 5a3c04bb
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7800
      
      `isDefined` is marked as true twice in `Location.putNewKey`. The first one is unnecessary and will cause problem because it is too early and before some assert checking. E.g., if an attempt with incorrect `keyLengthBytes` marks `isDefined` as true, the location can not be used later.
      
      ping JoshRosen
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6324 from viirya/dup_isdefined and squashes the following commits:
      
      cbfe03b [Liang-Chi Hsieh] isDefined should not marked too early in putNewKey.
      5a3c04bb
    • Andrew Or's avatar
      [SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning · 5287eec5
      Andrew Or authored
      According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:
      
      a82b451 [Andrew Or] Fix style
      10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
      17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
      523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
      f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
      5287eec5
    • Holden Karau's avatar
      [SPARK-7711] Add a startTime property to match the corresponding one in Scala · 6b18cdc1
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6275 from holdenk/SPARK-771-startTime-is-missing-from-pyspark and squashes the following commits:
      
      06662dc [Holden Karau] add mising blank line for style checks
      7a87410 [Holden Karau] add back missing newline
      7a7876b [Holden Karau] Add a startTime property to match the corresponding one in the Scala SparkContext
      6b18cdc1
    • Tathagata Das's avatar
      [SPARK-7478] [SQL] Added SQLContext.getOrCreate · 3d0cccc8
      Tathagata Das authored
      Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like
      
      1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.
      
      2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
      
      This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.
      
      rxin marmbrus
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6006 from tdas/SPARK-7478 and squashes the following commits:
      
      25f4da9 [Tathagata Das] Addressed comments.
      79fe069 [Tathagata Das] Added comments.
      c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
      48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
      bf8cf50 [Tathagata Das] Fix more bug
      dec5594 [Tathagata Das] Fixed bug
      b4e9721 [Tathagata Das] Remove unnecessary import
      4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
      d3ea8e4 [Tathagata Das] Added HiveContext
      83bc950 [Tathagata Das] Updated tests
      f82ae81 [Tathagata Das] Fixed test
      bc72868 [Tathagata Das] Added SQLContext.getOrCreate
      3d0cccc8
    • Yin Huai's avatar
      [SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore · 30f3f556
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6285 from liancheng/spark-7763 and squashes the following commits:
      
      bb2829d [Yin Huai] Fix hashCode.
      d677f7d [Cheng Lian] Fixes Scala style issue
      44b283f [Cheng Lian] Adds test case for SPARK-7616
      6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616.
      6cabf3c [Yin Huai] Update unit test.
      7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec.
      e9a03ec [Cheng Lian] Persists partition columns into metastore
      30f3f556
    • Tathagata Das's avatar
      [SPARK-7722] [STREAMING] Added Kinesis to style checker · 311fab6f
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6325 from tdas/SPARK-7722 and squashes the following commits:
      
      9ab35b2 [Tathagata Das] Fixed styles in Kinesis
      311fab6f
    • Xiangrui Meng's avatar
      [SPARK-7498] [MLLIB] add varargs back to setDefault · cdc7c055
      Xiangrui Meng authored
      We removed `varargs` due to Java compilation issues. That was a false alarm because I didn't run `build/sbt clean`. So this PR reverts the changes. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6320 from mengxr/SPARK-7498 and squashes the following commits:
      
      74a7259 [Xiangrui Meng] add varargs back to setDefault
      cdc7c055
    • Joseph K. Bradley's avatar
      [SPARK-7585] [ML] [DOC] VectorIndexer user guide section · 6d75ed7e
      Joseph K. Bradley authored
      Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits:
      
      dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
      f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      6d75ed7e
    • Andrew Or's avatar
      [SPARK-7775] YARN AM negative sleep exception · 15680aee
      Andrew Or authored
      ```
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative
        at java.lang.Thread.sleep(Native Method)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356)
      ```
      This kills the reporter thread. This is caused by #6082 (merged into master branch only).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6305 from andrewor14/yarn-negative-sleep and squashes the following commits:
      
      b970770 [Andrew Or] Use existing cap
      56d6e5e [Andrew Or] Avoid negative sleep
      15680aee
    • scwf's avatar
      [SQL] [TEST] udf_java_method failed due to jdk version · f6c486aa
      scwf authored
      java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it.
      ```
      jdk version   	result
      1.7.0_11		2.7182818284590455
      1.7.0_05        2.7182818284590455
      1.7.0_71		2.718281828459045
      ```
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6274 from scwf/java_method and squashes the following commits:
      
      3dd2516 [scwf] address comments
      5fa1459 [scwf] style
      df46445 [scwf] fix test error
      fcb6d22 [scwf] fix udf_java_method
      f6c486aa
    • Shuo Xiang's avatar
      [SPARK-7793] [MLLIB] Use getOrElse for getting the threshold of SVM model · 4f572008
      Shuo Xiang authored
      same issue and fix as in Spark-7694.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6321 from coderxiang/nb and squashes the following commits:
      
      a5e6de4 [Shuo Xiang] use getOrElse for svmmodel.tostring
      2cb0177 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into nb
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      4f572008
    • kaka1992's avatar
      [SPARK-7394][SQL] Add Pandas style cast (astype) · 699906e5
      kaka1992 authored
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #6313 from kaka1992/astype and squashes the following commits:
      
      73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      699906e5
    • Sean Owen's avatar
      [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative · 6e534026
      Sean Owen authored
      Document current limitation of rdd.fold.
      
      This does not resolve SPARK-6416 but just documents the issue.
      CC JoshRosen
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6231 from srowen/SPARK-6416 and squashes the following commits:
      
      9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
      da40d84 [Sean Owen] Document current limitation of rdd.fold.
      6e534026
    • Tathagata Das's avatar
      [SPARK-7787] [STREAMING] Fix serialization issue of SerializableAWSCredentials · 4b7ff309
      Tathagata Das authored
      Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6316 from tdas/SPARK-7787 and squashes the following commits:
      
      248ca5c [Tathagata Das] Fixed serializability
      4b7ff309
    • Cheng Lian's avatar
      [SPARK-7749] [SQL] Fixes partition discovery for non-partitioned tables · 8730fbb4
      Cheng Lian authored
      When no partition columns can be found, we should have an empty `PartitionSpec`, rather than a `PartitionSpec` with empty partition columns.
      
      This PR together with #6285 should fix SPARK-7749.
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6287 from liancheng/spark-7749 and squashes the following commits:
      
      a799ff3 [Cheng Lian] Adds test cases for SPARK-7749
      c4949be [Cheng Lian] Minor refactoring, and tolerant _TEMPORARY directory name
      5aa87ea [Yin Huai] Make parsePartitions more robust.
      fc56656 [Cheng Lian] Returns empty PartitionSpec if no partition columns can be inferred
      19ae41e [Cheng Lian] Don't list base directory as leaf directory
      8730fbb4
    • Xiangrui Meng's avatar
      [SPARK-7752] [MLLIB] Use lowercase letters for NaiveBayes.modelType · 13348e21
      Xiangrui Meng authored
      to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6277 from mengxr/SPARK-7752 and squashes the following commits:
      
      f38b662 [Xiangrui Meng] add another case _ back in test
      ae5c66a [Xiangrui Meng] model type -> modelType
      711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
      40ae53e [Xiangrui Meng] fix Java test suite
      264a814 [Xiangrui Meng] add case _ back
      3c456a8 [Xiangrui Meng] update NB user guide
      17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
      13348e21
    • Davies Liu's avatar
      [SPARK-7565] [SQL] fix MapType in JsonRDD · a25c1ab8
      Davies Liu authored
      The key of Map in JsonRDD should be converted into UTF8String (also failed records), Thanks to yhuai viirya
      
      Closes #6084
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6299 from davies/string_in_json and squashes the following commits:
      
      0dbf559 [Davies Liu] improve test, fix corrupt record
      6836a80 [Davies Liu] move unit tests into Scala
      b97af11 [Davies Liu] fix MapType in JsonRDD
      a25c1ab8
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] [Minor] Move the testData into beforeAll() · feb3a9d3
      Cheng Hao authored
      Follow up of #6340, to avoid the test report missing once it fails.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6312 from chenghao-intel/rollup_minor and squashes the following commits:
      
      b03a25f [Cheng Hao] simplify the testData instantiation
      09b7e8b [Cheng Hao] move the testData into beforeAll()
      feb3a9d3
    • Burak Yavuz's avatar
      [SPARK-7745] Change asserts to requires for user input checks in Spark Streaming · 1ee8eb43
      Burak Yavuz authored
      Assertions can be turned off. `require` throws an `IllegalArgumentException` which makes more sense when it's a user set variable.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6271 from brkyvz/streaming-require and squashes the following commits:
      
      d249484 [Burak Yavuz] fix merge conflict
      264adb8 [Burak Yavuz] addressed comments v1.0
      6161350 [Burak Yavuz] fix tests
      16aa766 [Burak Yavuz] changed more assertions to more meaningful errors
      afd923d [Burak Yavuz] changed some assertions to require
      1ee8eb43
    • Xiangrui Meng's avatar
      [SPARK-7753] [MLLIB] Update KernelDensity API · 947ea1cf
      Xiangrui Meng authored
      Update `KernelDensity` API to make it extensible to different kernels in the future. `bandwidth` is used instead of `standardDeviation`. The static `kernelDensity` method is removed from `Statistics`. The implementation is updated using BLAS, while the algorithm remains the same. sryza srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6279 from mengxr/SPARK-7753 and squashes the following commits:
      
      4cdfadc [Xiangrui Meng] add example code in the doc
      767fd5a [Xiangrui Meng] update KernelDensity API
      947ea1cf
    • Davies Liu's avatar
      [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs · 8ddcb25b
      Davies Liu authored
      Add version info for public Python SQL API.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6295 from davies/versions and squashes the following commits:
      
      cfd91e6 [Davies Liu] add more version for DataFrame API
      600834d [Davies Liu] add version to SQL API docs
      8ddcb25b
    • Mingfei's avatar
      [SPARK-7389] [CORE] Tachyon integration improvement · 04940c49
      Mingfei authored
      Two main changes:
      
      Add two functions in ExternalBlockManager, which are putValues and getValues
      because the implementation may not rely on the putBytes and getBytes
      
      improve Tachyon integration.
      Currently, when putting data into Tachyon, Spark first serialize all data in one partition into a ByteBuffer, and then write into Tachyon, this will uses much memory and increase GC overhead
      
      when get data from Tachyon, getValues depends on getBytes, which also read all data into On heap byte arry, and result in much memory usage.
      This PR changes the approach of the two functions, make them read / write data by stream to reduce memory usage.
      
      In our testing,  when data size is huge, this patch reduces about 30% GC time and 70% full GC time, and total execution time reduces about 10%
      
      Author: Mingfei <mingfei.shi@intel.com>
      
      Closes #5908 from shimingfei/Tachyon-integration-rebase and squashes the following commits:
      
      033bc57 [Mingfei] modify accroding to comments
      747c69a [Mingfei] modify according to comments - format changes
      ce52c67 [Mingfei] put close() in a finally block
      d2c60bb [Mingfei] modify according to comments, some code style change
      4c11591 [Mingfei] modify according to comments split putIntoExternalBlockStore into two functions add default implementation for getValues and putValues
      cc0a32e [Mingfei] Make getValues read data from Tachyon by stream Make putValues write data to Tachyon by stream
      017593d [Mingfei] add getValues and putValues in ExternalBlockManager's Interface
      04940c49
    • Liang-Chi Hsieh's avatar
      [SPARK-7746][SQL] Add FetchSize parameter for JDBC driver · d0eb9ffe
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7746
      
      Looks like an easy to add parameter but can show significant performance improvement if the JDBC driver accepts it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6283 from viirya/jdbc_fetchsize and squashes the following commits:
      
      de47f94 [Liang-Chi Hsieh] Don't keep fetchSize as single parameter.
      b7bff2f [Liang-Chi Hsieh] Add FetchSize parameter for JDBC driver.
      d0eb9ffe
Loading