Skip to content
Snippets Groups Projects
  1. May 21, 2015
    • Xiangrui Meng's avatar
      [SPARK-7219] [MLLIB] Output feature attributes in HashingTF · 85b96372
      Xiangrui Meng authored
      This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6308 from mengxr/SPARK-7219 and squashes the following commits:
      
      9bd2922 [Xiangrui Meng] address comments
      e82a68a [Xiangrui Meng] remove sqlContext from test suite
      995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219
      2194703 [Xiangrui Meng] add test for attributes
      178ae23 [Xiangrui Meng] update HashingTF with tests
      91a6106 [Xiangrui Meng] WIP
      85b96372
    • Xiangrui Meng's avatar
      [SPARK-7794] [MLLIB] update RegexTokenizer default settings · f5db4b41
      Xiangrui Meng authored
      The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:
      
      5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
      f5db4b41
    • Davies Liu's avatar
      [SPARK-7783] [SQL] [PySpark] add DataFrame.rollup/cube in Python · 17791a58
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6311 from davies/rollup and squashes the following commits:
      
      0261db1 [Davies Liu] use @since
      a51ca6b [Davies Liu] Merge branch 'master' of github.com:apache/spark into rollup
      8ad5af4 [Davies Liu] Update dataframe.py
      ade3841 [Davies Liu] add DataFrame.rollup/cube in Python
      17791a58
    • Tathagata Das's avatar
      [SPARK-7776] [STREAMING] Added shutdown hook to StreamingContext · d68ea24d
      Tathagata Das authored
      Shutdown hook to stop SparkContext was added recently. This results in ugly errors when a streaming application is terminated by ctrl-C.
      
      ```
      Exception in thread "Thread-27" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:736)
      	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:735)
      	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
      	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:735)
      	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1468)
      	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
      	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1403)
      	at org.apache.spark.SparkContext.stop(SparkContext.scala:1642)
      	at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:559)
      	at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2266)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2236)
      	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1764)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2236)
      	at scala.util.Try$.apply(Try.scala:161)
      	at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2218)
      	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
      ```
      
      This is because the Spark's shutdown hook stops the context, and the streaming jobs fail in the middle. The correct solution is to stop the streaming context before the spark context. This PR adds the shutdown hook to do so with a priority higher than the SparkContext's shutdown hooks priority.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6307 from tdas/SPARK-7776 and squashes the following commits:
      
      e3d5475 [Tathagata Das] Added conf to specify graceful shutdown
      4c18652 [Tathagata Das] Added shutdown hook to StreamingContxt.
      d68ea24d
    • Yin Huai's avatar
      [SPARK-7737] [SQL] Use leaf dirs having data files to discover partitions. · 347b5010
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-7737
      
      cc liancheng
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6329 from yhuai/spark-7737 and squashes the following commits:
      
      7e0dfc7 [Yin Huai] Use leaf dirs having data files to discover partitions.
      347b5010
    • Yin Huai's avatar
      [BUILD] Always run SQL tests in master build. · 147b6be3
      Yin Huai authored
      Seems our master build does not run HiveCompatibilitySuite (because _RUN_SQL_TESTS is not set). This PR introduces a property `AMP_JENKINS_PRB` to differentiate a PR build and a regular build. If a build is a regular one, we always set _RUN_SQL_TESTS to true.
      
      cc JoshRosen nchammas
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5955 from yhuai/runSQLTests and squashes the following commits:
      
      3d399bc [Yin Huai] Always run SQL tests in master build.
      147b6be3
    • Liang-Chi Hsieh's avatar
      [SPARK-7800] isDefined should not marked too early in putNewKey · 5a3c04bb
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7800
      
      `isDefined` is marked as true twice in `Location.putNewKey`. The first one is unnecessary and will cause problem because it is too early and before some assert checking. E.g., if an attempt with incorrect `keyLengthBytes` marks `isDefined` as true, the location can not be used later.
      
      ping JoshRosen
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6324 from viirya/dup_isdefined and squashes the following commits:
      
      cbfe03b [Liang-Chi Hsieh] isDefined should not marked too early in putNewKey.
      5a3c04bb
    • Andrew Or's avatar
      [SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning · 5287eec5
      Andrew Or authored
      According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:
      
      a82b451 [Andrew Or] Fix style
      10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
      17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
      523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
      f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
      5287eec5
    • Holden Karau's avatar
      [SPARK-7711] Add a startTime property to match the corresponding one in Scala · 6b18cdc1
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6275 from holdenk/SPARK-771-startTime-is-missing-from-pyspark and squashes the following commits:
      
      06662dc [Holden Karau] add mising blank line for style checks
      7a87410 [Holden Karau] add back missing newline
      7a7876b [Holden Karau] Add a startTime property to match the corresponding one in the Scala SparkContext
      6b18cdc1
    • Tathagata Das's avatar
      [SPARK-7478] [SQL] Added SQLContext.getOrCreate · 3d0cccc8
      Tathagata Das authored
      Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like
      
      1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing.
      
      2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
      
      This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf.
      
      rxin marmbrus
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6006 from tdas/SPARK-7478 and squashes the following commits:
      
      25f4da9 [Tathagata Das] Addressed comments.
      79fe069 [Tathagata Das] Added comments.
      c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
      48adb14 [Tathagata Das] Removed HiveContext.getOrCreate
      bf8cf50 [Tathagata Das] Fix more bug
      dec5594 [Tathagata Das] Fixed bug
      b4e9721 [Tathagata Das] Remove unnecessary import
      4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478
      d3ea8e4 [Tathagata Das] Added HiveContext
      83bc950 [Tathagata Das] Updated tests
      f82ae81 [Tathagata Das] Fixed test
      bc72868 [Tathagata Das] Added SQLContext.getOrCreate
      3d0cccc8
    • Yin Huai's avatar
      [SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore · 30f3f556
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6285 from liancheng/spark-7763 and squashes the following commits:
      
      bb2829d [Yin Huai] Fix hashCode.
      d677f7d [Cheng Lian] Fixes Scala style issue
      44b283f [Cheng Lian] Adds test case for SPARK-7616
      6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616.
      6cabf3c [Yin Huai] Update unit test.
      7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec.
      e9a03ec [Cheng Lian] Persists partition columns into metastore
      30f3f556
    • Tathagata Das's avatar
      [SPARK-7722] [STREAMING] Added Kinesis to style checker · 311fab6f
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6325 from tdas/SPARK-7722 and squashes the following commits:
      
      9ab35b2 [Tathagata Das] Fixed styles in Kinesis
      311fab6f
    • Xiangrui Meng's avatar
      [SPARK-7498] [MLLIB] add varargs back to setDefault · cdc7c055
      Xiangrui Meng authored
      We removed `varargs` due to Java compilation issues. That was a false alarm because I didn't run `build/sbt clean`. So this PR reverts the changes. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6320 from mengxr/SPARK-7498 and squashes the following commits:
      
      74a7259 [Xiangrui Meng] add varargs back to setDefault
      cdc7c055
    • Joseph K. Bradley's avatar
      [SPARK-7585] [ML] [DOC] VectorIndexer user guide section · 6d75ed7e
      Joseph K. Bradley authored
      Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits:
      
      dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
      f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      6d75ed7e
    • Andrew Or's avatar
      [SPARK-7775] YARN AM negative sleep exception · 15680aee
      Andrew Or authored
      ```
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      Exception in thread "Reporter" java.lang.IllegalArgumentException: timeout value is negative
        at java.lang.Thread.sleep(Native Method)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:356)
      ```
      This kills the reporter thread. This is caused by #6082 (merged into master branch only).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6305 from andrewor14/yarn-negative-sleep and squashes the following commits:
      
      b970770 [Andrew Or] Use existing cap
      56d6e5e [Andrew Or] Avoid negative sleep
      15680aee
    • scwf's avatar
      [SQL] [TEST] udf_java_method failed due to jdk version · f6c486aa
      scwf authored
      java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it.
      ```
      jdk version   	result
      1.7.0_11		2.7182818284590455
      1.7.0_05        2.7182818284590455
      1.7.0_71		2.718281828459045
      ```
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6274 from scwf/java_method and squashes the following commits:
      
      3dd2516 [scwf] address comments
      5fa1459 [scwf] style
      df46445 [scwf] fix test error
      fcb6d22 [scwf] fix udf_java_method
      f6c486aa
    • Shuo Xiang's avatar
      [SPARK-7793] [MLLIB] Use getOrElse for getting the threshold of SVM model · 4f572008
      Shuo Xiang authored
      same issue and fix as in Spark-7694.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6321 from coderxiang/nb and squashes the following commits:
      
      a5e6de4 [Shuo Xiang] use getOrElse for svmmodel.tostring
      2cb0177 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into nb
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      4f572008
    • kaka1992's avatar
      [SPARK-7394][SQL] Add Pandas style cast (astype) · 699906e5
      kaka1992 authored
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #6313 from kaka1992/astype and squashes the following commits:
      
      73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      699906e5
    • Sean Owen's avatar
      [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative · 6e534026
      Sean Owen authored
      Document current limitation of rdd.fold.
      
      This does not resolve SPARK-6416 but just documents the issue.
      CC JoshRosen
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6231 from srowen/SPARK-6416 and squashes the following commits:
      
      9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
      da40d84 [Sean Owen] Document current limitation of rdd.fold.
      6e534026
    • Tathagata Das's avatar
      [SPARK-7787] [STREAMING] Fix serialization issue of SerializableAWSCredentials · 4b7ff309
      Tathagata Das authored
      Lack of default constructor causes deserialization to fail. This occurs only when the AWS credentials are explicitly specified through KinesisUtils.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6316 from tdas/SPARK-7787 and squashes the following commits:
      
      248ca5c [Tathagata Das] Fixed serializability
      4b7ff309
    • Cheng Lian's avatar
      [SPARK-7749] [SQL] Fixes partition discovery for non-partitioned tables · 8730fbb4
      Cheng Lian authored
      When no partition columns can be found, we should have an empty `PartitionSpec`, rather than a `PartitionSpec` with empty partition columns.
      
      This PR together with #6285 should fix SPARK-7749.
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6287 from liancheng/spark-7749 and squashes the following commits:
      
      a799ff3 [Cheng Lian] Adds test cases for SPARK-7749
      c4949be [Cheng Lian] Minor refactoring, and tolerant _TEMPORARY directory name
      5aa87ea [Yin Huai] Make parsePartitions more robust.
      fc56656 [Cheng Lian] Returns empty PartitionSpec if no partition columns can be inferred
      19ae41e [Cheng Lian] Don't list base directory as leaf directory
      8730fbb4
    • Xiangrui Meng's avatar
      [SPARK-7752] [MLLIB] Use lowercase letters for NaiveBayes.modelType · 13348e21
      Xiangrui Meng authored
      to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6277 from mengxr/SPARK-7752 and squashes the following commits:
      
      f38b662 [Xiangrui Meng] add another case _ back in test
      ae5c66a [Xiangrui Meng] model type -> modelType
      711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
      40ae53e [Xiangrui Meng] fix Java test suite
      264a814 [Xiangrui Meng] add case _ back
      3c456a8 [Xiangrui Meng] update NB user guide
      17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
      13348e21
    • Davies Liu's avatar
      [SPARK-7565] [SQL] fix MapType in JsonRDD · a25c1ab8
      Davies Liu authored
      The key of Map in JsonRDD should be converted into UTF8String (also failed records), Thanks to yhuai viirya
      
      Closes #6084
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6299 from davies/string_in_json and squashes the following commits:
      
      0dbf559 [Davies Liu] improve test, fix corrupt record
      6836a80 [Davies Liu] move unit tests into Scala
      b97af11 [Davies Liu] fix MapType in JsonRDD
      a25c1ab8
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] [Minor] Move the testData into beforeAll() · feb3a9d3
      Cheng Hao authored
      Follow up of #6340, to avoid the test report missing once it fails.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6312 from chenghao-intel/rollup_minor and squashes the following commits:
      
      b03a25f [Cheng Hao] simplify the testData instantiation
      09b7e8b [Cheng Hao] move the testData into beforeAll()
      feb3a9d3
    • Burak Yavuz's avatar
      [SPARK-7745] Change asserts to requires for user input checks in Spark Streaming · 1ee8eb43
      Burak Yavuz authored
      Assertions can be turned off. `require` throws an `IllegalArgumentException` which makes more sense when it's a user set variable.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6271 from brkyvz/streaming-require and squashes the following commits:
      
      d249484 [Burak Yavuz] fix merge conflict
      264adb8 [Burak Yavuz] addressed comments v1.0
      6161350 [Burak Yavuz] fix tests
      16aa766 [Burak Yavuz] changed more assertions to more meaningful errors
      afd923d [Burak Yavuz] changed some assertions to require
      1ee8eb43
    • Xiangrui Meng's avatar
      [SPARK-7753] [MLLIB] Update KernelDensity API · 947ea1cf
      Xiangrui Meng authored
      Update `KernelDensity` API to make it extensible to different kernels in the future. `bandwidth` is used instead of `standardDeviation`. The static `kernelDensity` method is removed from `Statistics`. The implementation is updated using BLAS, while the algorithm remains the same. sryza srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6279 from mengxr/SPARK-7753 and squashes the following commits:
      
      4cdfadc [Xiangrui Meng] add example code in the doc
      767fd5a [Xiangrui Meng] update KernelDensity API
      947ea1cf
    • Davies Liu's avatar
      [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs · 8ddcb25b
      Davies Liu authored
      Add version info for public Python SQL API.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6295 from davies/versions and squashes the following commits:
      
      cfd91e6 [Davies Liu] add more version for DataFrame API
      600834d [Davies Liu] add version to SQL API docs
      8ddcb25b
    • Mingfei's avatar
      [SPARK-7389] [CORE] Tachyon integration improvement · 04940c49
      Mingfei authored
      Two main changes:
      
      Add two functions in ExternalBlockManager, which are putValues and getValues
      because the implementation may not rely on the putBytes and getBytes
      
      improve Tachyon integration.
      Currently, when putting data into Tachyon, Spark first serialize all data in one partition into a ByteBuffer, and then write into Tachyon, this will uses much memory and increase GC overhead
      
      when get data from Tachyon, getValues depends on getBytes, which also read all data into On heap byte arry, and result in much memory usage.
      This PR changes the approach of the two functions, make them read / write data by stream to reduce memory usage.
      
      In our testing,  when data size is huge, this patch reduces about 30% GC time and 70% full GC time, and total execution time reduces about 10%
      
      Author: Mingfei <mingfei.shi@intel.com>
      
      Closes #5908 from shimingfei/Tachyon-integration-rebase and squashes the following commits:
      
      033bc57 [Mingfei] modify accroding to comments
      747c69a [Mingfei] modify according to comments - format changes
      ce52c67 [Mingfei] put close() in a finally block
      d2c60bb [Mingfei] modify according to comments, some code style change
      4c11591 [Mingfei] modify according to comments split putIntoExternalBlockStore into two functions add default implementation for getValues and putValues
      cc0a32e [Mingfei] Make getValues read data from Tachyon by stream Make putValues write data to Tachyon by stream
      017593d [Mingfei] add getValues and putValues in ExternalBlockManager's Interface
      04940c49
    • Liang-Chi Hsieh's avatar
      [SPARK-7746][SQL] Add FetchSize parameter for JDBC driver · d0eb9ffe
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7746
      
      Looks like an easy to add parameter but can show significant performance improvement if the JDBC driver accepts it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6283 from viirya/jdbc_fetchsize and squashes the following commits:
      
      de47f94 [Liang-Chi Hsieh] Don't keep fetchSize as single parameter.
      b7bff2f [Liang-Chi Hsieh] Add FetchSize parameter for JDBC driver.
      d0eb9ffe
  2. May 20, 2015
    • Xiangrui Meng's avatar
      [SPARK-7774] [MLLIB] add sqlContext to MLlibTestSparkContext · ddec173c
      Xiangrui Meng authored
      to simplify test suites that require a SQLContext.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6303 from mengxr/SPARK-7774 and squashes the following commits:
      
      0622b5a [Xiangrui Meng] update some other test suites
      e1f9b8d [Xiangrui Meng] add sqlContext to MLlibTestSparkContext
      ddec173c
    • Cheng Hao's avatar
      [SPARK-7320] [SQL] Add Cube / Rollup for dataframe · 42c592ad
      Cheng Hao authored
      This is a follow up for #6257, which broke the maven test.
      
      Add cube & rollup for DataFrame
      For example:
      ```scala
      testData.rollup($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      testData.cube($"a" + $"b", $"b").agg(sum($"a" - $"b"))
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6304 from chenghao-intel/rollup and squashes the following commits:
      
      04bb1de [Cheng Hao] move the table register/unregister into beforeAll/afterAll
      a6069f1 [Cheng Hao] cancel the implicit keyword
      ced4b8f [Cheng Hao] remove the unnecessary code changes
      9959dfa [Cheng Hao] update the code as comments
      e1d88aa [Cheng Hao] update the code as suggested
      03bc3d9 [Cheng Hao] Remove the CubedData & RollupedData
      5fd62d0 [Cheng Hao] hiden the CubedData & RollupedData
      5ffb196 [Cheng Hao] Add Cube / Rollup for dataframe
      42c592ad
    • zsxwing's avatar
      [SPARK-7777] [STREAMING] Fix the flaky test in org.apache.spark.streaming.BasicOperationsSuite · 895baf8f
      zsxwing authored
      Just added a guard to make sure a batch has completed before moving to the next batch.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6306 from zsxwing/SPARK-7777 and squashes the following commits:
      
      ecee529 [zsxwing] Fix the failure message
      58634fe [zsxwing] Fix the flaky test in org.apache.spark.streaming.BasicOperationsSuite
      895baf8f
    • Hari Shreedharan's avatar
      [SPARK-7750] [WEBUI] Rename endpoints from `json` to `api` to allow fu… · a70bf06b
      Hari Shreedharan authored
      …rther extension to non-json outputs too.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6273 from harishreedharan/json-to-api and squashes the following commits:
      
      e14b73b [Hari Shreedharan] Rename `getJsonServlet` to `getServletHandler` i
      42f8acb [Hari Shreedharan] Import order fixes.
      2ef852f [Hari Shreedharan] [SPARK-7750][WebUI] Rename endpoints from `json` to `api` to allow further extension to non-json outputs too.
      a70bf06b
    • Josh Rosen's avatar
      [SPARK-7719] Re-add UnsafeShuffleWriterSuite test that was removed for Java 6 compat · 5196efff
      Josh Rosen authored
      This patch re-adds a test which was removed in 9ebb44f8 due to a Java 6 compatibility issue.  We now use Guava's `Iterators.emptyIterator()` in place of `Collections.emptyIterator()`, which isn't present in all Java 6 versions.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6298 from JoshRosen/SPARK-7719-fix-java-6-test-code and squashes the following commits:
      
      5c9bd85 [Josh Rosen] Re-add UnsafeShuffleWriterSuite.emptyIterator() test which was removed due to Java 6 issue
      5196efff
    • Xiangrui Meng's avatar
      [SPARK-7762] [MLLIB] set default value for outputCol · c330e52d
      Xiangrui Meng authored
      Set a default value for `outputCol` instead of forcing users to name it. This is useful for intermediate transformers in the pipeline. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6289 from mengxr/SPARK-7762 and squashes the following commits:
      
      54edebc [Xiangrui Meng] merge master
      bff8667 [Xiangrui Meng] update unit test
      171246b [Xiangrui Meng] add unit test for outputCol
      a4321bd [Xiangrui Meng] set default value for outputCol
      c330e52d
    • Josh Rosen's avatar
      [SPARK-7251] Perform sequential scan when iterating over BytesToBytesMap · f2faa7af
      Josh Rosen authored
      This patch modifies `BytesToBytesMap.iterator()` to iterate through records in the order that they appear in the data pages rather than iterating through the hashtable pointer arrays. This results in fewer random memory accesses, significantly improving performance for scan-and-copy operations.
      
      This is possible because our data pages are laid out as sequences of `[keyLength][data][valueLength][data]` entries.  In order to mark the end of a partially-filled data page, we write `-1` as a special end-of-page length (BytesToByesMap supports empty/zero-length keys and values, which is why we had to use a negative length).
      
      This patch incorporates / closes #5836.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6159 from JoshRosen/SPARK-7251 and squashes the following commits:
      
      05bd90a [Josh Rosen] Compare capacity, not size, to MAX_CAPACITY
      2a20d71 [Josh Rosen] Fix maximum BytesToBytesMap capacity
      bc4854b [Josh Rosen] Guard against overflow when growing BytesToBytesMap
      f5feadf [Josh Rosen] Add test for iterating over an empty map
      273b842 [Josh Rosen] [SPARK-7251] Perform sequential scan when iterating over entries in BytesToBytesMap
      f2faa7af
    • Josh Rosen's avatar
      [SPARK-7698] Cache and reuse buffers in ExecutorMemoryAllocator when using heap allocation · 7956dd7a
      Josh Rosen authored
      When on-heap memory allocation is used, ExecutorMemoryManager should maintain a cache / pool of buffers for re-use by tasks. This will significantly improve the performance of the new Tungsten's sort-shuffle for jobs with many short-lived tasks by eliminating a major source of GC.
      
      This pull request is a minimum-viable-implementation of this idea.  In its current form, this patch significantly improves performance on a stress test which launches huge numbers of short-lived shuffle map tasks back-to-back in the same JVM.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6227 from JoshRosen/SPARK-7698 and squashes the following commits:
      
      fd6cb55 [Josh Rosen] SoftReference -> WeakReference
      b154e86 [Josh Rosen] WIP sketch of pooling in ExecutorMemoryManager
      7956dd7a
    • Tathagata Das's avatar
      [SPARK-7767] [STREAMING] Added test for checkpoint serialization in StreamingContext.start() · 3c434cbf
      Tathagata Das authored
      Currently, the background checkpointing thread fails silently if the checkpoint is not serializable. It is hard to debug and therefore its best to fail fast at `start()` when checkpointing is enabled and the checkpoint is not serializable.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6292 from tdas/SPARK-7767 and squashes the following commits:
      
      51304e6 [Tathagata Das] Addressed comments.
      c35237b [Tathagata Das] Added test for checkpoint serialization in StreamingContext.start()
      3c434cbf
    • Andrew Or's avatar
      [SPARK-7237] [SPARK-7741] [CORE] [STREAMING] Clean more closures that need cleaning · 9b84443d
      Andrew Or authored
      SPARK-7741 is the equivalent of SPARK-7237 in streaming. This is an alternative to #6268.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6269 from andrewor14/clean-moar and squashes the following commits:
      
      c51c9ab [Andrew Or] Add periods (trivial)
      6c686ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
      79a435b [Andrew Or] Fix tests
      d18c9f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
      65ef07b [Andrew Or] Fix tests?
      4b487a3 [Andrew Or] Add tests for closures passed to DStream operations
      328139b [Andrew Or] Do not forget foreachRDD
      5431f61 [Andrew Or] Clean streaming closures
      72b7b73 [Andrew Or] Clean core closures
      9b84443d
    • Holden Karau's avatar
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42... · 191ee474
      Holden Karau authored
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6139 from holdenk/SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random and squashes the following commits:
      
      591f8e5 [Holden Karau] specify old seed for doc tests
      2470004 [Holden Karau] Fix a bunch of seeds with default values to have None as the default which will then result in using the hash of the class name
      cbad96d [Holden Karau] Add the setParams function that is used in the real code
      423b8d7 [Holden Karau] Switch the test code to behave slightly more like production code. also don't check the param map value only check for key existence
      140d25d [Holden Karau] remove extra space
      926165a [Holden Karau] Add some missing newlines for pep8 style
      8616751 [Holden Karau] merge in master
      58532e6 [Holden Karau] its the __name__ method, also treat None values as not set
      56ef24a [Holden Karau] fix test and regenerate base
      afdaa5c [Holden Karau] make sure different classes have different results
      68eb528 [Holden Karau] switch default seed to hash of type of self
      89c4611 [Holden Karau] Merge branch 'master' into SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random
      31cd96f [Holden Karau] specify the seed to randomforestregressor test
      e1b947f [Holden Karau] Style fixes
      ce90ec8 [Holden Karau] merge in master
      bcdf3c9 [Holden Karau] update docstring seeds to none and some other default seeds from 42
      65eba21 [Holden Karau] pep8 fixes
      0e3797e [Holden Karau] Make seed default to random in more places
      213a543 [Holden Karau] Simplify the generated code to only include set default if there is a default rather than having None is note None in the generated code
      1ff17c2 [Holden Karau] Make the seed random for HasSeed in python
      191ee474
Loading