Skip to content
Snippets Groups Projects
  1. Sep 23, 2015
    • zsxwing's avatar
      [SPARK-10769] [STREAMING] [TESTS] Fix o.a.s.streaming.CheckpointSuite.maintains rate controller · 50e46342
      zsxwing authored
      Fixed the following failure in https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1787/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_maintains_rate_controller/
      ```
      sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 660 times over 10.000044392000001 seconds. Last failure message: 9223372036854775807 did not equal 200.
      	at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
      	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
      	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      	at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336)
      	at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      	at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply$mcV$sp(CheckpointSuite.scala:413)
      	at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
      	at org.apache.spark.streaming.CheckpointSuite$$anonfun$15.apply(CheckpointSuite.scala:396)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      	at org.scalatest.Transformer.apply(Transformer.scala:22)
      ```
      
      In this test, it calls `advanceTimeWithRealDelay(ssc, 2)` to run two batch jobs. However, one race condition is these two jobs can finish before the receiver is registered. Then `UpdateRateLimit` won't be sent to the receiver and `getDefaultBlockGeneratorRateLimit` cannot be updated.
      
      Here are the logs related to this issue:
      ```
      15/09/22 19:28:26.154 pool-1-thread-1-ScalaTest-running-CheckpointSuite INFO CheckpointSuite: Manual clock before advancing = 2500
      
      15/09/22 19:28:26.869 JobScheduler INFO JobScheduler: Finished job streaming job 3000 ms.0 from job set of time 3000 ms
      15/09/22 19:28:26.869 JobScheduler INFO JobScheduler: Total delay: 1442975303.869 s for time 3000 ms (execution: 0.711 s)
      
      15/09/22 19:28:26.873 JobScheduler INFO JobScheduler: Finished job streaming job 3500 ms.0 from job set of time 3500 ms
      15/09/22 19:28:26.873 JobScheduler INFO JobScheduler: Total delay: 1442975303.373 s for time 3500 ms (execution: 0.004 s)
      
      15/09/22 19:28:26.879 sparkDriver-akka.actor.default-dispatcher-3 INFO ReceiverTracker: Registered receiver for stream 0 from localhost:57749
      
      15/09/22 19:28:27.154 pool-1-thread-1-ScalaTest-running-CheckpointSuite INFO CheckpointSuite: Manual clock after advancing = 3500
      ```
      `advanceTimeWithRealDelay(ssc, 2)` triggered job 3000ms and 3500ms but the receiver was registered after job 3000ms and 3500ms finished.
      
      So we should make sure the receiver online before running `advanceTimeWithRealDelay(ssc, 2)`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8877 from zsxwing/SPARK-10769.
      50e46342
    • zsxwing's avatar
      [SPARK-10224] [STREAMING] Fix the issue that blockIntervalTimer won't call... · 44c28abf
      zsxwing authored
      [SPARK-10224] [STREAMING] Fix the issue that blockIntervalTimer won't call updateCurrentBuffer when stopping
      
      `blockIntervalTimer.stop(interruptTimer = false)` doesn't guarantee calling `updateCurrentBuffer`. So it's possible that `blockIntervalTimer` will exit when `updateCurrentBuffer` is not empty. Then the data in `currentBuffer` will be lost.
      
      To reproduce it, you can add `Thread.sleep(200)` in this line (https://github.com/apache/spark/blob/69c9c177160e32a2fbc9b36ecc52156077fca6fc/streaming/src/main/scala/org/apache/spark/streaming/util/RecurringTimer.scala#L100) and run `StreamingContexSuite`.
      I cannot write a unit test to reproduce it because I cannot find an approach to force `RecurringTimer` suspend at this line for a few milliseconds.
      
      There was a failure in Jenkins here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41455/console
      
      This PR updates RecurringTimer to make sure `stop(interruptTimer = false)` will call `callback` at least once after the `stop` method is called.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8417 from zsxwing/SPARK-10224.
      44c28abf
    • Tathagata Das's avatar
      [SPARK-10652] [SPARK-10742] [STREAMING] Set meaningful job descriptions for all streaming jobs · 5548a254
      Tathagata Das authored
      Here is the screenshot after adding the job descriptions to threads that run receivers and the scheduler thread running the batch jobs.
      
      ## All jobs page
      * Added job descriptions with links to relevant batch details page
      ![image](https://cloud.githubusercontent.com/assets/663212/9924165/cda4a372-5cb1-11e5-91ca-d43a32c699e9.png)
      
      ## All stages page
      * Added stage descriptions with links to relevant batch details page
      ![image](https://cloud.githubusercontent.com/assets/663212/9923814/2cce266a-5cae-11e5-8a3f-dad84d06c50e.png)
      
      ## Streaming batch details page
      * Added the +details link
      ![image](https://cloud.githubusercontent.com/assets/663212/9921977/24014a32-5c98-11e5-958e-457b6c38065b.png)
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #8791 from tdas/SPARK-10652.
      5548a254
  2. Sep 22, 2015
    • Matt Hagen's avatar
      [SPARK-10663] Removed unnecessary invocation of DataFrame.toDF method. · 558e9c7e
      Matt Hagen authored
      The Scala example under the "Example: Pipeline" heading in this
      document initializes the "test" variable to a DataFrame. Because test
      is already a DF, there is not need to call test.toDF as the example
      does in a subsequent line: model.transform(test.toDF). So, I removed
      the extraneous toDF invocation.
      
      Author: Matt Hagen <anonz3000@gmail.com>
      
      Closes #8875 from hagenhaus/SPARK-10663.
      558e9c7e
    • Zhichao Li's avatar
      [SPARK-10310] [SQL] Fixes script transformation field/line delimiters · 84f81e03
      Zhichao Li authored
      **Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**
      
      This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.
      
      Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.
      84f81e03
    • Andrew Or's avatar
      [SPARK-10640] History server fails to parse TaskCommitDenied · 61d4c07f
      Andrew Or authored
      ... simply because the code is missing!
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8828 from andrewor14/task-end-reason-json.
      61d4c07f
    • Reynold Xin's avatar
      [SPARK-10714] [SPARK-8632] [SPARK-10685] [SQL] Refactor Python UDF handling · a96ba40f
      Reynold Xin authored
      This patch refactors Python UDF handling:
      
      1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs.
      2. Use PythonRunner in Spark SQL's BatchPythonEvaluation.
      3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5.
      
      There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small.
      
      This basically implements the approach in https://github.com/apache/spark/pull/8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8835 from rxin/python-iter-refactor.
      a96ba40f
    • Yin Huai's avatar
      [SPARK-10737] [SQL] When using UnsafeRows, SortMergeJoin may return wrong results · 5aea987c
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10737
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8854 from yhuai/SMJBug.
      5aea987c
    • Yin Huai's avatar
      [SPARK-10672] [SQL] Do not fail when we cannot save the metadata of a data... · 2204cdb2
      Yin Huai authored
      [SPARK-10672] [SQL] Do not fail when we cannot save the metadata of a data source table in a hive compatible way
      
      https://issues.apache.org/jira/browse/SPARK-10672
      
      With changes in this PR, we will fallback to same the metadata of a table in Spark SQL specific way if we fail to save it in a hive compatible way (Hive throws an exception because of its internal restrictions, e.g. binary and decimal types cannot be saved to parquet if the metastore is running Hive 0.13). I manually tested the fix with the following test in `DataSourceWithHiveMetastoreCatalogSuite` (`spark.sql.hive.metastore.version=0.13` and `spark.sql.hive.metastore.jars`=`maven`).
      
      ```
          test(s"fail to save metadata of a parquet table in hive 0.13") {
            withTempPath { dir =>
              withTable("t") {
                val path = dir.getCanonicalPath
      
                sql(
                  s"""CREATE TABLE t USING $provider
                     |OPTIONS (path '$path')
                     |AS SELECT 1 AS d1, cast("val_1" as binary) AS d2
                   """.stripMargin)
      
                sql(
                  s"""describe formatted t
                   """.stripMargin).collect.foreach(println)
      
                sqlContext.table("t").show
              }
            }
          }
        }
      ```
      
      Without this fix, we will fail with the following error.
      ```
      org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Unknown field type: binary
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:576)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:359)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
      	at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
      	at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
      	at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:357)
      	at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:358)
      	at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:285)
      	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
      	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
      	at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
      	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:58)
      	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:58)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:144)
      	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:129)
      	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
      	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
      	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:56)
      	at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:56)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2$$anonfun$apply$2.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:165)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:150)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTable(HiveMetastoreCatalogSuite.scala:52)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2.apply(HiveMetastoreCatalogSuite.scala:162)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1$$anonfun$apply$mcV$sp$2.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:125)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.withTempPath(HiveMetastoreCatalogSuite.scala:52)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply$mcV$sp(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite$$anonfun$4$$anonfun$apply$1.apply(HiveMetastoreCatalogSuite.scala:161)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      	at org.scalatest.Transformer.apply(Transformer.scala:22)
      	at org.scalatest.Transformer.apply(Transformer.scala:20)
      	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
      	at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      	at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      	at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      	at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      	at scala.collection.immutable.List.foreach(List.scala:318)
      	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      	at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      	at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      	at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      	at org.scalatest.Suite$class.run(Suite.scala:1424)
      	at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      	at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      	at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.org$scalatest$BeforeAndAfterAll$$super$run(HiveMetastoreCatalogSuite.scala:52)
      	at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
      	at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
      	at org.apache.spark.sql.hive.DataSourceWithHiveMetastoreCatalogSuite.run(HiveMetastoreCatalogSuite.scala:52)
      	at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
      	at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:294)
      	at sbt.ForkMain$Run$2.call(ForkMain.java:284)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.UnsupportedOperationException: Unknown field type: binary
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:108)
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.<init>(ArrayWritableObjectInspector.java:60)
      	at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
      	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
      	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
      	at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
      	at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
      	... 76 more
      ```
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8824 from yhuai/datasourceMetadata.
      2204cdb2
    • Wenchen Fan's avatar
      [SPARK-10740] [SQL] handle nondeterministic expressions correctly for set operations · 5017c685
      Wenchen Fan authored
      https://issues.apache.org/jira/browse/SPARK-10740
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8858 from cloud-fan/non-deter.
      5017c685
    • Josh Rosen's avatar
      [SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader · 1ca5e2e0
      Josh Rosen authored
      The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. This patch addresses this by renaming HashShuffleReader to BlockStoreShuffleReader.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8825 from JoshRosen/shuffle-reader-cleanup.
      1ca5e2e0
    • Davies Liu's avatar
      [SPARK-10593] [SQL] fix resolve output of Generate · 22d40159
      Davies Liu authored
      The output of Generate should not be resolved as Reference.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8755 from davies/view.
      22d40159
    • xutingjun's avatar
      [SPARK-9585] Delete the input format caching because some input format are non thread safe · 2ea0f2e1
      xutingjun authored
      If we cache the  InputFormat, all tasks on the same executor will share it.
      Some InputFormat is thread safety, but some are not, such as HiveHBaseTableInputFormat. If tasks share a non thread safe InputFormat, unexpected error may be occurs.
      To avoid it, I think we should delete the input format  caching.
      
      Author: xutingjun <xutingjun@huawei.com>
      Author: meiyoula <1039320815@qq.com>
      Author: Xutingjun <xutingjun@huawei.com>
      
      Closes #7918 from XuTingjun/cached_inputFormat.
      2ea0f2e1
    • Yanbo Liang's avatar
      [SPARK-10750] [ML] ML Param validate should print better error information · 7104ee0e
      Yanbo Liang authored
      Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
      Take ```VectorSlicer.setNames``` as an example:
      ```scala
      val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
      // The value of setNames must be contain distinct elements, so the next line will throw exception.
      vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
      ```
      It will throw IllegalArgumentException as:
      ```
      vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
      java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
      ```
      We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
      ```
      vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
      java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #8863 from yanboliang/spark-10750.
      7104ee0e
    • Holden Karau's avatar
      [SPARK-9962] [ML] Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training · f4a3c4e3
      Holden Karau authored
      NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.
      f4a3c4e3
    • Meihua Wu's avatar
      [SPARK-10706] [MLLIB] Add java wrapper for random vector rdd · 870b8a2e
      Meihua Wu authored
      Add java wrapper for random vector rdd
      
      holdenk srowen
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #8841 from rotationsymmetry/SPARK-10706.
      870b8a2e
    • Rekha Joshi's avatar
      [SPARK-10718] [BUILD] Update License on conf files and corresponding excludes file update · 7278f792
      Rekha Joshi authored
      Update License on conf files and corresponding excludes file update
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #8842 from rekhajoshm/SPARK-10718.
      7278f792
    • Akash Mishra's avatar
      [SPARK-10695] [DOCUMENTATION] [MESOS] Fixing incorrect value informati… · 0bd0e5be
      Akash Mishra authored
      …on for spark.mesos.constraints parameter.
      
      Author: Akash Mishra <akash.mishra20@gmail.com>
      
      Closes #8816 from SleepyThread/constraint-fix.
      0bd0e5be
    • Reynold Xin's avatar
      [SQL] [MINOR] map -> foreach. · f3b727c8
      Reynold Xin authored
      DataFrame.explain should use foreach to print the explain content.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8862 from rxin/map-foreach.
      f3b727c8
    • Yin Huai's avatar
      [SPARK-8567] [SQL] Increase the timeout of o.a.s.sql.hive.HiveSparkSubmitSuite to 5 minutes. · 4da32bc0
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-8567
      
      Looks like "SPARK-8368: includes jars passed in through --jars" is pretty flaky now. Based on some history runs, the time spent on a successful run may be from 1.5 minutes to almost 3 minutes. Let's try to increase the timeout and see if we can fix this test.
      
      https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/385/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_8368__includes_jars_passed_in_through___jars/history/?start=25
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8850 from yhuai/SPARK-8567-anotherTry.
      4da32bc0
    • Andrew Or's avatar
      fd61b004
    • Madhusudanan Kandasamy's avatar
      [SPARK-10458] [SPARK CORE] Added isStopped() method in SparkContext · f24316e6
      Madhusudanan Kandasamy authored
      Added isStopped() method in SparkContext
      
      Author: Madhusudanan Kandasamy <madhusudanan@in.ibm.com>
      
      Closes #8749 from kmadhugit/SPARK-10458.
      f24316e6
    • Liang-Chi Hsieh's avatar
      [SPARK-10446][SQL] Support to specify join type when calling join with usingColumns · 1fcefef0
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10446
      
      Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8600 from viirya/usingcolumns_df.
      1fcefef0
    • Ewan Leith's avatar
      [SPARK-10419] [SQL] Adding SQLServer support for datetimeoffset types to JdbcDialects · 781b21ba
      Ewan Leith authored
      Reading from Microsoft SQL Server over jdbc fails when the table contains datetimeoffset types.
      
      This patch registers a SQLServer JDBC Dialect that maps datetimeoffset to a String, as Microsoft suggest.
      
      Author: Ewan Leith <ewan.leith@realitymine.com>
      
      Closes #8575 from realitymine-coordinator/sqlserver.
      781b21ba
    • Jian Feng's avatar
      [SPARK-10577] [PYSPARK] DataFrame hint for broadcast join · 0180b849
      Jian Feng authored
      https://issues.apache.org/jira/browse/SPARK-10577
      
      Author: Jian Feng <jzhang.chs@gmail.com>
      
      Closes #8801 from Jianfeng-chs/master.
      0180b849
    • Sean Owen's avatar
      [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on... · bf20d6c9
      Sean Owen authored
      [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file
      
      Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8846 from srowen/SPARK-10716.
      bf20d6c9
    • Holden Karau's avatar
      [SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner · 1cd67415
      Holden Karau authored
      from the issue:
      
      In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
      Here's an example of my code in Scala:
      weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
      But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
      weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
      But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
      1cd67415
  3. Sep 21, 2015
Loading