Skip to content
Snippets Groups Projects
  1. Dec 07, 2016
    • Michael Armbrust's avatar
      [SPARK-18754][SS] Rename recentProgresses to recentProgress · 70b2bf71
      Michael Armbrust authored
      Based on an informal survey, users find this option easier to understand / remember.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #16182 from marmbrus/renameRecentProgress.
      70b2bf71
    • Shixiong Zhu's avatar
      [SPARK-18588][TESTS] Fix flaky test: KafkaSourceStressForDontFailOnDataLossSuite · edc87e18
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the following failures:
      
      ```
      org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout.
      ```
      
      ```
      sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null
      	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
      	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146)
      Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
      	at java.util.ArrayList.addAll(ArrayList.java:577)
      	at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257)
      	at org.apache.kafka.clients.Metadata.update(Metadata.java:177)
      	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605)
      	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582)
      	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450)
      	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
      	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
      	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
      	at
      ...
      ```
      
      ## How was this patch tested?
      
      Tested in #16048 by running many times.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16109 from zsxwing/fix-kafka-flaky-test.
      edc87e18
    • sarutak's avatar
      [SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040 · bb94f61a
      sarutak authored
      ## What changes were proposed in this pull request?
      
      When SSL is enabled, the Spark shell shows:
      ```
      Spark context Web UI available at https://192.168.99.1:4040
      ```
      This is wrong because 4040 is http, not https. It redirects to the https port.
      More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481.
      
      CC: mengxr liancheng
      
      I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled.
      
      Author: sarutak <sarutak@oss.nttdata.co.jp>
      
      Closes #16190 from sarutak/SPARK-18761.
      bb94f61a
    • Shixiong Zhu's avatar
      [SPARK-18764][CORE] Add a warning log when skipping a corrupted file · dbf3e298
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16192 from zsxwing/SPARK-18764.
      dbf3e298
    • Andrew Ray's avatar
      [SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy column is not attribute · f1fca81b
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Fixes AnalysisException for pivot queries that have group by columns that are expressions and not attributes by substituting the expressions output attribute in the second aggregation and final projection.
      
      ## How was this patch tested?
      
      existing and additional unit tests
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16177 from aray/SPARK-17760.
      f1fca81b
    • Jie Xiong's avatar
      [SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in BytesToBytesMap · c496d03b
      Jie Xiong authored
      ## What changes were proposed in this pull request?
      
      BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate.
      
      However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM.
      
      This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills.
      
      ## How was this patch tested?
      
      Existing tests and tested on realworld workloads.
      
      Author: Jie Xiong <jiexiong@fb.com>
      Author: jiexiong <jiexiong@gmail.com>
      
      Closes #15722 from jiexiong/jie_oom_fix.
      c496d03b
    • Sean Owen's avatar
      [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils · 79f5f281
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.
      
      ## How was this patch tested?
      
      Existing test plus new test case.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16129 from srowen/SPARK-18678.
      Unverified
      79f5f281
    • actuaryzhang's avatar
      [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization · b8280271
      actuaryzhang authored
      Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.
      
      ## What changes were proposed in this pull request?
      Update initialization in Poisson GLM
      
      ## How was this patch tested?
      Add test in GeneralizedLinearRegressionSuite
      
      srowen sethah yanboliang HyukjinKwon mengxr
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16131 from actuaryzhang/master.
      Unverified
      b8280271
    • Yanbo Liang's avatar
      [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. · 90b59d1b
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Several cleanup and improvements for ```spark.logit```:
      * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
      * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
      * SparkR test improvement: comparing the training result with native R glmnet.
      * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
      
      ## How was this patch tested?
      Unit tests.
      
      The ```summary``` output after this change:
      multinomial logistic regression:
      ```
      > df <- suppressWarnings(createDataFrame(iris))
      > model <- spark.logit(df, Species ~ ., regParam = 0.5)
      > summary(model)
      $coefficients
                   versicolor  virginica   setosa
      (Intercept)  1.514031    -2.609108   1.095077
      Sepal_Length 0.02511006  0.2649821   -0.2900921
      Sepal_Width  -0.5291215  -0.02016446 0.549286
      Petal_Length 0.03647411  0.1544119   -0.190886
      Petal_Width  0.000236092 0.4195804   -0.4198165
      ```
      binomial logistic regression:
      ```
      > df <- suppressWarnings(createDataFrame(iris))
      > training <- df[df$Species %in% c("versicolor", "virginica"), ]
      > model <- spark.logit(training, Species ~ ., regParam = 0.5)
      > summary(model)
      $coefficients
                   Estimate
      (Intercept)  -6.053815
      Sepal_Length 0.2449379
      Sepal_Width  0.1648321
      Petal_Length 0.4730718
      Petal_Width  1.031947
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16117 from yanboliang/spark-18686.
      90b59d1b
  2. Dec 06, 2016
    • Tathagata Das's avatar
      [SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven · 5c6bcdbd
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources.
      
      ## How was this patch tested?
      
      Manually ran maven test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16183 from tdas/SPARK-18671-1.
      5c6bcdbd
    • Reynold Xin's avatar
      Closes stale & invalid pull requests. · 08d64412
      Reynold Xin authored
      Closes #14537.
      Closes #16181.
      Closes #8318.
      Closes #6848.
      Closes #7265.
      Closes #9543.
      08d64412
    • c-sahuja's avatar
      Update Spark documentation to provide information on how to create External Table · 01c7c6b8
      c-sahuja authored
      ## What changes were proposed in this pull request?
      Although, currently, the saveAsTable does not provide an API to save the table as an external table from a DataFrame, we can achieve this functionality by using options on DataFrameWriter where the key for the map is the String: "path" and the value is another String which is the location of the external table itself. This can be provided before the call to saveAsTable is performed.
      
      ## How was this patch tested?
      Documentation was reviewed for formatting and content after the push was performed on the branch.
      ![updated documentation](https://cloud.githubusercontent.com/assets/15376052/20953147/4cfcf308-bc57-11e6-807c-e21fb774a760.PNG)
      
      Author: c-sahuja <sahuja@cloudera.com>
      
      Closes #16185 from c-sahuja/createExternalTable.
      01c7c6b8
    • Tathagata Das's avatar
      [SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted... · 539bb3cf
      Tathagata Das authored
      [SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted string instead of millis
      
      ## What changes were proposed in this pull request?
      
      Easier to read while debugging as a formatted string (in ISO8601 format) than in millis
      
      ## How was this patch tested?
      Updated unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16166 from tdas/SPARK-18734.
      539bb3cf
    • Sean Owen's avatar
      Revert "[SPARK-18697][BUILD] Upgrade sbt plugins" · 4cc8d890
      Sean Owen authored
      This reverts commit 7f31d378.
      Unverified
      4cc8d890
    • Anirudh's avatar
      [SPARK-18662] Move resource managers to separate directory · 81e5619c
      Anirudh authored
      ## What changes were proposed in this pull request?
      
      * Moves yarn and mesos scheduler backends to resource-managers/ sub-directory (in preparation for https://issues.apache.org/jira/browse/SPARK-18278)
      * Corresponding change in top-level pom.xml.
      
      Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340
      
      ## How was this patch tested?
      
      * Manual tests
      
      /cc rxin
      
      Author: Anirudh <ramanathana@google.com>
      
      Closes #16092 from foxish/fix-scheduler-structure-2.
      81e5619c
    • Shuai Lin's avatar
      [SPARK-18171][MESOS] Show correct framework address in mesos master web ui... · a8ced76f
      Shuai Lin authored
      [SPARK-18171][MESOS] Show correct framework address in mesos master web ui when the advertised address is used
      
      ## What changes were proposed in this pull request?
      
      In [SPARK-4563](https://issues.apache.org/jira/browse/SPARK-4563) we added the support for the driver to advertise a different hostname/ip (`spark.driver.host` to the executors other than the hostname/ip the driver actually binds to (`spark.driver.bindAddress`). But in the mesos webui's frameworks page, it still shows the driver's binds hostname/ip (though the web ui link is correct). We should fix it to make them consistent.
      
      Before:
      
      ![mesos-spark](https://cloud.githubusercontent.com/assets/717363/19835148/4dffc6d4-9eb8-11e6-8999-f80f10e4c3f7.png)
      
      After:
      
      ![mesos-spark2](https://cloud.githubusercontent.com/assets/717363/19835149/54596ae4-9eb8-11e6-896c-230426acd1a1.png)
      
      This PR doesn't affect the behavior on the spark side, only makes the display on the mesos master side more consistent.
      ## How was this patch tested?
      
      Manual test.
      - Build the package and build a docker image (spark:2.1-test)
      
      ``` sh
      ./dev/make-distribution.sh -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -Pmesos
      ```
      -  Then run the spark driver inside a docker container.
      
      ``` sh
      docker run --rm -it \
        --name=spark \
        -p 30000-30010:30000-30010 \
        -e LIBPROCESS_ADVERTISE_IP=172.17.42.1 \
        -e LIBPROCESS_PORT=30000 \
        -e MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-1.0.0.so \
        -e MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos-1.0.0.so \
        -e SPARK_HOME=/opt/dist \
        spark:2.1-test
      ```
      - Inside the container, launch the spark driver, making use of the advertised address:
      
      ``` sh
      /opt/dist/bin/spark-shell \
        --master mesos://zk://172.17.42.1:2181/mesos \
        --conf spark.driver.host=172.17.42.1 \
        --conf spark.driver.bindAddress=172.17.0.1 \
        --conf spark.driver.port=30001 \
        --conf spark.driver.blockManager.port=30002 \
        --conf spark.ui.port=30003 \
        --conf spark.mesos.coarse=true \
        --conf spark.cores.max=2 \
        --conf spark.executor.cores=1 \
        --conf spark.executor.memory=1g \
        --conf spark.mesos.executor.docker.image=spark:2.1-test
      ```
      - Run several spark jobs to ensure everything is running fine.
      
      ``` scala
      val rdd = sc.textFile("file:///opt/dist/README.md")
      rdd.cache().count
      ```
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #15684 from lins05/spark-18171-show-correct-host-name-in-mesos-master-web-ui.
      Unverified
      a8ced76f
    • Weiqing Yang's avatar
      [SPARK-18697][BUILD] Upgrade sbt plugins · 7f31d378
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      This PR is to upgrade sbt plugins. The following sbt plugins will be upgraded:
      ```
      sbt-assembly: 0.11.2 -> 0.14.3
      sbteclipse-plugin: 4.0.0 -> 5.0.1
      sbt-mima-plugin: 0.1.11 -> 0.1.12
      org.ow2.asm/asm: 5.0.3 -> 5.1
      org.ow2.asm/asm-commons: 5.0.3 -> 5.1
      ```
      All other plugins are up-to-date.
      
      ## How was this patch tested?
      Pass the Jenkins build.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16159 from weiqingy/SPARK-18697.
      Unverified
      7f31d378
    • Shuai Lin's avatar
      [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. · bd9a4a5a
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      Since we already include the python examples in the pyspark package, we should include the example data with it as well.
      
      We should also include the third-party licences since we distribute their jars with the pyspark package.
      
      ## How was this patch tested?
      
      Manually tested with python2.7 and python3.4
      ```sh
      $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
      $ cd python
      $ python setup.py sdist
      $ pip install  dist/pyspark-2.1.0.dev0.tar.gz
      
      $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
      graphx
      mllib
      streaming
      
      $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
      600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/
      
      $ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
      LICENSE-AnchorJS.txt
      LICENSE-DPark.txt
      LICENSE-Mockito.txt
      LICENSE-SnapTree.txt
      LICENSE-antlr.txt
      ```
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #16082 from lins05/include-data-in-pyspark-dist.
      Unverified
      bd9a4a5a
    • Shixiong Zhu's avatar
      [SPARK-18744][CORE] Remove workaround for Netty memory leak · eeed38ea
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      We added some codes in https://github.com/apache/spark/pull/14961 because of https://github.com/netty/netty/issues/5833
      
      Now we can remove them as it's fixed in Netty 4.0.42.Final.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16167 from zsxwing/remove-netty-workaround.
      eeed38ea
    • Yuhao's avatar
      [SPARK-18374][ML] Incorrect words in StopWords/english.txt · fac5b75b
      Yuhao authored
      ## What changes were proposed in this pull request?
      
      Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.
      
      Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.
      
      see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374
      
      ## How was this patch tested?
      existing ut
      
      Author: Yuhao <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #16103 from hhbyyh/addstopwords.
      Unverified
      fac5b75b
    • Tathagata Das's avatar
      [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured... · 1ef6b296
      Tathagata Das authored
      [SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured Streaming log formats
      
      ## What changes were proposed in this pull request?
      
      To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.
      
      ## How was this patch tested?
      new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16128 from tdas/SPARK-18671.
      1ef6b296
    • Reynold Xin's avatar
      [SPARK-18714][SQL] Add a simple time function to SparkSession · cb1f10b4
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession:
      
      ```
      scala> spark.time { spark.range(1000).count() }
      Time taken: 77 ms
      res1: Long = 1000
      ```
      
      ## How was this patch tested?
      I tested this interactively in spark-shell.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16140 from rxin/SPARK-18714.
      cb1f10b4
    • Peter Ableda's avatar
      [SPARK-18740] Log spark.app.name in driver logs · 05d416ff
      Peter Ableda authored
      ## What changes were proposed in this pull request?
      
      Added simple logInfo line to print out the `spark.app.name` in the driver logs
      
      ## How was this patch tested?
      
      Spark was built and tested with SparkPi app. Example log:
      ```
      16/12/06 05:49:50 INFO spark.SparkContext: Running Spark version 2.0.0
      16/12/06 05:49:52 INFO spark.SparkContext: Submitted application: Spark Pi
      16/12/06 05:49:52 INFO spark.SecurityManager: Changing view acls to: root
      16/12/06 05:49:52 INFO spark.SecurityManager: Changing modify acls to: root
      ```
      
      Author: Peter Ableda <peter.ableda@cloudera.com>
      
      Closes #16172 from peterableda/feature/print_appname.
      05d416ff
    • Herman van Hovell's avatar
      [SPARK-18634][SQL][TRIVIAL] Touch-up Generate · 381ef4ea
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      I jumped the gun on merging https://github.com/apache/spark/pull/16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16170 from hvanhovell/SPARK-18634.
      381ef4ea
  3. Dec 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-18721][SS] Fix ForeachSink with watermark + append · 7863c623
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark.
      
      This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan.
      
      ## How was this patch tested?
      
      `test("foreach with watermark: append")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16160 from zsxwing/SPARK-18721.
      7863c623
    • hyukjinkwon's avatar
      [SPARK-18672][CORE] Close recordwriter in SparkHadoopMapReduceWriter before committing · b8c7b8d3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems some APIs such as `PairRDDFunctions.saveAsHadoopDataset()` do not close the record writer before issuing the commit for the task.
      
      On Windows, the output in the temp directory is being open and output committer tries to rename it from temp directory to the output directory after finishing writing.
      
      So, it fails to move the file. It seems we should close the writer actually before committing the task like the other writers such as `FileFormatWriter`.
      
      Identified failure was as below:
      
      ```
      FAILURE! - in org.apache.spark.JavaAPISuite
      writeWithNewAPIHadoopFile(org.apache.spark.JavaAPISuite)  Time elapsed: 0.25 sec  <<< ERROR!
      org.apache.spark.SparkException: Job aborted.
      	at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
      Caused by: org.apache.spark.SparkException:
      Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:182)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:100)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:99)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000
      	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:436)
      	at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:415)
      	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
      	at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
      	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:167)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:156)
      	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
      	at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:168)
      	... 8 more
      Driver stacktrace:
      	at org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
      Caused by: org.apache.spark.SparkException: Task failed while writing rows
      Caused by: java.io.IOException: Could not rename file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155_0000_r_000000_0 to file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155_0000_r_000000
      ```
      
      This PR proposes to close this before committing the task.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      **Before**
      
      https://ci.appveyor.com/project/spark-test/spark/build/94-scala-tests
      
      **After**
      
      https://ci.appveyor.com/project/spark-test/spark/build/93-scala-tests
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16098 from HyukjinKwon/close-wirter-first.
      Unverified
      b8c7b8d3
    • Michael Allman's avatar
      [SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` · 772ddbea
      Michael Allman authored
      (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
      
      ## What changes were proposed in this pull request?
      
      Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.
      
      To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:
      
      Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds:
      7.901
      3.983
      4.018
      4.331
      4.261
      
      Spark at bdc8153e, `SHOW PARTITIONS table2`
      (Timed out after 10 minutes with a `SocketTimeoutException`.)
      
      Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
      3.801
      0.449
      0.395
      0.348
      0.336
      
      Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
      5.184
      1.63
      1.474
      1.519
      1.41
      
      Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.
      
      This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.
      
      ## How was this patch tested?
      
      I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15998 from mallman/spark-18572-list_partition_names.
      772ddbea
    • Shixiong Zhu's avatar
      [SPARK-18722][SS] Move no data rate limit from StreamExecution to ProgressReporter · 4af142f5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16155 from zsxwing/SPARK-18722.
      4af142f5
    • root's avatar
      [SPARK-18555][SQL] DataFrameNaFunctions.fill miss up original values in long integers · 508de38c
      root authored
      ## What changes were proposed in this pull request?
      
         DataSet.na.fill(0) used on a DataSet which has a long value column, it will change the original long value.
      
         The reason is that the type of the function fill's param is Double, and the numeric columns are always cast to double(`fillCol[Double](f, value)`) .
      ```
        def fill(value: Double, cols: Seq[String]): DataFrame = {
          val columnEquals = df.sparkSession.sessionState.analyzer.resolver
          val projections = df.schema.fields.map { f =>
            // Only fill if the column is part of the cols list.
            if (f.dataType.isInstanceOf[NumericType] && cols.exists(col => columnEquals(f.name, col))) {
              fillCol[Double](f, value)
            } else {
              df.col(f.name)
            }
          }
          df.select(projections : _*)
        }
      ```
      
       For example:
      ```
      scala> val df = Seq[(Long, Long)]((1, 2), (-1, -2), (9123146099426677101L, 9123146560113991650L)).toDF("a", "b")
      df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint]
      
      scala> df.show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426677101|9123146560113991650|
      +-------------------+-------------------+
      
      scala> df.na.fill(0).show
      +-------------------+-------------------+
      |                  a|                  b|
      +-------------------+-------------------+
      |                  1|                  2|
      |                 -1|                 -2|
      |9123146099426676736|9123146560113991680|
      +-------------------+-------------------+
       ```
      
      the original values changed [which is not we expected result]:
      ```
       9123146099426677101 -> 9123146099426676736
       9123146560113991650 -> 9123146560113991680
      ```
      
      ## How was this patch tested?
      
      unit test added.
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      
      Closes #15994 from windpiger/nafillMissupOriginalValue.
      508de38c
    • gatorsmile's avatar
      [SPARK-18720][SQL][MINOR] Code Refactoring of withColumn · 2398fde4
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Our existing withColumn for adding metadata can simply use the existing public withColumn API.
      
      ### How was this patch tested?
      The existing test cases cover it.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16152 from gatorsmile/withColumnRefactoring.
      2398fde4
    • Tathagata Das's avatar
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · bb57bfe9
      Tathagata Das authored
      [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name
      
      ## What changes were proposed in this pull request?
      Here are the major changes in this PR.
      - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
      - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
      - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
      - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.
      
      Implementation details
      - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
      - Added the `id` as the new `StreamMetadata`.
      - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
      - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`
      
      TODO
      - [x] Test handling of name=null in json generation of StreamingQueryProgress
      - [x] Test handling of name=null in json generation of StreamingQueryListener events
      - [x] Test python API of runId
      
      ## How was this patch tested?
      Updated unit tests and new unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16113 from tdas/SPARK-18657.
      bb57bfe9
    • Shixiong Zhu's avatar
      [SPARK-18729][SS] Move DataFrame.collect out of synchronized block in MemorySink · 1b2785c3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16162 from zsxwing/SPARK-18729.
      1b2785c3
    • Liang-Chi Hsieh's avatar
      [SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs · 3ba69b64
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.
      
      The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.
      
          >>> from pyspark.sql.functions import *
          >>> from pyspark.sql.types import *
          >>>
          >>> df = spark.range(10)
          >>>
          >>> def return_range(value):
          ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
          ...
          >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
          ...                                                     StructField("string_val", StringType())])))
          >>>
          >>> df.select("id", explode(range_udf(df.id))).show()
          Traceback (most recent call last):
            File "<stdin>", line 1, in <module>
            File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
              print(self._jdf.showString(n, 20))
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
            File "/spark/python/pyspark/sql/utils.py", line 63, in deco
              return f(*a, **kw)
            File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
              at scala.Predef$.assert(Predef.scala:156)
              at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
              at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)
      
      The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.
      
      Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.
      
      It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.
      
      However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.
      
      To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.
      
      ## How was this patch tested?
      
      Added test cases to PySpark.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16120 from viirya/fix-py-udf-with-generator.
      3ba69b64
    • Nicholas Chammas's avatar
      [SPARK-18719] Add spark.ui.showConsoleProgress to configuration docs · 18eaabb7
      Nicholas Chammas authored
      This PR adds `spark.ui.showConsoleProgress` to the configuration docs.
      
      I tested this PR by building the docs locally and confirming that this change shows up as expected.
      
      Relates to #3029.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #16151 from nchammas/ui-progressbar-doc.
      18eaabb7
    • Nicholas Chammas's avatar
      [DOCS][MINOR] Update location of Spark YARN shuffle jar · 5a92dc76
      Nicholas Chammas authored
      Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`.
      
      This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #16130 from nchammas/yarn-doc-fix.
      5a92dc76
    • Wenchen Fan's avatar
      [SPARK-18711][SQL] should disable subexpression elimination for LambdaVariable · 01a7d33d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination.
      
      However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop.
      
      This PR skips expressions containing `LambdaVariable` when doing subexpression elimination.
      
      ## How was this patch tested?
      
      updated test in `DatasetAggregatorSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16143 from cloud-fan/aggregator.
      01a7d33d
    • Shixiong Zhu's avatar
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · 24601285
      Shixiong Zhu authored
      [SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException
      
      ## What changes were proposed in this pull request?
      
      - Add StreamingQuery.explain and exception to Python.
      - Fix StreamingQueryException to not expose `OffsetSeq`.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16125 from zsxwing/py-streaming-explain.
      24601285
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Use SparkR `TRUE` value and add default values for `StructField` in SQL Guide. · 410b7898
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional.
      
      **BEFORE**
      * SPARK 2.1.0 RC1
      http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types
      
      **AFTER**
      
      * R
      <img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png">
      
      * Scala
      <img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png">
      
      * Python
      <img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png">
      
      ## How was this patch tested?
      
      Manual.
      
      ```
      cd docs
      SKIP_API=1 jekyll build
      open _site/index.html
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.
      410b7898
    • Yanbo Liang's avatar
      [SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide. · eb8dd681
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Add R examples to ML programming guide for the following algorithms as POC:
      * spark.glm
      * spark.survreg
      * spark.naiveBayes
      * spark.kmeans
      
      The four algorithms were added to SparkR since 2.0.0, more docs for algorithms added during 2.1 release cycle will be addressed in a separate follow-up PR.
      
      ## How was this patch tested?
      This is the screenshots of generated ML programming guide for ```GeneralizedLinearRegression```:
      ![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png)
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16136 from yanboliang/spark-18279.
      eb8dd681
    • Zheng RuiFeng's avatar
      [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol · bdfe7f67
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`
      
      ## How was this patch tested?
      added tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16059 from zhengruifeng/ovrm_setCol.
      bdfe7f67
Loading