Skip to content
Snippets Groups Projects
  1. Aug 06, 2017
  2. Aug 05, 2017
    • Takeshi Yamamuro's avatar
      [SPARK-20963][SQL] Support column aliases for join relations in FROM clause · 990efad1
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added parsing rules to support column aliases for join relations in FROM clause.
      This pr is a sub-task of #18079.
      
      ## How was this patch tested?
      Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18772 from maropu/SPARK-20963-2.
      990efad1
    • hzyaoqin's avatar
      [SPARK-21637][SPARK-21451][SQL] get `spark.hadoop.*` properties from sysProps to hiveconf · 41568e9a
      hzyaoqin authored
      ## What changes were proposed in this pull request?
      When we use `bin/spark-sql` command configuring `--conf spark.hadoop.foo=bar`, the `SparkSQLCliDriver` initializes an instance of  hiveconf, it does not add `foo->bar` to it.
      this pr gets `spark.hadoop.*` properties from sysProps to this hiveconf
      
      ## How was this patch tested?
      UT
      
      Author: hzyaoqin <hzyaoqin@corp.netease.com>
      Author: Kent Yao <yaooqinn@hotmail.com>
      
      Closes #18668 from yaooqinn/SPARK-21451.
      41568e9a
    • arodriguez's avatar
      [SPARK-21640] Add errorifexists as a valid string for ErrorIfExists save mode · dcac1d57
      arodriguez authored
      ## What changes were proposed in this pull request?
      
      This PR includes the changes to make the string "errorifexists" also valid for ErrorIfExists save mode.
      
      ## How was this patch tested?
      
      Unit tests and manual tests
      
      Author: arodriguez <arodriguez@arodriguez.stratio>
      
      Closes #18844 from ardlema/SPARK-21640.
      dcac1d57
    • hyukjinkwon's avatar
      [SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments... · ba327ee5
      hyukjinkwon authored
      [SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments separately, and note/since in SQL built-in function documentation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to separate `extended` into `examples` and `arguments` internally so that both can be separately documented and add `since` and `note` for additional information.
      
      For `since`, it looks users sometimes get confused by, up to my knowledge, missing version information. For example, see https://www.mail-archive.com/userspark.apache.org/msg64798.html
      
      For few good examples to check the built documentation, please see both:
      `from_json` - https://spark-test.github.io/sparksqldoc/#from_json
      `like` - https://spark-test.github.io/sparksqldoc/#like
      
      For `DESCRIBE FUNCTION`, `note` and `since` are added as below:
      
      ```
      > DESCRIBE FUNCTION EXTENDED rlike;
      ...
      Extended Usage:
          Arguments:
            ...
      
          Examples:
            ...
      
          Note:
            Use LIKE to match with simple string pattern
      ```
      
      ```
      > DESCRIBE FUNCTION EXTENDED to_json;
      ...
          Examples:
            ...
      
          Since: 2.2.0
      ```
      
      For the complete documentation, see https://spark-test.github.io/sparksqldoc/
      
      ## How was this patch tested?
      
      Manual tests and existing tests. Please see https://spark-test.github.io/sparksqldoc
      
      Jenkins tests are needed to double check
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18749 from HyukjinKwon/followup-sql-doc-gen.
      ba327ee5
    • hyukjinkwon's avatar
      [INFRA] Close stale PRs · 3a45c7fe
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to close stale PRs, mostly the same instances with #18017
      
      Closes #14085 - [SPARK-16408][SQL] SparkSQL Added file get Exception: is a directory …
      Closes #14239 - [SPARK-16593] [CORE] [WIP] Provide a pre-fetch mechanism to accelerate shuffle stage.
      Closes #14567 - [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation
      Closes #14579 - [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() should return Python context managers
      Closes #14601 - [SPARK-13979][Core] Killed executor is re spawned without AWS key…
      Closes #14830 - [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples
      Closes #14963 - [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python
      Closes #15227 - [SPARK-17655][SQL]Remove unused variables declarations and definations in a WholeStageCodeGened stage
      Closes #15240 - [SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins
      Closes #15405 - [SPARK-15917][CORE] Added support for number of executors in Standalone [WIP]
      Closes #16099 - [SPARK-18665][SQL] set statement state to "ERROR" after user cancel job
      Closes #16445 - [SPARK-19043][SQL]Make SparkSQLSessionManager more configurable
      Closes #16618 - [SPARK-14409][ML][WIP] Add RankingEvaluator
      Closes #16766 - [SPARK-19426][SQL] Custom coalesce for Dataset
      Closes #16832 - [SPARK-19490][SQL] ignore case sensitivity when filtering hive partition columns
      Closes #17052 - [SPARK-19690][SS] Join a streaming DataFrame with a batch DataFrame which has an aggregation may not work
      Closes #17267 - [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly
      Closes #17371 - [SPARK-19903][PYSPARK][SS] window operator miss the `watermark` metadata of time column
      Closes #17401 - [SPARK-18364][YARN] Expose metrics for YarnShuffleService
      Closes #17519 - [SPARK-15352][Doc] follow-up: add configuration docs for topology-aware block replication
      Closes #17530 - [SPARK-5158] Access kerberized HDFS from Spark standalone
      Closes #17854 - [SPARK-20564][Deploy] Reduce massive executor failures when executor count is large (>2000)
      Closes #17979 - [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos
      Closes #18127 - [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql statement 'insert into' on hbase table
      Closes #18236 - [SPARK-21015] Check field name is not null and empty in GenericRowWit…
      Closes #18269 - [SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex
      Closes #18328 - [SPARK-21121][SQL] Support changing storage level via the spark.sql.inMemoryColumnarStorage.level variable
      Closes #18354 - [SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting
      Closes #18383 - [SPARK-21167][SS] Set kafka clientId while fetch messages
      Closes #18414 - [SPARK-21169] [core] Make sure to update application status to RUNNING if executors are accepted and RUNNING after recovery
      Closes #18432 - resolve com.esotericsoftware.kryo.KryoException
      Closes #18490 - [SPARK-21269][Core][WIP] Fix FetchFailedException when enable maxReqSizeShuffleToMem and KryoSerializer
      Closes #18585 - SPARK-21359
      Closes #18609 - Spark SQL merge small files to big files Update InsertIntoHiveTable.scala
      
      Added:
      Closes #18308 - [SPARK-21099][Spark Core] INFO Log Message Using Incorrect Executor I…
      Closes #18599 - [SPARK-21372] spark writes one log file even I set the number of spark_rotate_log to 0
      Closes #18619 - [SPARK-21397][BUILD]Maven shade plugin adding dependency-reduced-pom.xml to …
      Closes #18667 - Fix the simpleString used in error messages
      Closes #18782 - Branch 2.1
      
      Added:
      Closes #17694 - [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts when using multiple threads
      
      Added:
      Closes #16456 - [SPARK-18994] clean up the local directories for application in future by annother thread
      Closes #18683 - [SPARK-21474][CORE] Make number of parallel fetches from a reducer configurable
      Closes #18690 - [SPARK-21334][CORE] Add metrics reporting service to External Shuffle Server
      
      Added:
      Closes #18827 - Merge pull request 1 from apache/master
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18780 from HyukjinKwon/close-prs.
      3a45c7fe
    • liuxian's avatar
      [SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as group-by ordinal · 894d5a45
      liuxian authored
      ## What changes were proposed in this pull request?
      
      create temporary view data as select * from values
      (1, 1),
      (1, 2),
      (2, 1),
      (2, 2),
      (3, 1),
      (3, 2)
      as data(a, b);
      
      `select 3, 4, sum(b) from data group by 1, 2;`
      `select 3 as c, 4 as d, sum(b) from data group by c, d;`
      When running these two cases, the following exception occurred:
      `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10`
      
      The cause of this failure:
      If an aggregateExpression is integer, after replaced with this aggregateExpression, the
      groupExpression still considered as an ordinal.
      
      The solution:
      This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
      
      ## How was this patch tested?
      Added unit test case
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18779 from 10110346/groupby.
      894d5a45
    • Shixiong Zhu's avatar
      [SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache · 6cbd18c9
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR replaces #18623 to do some clean up.
      
      Closes #18623
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Andrey Taptunov <taptunov@amazon.com>
      
      Closes #18848 from zsxwing/review-pr18623.
      6cbd18c9
  3. Aug 04, 2017
    • Reynold Xin's avatar
      [SPARK-21634][SQL] Change OneRowRelation from a case object to case class · 5ad1796b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class.
      
      This blocks SPARK-21619.
      
      ## How was this patch tested?
      Should be covered by existing test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18839 from rxin/SPARK-21634.
      5ad1796b
    • Yuming Wang's avatar
      [SPARK-21205][SQL] pmod(number, 0) should be null. · 231f6724
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      Hive `pmod(3.13, 0)`:
      ```:sql
      hive> select pmod(3.13, 0);
      OK
      NULL
      Time taken: 2.514 seconds, Fetched: 1 row(s)
      hive>
      ```
      
      Spark `mod(3.13, 0)`:
      ```:sql
      spark-sql> select mod(3.13, 0);
      NULL
      spark-sql>
      ```
      
      But the Spark `pmod(3.13, 0)`:
      ```:sql
      spark-sql> select pmod(3.13, 0);
      17/06/25 09:35:58 ERROR SparkSQLDriver: Failed in [select pmod(3.13, 0)]
      java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.Pmod.pmod(arithmetic.scala:504)
      	at org.apache.spark.sql.catalyst.expressions.Pmod.nullSafeEval(arithmetic.scala:432)
      	at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
      	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:323)
      ...
      ```
      This PR make `pmod(number, 0)` to null.
      
      ## How was this patch tested?
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18413 from wangyum/SPARK-21205.
      231f6724
    • Ajay Saini's avatar
      [SPARK-21633][ML][PYTHON] UnaryTransformer in Python · 1347b2a6
      Ajay Saini authored
      ## What changes were proposed in this pull request?
      
      Implemented UnaryTransformer in Python.
      
      ## How was this patch tested?
      
      This patch was tested by creating a MockUnaryTransformer class in the unit tests that extends UnaryTransformer and testing that the transform function produced correct output.
      
      Author: Ajay Saini <ajays725@gmail.com>
      
      Closes #18746 from ajaysaini725/AddPythonUnaryTransformer.
      1347b2a6
    • Andrew Ray's avatar
      [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with... · 25826c77
      Andrew Ray authored
      [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column
      
      ## What changes were proposed in this pull request?
      
      An overflow of the difference of bounds on the partitioning column leads to no data being read. This
      patch checks for this overflow.
      
      ## How was this patch tested?
      
      New unit test.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18800 from aray/SPARK-21330.
      25826c77
    • Dmitry Parfenchik's avatar
      [SPARK-21254][WEBUI] History UI performance fixes · e3967dc5
      Dmitry Parfenchik authored
      ## What changes were proposed in this pull request?
      
      As described in JIRA ticket, History page is taking ~1min to load for cases when amount of jobs is 10k+.
      Most of the time is currently being spent on DOM manipulations and all additional costs implied by this (browser repaints and reflows).
      PR's goal is not to change any behavior but to optimize time of History UI rendering:
      
      1. The most costly operation is setting `innerHTML` for `duration` column within a loop, which is [extremely unperformant](https://jsperf.com/jquery-append-vs-html-list-performance/24). [Refactoring ](https://github.com/criteo-forks/spark/commit/114943b21a730092aa3249b7a905b240bd46e531) this helped to get page load time **down to 10-15s**
      
      2. Second big gain bringing page load time **down to 4s** was [was achieved](https://github.com/criteo-forks/spark/commit/f35fdcd5f129339fce75996e9242c88085a9b8ab) by detaching table's DOM before parsing it with DataTables jQuery plugin.
      
      3. Another chunk of improvements ([1](https://github.com/criteo-forks/spark/commit/332b398db7eb3052484d436919185cb0b62b2385), [2](https://github.com/criteo-forks/spark/commit/0af596a547e3a1f2b594a83cbda1f6ef559de86b), [3](https://github.com/criteo-forks/spark/commit/235f164178a09e22306f05090ee1ff5f314a6710)) was focused on removing unnecessary DOM manipulations that in  total contributed ~250ms to page load time.
      
      ## How was this patch tested?
      
      Tested by existing Selenium tests in `org.apache.spark.deploy.history.HistoryServerSuite`.
      
      Changes were also tested on Criteo's spark-2.1 fork with 20k+ number of rows in the table, reducing load time to 4s.
      
      Author: Dmitry Parfenchik <d.parfenchik@criteo.com>
      Author: Anna Savarin <a.savarin@criteo.com>
      
      Closes #18783 from 2ooom/history-ui-perf-fix-upstream-master.
      e3967dc5
  4. Aug 03, 2017
    • Christiam Camacho's avatar
      Fix Java SimpleApp spark application · dd72b10a
      Christiam Camacho authored
      ## What changes were proposed in this pull request?
      
      Add missing import and missing parentheses to invoke `SparkSession::text()`.
      
      ## How was this patch tested?
      
      Built and the code for this application, ran jekyll locally per docs/README.md.
      
      Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov>
      
      Closes #18795 from christiam/master.
      dd72b10a
    • louis lyu's avatar
      [SPARK-20713][SPARK CORE] Convert CommitDenied to TaskKilled. · bb7afb4e
      louis lyu authored
      ## What changes were proposed in this pull request?
      
      In executor, toTaskFailedReason is converted to toTaskCommitDeniedReason to avoid the inconsistency of taskState. In JobProgressListener, add case TaskCommitDenied so that now the stage killed number is been incremented other than failed number.
      This pull request is picked up from: https://github.com/apache/spark/pull/18070 using commit: ff93ade0248baf3793ab55659042f9d7b8efbdef
      The case match for TaskCommitDenied is added incrementing the correct num of killed after pull/18070.
      
      ## How was this patch tested?
      
      Run a normal speculative job and check the Stage UI page, should have no failed displayed.
      
      Author: louis lyu <llyu@c02tk24rg8wl-lm.champ.corp.yahoo.com>
      
      Closes #18819 from nlyu/SPARK-20713.
      bb7afb4e
    • Dilip Biswal's avatar
      [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail... · 13785daa
      Dilip Biswal authored
      [SPARK-21599][SQL] Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException
      
      ## What changes were proposed in this pull request?
      In case of datasource tables (when they are stored in non-hive compatible way) , the schema information is recorded as table properties in hive meta-store. The alterTableStats method needs to get the schema information from table properties for data source tables before recording the column level statistics. Currently, we don't get the correct schema information and fail with java.util.NoSuchElement exception.
      
      ## How was this patch tested?
      A new test case is added in StatisticsSuite.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #18804 from dilipbiswal/datasource_stats.
      13785daa
    • hyukjinkwon's avatar
      [SPARK-21602][R] Add map_keys and map_values functions to R · 97ba4918
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds `map_values` and `map_keys` to R API.
      
      ```r
      > df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
      > tmp <- mutate(df, v = create_map(df$model, df$cyl))
      > head(select(tmp, map_keys(tmp$v)))
      ```
      ```
              map_keys(v)
      1         Mazda RX4
      2     Mazda RX4 Wag
      3        Datsun 710
      4    Hornet 4 Drive
      5 Hornet Sportabout
      6           Valiant
      ```
      ```r
      > head(select(tmp, map_values(tmp$v)))
      ```
      ```
        map_values(v)
      1             6
      2             6
      3             4
      4             6
      5             8
      6             6
      ```
      
      ## How was this patch tested?
      
      Manual tests and unit tests in `R/pkg/tests/fulltests/test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18809 from HyukjinKwon/map-keys-values-r.
      97ba4918
    • Chang chen's avatar
      [SPARK-21605][BUILD] Let IntelliJ IDEA correctly detect Language level and Target byte code version · e7c59b41
      Chang chen authored
      With SPARK-21592, removing source and target properties from maven-compiler-plugin lets IntelliJ IDEA use default Language level and Target byte code version which are 1.4.
      
      This change adds source, target and encoding properties back to fix this issue.  As I test, it doesn't increase compile time.
      
      Author: Chang chen <baibaichen@gmail.com>
      
      Closes #18808 from baibaichen/feature/idea-fix.
      e7c59b41
    • zuotingbing's avatar
      [SPARK-21611][SQL] Error class name for log in several classes. · 32214706
      zuotingbing authored
      ## What changes were proposed in this pull request?
      
      Error class name for log in several classes. such as:
      `2017-08-02 16:43:37,695 INFO CompositeService: Operation log root directory is created: /tmp/mr/operation_logs`
      `Operation log root directory is created ... ` is in `SessionManager.java` actually.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #18816 from zuotingbing/SPARK-21611.
      32214706
    • zuotingbing's avatar
      [SPARK-21604][SQL] if the object extends Logging, i suggest to remove the var LOG which is useless. · f13dbb3a
      zuotingbing authored
      ## What changes were proposed in this pull request?
      
      if the object extends Logging, i suggest to remove the var LOG which is useless.
      
      ## How was this patch tested?
      
      Exist tests
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #18811 from zuotingbing/SPARK-21604.
      f13dbb3a
    • Ayush Singh's avatar
      [SPARK-21615][ML][MLLIB][DOCS] Fix broken redirect in collaborative filtering... · 7c206dd3
      Ayush Singh authored
      [SPARK-21615][ML][MLLIB][DOCS] Fix broken redirect in collaborative filtering docs to databricks training repo
      
      ## What changes were proposed in this pull request?
      * Current [MLlib Collaborative Filtering tutorial](https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html) points to broken links to old databricks website.
      * Databricks moved all their content to [git repo](https://github.com/databricks/spark-training)
      * Two links needs to be fixed,
        * [training exercises](https://databricks-training.s3.amazonaws.com/index.html)
        * [personalized movie recommendation with spark.mllib](https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html)
      
      ## How was this patch tested?
      Generated docs locally
      
      Author: Ayush Singh <singhay@ccs.neu.edu>
      
      Closes #18821 from singhay/SPARK-21615.
      7c206dd3
  5. Aug 02, 2017
    • Shixiong Zhu's avatar
      [SPARK-21546][SS] dropDuplicates should ignore watermark when it's not a key · 0d26b3aa
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue.
      
      ## How was this patch tested?
      
      The new unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18822 from zsxwing/SPARK-21546.
      0d26b3aa
    • Marcelo Vanzin's avatar
      [SPARK-21490][CORE] Make sure SparkLauncher redirects needed streams. · 9456176d
      Marcelo Vanzin authored
      The code was failing to account for some cases when setting up log
      redirection. For example, if a user redirected only stdout to a file,
      the launcher code would leave stderr without redirection, which could
      lead to child processes getting stuck because stderr wasn't being
      read.
      
      So detect cases where only one of the streams is redirected, and
      redirect the other stream to the log as appropriate.
      
      For the old "launch()" API, redirection of the unconfigured stream
      only happens if the user has explicitly requested for log redirection.
      Log redirection is on by default with "startApplication()".
      
      Most of the change is actually adding new unit tests to make sure the
      different cases work as expected. As part of that, I moved some tests
      that were in the core/ module to the launcher/ module instead, since
      they don't depend on spark-submit.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18696 from vanzin/SPARK-21490.
      9456176d
    • Shixiong Zhu's avatar
      [SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats · 7f63e85b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR fixed a potential overflow issue in EventTimeStats.
      
      ## How was this patch tested?
      
      The new unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18803 from zsxwing/avg.
      7f63e85b
    • zero323's avatar
      [SPARK-20601][ML] Python API for Constrained Logistic Regression · 845c039c
      zero323 authored
      ## What changes were proposed in this pull request?
      Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 .
      
      ## How was this patch tested?
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18759 from yanboliang/SPARK-20601.
      845c039c
  6. Aug 01, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21578][CORE] Add JavaSparkContextSuite · 14e75758
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Due to SI-8479, [SPARK-1093](https://issues.apache.org/jira/browse/SPARK-21578) introduced redundant [SparkContext constructors](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181). However, [SI-8479](https://issues.scala-lang.org/browse/SI-8479) is already fixed in Scala 2.10.5 and Scala 2.11.1.
      
      The real reason to provide this constructor is that Java code can access `SparkContext` directly. It's Scala behavior, SI-4278. So, this PR adds an explicit testsuite, `JavaSparkContextSuite`  to prevent future regression, and fixes the outdate comment, too.
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test suite.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18778 from dongjoon-hyun/SPARK-21578.
      14e75758
    • gatorsmile's avatar
      [CORE][MINOR] Improve the error message of checkpoint RDD verification · 4cc704b1
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The original error message is pretty confusing. It is unable to tell which number is `number of partitions` and which one is the `RDD ID`. This PR is to improve the checkpoint checking.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18796 from gatorsmile/improveErrMsgForCheckpoint.
      4cc704b1
    • Bryan Cutler's avatar
      [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry · 77cc0d67
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      When using PySpark broadcast variables in a multi-threaded environment,  `SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race condition can occur when broadcast variables that are pickled from one thread get added to the shared ` _pickled_broadcast_vars` and become part of the python command from another thread.  This PR introduces a thread-safe pickled registry using thread local storage so that when python command is pickled (causing the broadcast variable to be pickled and added to the registry) each thread will have their own view of the pickle registry to retrieve and clear the broadcast variables used.
      
      ## How was this patch tested?
      
      Added a unit test that causes this race condition using another thread.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #18695 from BryanCutler/pyspark-bcast-threadsafe-SPARK-12717.
      77cc0d67
    • Devaraj K's avatar
      [SPARK-21339][CORE] spark-shell --packages option does not add jars to classpath on windows · 58da1a24
      Devaraj K authored
      The --packages option jars are getting added to the classpath with the scheme as "file:///", in Unix it doesn't have problem with this since the scheme contains the Unix Path separator which separates the jar name with location in the classpath. In Windows, the jar file is not getting resolved from the classpath because of the scheme.
      
      Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar
      Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar
      
      With this PR, we are avoiding the 'file://' scheme to get added to the packages jar files.
      
      I have verified manually in Windows and Unix environments, with the change it adds the jar to classpath like below,
      
      Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar
      Unix : /home/<user>/.ivy2/jars/<jar-name>.jar
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #18708 from devaraj-kavali/SPARK-21339.
      58da1a24
    • Sean Owen's avatar
      [SPARK-21593][DOCS] Fix 2 rendering errors on configuration page · b1d59e60
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and SPARK-15355.
      
      ## How was this patch tested?
      
      Manually built and viewed docs with jekyll
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18793 from srowen/SPARK-21593.
      b1d59e60
    • Grzegorz Slowikowski's avatar
      [SPARK-21592][BUILD] Skip maven-compiler-plugin main and test compilations in Maven build · 74cda94c
      Grzegorz Slowikowski authored
      `scala-maven-plugin` in `incremental` mode compiles `Scala` and `Java` classes. There is no need to execute `maven-compiler-plugin` goals to compile (in fact recompile) `Java`.
      
      This change reduces compilation time (over 10% on my machine).
      
      Author: Grzegorz Slowikowski <gslowikowski@gmail.com>
      
      Closes #18750 from gslowikowski/remove-redundant-compilation-from-maven.
      74cda94c
    • Marcelo Vanzin's avatar
      [SPARK-20079][YARN] Fix client AM not allocating executors after restart. · 6735433c
      Marcelo Vanzin authored
      The main goal of this change is to avoid the situation described
      in the bug, where an AM restart in the middle of a job may cause
      no new executors to be allocated because of faulty logic in the
      reset path.
      
      The change does two things:
      
      - fixes the executor alloc manager's reset() so that it does not
        stop allocation after a reset() in the middle of a job
      - re-orders the initialization of the YarnAllocator class so that
        it fetches the current executor ID before triggering the reset()
        above.
      
      This ensures both that the new allocator gets new requests for executors,
      and that it starts from the correct executor id.
      
      Tested with unit tests and by manually causing AM restarts while
      running jobs using spark-shell in YARN mode.
      
      Closes #17882
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      Author: Guoqiang Li <witgo@qq.com>
      
      Closes #18663 from vanzin/SPARK-20079.
      6735433c
    • Marcelo Vanzin's avatar
      [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite. · b1335018
      Marcelo Vanzin authored
      Handle the case where the server closes the socket before the full message
      has been written by the client.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18727 from vanzin/SPARK-21522.
      b1335018
    • pgandhi's avatar
      [SPARK-21585] Application Master marking application status as Failed for Client Mode · 97ccc63f
      pgandhi authored
      The fix deployed for SPARK-21541 resulted in the Application Master to set the final status of a spark application as Failed for the client mode as the flag 'registered' was not being set to true for client mode. So, in order to fix the issue, I have set the flag 'registered' as true in client mode on successfully registering Application Master.
      
      ## How was this patch tested?
      Tested the patch manually.
      
      Before:
      <img width="1275" alt="screen shot-before2" src="https://user-images.githubusercontent.com/22228190/28799641-02b5ed78-760f-11e7-9eb0-bf8407dad0ad.png">
      
      After:
      <img width="1221" alt="screen shot-after2" src="https://user-images.githubusercontent.com/22228190/28799646-0ac9ef14-760f-11e7-8bf5-9dfd743d0f2f.png">
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: pgandhi <pgandhi@yahoo-inc.com>
      Author: pgandhi999 <parthkgandhi9@gmail.com>
      
      Closes #18788 from pgandhi999/SPARK-21585.
      97ccc63f
    • Zheng RuiFeng's avatar
      [SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold · 253a07e4
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      
      Closes #18612 from zhengruifeng/override_HasXXX.
      253a07e4
    • jerryshao's avatar
      [SPARK-21475][CORE] Use NIO's Files API to replace... · 5fd0294f
      jerryshao authored
      [SPARK-21475][CORE] Use NIO's Files API to replace FileInputStream/FileOutputStream in some critical paths
      
      ## What changes were proposed in this pull request?
      
      Java's `FileInputStream` and `FileOutputStream` overrides finalize(), even this file input/output stream is closed correctly and promptly, it will still leave some memory footprints which will only get cleaned in Full GC. This will introduce two side effects:
      
      1. Lots of memory footprints regarding to Finalizer will be kept in memory and this will increase the memory overhead. In our use case of external shuffle service, a busy shuffle service will have bunch of this object and potentially lead to OOM.
      2. The Finalizer will only be called in Full GC, and this will increase the overhead of Full GC and lead to long GC pause.
      
      https://bugs.openjdk.java.net/browse/JDK-8080225
      
      https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful
      
      So to fix this potential issue, here propose to use NIO's Files#newInput/OutputStream instead in some critical paths like shuffle.
      
      Left unchanged FileInputStream in core which I think is not so critical:
      
      ```
      ./core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:467:    val file = new DataInputStream(new FileInputStream(filename))
      ./core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:942:    val in = new FileInputStream(new File(path))
      ./core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala:76:    val fileIn = new FileInputStream(file)
      ./core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala:248:        val fis = new FileInputStream(file)
      ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:910:                input = new FileInputStream(new File(t))
      ./core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala:20:import java.io.{FileInputStream, InputStream}
      ./core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala:132:        case Some(f) => new FileInputStream(f)
      ./core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala:20:import java.io.{FileInputStream, InputStream}
      ./core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala:77:        val fis = new FileInputStream(f)
      ./core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala:27:import org.apache.spark.io.NioBufferedFileInputStream
      ./core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala:94:      new DataInputStream(new NioBufferedFileInputStream(index))
      ./core/src/main/scala/org/apache/spark/storage/DiskStore.scala:111:        val channel = new FileInputStream(file).getChannel()
      ./core/src/main/scala/org/apache/spark/storage/DiskStore.scala:219:    val channel = new FileInputStream(file).getChannel()
      ./core/src/main/scala/org/apache/spark/TestUtils.scala:20:import java.io.{ByteArrayInputStream, File, FileInputStream, FileOutputStream}
      ./core/src/main/scala/org/apache/spark/TestUtils.scala:106:      val in = new FileInputStream(file)
      ./core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala:89:        inputStream = new FileInputStream(activeFile)
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:329:      if (in.isInstanceOf[FileInputStream] && out.isInstanceOf[FileOutputStream]
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:332:        val inChannel = in.asInstanceOf[FileInputStream].getChannel()
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:1533:      gzInputStream = new GZIPInputStream(new FileInputStream(file))
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:1560:      new GZIPInputStream(new FileInputStream(file))
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:1562:      new FileInputStream(file)
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:2090:    val inReader = new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8)
      ```
      
      Left unchanged FileOutputStream in core:
      
      ```
      ./core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:957:    val out = new FileOutputStream(file)
      ./core/src/main/scala/org/apache/spark/api/r/RBackend.scala:20:import java.io.{DataOutputStream, File, FileOutputStream, IOException}
      ./core/src/main/scala/org/apache/spark/api/r/RBackend.scala:131:      val dos = new DataOutputStream(new FileOutputStream(f))
      ./core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala:62:    val fileOut = new FileOutputStream(file)
      ./core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala:160:          val outStream = new FileOutputStream(outPath)
      ./core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala:239:    val zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile, false))
      ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:949:        val out = new FileOutputStream(tempFile)
      ./core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala:20:import java.io.{File, FileOutputStream, InputStream, IOException}
      ./core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala:106:    val out = new FileOutputStream(file, true)
      ./core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:109:     * Therefore, for local files, use FileOutputStream instead. */
      ./core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:112:        new FileOutputStream(uri.getPath)
      ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:20:import java.io.{BufferedOutputStream, File, FileOutputStream, OutputStream}
      ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:71:  private var fos: FileOutputStream = null
      ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:102:    fos = new FileOutputStream(file, true)
      ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:213:      var truncateStream: FileOutputStream = null
      ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:215:        truncateStream = new FileOutputStream(file, true)
      ./core/src/main/scala/org/apache/spark/storage/DiskStore.scala:153:    val out = new FileOutputStream(file).getChannel()
      ./core/src/main/scala/org/apache/spark/TestUtils.scala:20:import java.io.{ByteArrayInputStream, File, FileInputStream, FileOutputStream}
      ./core/src/main/scala/org/apache/spark/TestUtils.scala:81:    val jarStream = new JarOutputStream(new FileOutputStream(jarFile))
      ./core/src/main/scala/org/apache/spark/TestUtils.scala:96:    val jarFileStream = new FileOutputStream(jarFile)
      ./core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala:20:import java.io.{File, FileOutputStream, InputStream, IOException}
      ./core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala:31:  volatile private var outputStream: FileOutputStream = null
      ./core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala:97:    outputStream = new FileOutputStream(file, true)
      ./core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala:90:        gzOutputStream = new GZIPOutputStream(new FileOutputStream(gzFile))
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:329:      if (in.isInstanceOf[FileInputStream] && out.isInstanceOf[FileOutputStream]
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:333:        val outChannel = out.asInstanceOf[FileOutputStream].getChannel()
      ./core/src/main/scala/org/apache/spark/util/Utils.scala:527:      val out = new FileOutputStream(tempFile)
      ```
      
      Here in `DiskBlockObjectWriter`, it uses `FileDescriptor` so it is not easy to change to NIO Files API.
      
      For the `FileInputStream` and `FileOutputStream` in common/shuffle* I changed them all.
      
      ## How was this patch tested?
      
      Existing tests and manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18684 from jerryshao/SPARK-21475.
      5fd0294f
Loading