Skip to content
Snippets Groups Projects
  1. Aug 25, 2017
    • Marcelo Vanzin's avatar
      [SPARK-17742][CORE] Fail launcher app handle if child process exits with error. · 628bdeab
      Marcelo Vanzin authored
      This is a follow up to cba826d0; that commit set the app handle state
      to "LOST" when the child process exited, but that can be ambiguous. This
      change sets the state to "FAILED" if the exit code was non-zero and
      the handle state wasn't a failure state, or "LOST" if the exit status
      was zero.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #19012 from vanzin/SPARK-17742.
      628bdeab
    • jerryshao's avatar
      [SPARK-21714][CORE][YARN] Avoiding re-uploading remote resources in yarn client mode · 1813c4a8
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      With SPARK-10643, Spark supports download resources from remote in client deploy mode. But the implementation overrides variables which representing added resources (like `args.jars`, `args.pyFiles`) to local path, And yarn client leverage this local path to re-upload resources to distributed cache. This is unnecessary to break the semantics of putting resources in a shared FS. So here proposed to fix it.
      
      ## How was this patch tested?
      
      This is manually verified with jars, pyFiles in local and remote storage, both in client and cluster mode.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #18962 from jerryshao/SPARK-21714.
      1813c4a8
    • Dongjoon Hyun's avatar
      [SPARK-21832][TEST] Merge SQLBuilderTest into ExpressionSQLBuilderSuite · 1f24ceee
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After [SPARK-19025](https://github.com/apache/spark/pull/16869), there is no need to keep SQLBuilderTest.
      ExpressionSQLBuilderSuite is the only place to use it.
      This PR aims to remove SQLBuilderTest.
      
      ## How was this patch tested?
      
      Pass the updated `ExpressionSQLBuilderSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #19044 from dongjoon-hyun/SPARK-21832.
      1f24ceee
    • Sean Owen's avatar
      [MINOR][BUILD] Fix build warnings and Java lint errors · de7af295
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19051 from srowen/JavaWarnings.
      de7af295
    • zhoukang's avatar
      [SPARK-21527][CORE] Use buffer limit in order to use JAVA NIO Util's buffercache · 574ef6c9
      zhoukang authored
      ## What changes were proposed in this pull request?
      
      Right now, ChunkedByteBuffer#writeFully do not slice bytes first.We observe code in java nio Util#getTemporaryDirectBuffer below:
      
              BufferCache cache = bufferCache.get();
              ByteBuffer buf = cache.get(size);
              if (buf != null) {
                  return buf;
              } else {
                  // No suitable buffer in the cache so we need to allocate a new
                  // one. To avoid the cache growing then we remove the first
                  // buffer from the cache and free it.
                  if (!cache.isEmpty()) {
                      buf = cache.removeFirst();
                      free(buf);
                  }
                  return ByteBuffer.allocateDirect(size);
              }
      
      If we slice first with a fixed size, we can use buffer cache and only need to allocate at the first write call.
      Since we allocate new buffer, we can not control the free time of this buffer.This once cause memory issue in our production cluster.
      In this patch, i supply a new api which will slice with fixed size for buffer writing.
      
      ## How was this patch tested?
      
      Unit test and test in production.
      
      Author: zhoukang <zhoukang199191@gmail.com>
      Author: zhoukang <zhoukang@xiaomi.com>
      
      Closes #18730 from caneGuy/zhoukang/improve-chunkwrite.
      574ef6c9
    • mike's avatar
      [SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum · 7d16776d
      mike authored
      ## What changes were proposed in this pull request?
      
      Fixed NPE when creating encoder for enum.
      
      When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
      I did a little research and it turns out, that in JavaTypeInference following code
      ```
        def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
          val beanInfo = Introspector.getBeanInfo(beanClass)
          beanInfo.getPropertyDescriptors.filterNot(_.getName == "class")
            .filter(_.getReadMethod != null)
        }
      ```
      filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495.
      
      I added property name "declaringClass" to filtering to resolve this.
      
      ## How was this patch tested?
      Unit test in JavaDatasetSuite which creates an encoder for enum
      
      Author: mike <mike0sv@gmail.com>
      Author: Mikhail Sveshnikov <mike0sv@gmail.com>
      
      Closes #18488 from mike0sv/enum-support.
      7d16776d
  2. Aug 24, 2017
    • Yuhao Yang's avatar
      [SPARK-21108][ML] convert LinearSVC to aggregator framework · f3676d63
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      convert LinearSVC to new aggregator framework
      
      ## How was this patch tested?
      
      existing unit test.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #18315 from hhbyyh/svcAggregator.
      f3676d63
    • Herman van Hovell's avatar
      [SPARK-21830][SQL] Bump ANTLR version and fix a few issues. · 05af2de0
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.
      
      The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
      ```sql
      SELECT *
      FROM RANGE(1000)
      WHERE
      TRUE
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      AND NOT upper(DESCRIPTION) LIKE '%FOO%'
      ```
      
      This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #19042 from hvanhovell/SPARK-21830.
      05af2de0
    • xu.zhang's avatar
      [SPARK-21701][CORE] Enable RPC client to use ` SO_RCVBUF` and ` SO_SNDBUF` in SparkConf. · 763b83ee
      xu.zhang authored
      ## What changes were proposed in this pull request?
      
      TCP parameters like SO_RCVBUF and SO_SNDBUF can be set in SparkConf, and `org.apache.spark.network.server.TransportServe`r can use those parameters to build server by leveraging netty. But for TransportClientFactory, there is no such way to set those parameters from SparkConf. This could be inconsistent in server and client side when people set parameters in SparkConf. So this PR make RPC client to be enable to use those TCP parameters as well.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: xu.zhang <xu.zhang@hulu.com>
      
      Closes #18964 from neoremind/add_client_param.
      763b83ee
    • Shixiong Zhu's avatar
      [SPARK-21788][SS] Handle more exceptions when stopping a streaming query · d3abb369
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add more cases we should view as a normal query stop rather than a failure.
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <zsxwing@gmail.com>
      
      Closes #18997 from zsxwing/SPARK-21788.
      d3abb369
    • Wenchen Fan's avatar
      [SPARK-21826][SQL] outer broadcast hash join should not throw NPE · 2dd37d82
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 .
      
      Non-equal join condition should only be applied when the equal-join condition matches.
      
      ## How was this patch tested?
      
      regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #19036 from cloud-fan/bug.
      2dd37d82
    • Liang-Chi Hsieh's avatar
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved... · 183d4cb7
      Liang-Chi Hsieh authored
      [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
      
      ## What changes were proposed in this pull request?
      
      With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans.
      
      For a correlated IN query looks like:
      
          SELECT t1.a FROM t1
          WHERE
          t1.a IN (SELECT t2.c
                  FROM t2
                  WHERE t1.b < t2.d);
      
      The query plan might look like:
      
          Project [a#0]
          +- Filter a#0 IN (list#4 [b#1])
             :  +- Project [c#2]
             :     +- Filter (outer(b#1) < d#3)
             :        +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      After `PullupCorrelatedPredicates`, it produces query plan like:
      
          'Project [a#0]
          +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
             :  +- Project [c#2, d#3]
             :     +- LocalRelation <empty>, [c#2, d#3]
             +- LocalRelation <empty>, [a#0, b#1]
      
      Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery.
      
      When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`.
      
      We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check.
      
      ## How was this patch tested?
      
      Added test.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #18968 from viirya/SPARK-21759.
      183d4cb7
    • Takuya UESHIN's avatar
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector... · 9e33954d
      Takuya UESHIN authored
      [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
      
      ## What changes were proposed in this pull request?
      
      This is a refactoring of `ColumnVector` hierarchy and related classes.
      
      1. make `ColumnVector` read-only
      2. introduce `WritableColumnVector` with write interface
      3. remove `ReadOnlyColumnVector`
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18958 from ueshin/issues/SPARK-21745.
      9e33954d
    • hyukjinkwon's avatar
      [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should... · dc5d34d8
      hyukjinkwon authored
      [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column
      
      ## What changes were proposed in this pull request?
      
      While preparing to take over https://github.com/apache/spark/pull/16537, I realised a (I think) better approach to make the exception handling in one point.
      
      This PR proposes to fix `_to_java_column` in `pyspark.sql.column`, which most of functions in `functions.py` and some other APIs use. This `_to_java_column` basically looks not working with other types than `pyspark.sql.column.Column` or string (`str` and `unicode`).
      
      If this is not `Column`, then it calls `_create_column_from_name` which calls `functions.col` within JVM:
      
      https://github.com/apache/spark/blob/42b9eda80e975d970c3e8da4047b318b83dd269f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L76
      
      And it looks we only have `String` one with `col`.
      
      So, these should work:
      
      ```python
      >>> from pyspark.sql.column import _to_java_column, Column
      >>> _to_java_column("a")
      JavaObject id=o28
      >>> _to_java_column(u"a")
      JavaObject id=o29
      >>> _to_java_column(spark.range(1).id)
      JavaObject id=o33
      ```
      
      whereas these do not:
      
      ```python
      >>> _to_java_column(1)
      ```
      ```
      ...
      py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
      py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
          ...
      ```
      
      ```python
      >>> _to_java_column([])
      ```
      ```
      ...
      py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
      py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
          ...
      ```
      
      ```python
      >>> class A(): pass
      >>> _to_java_column(A())
      ```
      ```
      ...
      AttributeError: 'A' object has no attribute '_get_object_id'
      ```
      
      Meaning most of functions using `_to_java_column` such as `udf` or `to_json` or some other APIs throw an exception as below:
      
      ```python
      >>> from pyspark.sql.functions import udf
      >>> udf(lambda x: x)(None)
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
      : java.lang.NullPointerException
          ...
      ```
      
      ```python
      >>> from pyspark.sql.functions import to_json
      >>> to_json(None)
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col.
      : java.lang.NullPointerException
          ...
      ```
      
      **After this PR**:
      
      ```python
      >>> from pyspark.sql.functions import udf
      >>> udf(lambda x: x)(None)
      ...
      ```
      
      ```
      TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.
      ```
      
      ```python
      >>> from pyspark.sql.functions import to_json
      >>> to_json(None)
      ```
      
      ```
      ...
      TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions.
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `python/pyspark/sql/tests.py` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #19027 from HyukjinKwon/SPARK-19165.
      dc5d34d8
    • Jen-Ming Chung's avatar
      [SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one · 95713eb4
      Jen-Ming Chung authored
      ## What changes were proposed in this pull request?
      
      When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below:
      
      ``` scala
      scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show()
      +---+---+----+
      | c0| c1|  c2|
      +---+---+----+
      |  1|  2|null|
      +---+---+----+
      ```
      
      I think this should be consistent with Hive's implementation:
      ```
      hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a');
      ...
      1    1
      ```
      
      In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf.
      
      ## How was this patch tested?
      
      Added test in JsonExpressionsSuite.
      
      Author: Jen-Ming Chung <jenmingisme@gmail.com>
      
      Closes #19017 from jmchung/SPARK-21804.
      95713eb4
    • lufei's avatar
      [MINOR][SQL] The comment of Class ExchangeCoordinator exist a typing and context error · 846bc61c
      lufei authored
      ## What changes were proposed in this pull request?
      
      The given example in the comment of Class ExchangeCoordinator is exist four post-shuffle partitions,but the current comment is “three”.
      
      ## How was this patch tested?
      
      Author: lufei <lu.fei80@zte.com.cn>
      
      Closes #19028 from figo77/SPARK-21816.
      846bc61c
    • Susan X. Huynh's avatar
      [SPARK-21694][MESOS] Support Mesos CNI network labels · ce0d3bb3
      Susan X. Huynh authored
      JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21694
      
      ## What changes were proposed in this pull request?
      
      Spark already supports launching containers attached to a given CNI network by specifying it via the config `spark.mesos.network.name`.
      
      This PR adds support to pass in network labels to CNI plugins via a new config option `spark.mesos.network.labels`. These network labels are key-value pairs that are set in the `NetworkInfo` of both the driver and executor tasks. More details in the related Mesos documentation:  http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins
      
      ## How was this patch tested?
      
      Unit tests, for both driver and executor tasks.
      Manual integration test to submit a job with the `spark.mesos.network.labels` option, hit the mesos/state.json endpoint, and check that the labels are set in the driver and executor tasks.
      
      ArtRand skonto
      
      Author: Susan X. Huynh <xhuynh@mesosphere.com>
      
      Closes #18910 from susanxhuynh/sh-mesos-cni-labels.
      ce0d3bb3
  3. Aug 23, 2017
    • Felix Cheung's avatar
      [SPARK-21805][SPARKR] Disable R vignettes code on Windows · 43cbfad9
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there)
      
      fix * checking re-building of vignette outputs ... WARNING
      and
      > %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir.
      
      ## How was this patch tested?
      
      jenkins, appveyor, r-hub
      
      before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log
      after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #19016 from felixcheung/rvigwind.
      43cbfad9
    • 10129659's avatar
      [SPARK-21807][SQL] Override ++ operation in ExpressionSet to reduce clone time · b8aaef49
      10129659 authored
      ## What changes were proposed in this pull request?
      The getAliasedConstraints  fuction in LogicalPlan.scala will clone the expression set when an element added,
      and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time.
      
      Before modified, the cost of getAliasedConstraints is:
      100 expressions:  41 seconds
      150 expressions:  466 seconds
      
      After modified, the cost of getAliasedConstraints is:
      100 expressions:  1.8 seconds
      150 expressions:  6.5 seconds
      
      The test is like this:
      test("getAliasedConstraints") {
          val expressionNum = 150
          val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")())
          val aggPlan = Aggregate(Nil, aggExpression, LocalRelation())
      
          val beginTime = System.currentTimeMillis()
          val expressions = aggPlan.validConstraints
          println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms")
          // The size of Aliased expression is n * (n - 1) / 2 + n
          assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum)
        }
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Run new added test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: 10129659 <chen.yanshan@zte.com.cn>
      
      Closes #19022 from eatoncys/getAliasedConstraints.
      b8aaef49
    • Takeshi Yamamuro's avatar
      [SPARK-21603][SQL][FOLLOW-UP] Change the default value of maxLinesPerFunction into 4000 · 6942aeeb
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr changed the default value of `maxLinesPerFunction` into `4000`. In #18810, we had this new option to disable code generation for too long functions and I found this option only affected `Q17` and `Q66` in TPC-DS. But, `Q66` had some performance regression:
      
      ```
      Q17 w/o #18810, 3224ms --> q17 w/#18810, 2627ms (improvement)
      Q66 w/o #18810, 1712ms --> q66 w/#18810, 3032ms (regression)
      ```
      
      To keep the previous performance in TPC-DS, we better set higher value at `maxLinesPerFunction` by default.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #19021 from maropu/SPARK-21603-FOLLOWUP-1.
      6942aeeb
    • Sanket Chintapalli's avatar
      [SPARK-21501] Change CacheLoader to limit entries based on memory footprint · 1662e931
      Sanket Chintapalli authored
      Right now the spark shuffle service has a cache for index files. It is based on a # of files cached (spark.shuffle.service.index.cache.entries). This can cause issues if people have a lot of reducers because the size of each entry can fluctuate based on the # of reducers.
      We saw an issues with a job that had 170000 reducers and it caused NM with spark shuffle service to use 700-800MB or memory in NM by itself.
      We should change this cache to be memory based and only allow a certain memory size used. When I say memory based I mean the cache should have a limit of say 100MB.
      
      https://issues.apache.org/jira/browse/SPARK-21501
      
      Manual Testing with 170000 reducers has been performed with cache loaded up to max 100MB default limit, with each shuffle index file of size 1.3MB. Eviction takes place as soon as the total cache size reaches the 100MB limit and the objects will be ready for garbage collection there by avoiding NM to crash. No notable difference in runtime has been observed.
      
      Author: Sanket Chintapalli <schintap@yahoo-inc.com>
      
      Closes #18940 from redsanket/SPARK-21501.
      1662e931
  4. Aug 22, 2017
    • Weichen Xu's avatar
      [SPARK-12664][ML] Expose probability in mlp model · d6b30edd
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability  column when transforming data.
      
      ## How was this patch tested?
      
      Test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.
      d6b30edd
    • Jane Wang's avatar
      [SPARK-19326] Speculated task attempts do not get launched in few scenarios · d58a3507
      Jane Wang authored
      ## What changes were proposed in this pull request?
      
      Add a new listener event when a speculative task is created and notify it to ExecutorAllocationManager for requesting more executor.
      
      ## How was this patch tested?
      
      - Added Unittests.
      - For the test snippet in the jira:
      val n = 100
      val someRDD = sc.parallelize(1 to n, n)
      someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
      if (index == 1) {
      Thread.sleep(Long.MaxValue) // fake long running task(s)
      }
      it.toList.map(x => index + ", " + x).iterator
      }).collect
      With this code change, spark indicates 101 jobs are running (99 succeeded, 2 running and 1 is speculative job)
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #18492 from janewangfb/speculated_task_not_launched.
      d58a3507
    • Yanbo Liang's avatar
      [ML][MINOR] Make sharedParams update. · 34296190
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's not updated in master. Maybe someone manual update ```sharedParams.scala```, this PR fix this issue.
      
      ## How was this patch tested?
      Offline check.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #19011 from yanboliang/sharedParams.
      34296190
    • Jose Torres's avatar
      [SPARK-21765] Set isStreaming on leaf nodes for streaming plans. · 3c0c2d09
      Jose Torres authored
      ## What changes were proposed in this pull request?
      All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from.
      
      ## How was this patch tested?
      
      Existing unit tests - no functional change is intended in this PR.
      
      Author: Jose Torres <joseph-torres@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18973 from joseph-torres/SPARK-21765.
      3c0c2d09
    • Bryan Cutler's avatar
      [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator · 41bb1ddc
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Added call to copy values of Params from Estimator to Model after fit in PySpark ML.  This will copy values for any params that are also defined in the Model.  Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object.  This is a temporary fix that can be removed once the PySpark models properly define the params themselves.
      
      ## How was this patch tested?
      
      Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.
      41bb1ddc
    • Weichen Xu's avatar
      [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero · d56c2621
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      fix bug of MLOR do not work correctly when featureStd contains zero
      
      We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
      ```
          val multinomialDatasetWithZeroVar = {
            val nPoints = 100
            val coefficients = Array(
              -0.57997, 0.912083, -0.371077,
              -0.16624, -0.84355, -0.048509)
      
            val xMean = Array(5.843, 3.0)
            val xVariance = Array(0.6856, 0.0)  // including zero variance
      
            val testData = generateMultinomialLogisticInput(
              coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
      
            val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
            df.cache()
            df
          }
      ```
      ## How was this patch tested?
      
      testcase added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18896 from WeichenXu123/fix_mlor_stdvalue_zero_bug.
      d56c2621
    • gatorsmile's avatar
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas... · 01a8e462
      gatorsmile authored
      [SPARK-21769][SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL
      
      ## What changes were proposed in this pull request?
      For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different.
      
      The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema.
      
      ## How was this patch tested?
      Added a cross-version test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19003 from gatorsmile/respectSparkSchema.
      01a8e462
    • gatorsmile's avatar
      [SPARK-21499][SQL] Support creating persistent function for Spark... · 43d71d96
      gatorsmile authored
      [SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)
      
      ## What changes were proposed in this pull request?
      This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction).
      
      ```SQL
      CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
      ```
      
      Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)`
      
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18700 from gatorsmile/javaUDFinScala.
      43d71d96
    • jerryshao's avatar
      [SPARK-20641][CORE] Add missing kvstore module in Laucher and SparkSubmit code · 3ed1ae10
      jerryshao authored
      There're two code in Launcher and SparkSubmit will will explicitly list all the Spark submodules, newly added kvstore module is missing in this two parts, so submitting a minor PR to fix this.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #19014 from jerryshao/missing-kvstore.
      3ed1ae10
    • gatorsmile's avatar
      [SPARK-21803][TEST] Remove the HiveDDLCommandSuite · be72b157
      gatorsmile authored
      ## What changes were proposed in this pull request?
      We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #19015 from gatorsmile/combineDDL.
      be72b157
    • Andrew Ray's avatar
      [SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation · 5c9b3017
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included  expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute.
      
      This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically.
      
      ## How was this patch tested?
      
      Modified and additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #18786 from aray/summary-r.
      5c9b3017
  5. Aug 21, 2017
    • Kyle Kelley's avatar
      [SPARK-21070][PYSPARK] Attempt to update cloudpickle again · 751f5133
      Kyle Kelley authored
      ## What changes were proposed in this pull request?
      
      Based on https://github.com/apache/spark/pull/18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle.
      
      Some notable changes:
      * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80)
      * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90)
      * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88)
      * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85)
      * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72)
      * Allow pickling of builtin methods (cloudpipe/cloudpickle#57)
      * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52)
      * Support method descriptor (cloudpipe/cloudpickle#46)
      * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32)
      * ** Remove non-standard __transient__check (cloudpipe/cloudpickle#110)** -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes.
      * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96)
      * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102)
      
      ## How was this patch tested?
      
      Existing PySpark unit tests + the unit tests from the cloudpickle project on their own.
      
      Author: Holden Karau <holden@us.ibm.com>
      Author: Kyle Kelley <rgbkrk@gmail.com>
      
      Closes #18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.
      751f5133
    • Yanbo Liang's avatar
      [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. · c108a5d3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly.
      
      ## How was this patch tested?
      Existing tests, only adding some comments in code.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18992 from yanboliang/SPARK-19762.
      c108a5d3
    • Marcelo Vanzin's avatar
      [SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. · 84b5b16e
      Marcelo Vanzin authored
      For Hive tables, the current "replace the schema" code is the correct
      path, except that an exception in that path should result in an error, and
      not in retrying in a different way.
      
      For data source tables, Spark may generate a non-compatible Hive table;
      but for that to work with Hive 2.1, the detection of data source tables needs
      to be fixed in the Hive client, to also consider the raw tables used by code
      such as `alterTableSchema`.
      
      Tested with existing and added unit tests (plus internal tests with a 2.1 metastore).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18849 from vanzin/SPARK-21617.
      84b5b16e
    • Yuming Wang's avatar
      [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdown verification back. · ba843292
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      The previous PR(https://github.com/apache/spark/pull/19000) removed filter pushdown verification, This PR add them back.
      
      ## How was this patch tested?
      manual tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #19002 from wangyum/SPARK-21790-follow-up.
      ba843292
    • Nick Pentreath's avatar
      [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher · 988b84d7
      Nick Pentreath authored
      Add Python API for `FeatureHasher` transformer.
      
      ## How was this patch tested?
      
      New doc test.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.
      988b84d7
    • Sean Owen's avatar
      [SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats ..." · b3a07526
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Reduce 'Skipping partitions' message to debug
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19010 from srowen/SPARK-21718.
      b3a07526
    • Sergey Serebryakov's avatar
      [SPARK-21782][CORE] Repartition creates skews when numPartitions is a power of 2 · 77d046ec
      Sergey Serebryakov authored
      ## Problem
      When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782
      
      ## What changes were proposed in this pull request?
      Instead of directly using `0, 1, 2,...` seeds to initialize `Random`, hash them with `scala.util.hashing.byteswap32()`.
      
      ## How was this patch tested?
      `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test`
      
      Author: Sergey Serebryakov <sserebryakov@tesla.com>
      
      Closes #18990 from megaserg/repartition-skew.
      77d046ec
  6. Aug 20, 2017
Loading