Skip to content
Snippets Groups Projects
  1. Apr 07, 2017
    • 郭小龙 10207633's avatar
      [SPARK-20218][DOC][APP-ID] applications//stages' in REST API,add description. · 77911201
      郭小龙 10207633 authored
      ## What changes were proposed in this pull request?
      
      1. '/applications/[app-id]/stages' in rest api.status should add description '?status=[active|complete|pending|failed] list only stages in the state.'
      
      Now the lack of this description, resulting in the use of this api do not know the use of the status through the brush stage list.
      
      2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant description ‘?status=[active|complete|pending|failed] list only stages in the state.’.
      Because only one stage is determined based on stage-id.
      
      code:
        GET
        def stageList(QueryParam("status") statuses: JList[StageStatus]): Seq[StageData] = {
          val listener = ui.jobProgressListener
          val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
          val adjStatuses = {
            if (statuses.isEmpty()) {
              Arrays.asList(StageStatus.values(): _*)
            } else {
              statuses
            }
          };
      
      ## How was this patch tested?
      
      manual tests
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
      
      Closes #17534 from guoxiaolongzte/SPARK-20218.
      
      (cherry picked from commit 9e0893b5)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      77911201
  2. Apr 05, 2017
    • Liang-Chi Hsieh's avatar
      [SPARK-20214][ML] Make sure converted csc matrix has sorted indices · fb81a412
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:
      
          from scipy.sparse import lil_matrix
          lil = lil_matrix((4, 1))
          lil[1, 0] = 1
          lil[3, 0] = 2
          _convert_to_vector(lil.todok())
      
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
            return SparseVector(l.shape[0], csc.indices, csc.data)
          File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
            % (self.indices[i], self.indices[i + 1]))
          TypeError: Indices 3 and 1 are not strictly increasing
      
      A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:
      
          >>> from scipy.sparse import lil_matrix
          >>> lil = lil_matrix((4, 1))
          >>> lil[1, 0] = 1
          >>> lil[3, 0] = 2
          >>> dok = lil.todok()
          >>> csc = dok.tocsc()
          >>> csc.has_sorted_indices
          0
          >>> csc.indices
          array([3, 1], dtype=int32)
      
      I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #17532 from viirya/make-sure-sorted-indices.
      
      (cherry picked from commit 12206058)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      fb81a412
    • wangzhenhua's avatar
      [SPARK-20223][SQL] Fix typo in tpcds q77.sql · 2b85e059
      wangzhenhua authored
      
      ## What changes were proposed in this pull request?
      
      Fix typo in tpcds q77.sql
      
      ## How was this patch tested?
      
      N/A
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17538 from wzhfy/typoQ77.
      
      (cherry picked from commit a2d8d767)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      2b85e059
    • Oliver Köth's avatar
      [SPARK-20042][WEB UI] Fix log page buttons for reverse proxy mode · efc72dcc
      Oliver Köth authored
      
      with spark.ui.reverseProxy=true, full path URLs like /log will point to
      the master web endpoint which is serving the worker UI as reverse proxy.
      To access a REST endpoint in the worker in reverse proxy mode , the
      leading /proxy/"target"/ part of the base URI must be retained.
      
      Added logic to log-view.js to handle this, similar to executorspage.js
      
      Patch was tested manually
      
      Author: Oliver Köth <okoeth@de.ibm.com>
      
      Closes #17370 from okoethibm/master.
      
      (cherry picked from commit 6f09dc70)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      efc72dcc
  3. Apr 04, 2017
    • Marcelo Vanzin's avatar
      [SPARK-20191][YARN] Crate wrapper for RackResolver so tests can override it. · 00c12488
      Marcelo Vanzin authored
      
      Current test code tries to override the RackResolver used by setting
      configuration params, but because YARN libs statically initialize the
      resolver the first time it's used, that means that those configs don't
      really take effect during Spark tests.
      
      This change adds a wrapper class that easily allows tests to override the
      behavior of the resolver for the Spark code that uses it.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17508 from vanzin/SPARK-20191.
      
      (cherry picked from commit 0736980f)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      00c12488
    • guoxiaolongzte's avatar
      [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s… · f9546dac
      guoxiaolongzte authored
      …ucceeded|failed|unknown]
      
      ## What changes were proposed in this pull request?
      
      '/applications/[app-id]/jobs' in rest api.status should be'[running|succeeded|failed|unknown]'.
      now status is '[complete|succeeded|failed]'.
      but '/applications/[app-id]/jobs?status=complete' the server return 'HTTP ERROR 404'.
      Added '?status=running' and '?status=unknown'.
      code :
      public enum JobExecutionStatus {
      RUNNING,
      SUCCEEDED,
      FAILED,
      UNKNOWN;
      
      ## How was this patch tested?
      
       manual tests
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>
      
      Closes #17507 from guoxiaolongzte/SPARK-20190.
      
      (cherry picked from commit c95fbea6)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      f9546dac
  4. Apr 03, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown · 77700ea3
      hyukjinkwon authored
      # What changes were proposed in this pull request?
      
      It seems there are several non-breaking spaces were inserted into several `.md`s and they look breaking rendering markdown files.
      
      These are different. For example, this can be checked via `python` as below:
      
      ```python
      >>> " "
      '\xc2\xa0'
      >>> " "
      ' '
      ```
      
      _Note that it seems this PR description automatically replaces non-breaking spaces into normal spaces. Please open a `vi` and copy and paste it into `python` to verify this (do not copy the characters here)._
      
      I checked the output below in  Sapari and Chrome on Mac OS and, Internal Explorer on Windows 10.
      
      **Before**
      
      ![2017-04-03 12 37 17](https://cloud.githubusercontent.com/assets/6477701/24594655/50aaba02-186a-11e7-80bb-d34b17a3398a.png)
      ![2017-04-03 12 36 57](https://cloud.githubusercontent.com/assets/6477701/24594654/50a855e6-186a-11e7-94e2-661e56544b0f.png)
      
      **After**
      
      ![2017-04-03 12 36 46](https://cloud.githubusercontent.com/assets/6477701/24594657/53c2545c-186a-11e7-9a73-00529afbfd75.png)
      ![2017-04-03 12 36 31](https://cloud.githubusercontent.com/assets/6477701/24594658/53c286c0-186a-11e7-99c9-e66b1f510fe7.png
      
      )
      
      ## How was this patch tested?
      
      Manually checking.
      
      These instances were found via
      
      ```
      grep --include=*.scala --include=*.python --include=*.java --include=*.r --include=*.R --include=*.md --include=*.r -r -I " " .
      ```
      
      in Mac OS.
      
      It seems there are several instances more as below:
      
      ```
      ./docs/sql-programming-guide.md:        │   ├── ...
      ./docs/sql-programming-guide.md:        │   │
      ./docs/sql-programming-guide.md:        │   ├── country=US
      ./docs/sql-programming-guide.md:        │   │   └── data.parquet
      ./docs/sql-programming-guide.md:        │   ├── country=CN
      ./docs/sql-programming-guide.md:        │   │   └── data.parquet
      ./docs/sql-programming-guide.md:        │   └── ...
      ./docs/sql-programming-guide.md:            ├── ...
      ./docs/sql-programming-guide.md:            │
      ./docs/sql-programming-guide.md:            ├── country=US
      ./docs/sql-programming-guide.md:            │   └── data.parquet
      ./docs/sql-programming-guide.md:            ├── country=CN
      ./docs/sql-programming-guide.md:            │   └── data.parquet
      ./docs/sql-programming-guide.md:            └── ...
      ./sql/core/src/test/README.md:│   ├── *.avdl                  # Testing Avro IDL(s)
      ./sql/core/src/test/README.md:│   └── *.avpr                  # !! NO TOUCH !! Protocol files generated from Avro IDL(s)
      ./sql/core/src/test/README.md:│   ├── gen-avro.sh             # Script used to generate Java code for Avro
      ./sql/core/src/test/README.md:│   └── gen-thrift.sh           # Script used to generate Java code for Thrift
      ```
      
      These seems generated via `tree` command which inserts non-breaking spaces. They do not look causing any problem for rendering within code blocks and I did not fix it to reduce the overhead to manually replace it when it is overwritten via `tree` command in the future.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17517 from HyukjinKwon/non-breaking-space.
      
      (cherry picked from commit 364b0db7)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      77700ea3
  5. Apr 02, 2017
  6. Mar 31, 2017
    • Ryan Blue's avatar
      [SPARK-20084][CORE] Remove internal.metrics.updatedBlockStatuses from history files. · e3cec18e
      Ryan Blue authored
      
      ## What changes were proposed in this pull request?
      
      Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
      
      ## How was this patch tested?
      
      Current History UI tests cover use of the history file.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.
      
      (cherry picked from commit c4c03eed)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      e3cec18e
    • Kunal Khamar's avatar
      [SPARK-20164][SQL] AnalysisException not tolerant of null query plan. · 6a1b2eb4
      Kunal Khamar authored
      
      The query plan in an `AnalysisException` may be `null` when an `AnalysisException` object is serialized and then deserialized, since `plan` is marked `transient`. Or when someone throws an `AnalysisException` with a null query plan (which should not happen).
      `def getMessage` is not tolerant of this and throws a `NullPointerException`, leading to loss of information about the original exception.
      The fix is to add a `null` check in `getMessage`.
      
      - Unit test
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17486 from kunalkhamar/spark-20164.
      
      (cherry picked from commit 254877c2)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      6a1b2eb4
  7. Mar 29, 2017
    • jerryshao's avatar
      [SPARK-20059][YARN] Use the correct classloader for HBaseCredentialProvider · 103ff54d
      jerryshao authored
      
      ## What changes were proposed in this pull request?
      
      Currently we use system classloader to find HBase jars, if it is specified by `--jars`, then it will be failed with ClassNotFound issue. So here changing to use child classloader.
      
      Also putting added jars and main jar into classpath of submitted application in yarn cluster mode, otherwise HBase jars specified with `--jars` will never be honored in cluster mode, and fetching tokens in client side will always be failed.
      
      ## How was this patch tested?
      
      Unit test and local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17388 from jerryshao/SPARK-20059.
      
      (cherry picked from commit c622a87c)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      103ff54d
    • Reynold Xin's avatar
      [SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to simplify driver side metric updates · f8c1b3e2
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      It is not super intuitive how to update SQLMetric on the driver side. This patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, and adds documentation to make it more obvious.
      
      ## How was this patch tested?
      Updated a test case to use this method.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17464 from rxin/SPARK-20134.
      
      (cherry picked from commit 9712bd39)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      f8c1b3e2
  8. Mar 28, 2017
    • 颜发才(Yan Facai)'s avatar
      [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for... · 30954806
      颜发才(Yan Facai) authored
      [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini
      
      Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading
      
      TODO:
      + [x] add unit test
      + [x] fix the bug
      
      Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
      
      Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
      
      (cherry picked from commit 7d432af8)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      30954806
    • Patrick Wendell's avatar
      4964dbed
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.1-rc2 · 02b165dc
      Patrick Wendell authored
      02b165dc
    • sureshthalamati's avatar
      [SPARK-14536][SQL][BACKPORT-2.1] fix to handle null value in array type column for postgres. · e669dd7e
      sureshthalamati authored
      ## What changes were proposed in this pull request?
      JDBC read is failing with NPE due to missing null value check for array data type if the source table has null values in the array type column. For null values Resultset.getArray() returns null.
      This PR adds null safe check to the Resultset.getArray() value before invoking method on the Array object
      
      ## How was this patch tested?
      Updated the PostgresIntegration test suite to test null values. Ran docker integration tests on my laptop.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #17460 from sureshthalamati/jdbc_array_null_fix_spark_2.1-SPARK-14536.
      e669dd7e
    • Wenchen Fan's avatar
      [SPARK-20125][SQL] Dataset of type option of map does not work · fd2e4061
      Wenchen Fan authored
      
      When we build the deserializer expression for map type, we will use `StaticInvoke` to call `ArrayBasedMapData.toScalaMap`, and declare the return type as `scala.collection.immutable.Map`. If the map is inside an Option, we will wrap this `StaticInvoke` with `WrapOption`, which requires the input to be `scala.collect.Map`. Ideally this should be fine, as `scala.collection.immutable.Map` extends `scala.collect.Map`, but our `ObjectType` is too strict about this, this PR fixes it.
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17454 from cloud-fan/map.
      
      (cherry picked from commit d4fac410)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      fd2e4061
    • jerryshao's avatar
      [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of... · 4bcb7d67
      jerryshao authored
      [SPARK-19995][YARN] Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
      
      ## What changes were proposed in this pull request?
      
      In the current Spark on YARN code, we will obtain tokens from provided services, but we're not going to add these tokens to the current user's credentials. This will make all the following operations to these services still require TGT rather than delegation tokens. This is unnecessary since we already got the tokens, also this will lead to failure in user impersonation scenario, because the TGT is granted by real user, not proxy user.
      
      So here changing to put all the tokens to the current UGI, so that following operations to these services will honor tokens rather than TGT, and this will further handle the proxy user issue mentioned above.
      
      ## How was this patch tested?
      
      Local verified in secure cluster.
      
      vanzin tgravescs mridulm  dongjoon-hyun please help to review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17335 from jerryshao/SPARK-19995.
      
      (cherry picked from commit 17eddb35)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      4bcb7d67
  9. Mar 27, 2017
    • Josh Rosen's avatar
      [SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two minor build fixes · 4056191d
      Josh Rosen authored
      
      ## What changes were proposed in this pull request?
      
      The master snapshot publisher builds are currently broken due to two minor build issues:
      
      1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
      2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
      
      ## How was this patch tested?
      
      The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17437 from JoshRosen/spark-20102.
      
      (cherry picked from commit 314cf51d)
      Signed-off-by: default avatarJosh Rosen <joshrosen@databricks.com>
      4056191d
  10. Mar 26, 2017
    • Herman van Hovell's avatar
      [SPARK-20086][SQL] CollapseWindow should not collapse dependent adjacent windows · b6d348ee
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The `CollapseWindow` is currently to aggressive when collapsing adjacent windows. It also collapses windows in the which the parent produces a column that is consumed by the child; this creates an invalid window which will fail at runtime.
      
      This PR fixes this by adding a check for dependent adjacent windows to the `CollapseWindow` rule.
      
      ## How was this patch tested?
      Added a new test case to `CollapseWindowSuite`
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #17432 from hvanhovell/SPARK-20086.
      
      (cherry picked from commit 617ab644)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      b6d348ee
  11. Mar 25, 2017
    • Carson Wang's avatar
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to … · d989434e
      Carson Wang authored
      [SPARK-19674][SQL] Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates
      
      N.B. This is a backport to branch-2.1 of #17009.
      
      ## What changes were proposed in this pull request?
      In SQLListener.getExecutionMetrics, driver accumulator updates don't belong to the execution should be ignored when merging all accumulator updates to prevent NoSuchElementException.
      
      ## How was this patch tested?
      Updated unit test.
      
      Author: Carson Wang <carson.wangintel.com>
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #17418 from mallman/spark-19674-backport_2.1.
      d989434e
  12. Mar 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-19959][SQL] Fix to throw NullPointerException in df[java.lang.Long].collect · 92f0b012
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR fixes `NullPointerException` in the generated code by Catalyst. When we run the following code, we get the following `NullPointerException`. This is because there is no null checks for `inputadapter_value`  while `java.lang.Long inputadapter_value` at Line 30 may have `null`.
      
      This happen when a type of DataFrame is nullable primitive type such as `java.lang.Long` and the wholestage codegen is used. While the physical plan keeps `nullable=true` in `input[0, java.lang.Long, true].longValue`, `BoundReference.doGenCode` ignores `nullable=true`. Thus, nullcheck code will not be generated and `NullPointerException` will occur.
      
      This PR checks the nullability and correctly generates nullcheck if needed.
      ```java
      sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
      ```
      
      ```java
      Caused by: java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:393)
      ...
      ```
      
      Generated code without this PR
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       java.lang.Long inputadapter_value = (java.lang.Long)inputadapter_row.get(0, null);
      /* 031 */
      /* 032 */       boolean serializefromobject_isNull = true;
      /* 033 */       long serializefromobject_value = -1L;
      /* 034 */       if (!false) {
      /* 035 */         serializefromobject_isNull = false;
      /* 036 */         if (!serializefromobject_isNull) {
      /* 037 */           serializefromobject_value = inputadapter_value.longValue();
      /* 038 */         }
      /* 039 */
      /* 040 */       }
      /* 041 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 042 */
      /* 043 */       if (serializefromobject_isNull) {
      /* 044 */         serializefromobject_rowWriter.setNullAt(0);
      /* 045 */       } else {
      /* 046 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 047 */       }
      /* 048 */       append(serializefromobject_result);
      /* 049 */       if (shouldStop()) return;
      /* 050 */     }
      /* 051 */   }
      /* 052 */ }
      ```
      
      Generated code with this PR
      
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
      /* 031 */       java.lang.Long inputadapter_value = inputadapter_isNull ? null : ((java.lang.Long)inputadapter_row.get(0, null));
      /* 032 */
      /* 033 */       boolean serializefromobject_isNull = true;
      /* 034 */       long serializefromobject_value = -1L;
      /* 035 */       if (!inputadapter_isNull) {
      /* 036 */         serializefromobject_isNull = false;
      /* 037 */         if (!serializefromobject_isNull) {
      /* 038 */           serializefromobject_value = inputadapter_value.longValue();
      /* 039 */         }
      /* 040 */
      /* 041 */       }
      /* 042 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 043 */
      /* 044 */       if (serializefromobject_isNull) {
      /* 045 */         serializefromobject_rowWriter.setNullAt(0);
      /* 046 */       } else {
      /* 047 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 048 */       }
      /* 049 */       append(serializefromobject_result);
      /* 050 */       if (shouldStop()) return;
      /* 051 */     }
      /* 052 */   }
      /* 053 */ }
      ```
      
      ## How was this patch tested?
      
      Added new test suites in `DataFrameSuites`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17302 from kiszk/SPARK-19959.
      
      (cherry picked from commit bb823ca4)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      92f0b012
    • Dongjoon Hyun's avatar
      [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USER instead of PRINCIPAL... · af960e86
      Dongjoon Hyun authored
      [SPARK-19970][SQL][BRANCH-2.1] Table owner should be USER instead of PRINCIPAL in kerberized clusters
      
      ## What changes were proposed in this pull request?
      
      In the kerberized hadoop cluster, when Spark creates tables, the owner of tables are filled with PRINCIPAL strings instead of USER names. This is inconsistent with Hive and causes problems when using [ROLE](https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization) in Hive. We had better to fix this.
      
      **BEFORE**
      ```scala
      scala> sql("create table t(a int)").show
      scala> sql("desc formatted t").show(false)
      ...
      |Owner:                      |sparkEXAMPLE.COM                                         |       |
      ```
      
      **AFTER**
      ```scala
      scala> sql("create table t(a int)").show
      scala> sql("desc formatted t").show(false)
      ...
      |Owner:                      |spark                                         |       |
      ```
      
      ## How was this patch tested?
      
      Manually do `create table` and `desc formatted` because this happens in Kerberized clusters.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17363 from dongjoon-hyun/SPARK-19970-2.
      af960e86
  13. Mar 22, 2017
  14. Mar 21, 2017
    • Patrick Wendell's avatar
      c4d2b833
    • Patrick Wendell's avatar
      Preparing Spark release v2.1.1-rc1 · 30abb95c
      Patrick Wendell authored
      30abb95c
    • Takeshi Yamamuro's avatar
      [SPARK-19980][SQL][BACKPORT-2.1] Add NULL checks in Bean serializer · a04428fe
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      A Bean serializer in `ExpressionEncoder`  could change values when Beans having NULL. A concrete example is as follows;
      ```
      scala> :paste
      class Outer extends Serializable {
        private var cls: Inner = _
        def setCls(c: Inner): Unit = cls = c
        def getCls(): Inner = cls
      }
      
      class Inner extends Serializable {
        private var str: String = _
        def setStr(s: String): Unit = str = str
        def getStr(): String = str
      }
      
      scala> Seq("""{"cls":null}""", """{"cls": {"str":null}}""").toDF().write.text("data")
      scala> val encoder = Encoders.bean(classOf[Outer])
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |[null]|     // <-- Value changed
      +------+
      ```
      
      This is because the Bean serializer does not have the NULL-check expressions that the serializer of Scala's product types has. Actually, this value change does not happen in Scala's product types;
      
      ```
      scala> :paste
      case class Outer(cls: Inner)
      case class Inner(str: String)
      
      scala> val encoder = Encoders.product[Outer]
      scala> val schema = encoder.schema
      scala> val df = spark.read.schema(schema).json("data").as[Outer](encoder)
      scala> df.show
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      
      scala> df.map(x => x)(encoder).show()
      +------+
      |   cls|
      +------+
      |[null]|
      |  null|
      +------+
      ```
      
      This pr added the NULL-check expressions in Bean serializer along with the serializer of Scala's product types.
      
      ## How was this patch tested?
      Added tests in `JavaDatasetSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17372 from maropu/SPARK-19980-BACKPORT2.1.
      a04428fe
    • Will Manning's avatar
      clarify array_contains function description · 9dfdd2ad
      Will Manning authored
      ## What changes were proposed in this pull request?
      
      The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
      
      ## How was this patch tested?
      
      No testing, since it merely changes a comment.
      
      Please review http://spark.apache.org/contributing.html
      
       before opening a pull request.
      
      Author: Will Manning <lwwmanning@gmail.com>
      
      Closes #17380 from lwwmanning/patch-1.
      
      (cherry picked from commit a04dcde8)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9dfdd2ad
    • Felix Cheung's avatar
      [SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle when java is not installed · 5c18b6c3
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      When SparkR is installed as a R package there might not be any java runtime.
      If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
      
      ## How was this patch tested?
      
      manually
      
      - [x] need to test on Windows
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16596 from felixcheung/rcheckjava.
      
      (cherry picked from commit a8877bdb)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      5c18b6c3
    • zhaorongsheng's avatar
      [SPARK-20017][SQL] change the nullability of function 'StringToMap' from 'false' to 'true' · a88c88aa
      zhaorongsheng authored
      
      ## What changes were proposed in this pull request?
      
      Change the nullability of function `StringToMap` from `false` to `true`.
      
      Author: zhaorongsheng <334362872@qq.com>
      
      Closes #17350 from zhaorongsheng/bug-fix_strToMap_NPE.
      
      (cherry picked from commit 7dbc162f)
      Signed-off-by: default avatarXiao Li <gatorsmile@gmail.com>
      a88c88aa
  15. Mar 20, 2017
    • Dongjoon Hyun's avatar
      [SPARK-19912][SQL] String literals should be escaped for Hive metastore partition pruning · c4c7b185
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      Since current `HiveShim`'s `convertFilters` does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.
      
      **BEFORE**
      
      ```scala
      scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")
      
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
      ```
      
      **AFTER**
      
      ```scala
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      |  2|
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with new test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17266 from dongjoon-hyun/SPARK-19912.
      
      (cherry picked from commit 21e366ae)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      c4c7b185
    • Michael Allman's avatar
      [SPARK-17204][CORE] Fix replicated off heap storage · d205d40a
      Michael Allman authored
      (Jira: https://issues.apache.org/jira/browse/SPARK-17204
      
      )
      
      ## What changes were proposed in this pull request?
      
      There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems.
      
      ## How was this patch tested?
      
      `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #16499 from mallman/spark-17204-replicated_off_heap_storage.
      
      (cherry picked from commit 7fa116f8)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      d205d40a
    • wangzhenhua's avatar
      [SPARK-19994][SQL] Wrong outputOrdering for right/full outer smj · af8bf218
      wangzhenhua authored
      
      ## What changes were proposed in this pull request?
      
      For right outer join, values of the left key will be filled with nulls if it can't match the value of the right key, so `nullOrdering` of the left key can't be guaranteed. We should output right key order instead of left key order.
      
      For full outer join, neither left key nor right key guarantees `nullOrdering`. We should not output any ordering.
      
      In tests, besides adding three test cases for left/right/full outer sort merge join, this patch also reorganizes code in `PlannerSuite` by putting together tests for `Sort`, and also extracts common logic in Sort tests into a method.
      
      ## How was this patch tested?
      
      Corresponding test cases are added.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #17331 from wzhfy/wrongOrdering.
      
      (cherry picked from commit 965a5abc)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      af8bf218
  16. Mar 19, 2017
    • Felix Cheung's avatar
      [SPARK-18817][SPARKR][SQL] change derby log output to temp dir · b60f6902
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Passes R `tempdir()` (this is the R session temp dir, shared with other temp files/dirs) to JVM, set System.Property for derby home dir to move derby.log
      
      ## How was this patch tested?
      
      Manually, unit tests
      
      With this, these are relocated to under /tmp
      ```
      # ls /tmp/RtmpG2M0cB/
      derby.log
      ```
      And they are removed automatically when the R session is ended.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16330 from felixcheung/rderby.
      
      (cherry picked from commit 422aa67d)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      b60f6902
  17. Mar 17, 2017
    • Jacek Laskowski's avatar
      [SQL][MINOR] Fix scaladoc for UDFRegistration · 780f6060
      Jacek Laskowski authored
      
      ## What changes were proposed in this pull request?
      
      Fix scaladoc for UDFRegistration
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17337 from jaceklaskowski/udfregistration-scaladoc.
      
      (cherry picked from commit 6326d406)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      780f6060
    • Shixiong Zhu's avatar
      [SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable · 5fb70831
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.
      
      This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17323 from zsxwing/SPARK-19986.
      
      (cherry picked from commit 376d7821)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      5fb70831
    • Liwei Lin's avatar
      [SPARK-19721][SS][BRANCH-2.1] Good error message for version mismatch in log files · 710b5554
      Liwei Lin authored
      ## Problem
      
      There are several places where we write out version identifiers in various logs for structured streaming (usually `v1`). However, in the places where we check for this, we throw a confusing error message.
      
      ## What changes were proposed in this pull request?
      
      This patch made two major changes:
      1. added a `parseVersion(...)` method, and based on this method, fixed the following places the way they did version checking (no other place needed to do this checking):
      ```
      HDFSMetadataLog
        - CompactibleFileStreamLog  ------------> fixed with this patch
          - FileStreamSourceLog  ---------------> inherited the fix of `CompactibleFileStreamLog`
          - FileStreamSinkLog  -----------------> inherited the fix of `CompactibleFileStreamLog`
        - OffsetSeqLog  ------------------------> fixed with this patch
        - anonymous subclass in KafkaSource  ---> fixed with this patch
      ```
      
      2. changed the type of `FileStreamSinkLog.VERSION`, `FileStreamSourceLog.VERSION` etc. from `String` to `Int`, so that we can identify newer versions via `version > 1` instead of `version != "v1"`
          - note this didn't break any backwards compatibility -- we are still writing out `"v1"` and reading back `"v1"`
      
      ## Exception message with this patch
      ```
      java.lang.IllegalStateException: Failed to read log file /private/var/folders/nn/82rmvkk568sd8p3p8tb33trw0000gn/T/spark-86867b65-0069-4ef1-b0eb-d8bd258ff5b8/0. UnsupportedLogVersion: maximum supported log version is v1, but encountered v99. The log file was produced by a newer version of Spark and cannot be read by this version. Please upgrade.
      	at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:202)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:78)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3$$anonfun$apply$mcV$sp$2.apply(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:133)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite.withTempDir(OffsetSeqLogSuite.scala:26)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply$mcV$sp(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
      	at org.apache.spark.sql.execution.streaming.OffsetSeqLogSuite$$anonfun$3.apply(OffsetSeqLogSuite.scala:75)
      	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17327 from lw-lin/good-msg-2.1.
      710b5554
  18. Mar 16, 2017
    • Xiao Li's avatar
      [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL]... · 4b977ff0
      Xiao Li authored
      [SPARK-19765][SPARK-18549][SPARK-19093][SPARK-19736][BACKPORT-2.1][SQL] Backport Three Cache-related PRs to Spark 2.1
      
      ### What changes were proposed in this pull request?
      
      Backport a few cache related PRs:
      
      ---
      [[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression](https://github.com/apache/spark/pull/16493)
      
      Consider the plans inside subquery expressions while looking up cache manager to make
      use of cached data. Currently CacheManager.useCachedData does not consider the
      subquery expressions in the plan.
      
      ---
      [[SPARK-19736][SQL] refreshByPath should clear all cached plans with the specified path](https://github.com/apache/spark/pull/17064)
      
      Catalog.refreshByPath can refresh the cache entry and the associated metadata for all dataframes (if any), that contain the given data source path.
      
      However, CacheManager.invalidateCachedPath doesn't clear all cached plans with the specified path. It causes some strange behaviors reported in SPARK-15678.
      
      ---
      [[SPARK-19765][SPARK-18549][SQL] UNCACHE TABLE should un-cache all cached plans that refer to this table](https://github.com/apache/spark/pull/17097)
      
      When un-cache a table, we should not only remove the cache entry for this table, but also un-cache any other cached plans that refer to this table. The following commands trigger the table uncache: `DropTableCommand`, `TruncateTableCommand`, `AlterTableRenameCommand`, `UncacheTableCommand`, `RefreshTable` and `InsertIntoHiveTable`
      
      This PR also includes some refactors:
      - use java.util.LinkedList to store the cache entries, so that it's safer to remove elements while iterating
      - rename invalidateCache to recacheByPlan, which is more obvious about what it does.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17319 from gatorsmile/backport-17097.
      4b977ff0
Loading