Skip to content
Snippets Groups Projects
  1. Nov 22, 2016
  2. Nov 21, 2016
    • Liwei Lin's avatar
      [SPARK-18425][STRUCTURED STREAMING][TESTS] Test `CompactibleFileStreamLog` directly · ebeb0830
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).
      
      Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like https://github.com/apache/spark/pull/15828, https://github.com/apache/spark/pull/15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.
      
      ## How was this patch tested?
      
      the PR itself is about tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15870 from lw-lin/test-compact-1113.
      ebeb0830
    • Burak Yavuz's avatar
      [SPARK-18493] Add missing python APIs: withWatermark and checkpoint to dataframe · 97a8239a
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This PR adds two of the newly added methods of `Dataset`s to Python:
      `withWatermark` and `checkpoint`
      
      ## How was this patch tested?
      
      Doc tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15921 from brkyvz/py-watermark.
      97a8239a
    • hyukjinkwon's avatar
      [SPARK-17765][SQL] Support for writing out user-defined type in ORC datasource · a2d46477
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR adds the support for `UserDefinedType` when writing out instead of throwing `ClassCastException` in ORC data source.
      
      In more details, `OrcStruct` is being created based on string from`DataType.catalogString`. For user-defined type, it seems it returns `sqlType.simpleString` for `catalogString` by default[1]. However, during type-dispatching to match the output with the schema, it tries to cast to, for example, `StructType`[2].
      
      So, running the codes below (`MyDenseVector` was borrowed[3]) :
      
      ``` scala
      val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25))))
      val udtDF = data.toDF("id", "vectors")
      udtDF.write.orc("/tmp/test.orc")
      ```
      
      ends up throwing an exception as below:
      
      ```
      java.lang.ClassCastException: org.apache.spark.sql.UDT$MyDenseVectorUDT cannot be cast to org.apache.spark.sql.types.ArrayType
          at org.apache.spark.sql.hive.HiveInspectors$class.wrapperFor(HiveInspectors.scala:381)
          at org.apache.spark.sql.hive.orc.OrcSerializer.wrapperFor(OrcFileFormat.scala:164)
      ...
      ```
      
      So, this PR uses `UserDefinedType.sqlType` during finding the correct converter when writing out in ORC data source.
      
      [1]https://github.com/apache/spark/blob/dfdcab00c7b6200c22883baa3ebc5818be09556f/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala#L95
      [2]https://github.com/apache/spark/blob/d2dc8c4a162834818190ffd82894522c524ca3e5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L326
      [3]https://github.com/apache/spark/blob/2bfed1a0c5be7d0718fd574a4dad90f4f6b44be7/sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala#L38-L70
      ## How was this patch tested?
      
      Unit tests in `OrcQuerySuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15361 from HyukjinKwon/SPARK-17765.
      a2d46477
    • Dongjoon Hyun's avatar
      [SPARK-18517][SQL] DROP TABLE IF EXISTS should not warn for non-existing tables · ddd02f50
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `DROP TABLE IF EXISTS` shows warning for non-existing tables. However, it had better be quiet for this case by definition of the command.
      
      **BEFORE**
      ```scala
      scala> sql("DROP TABLE IF EXISTS nonexist")
      16/11/20 20:48:26 WARN DropTableCommand: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'nonexist' not found in database 'default';
      ```
      
      **AFTER**
      ```scala
      scala> sql("DROP TABLE IF EXISTS nonexist")
      res0: org.apache.spark.sql.DataFrame = []
      ```
      
      ## How was this patch tested?
      
      Manual because this is related to the warning messages instead of exceptions.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15953 from dongjoon-hyun/SPARK-18517.
      ddd02f50
    • Gabriel Huang's avatar
      [SPARK-18361][PYSPARK] Expose RDD localCheckpoint in PySpark · 70176871
      Gabriel Huang authored
      ## What changes were proposed in this pull request?
      
      Expose RDD's localCheckpoint() and associated functions in PySpark.
      
      ## How was this patch tested?
      
      I added a UnitTest in python/pyspark/tests.py which passes.
      
      I certify that this is my original work, and I license it to the project under the project's open source license.
      
      Gabriel HUANG
      Developer at Cardabel (http://cardabel.com/)
      
      Author: Gabriel Huang <gabi.xiaohuang@gmail.com>
      
      Closes #15811 from gabrielhuang/pyspark-localcheckpoint.
      70176871
    • Dongjoon Hyun's avatar
      [SPARK-18413][SQL] Add `maxConnections` JDBCOption · 07beb5d2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API.
      
      **Reported Scenario**
      
      For the following cases, the number of connections becomes 200 and database cannot handle all of them.
      
      ```sql
      CREATE OR REPLACE TEMPORARY VIEW resultview
      USING org.apache.spark.sql.jdbc
      OPTIONS (
        url "jdbc:oracle:thin:10.129.10.111:1521:BKDB",
        dbtable "result",
        user "HIVE",
        password "HIVE"
      );
      -- set spark.sql.shuffle.partitions=200
      INSERT OVERWRITE TABLE resultview SELECT g, count(1) AS COUNT FROM tnet.DT_LIVE_INFO GROUP BY g
      ```
      
      ## How was this patch tested?
      
      Manual. Do the followings and see Spark UI.
      
      **Step 1 (MySQL)**
      ```
      CREATE TABLE t1 (a INT);
      CREATE TABLE data (a INT);
      INSERT INTO data VALUES (1);
      INSERT INTO data VALUES (2);
      INSERT INTO data VALUES (3);
      ```
      
      **Step 2 (Spark)**
      ```scala
      SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar
      scala> sql("SET spark.sql.shuffle.partitions=3")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')")
      scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
      ```
      
      ![maxconnections](https://cloud.githubusercontent.com/assets/9700541/20287987/ed8409c2-aa84-11e6-8aab-ae28e63fe54d.png)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15868 from dongjoon-hyun/SPARK-18413.
      Unverified
      07beb5d2
    • Takuya UESHIN's avatar
      [SPARK-18398][SQL] Fix nullabilities of MapObjects and ExternalMapToCatalyst. · 9f262ae1
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      The nullabilities of `MapObject` can be made more strict by relying on `inputObject.nullable` and `lambdaFunction.nullable`.
      
      Also `ExternalMapToCatalyst.dataType` can be made more strict by relying on `valueConverter.nullable`.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15840 from ueshin/issues/SPARK-18398.
      9f262ae1
    • sethah's avatar
      [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM · e811fbf9
      sethah authored
      ## What changes were proposed in this pull request?
      
      Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15777 from sethah/pyspark_cluster_summaries.
      e811fbf9
  3. Nov 20, 2016
    • Takuya UESHIN's avatar
      [SPARK-18467][SQL] Extracts method for preparing arguments from StaticInvoke,... · 65854797
      Takuya UESHIN authored
      [SPARK-18467][SQL] Extracts method for preparing arguments from StaticInvoke, Invoke and NewInstance and modify to short circuit if arguments have null when `needNullCheck == true`.
      
      ## What changes were proposed in this pull request?
      
      This pr extracts method for preparing arguments from `StaticInvoke`, `Invoke` and `NewInstance` and modify to short circuit if arguments have `null` when `propageteNull == true`.
      
      The steps are as follows:
      
      1. Introduce `InvokeLike` to extract common logic from `StaticInvoke`, `Invoke` and `NewInstance` to prepare arguments.
      `StaticInvoke` and `Invoke` had a risk to exceed 64kb JVM limit to prepare arguments but after this patch they can handle them because they share the preparing code of NewInstance, which handles the limit well.
      
      2. Remove unneeded null checking and fix nullability of `NewInstance`.
      Avoid some of nullabilty checking which are not needed because the expression is not nullable.
      
      3. Modify to short circuit if arguments have `null` when `needNullCheck == true`.
      If `needNullCheck == true`, preparing arguments can be skipped if we found one of them is `null`, so modified to short circuit in the case.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15901 from ueshin/issues/SPARK-18467.
      65854797
    • Reynold Xin's avatar
      [HOTFIX][SQL] Fix DDLSuite failure. · b625a36e
      Reynold Xin authored
      b625a36e
    • Reynold Xin's avatar
      Fix Mesos build break for Scala 2.10. · 6659ae55
      Reynold Xin authored
      6659ae55
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] Print examples and disable group and tparam tags in javadoc · c528812c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes/fixes two things.
      
      - Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.
      
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
        [error]   /** group setParam */
        [error]       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
        [error]  * tparam FeaturesType  Type of input features.  E.g., <code>Vector</code>
        [error]    ^
        ...
        ```
      
        It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).
      
      - Print `example` in javadoc.
        Currently, there are few `example` tag in several places.
      
        ```
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This operation might be used to evaluate a graph
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We might use this operation to change the vertex values
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We can use this function to compute the in-degree of each
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function is used to update the vertices with new values based on external data.
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala:   * example Loads a file in the following format:
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function is used to update the vertices with new
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function can be used to filter the graph based on some property, without
        ./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
        ./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala:   * example addImports("org.apache.spark.SparkContext")
        ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
        ```
      
      **Before**
      
        <img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">
      
      **After**
        <img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png">
      
      ## How was this patch tested?
      
      Maunally tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15939 from HyukjinKwon/SPARK-3359-javadoc.
      Unverified
      c528812c
    • Herman van Hovell's avatar
      [SPARK-15214][SQL] Code-generation for Generate · 7ca7a635
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      
      This PR adds code generation to `Generate`. It supports two code paths:
      - General `TraversableOnce` based iteration. This used for regular `Generator` (code generation supporting) expressions. This code path expects the expression to return a `TraversableOnce[InternalRow]` and it will iterate over the returned collection. This PR adds code generation for the `stack` generator.
      - Specialized `ArrayData/MapData` based iteration. This is used for the `explode`, `posexplode` & `inline` functions and operates directly on the `ArrayData`/`MapData` result that the child of the generator returns.
      
      ### Benchmarks
      I have added some benchmarks and it seems we can create a nice speedup for explode:
      #### Environment
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6
      Intel(R) Core(TM) i7-4980HQ CPU  2.80GHz
      ```
      #### Explode Array
      ##### Before
      ```
      generate explode array:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode array wholestage off         7377 / 7607          2.3         439.7       1.0X
      generate explode array wholestage on          6055 / 6086          2.8         360.9       1.2X
      ```
      ##### After
      ```
      generate explode array:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode array wholestage off         7432 / 7696          2.3         443.0       1.0X
      generate explode array wholestage on           631 /  646         26.6          37.6      11.8X
      ```
      #### Explode Map
      ##### Before
      ```
      generate explode map:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode map wholestage off         12792 / 12848          1.3         762.5       1.0X
      generate explode map wholestage on          11181 / 11237          1.5         666.5       1.1X
      ```
      ##### After
      ```
      generate explode map:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate explode map wholestage off         10949 / 10972          1.5         652.6       1.0X
      generate explode map wholestage on             870 /  913         19.3          51.9      12.6X
      ```
      #### Posexplode
      ##### Before
      ```
      generate posexplode array:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate posexplode array wholestage off      7547 / 7580          2.2         449.8       1.0X
      generate posexplode array wholestage on       5786 / 5838          2.9         344.9       1.3X
      ```
      ##### After
      ```
      generate posexplode array:               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate posexplode array wholestage off      7535 / 7548          2.2         449.1       1.0X
      generate posexplode array wholestage on        620 /  624         27.1          37.0      12.1X
      ```
      #### Inline
      ##### Before
      ```
      generate inline array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate inline array wholestage off          6935 / 6978          2.4         413.3       1.0X
      generate inline array wholestage on           6360 / 6400          2.6         379.1       1.1X
      ```
      ##### After
      ```
      generate inline array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate inline array wholestage off          6940 / 6966          2.4         413.6       1.0X
      generate inline array wholestage on           1002 / 1012         16.7          59.7       6.9X
      ```
      #### Stack
      ##### Before
      ```
      generate stack:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate stack wholestage off               12980 / 13104          1.3         773.7       1.0X
      generate stack wholestage on                11566 / 11580          1.5         689.4       1.1X
      ```
      ##### After
      ```
      generate stack:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      generate stack wholestage off               12875 / 12949          1.3         767.4       1.0X
      generate stack wholestage on                   840 /  845         20.0          50.0      15.3X
      ```
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #13065 from hvanhovell/SPARK-15214.
      7ca7a635
  4. Nov 19, 2016
    • Reynold Xin's avatar
      a64f25d8
    • Reynold Xin's avatar
      [SPARK-18508][SQL] Fix documentation error for DateDiff · bce9a036
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The previous documentation and example for DateDiff was wrong.
      
      ## How was this patch tested?
      Doc only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15937 from rxin/datediff-doc.
      bce9a036
    • Kazuaki Ishizaki's avatar
      [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java · d93b6552
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.
      
      ## How was this patch tested?
      
      Manually executed query82 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15907 from kiszk/SPARK-18458.
      d93b6552
    • sethah's avatar
      [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training · 856e0042
      sethah authored
      ## What changes were proposed in this pull request?
      
      This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.
      
      Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.
      
      ## How was this patch tested?
      
      This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15893 from sethah/logreg_refactor.
      Unverified
      856e0042
    • Stavros Kontopoulos's avatar
      [SPARK-17062][MESOS] add conf option to mesos dispatcher · ea77c81e
      Stavros Kontopoulos authored
      Adds --conf option to set spark configuration properties in mesos dispacther.
      Properties provided with --conf take precedence over properties within the properties file.
      The reason for this PR is that for simple configuration or testing purposes we need to provide a property file (ideally a shared one for a cluster) even if we just provide a single property.
      
      Manually tested.
      
      Author: Stavros Kontopoulos <st.kontopoulos@gmail.com>
      Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
      
      Closes #14650 from skonto/dipatcher_conf.
      ea77c81e
    • Sean Owen's avatar
      [SPARK-18448][CORE] Fix @since 2.1.0 on new SparkSession.close() method · ded5fefb
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix since 2.1.0 on new SparkSession.close() method. I goofed in https://github.com/apache/spark/pull/15932 because it was back-ported to 2.1 instead of just master as originally planned.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15938 from srowen/SPARK-18448.2.
      Unverified
      ded5fefb
    • Sean Owen's avatar
      [SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s · 8b1e1088
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15833 from srowen/SPARK-18353.
      Unverified
      8b1e1088
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · d5b1d5fc
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      Unverified
      d5b1d5fc
    • Sean Owen's avatar
      [SPARK-18448][CORE] SparkSession should implement java.lang.AutoCloseable like JavaSparkContext · db9fb9ba
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Just adds `close()` + `Closeable` as a synonym for `stop()`. This makes it usable in Java in try-with-resources, as suggested by ash211  (`Closeable` extends `AutoCloseable` BTW)
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15932 from srowen/SPARK-18448.
      Unverified
      db9fb9ba
  5. Nov 18, 2016
    • Shixiong Zhu's avatar
      [SPARK-18497][SS] Make ForeachSink support watermark · 2a40de40
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      The issue in ForeachSink is the new created DataSet still uses the old QueryExecution. When `foreachPartition` is called, `QueryExecution.toString` will be called and then fail because it doesn't know how to plan EventTimeWatermark.
      
      This PR just replaces the QueryExecution with IncrementalExecution to fix the issue.
      
      ## How was this patch tested?
      
      `test("foreach with watermark")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15934 from zsxwing/SPARK-18497.
      2a40de40
    • Reynold Xin's avatar
      [SPARK-18505][SQL] Simplify AnalyzeColumnCommand · 6f7ff750
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.
      
      This is a small pull request to clean up AnalyzeColumnCommand:
      
      1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
      2. Removed the nested updateStats function, by just inlining the function.
      3. Renamed a few functions to better reflect what they do.
      4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
      5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
      6. Added more documentation explaining some of the non-obvious return types and code blocks.
      
      In follow-up pull requests, I'd like to address the following:
      
      1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
      2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
      3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
      4. Clearly document the data representation stored in the catalog for statistics.
      
      ## How was this patch tested?
      Affected test cases have been updated.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15933 from rxin/SPARK-18505.
      6f7ff750
    • Shixiong Zhu's avatar
      [SPARK-18477][SS] Enable interrupts for HDFS in HDFSMetadataLog · e5f5c29e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      HDFS `write` may just hang until timeout if some network error happens. It's better to enable interrupts to allow stopping the query fast on HDFS.
      
      This PR just changes the logic to only disable interrupts for local file system, as HADOOP-10622 only happens for local file system.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15911 from zsxwing/interrupt-on-dfs.
      e5f5c29e
    • hyukjinkwon's avatar
      [SPARK-18422][CORE] Fix wholeTextFiles test to pass on Windows in JavaAPISuite · 40d59ff5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the test `wholeTextFiles` in `JavaAPISuite.java`. This is failed due to the different path format on Windows.
      
      For example, the path in `container` was
      
      ```
      C:\projects\spark\target\tmp\1478967560189-0/part-00000
      ```
      
      whereas `new URI(res._1()).getPath()` was as below:
      
      ```
      /C:/projects/spark/target/tmp/1478967560189-0/part-00000
      ```
      
      ## How was this patch tested?
      
      Tests in `JavaAPISuite.java`.
      
      Tested via AppVeyor.
      
      **Before**
      Build: https://ci.appveyor.com/project/spark-test/spark/build/63-JavaAPISuite-1
      Diff: https://github.com/apache/spark/compare/master...spark-test:JavaAPISuite-1
      
      ```
      [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
      [error] Test org.apache.spark.JavaAPISuite.wholeTextFiles failed: java.lang.AssertionError: expected:<spark is easy to use.
      [error] > but was:<null>, took 0.578 sec
      [error]     at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
      ...
      ```
      
      **After**
      Build started: [CORE] `org.apache.spark.JavaAPISuite` [![PR-15866](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=198DDA52-F201-4D2B-BE2F-244E0C1725B2&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/198DDA52-F201-4D2B-BE2F-244E0C1725B2)
      Diff: https://github.com/apache/spark/compare/master...spark-test:198DDA52-F201-4D2B-BE2F-244E0C1725B2
      
      ```
      [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
      ...
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15866 from HyukjinKwon/SPARK-18422.
      Unverified
      40d59ff5
    • Andrew Ray's avatar
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all... · 795e9fc9
      Andrew Ray authored
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count
      
      ## What changes were proposed in this pull request?
      
      When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
      
      ## How was this patch tested?
      
      Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
      
      Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
      ```
      build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
      ```
      However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
      
      I tested with the following setup using above build options
      ```
      case class OrcData(intField: Long, stringField: String)
      spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
      
      sql(
            s"""CREATE EXTERNAL TABLE orc_test(
               |  intField LONG,
               |  stringField STRING
               |)
               |STORED AS ORC
               |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
             """.stripMargin)
      ```
      
      ## Results
      
      query | Spark 2.0.2 | this PR
      ---|---|---
      `sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
      `sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
      `sql("select * from orc_test").collect`|4.4 MB|4.4 MB
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #15898 from aray/sql-orc-no-col.
      795e9fc9
    • Tyson Condie's avatar
      [SPARK-18187][SQL] CompactibleFileStreamLog should not use "compactInterval"... · 51baca22
      Tyson Condie authored
      [SPARK-18187][SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting.
      
      ## What changes were proposed in this pull request?
      CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value.
      
      ## How was this patch tested?
      When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one.
      
      The primary solution to this issue was given by uncleGen
      Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: genmao.ygm <genmao.ygm@genmaoygmdeMacBook-Air.local>
      
      Closes #15852 from tcondie/spark-18187.
      51baca22
  6. Nov 17, 2016
    • Josh Rosen's avatar
      [SPARK-18462] Fix ClassCastException in SparkListenerDriverAccumUpdates event · d9dd979d
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a `ClassCastException: java.lang.Integer cannot be cast to java.lang.Long` error which could occur in the HistoryServer while trying to process a deserialized `SparkListenerDriverAccumUpdates` event.
      
      The problem stems from how `jackson-module-scala` handles primitive type parameters (see https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges for more details). This was causing a problem where our code expected a field to be deserialized as a `(Long, Long)` tuple but we got an `(Int, Int)` tuple instead.
      
      This patch hacks around this issue by registering a custom `Converter` with Jackson in order to deserialize the tuples as `(Object, Object)` and perform the appropriate casting.
      
      ## How was this patch tested?
      
      New regression tests in `SQLListenerSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15922 from JoshRosen/SPARK-18462.
      d9dd979d
    • Wenchen Fan's avatar
      [SPARK-18360][SQL] default table path of tables in default database should... · ce13c267
      Wenchen Fan authored
      [SPARK-18360][SQL] default table path of tables in default database should depend on the location of default database
      
      ## What changes were proposed in this pull request?
      
      The current semantic of the warehouse config:
      
      1. it's a static config, which means you can't change it once your spark application is launched.
      2. Once a database is created, its location won't change even the warehouse path config is changed.
      3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.
      
      rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.
      
      This PR fixes hive serde tables to make it consistent with data source tables.
      
      ## How was this patch tested?
      
      HiveSparkSubmitSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15812 from cloud-fan/default-db.
      ce13c267
    • root's avatar
      [SPARK-18490][SQL] duplication nodename extrainfo for ShuffleExchange · b0aa1aa1
      root authored
      ## What changes were proposed in this pull request?
      
         In ShuffleExchange, the nodename's extraInfo are the same when exchangeCoordinator.isEstimated
       is true or false.
      
      Merge the two situation in the PR.
      
      Author: root <root@iZbp1gsnrlfzjxh82cz80vZ.(none)>
      
      Closes #15920 from windpiger/DupNodeNameShuffleExchange.
      Unverified
      b0aa1aa1
    • Zheng RuiFeng's avatar
      [SPARK-18480][DOCS] Fix wrong links for ML guide docs · cdaf4ce9
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
      2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
      3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
      4, Other link updates.
      ## How was this patch tested?
       manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15912 from zhengruifeng/md_fix.
      Unverified
      cdaf4ce9
    • VinceShieh's avatar
      [SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings · de77c677
      VinceShieh authored
      ## What changes were proposed in this pull request?
      
      Several places in MLlib use custom regexes or other approaches to parse Spark versions.
      Those should be fixed to use the VersionUtils. This PR replaces custom regexes with
      VersionUtils to get Spark version numbers.
      ## How was this patch tested?
      
      Existing tests.
      
      Signed-off-by: VinceShieh vincent.xieintel.com
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #15055 from VinceShieh/SPARK-17462.
      Unverified
      de77c677
Loading