Skip to content
Snippets Groups Projects
  1. Dec 01, 2016
    • Reynold Xin's avatar
      [SPARK-18639] Build only a single pip package · 37e52f87
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We current build 5 separate pip binary tar balls, doubling the release script runtime. It'd be better to build one, especially for use cases that are just using Spark locally. In the long run, it would make more sense to have Hadoop support be pluggable.
      
      ## How was this patch tested?
      N/A - this is a release build script that doesn't have any automated test coverage. We will know if it goes wrong when we prepare releases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16072 from rxin/SPARK-18639.
      37e52f87
    • Shixiong Zhu's avatar
      [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite.... · 086b0c8f
      Shixiong Zhu authored
      [SPARK-18617][SPARK-18560][TESTS] Fix flaky test: StreamingContextSuite. Receiver data should be deserialized properly
      
      ## What changes were proposed in this pull request?
      
      Avoid to create multiple threads to stop StreamingContext. Otherwise, the latch added in #16091 can be passed too early.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16105 from zsxwing/SPARK-18617-2.
      086b0c8f
    • Sandeep Singh's avatar
      [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 78bb7f80
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
      Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
      
      ## How was this patch tested?
      ```scala
      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      ```
      * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
      After: garbage collection works as the object is dereferenced, and computation completes
      * Mem footprint tested using profiler
      * Added a parameter copy related test which was failing before.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      
      Closes #15843 from techaddict/SPARK-18274.
      78bb7f80
    • Wenchen Fan's avatar
      [SPARK-18674][SQL] improve the error message of using join · e6534847
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The current error message of USING join is quite confusing, for example:
      ```
      scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
      df1: org.apache.spark.sql.DataFrame = [c1: int]
      
      scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
      df2: org.apache.spark.sql.DataFrame = [c2: int]
      
      scala> df1.join(df2, usingColumn = "c1")
      org.apache.spark.sql.AnalysisException: using columns ['c1] can not be resolved given input columns: [c1, c2] ;;
      'Join UsingJoin(Inner,List('c1))
      :- Project [value#1 AS c1#3]
      :  +- LocalRelation [value#1]
      +- Project [value#7 AS c2#9]
         +- LocalRelation [value#7]
      ```
      
      after this PR, it becomes:
      ```
      scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
      df1: org.apache.spark.sql.DataFrame = [c1: int]
      
      scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
      df2: org.apache.spark.sql.DataFrame = [c2: int]
      
      scala> df1.join(df2, usingColumn = "c1")
      org.apache.spark.sql.AnalysisException: USING column `c1` can not be resolved with the right join side, the right output is: [c2];
      ```
      
      ## How was this patch tested?
      
      updated tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16100 from cloud-fan/natural.
      e6534847
    • Yuming Wang's avatar
      [SPARK-18645][DEPLOY] Fix spark-daemon.sh arguments error lead to throws Unrecognized option · 2ab8551e
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      spark-daemon.sh will lost single quotes around after #15338. as follows:
      ```
      execute_command nice -n 0 bash /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
      ```
      With this fix, as follows:
      ```
      execute_command nice -n 0 bash /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name 'Thrift JDBC/ODBC Server' --conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'
      ```
      
      ## How was this patch tested?
      
      - Manual tests
      - Build the package and start-thriftserver.sh with `--conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'`
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #16079 from wangyum/SPARK-18645.
      2ab8551e
    • Liang-Chi Hsieh's avatar
      [SPARK-18666][WEB UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled · dbf842b7
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      `spark.sql.unsafe.enabled` is deprecated since 1.6. There still are codes in UI to check it. We should remove it and clean the codes.
      
      ## How was this patch tested?
      
      Changes to related existing unit test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16095 from viirya/remove-deprecated-config-code.
      dbf842b7
    • Eric Liang's avatar
      [SPARK-18635][SQL] Partition name/values not escaped correctly in some cases · 88f559f2
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.
      
      To my understanding this is how values, filesystem paths, and URIs interact.
      - Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
      - Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
      - In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
      - In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
      - Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.
      
      In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.
      
      cc mallman cloud-fan yhuai
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16071 from ericl/spark-18635.
      88f559f2
    • gatorsmile's avatar
      [SPARK-18538][SQL] Fix Concurrent Table Fetching Using DataFrameReader JDBC APIs · b28fe4a4
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The following two `DataFrameReader` JDBC APIs ignore the user-specified parameters of parallelism degree.
      
      ```Scala
        def jdbc(
            url: String,
            table: String,
            columnName: String,
            lowerBound: Long,
            upperBound: Long,
            numPartitions: Int,
            connectionProperties: Properties): DataFrame
      ```
      
      ```Scala
        def jdbc(
            url: String,
            table: String,
            predicates: Array[String],
            connectionProperties: Properties): DataFrame
      ```
      
      This PR is to fix the issues. To verify the behavior correctness, we improve the plan output of `EXPLAIN` command by adding `numPartitions` in the `JDBCRelation` node.
      
      Before the fix,
      ```
      == Physical Plan ==
      *Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
      ```
      
      After the fix,
      ```
      == Physical Plan ==
      *Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
      ```
      ### How was this patch tested?
      Added the verification logics on all the test cases for JDBC concurrent fetching.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15975 from gatorsmile/jdbc.
      b28fe4a4
  2. Nov 30, 2016
    • wm624@hotmail.com's avatar
      [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support output original label. · 2eb6764f
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.
      
      In this PR, original label output is supported and test cases are modified and added. Document is also modified.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15910 from wangmiao1981/audit.
      2eb6764f
    • Shixiong Zhu's avatar
      [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite.... · 0a811210
      Shixiong Zhu authored
      [SPARK-18617][SPARK-18560][TEST] Fix flaky test: StreamingContextSuite. Receiver data should be deserialized properly
      
      ## What changes were proposed in this pull request?
      
      Fixed the potential SparkContext leak in `StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` which was added in #16052. I also removed FakeByteArrayReceiver and used TestReceiver directly.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16091 from zsxwing/SPARK-18617-follow-up.
      0a811210
    • Shixiong Zhu's avatar
      [SPARK-18655][SS] Ignore Structured Streaming 2.0.2 logs in history server · c4979f6e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      As `queryStatus` in StreamingQueryListener events was removed in #15954, parsing 2.0.2 structured streaming logs will throw the following errror:
      
      ```
      [info]   com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"])
      [info]  at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"])
      [info]   at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
      [info]   at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
      [info]   at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453)
      [info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
      ...
      ```
      
      This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs.
      
      ## How was this patch tested?
      
      `query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16085 from zsxwing/SPARK-18655.
      c4979f6e
    • Marcelo Vanzin's avatar
      [SPARK-18546][CORE] Fix merging shuffle spills when using encryption. · 93e9d880
      Marcelo Vanzin authored
      The problem exists because it's not possible to just concatenate encrypted
      partition data from different spill files; currently each partition would
      have its own initial vector to set up encryption, and the final merged file
      should contain a single initial vector for each merged partiton, otherwise
      iterating over each record becomes really hard.
      
      To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
      so that the merged file contains a single initial vector at the start of
      the partition data.
      
      Because it's not possible to do that using the fast transferTo path, when
      encryption is enabled UnsafeShuffleWriter will revert back to using file
      streams when merging. It may be possible to use a hybrid approach when
      using encryption, using an intermediate direct buffer when reading from
      files and encrypting the data, but that's better left for a separate patch.
      
      As part of the change I made DiskBlockObjectWriter take a SerializerManager
      instead of a "wrap stream" closure, since that makes it easier to test the
      code without having to mock SerializerManager functionality.
      
      Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
      side and ExternalAppendOnlyMapSuite for integration), and by running some
      apps that failed without the fix.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15982 from vanzin/SPARK-18546.
      93e9d880
    • Wenchen Fan's avatar
      [SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type · f135b70f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in https://github.com/apache/spark/pull/13469
      
      However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.
      
      This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.
      
      ## How was this patch tested?
      
      new test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15979 from cloud-fan/option.
      f135b70f
    • Yanbo Liang's avatar
      [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs · 60022bfd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      API review for 2.1, except ```LSH``` related classes which are still under development.
      
      ## How was this patch tested?
      Only doc changes, no new tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16009 from yanboliang/spark-18318.
      60022bfd
    • Josh Rosen's avatar
      [SPARK-18640] Add synchronization to TaskScheduler.runningTasksByExecutors · c51c7725
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      The method `TaskSchedulerImpl.runningTasksByExecutors()` accesses the mutable `executorIdToRunningTaskIds` map without proper synchronization. In addition, as markhamstra pointed out in #15986, the signature's use of parentheses is a little odd given that this is a pure getter method.
      
      This patch fixes both issues.
      
      ## How was this patch tested?
      
      Covered by existing tests.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16073 from JoshRosen/runningTasksByExecutors-thread-safety.
      c51c7725
    • manishAtGit's avatar
      [SPARK][EXAMPLE] Added missing semicolon in quick-start-guide example · bc95ea0b
      manishAtGit authored
      ## What changes were proposed in this pull request?
      
      Added missing semicolon in quick-start-guide java example code which wasn't compiling before.
      
      ## How was this patch tested?
      Locally by running and generating site for docs. You can see the last line contains ";" in the below snapshot.
      ![image](https://cloud.githubusercontent.com/assets/10628224/20751760/9a7e0402-b723-11e6-9aa8-3b6ca2d92ebf.png)
      
      Author: manishAtGit <manish@knoldus.com>
      
      Closes #16081 from manishatGit/fixed-quick-start-guide.
      bc95ea0b
    • Wenchen Fan's avatar
      [SPARK-18220][SQL] read Hive orc table with varchar column should not fail · 3f03c90a
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.
      
      In Spark 2.1, after https://github.com/apache/spark/pull/14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.
      
      This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.
      
      ## How was this patch tested?
      
      newly added regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16060 from cloud-fan/varchar.
      3f03c90a
    • jiangxingbo's avatar
      [SPARK-17932][SQL] Support SHOW TABLES EXTENDED LIKE 'identifier_with_wildcards' statement · c24076dc
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Currently we haven't implemented `SHOW TABLE EXTENDED` in Spark 2.0. This PR is to implement the statement.
      Goals:
      1. Support `SHOW TABLES EXTENDED LIKE 'identifier_with_wildcards'`;
      2. Explicitly output an unsupported error message for `SHOW TABLES [EXTENDED] ... PARTITION` statement;
      3. Improve test cases for `SHOW TABLES` statement.
      
      ## How was this patch tested?
      1. Add new test cases in file `show-tables.sql`.
      2. Modify tests for `SHOW TABLES` in `DDLSuite`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15958 from jiangxb1987/show-table-extended.
      c24076dc
    • gatorsmile's avatar
      [SPARK-17897][SQL] Fixed IsNotNull Constraint Inference Rule · 2eb093de
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The `constraints` of an operator is the expressions that evaluate to `true` for all the rows produced. That means, the expression result should be neither `false` nor `unknown` (NULL). Thus, we can conclude that `IsNotNull` on all the constraints, which are generated by its own predicates or propagated from the children. The constraint can be a complex expression. For better usage of these constraints, we try to push down `IsNotNull` to the lowest-level expressions (i.e., `Attribute`). `IsNotNull` can be pushed through an expression when it is null intolerant. (When the input is NULL, the null-intolerant expression always evaluates to NULL.)
      
      Below is the existing code we have for `IsNotNull` pushdown.
      ```Scala
        private def scanNullIntolerantExpr(expr: Expression): Seq[Attribute] = expr match {
          case a: Attribute => Seq(a)
          case _: NullIntolerant | IsNotNull(_: NullIntolerant) =>
            expr.children.flatMap(scanNullIntolerantExpr)
          case _ => Seq.empty[Attribute]
        }
      ```
      
      **`IsNotNull` itself is not null-intolerant.** It converts `null` to `false`. If the expression does not include any `Not`-like expression, it works; otherwise, it could generate a wrong result. This PR is to fix the above function by removing the `IsNotNull` from the inference. After the fix, when a constraint has a `IsNotNull` expression, we infer new attribute-specific `IsNotNull` constraints if and only if `IsNotNull` appears in the root.
      
      Without the fix, the following test case will return empty.
      ```Scala
      val data = Seq[java.lang.Integer](1, null).toDF("key")
      data.filter("not key is not null").show()
      ```
      Before the fix, the optimized plan is like
      ```
      == Optimized Logical Plan ==
      Project [value#1 AS key#3]
      +- Filter (isnotnull(value#1) && NOT isnotnull(value#1))
         +- LocalRelation [value#1]
      ```
      
      After the fix, the optimized plan is like
      ```
      == Optimized Logical Plan ==
      Project [value#1 AS key#3]
      +- Filter NOT isnotnull(value#1)
         +- LocalRelation [value#1]
      ```
      
      ### How was this patch tested?
      Added a test
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16067 from gatorsmile/isNotNull2.
      2eb093de
    • Anthony Truchet's avatar
      [SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun · c5a64d76
      Anthony Truchet authored
      ## What changes were proposed in this pull request?
      
      Fix a broadcasted variable leak occurring at each invocation of CostFun in L-BFGS.
      
      ## How was this patch tested?
      
      UTests + check that fixed fatal memory consumption on Criteo's use cases.
      
      This contribution is made on behalf of Criteo S.A.
      (http://labs.criteo.com/) under the terms of the Apache v2 License.
      
      Author: Anthony Truchet <a.truchet@criteo.com>
      
      Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.
      c5a64d76
    • Sandeep Singh's avatar
      [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer · fe854f2e
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      added the new handleInvalid param for these transformers to Python to maintain API parity.
      
      ## How was this patch tested?
      existing tests
      testing is done with new doctests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #15817 from techaddict/SPARK-18366.
      fe854f2e
    • uncleGen's avatar
      [SPARK-18617][CORE][STREAMING] Close "kryo auto pick" feature for Spark Streaming · 56c82eda
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      #15992 provided a solution to fix the bug, i.e. **receiver data can not be deserialized properly**. As zsxwing said, it is a critical bug, but we should not break APIs between maintenance releases. It may be a rational choice to close auto pick kryo serializer for Spark Streaming in the first step. I will continue #15992 to optimize the solution.
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16052 from uncleGen/SPARK-18617.
      56c82eda
    • Herman van Hovell's avatar
      [SPARK-18622][SQL] Fix the datatype of the Sum aggregate function · 879ba711
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The result of a `sum` aggregate function is typically a Decimal, Double or a Long. Currently the output dataType is based on input's dataType.
      
      The `FunctionArgumentConversion` rule will make sure that the input is promoted to the largest type, and that also ensures that the output uses a (hopefully) sufficiently large output dataType. The issue is that sum is in a resolved state when we cast the input type, this means that rules assuming that the dataType of the expression does not change anymore could have been applied in the mean time. This is what happens if we apply `WidenSetOperationTypes` before applying the casts, and this breaks analysis.
      
      The most straight forward and future proof solution is to make `sum` always output the widest dataType in its class (Long for IntegralTypes, Decimal for DecimalTypes & Double for FloatType and DoubleType). This PR implements that solution.
      
      We should move expression specific type casting rules into the given Expression at some point.
      
      ## How was this patch tested?
      Added (regression) tests to SQLQueryTestSuite's `union.sql`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16063 from hvanhovell/SPARK-18622.
      879ba711
    • gatorsmile's avatar
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character... · a1d9138a
      gatorsmile authored
      [SPARK-17680][SQL][TEST] Added a Testcase for Verifying Unicode Character Support for Column Names and Comments
      
      ### What changes were proposed in this pull request?
      
      Spark SQL supports Unicode characters for column names when specified within backticks(`). When the Hive support is enabled, the version of the Hive metastore must be higher than 0.12,  See the JIRA: https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports Unicode characters for column names since 0.13.
      
      In Spark SQL, table comments, and view comments always allow Unicode characters without backticks.
      
      BTW, a separate PR has been submitted for database and table name validation because we do not support Unicode characters in these two cases.
      ### How was this patch tested?
      
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15255 from gatorsmile/unicodeSupport.
      a1d9138a
    • Tathagata Das's avatar
      [SPARK-18516][STRUCTURED STREAMING] Follow up PR to add StreamingQuery.status to Python · bc09a2b8
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      - Add StreamingQueryStatus.json
      - Make it not case class (to avoid unnecessarily exposing implicit object StreamingQueryStatus, consistent with StreamingQueryProgress)
      - Add StreamingQuery.status to Python
      - Fix post-termination status
      
      ## How was this patch tested?
      New unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16075 from tdas/SPARK-18516-1.
      bc09a2b8
  3. Nov 29, 2016
    • Jeff Zhang's avatar
      [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark · 4c82ca86
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Add python api for KMeansSummary
      ## How was this patch tested?
      
      unit test added
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13557 from zjffdu/SPARK-15819.
      4c82ca86
    • Eric Liang's avatar
      [SPARK-18145] Update documentation for hive partition management in 2.1 · 489845f3
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This documents the partition handling changes for Spark 2.1 and how to migrate existing tables.
      
      ## How was this patch tested?
      
      Built docs locally.
      
      rxin
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #16074 from ericl/spark-18145.
      489845f3
    • Herman van Hovell's avatar
      [SPARK-18632][SQL] AggregateFunction should not implement ImplicitCastInputTypes · af9789a4
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      `AggregateFunction` currently implements `ImplicitCastInputTypes` (which enables implicit input type casting). There are actually quite a few situations in which we don't need this, or require more control over our input. A recent example is the aggregate for `CountMinSketch` which should only take string, binary or integral types inputs.
      
      This PR removes `ImplicitCastInputTypes` from the `AggregateFunction` and makes a case-by-case decision on what kind of input validation we should use.
      
      ## How was this patch tested?
      Refactoring only. Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16066 from hvanhovell/SPARK-18632.
      af9789a4
    • Yuhao's avatar
      [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit · 9b670bca
      Yuhao authored
      ## What changes were proposed in this pull request?
      make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.
      
      Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319
      
      ## How was this patch tested?
      existing ut
      
      Author: Yuhao <yuhao.yang@intel.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #15972 from hhbyyh/experimental21.
      9b670bca
    • Tathagata Das's avatar
      [SPARK-18516][SQL] Split state and progress in streaming · c3d08e2f
      Tathagata Das authored
      This PR separates the status of a `StreamingQuery` into two separate APIs:
       - `status` - describes the status of a `StreamingQuery` at this moment, including what phase of processing is currently happening and if data is available.
       - `recentProgress` - an array of statistics about the most recent microbatches that have executed.
      
      A recent progress contains the following information:
      ```
      {
        "id" : "2be8670a-fce1-4859-a530-748f29553bb6",
        "name" : "query-29",
        "timestamp" : 1479705392724,
        "inputRowsPerSecond" : 230.76923076923077,
        "processedRowsPerSecond" : 10.869565217391303,
        "durationMs" : {
          "triggerExecution" : 276,
          "queryPlanning" : 3,
          "getBatch" : 5,
          "getOffset" : 3,
          "addBatch" : 234,
          "walCommit" : 30
        },
        "currentWatermark" : 0,
        "stateOperators" : [ ],
        "sources" : [ {
          "description" : "KafkaSource[Subscribe[topic-14]]",
          "startOffset" : {
            "topic-14" : {
              "2" : 0,
              "4" : 1,
              "1" : 0,
              "3" : 0,
              "0" : 0
            }
          },
          "endOffset" : {
            "topic-14" : {
              "2" : 1,
              "4" : 2,
              "1" : 0,
              "3" : 0,
              "0" : 1
            }
          },
          "numRecords" : 3,
          "inputRowsPerSecond" : 230.76923076923077,
          "processedRowsPerSecond" : 10.869565217391303
        } ]
      }
      ```
      
      Additionally, in order to make it possible to correlate progress updates across restarts, we change the `id` field from an integer that is unique with in the JVM to a `UUID` that is globally unique.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15954 from marmbrus/queryProgress.
      c3d08e2f
    • Josh Rosen's avatar
      [SPARK-18553][CORE] Fix leak of TaskSetManager following executor loss · 9a02f682
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      _This is the master branch version of #15986; the original description follows:_
      
      This patch fixes a critical resource leak in the TaskScheduler which could cause RDDs and ShuffleDependencies to be kept alive indefinitely if an executor with running tasks is permanently lost and the associated stage fails.
      
      This problem was originally identified by analyzing the heap dump of a driver belonging to a cluster that had run out of shuffle space. This dump contained several `ShuffleDependency` instances that were retained by `TaskSetManager`s inside the scheduler but were not otherwise referenced. Each of these `TaskSetManager`s was considered a "zombie" but had no running tasks and therefore should have been cleaned up. However, these zombie task sets were still referenced by the `TaskSchedulerImpl.taskIdToTaskSetManager` map.
      
      Entries are added to the `taskIdToTaskSetManager` map when tasks are launched and are removed inside of `TaskScheduler.statusUpdate()`, which is invoked by the scheduler backend while processing `StatusUpdate` messages from executors. The problem with this design is that a completely dead executor will never send a `StatusUpdate`. There is [some code](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L338) in `statusUpdate` which handles tasks that exit with the `TaskState.LOST` state (which is supposed to correspond to a task failure triggered by total executor loss), but this state only seems to be used in Mesos fine-grained mode. There doesn't seem to be any code which performs per-task state cleanup for tasks that were running on an executor that completely disappears without sending any sort of final death message. The `executorLost` and [`removeExecutor`](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L527) methods don't appear to perform any cleanup of the `taskId -> *` mappings, causing the leaks observed here.
      
      This patch's fix is to maintain a `executorId -> running task id` mapping so that these `taskId -> *` maps can be properly cleaned up following an executor loss.
      
      There are some potential corner-case interactions that I'm concerned about here, especially some details in [the comment](https://github.com/apache/spark/blob/072f4c518cdc57d705beec6bcc3113d9a6740819/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L523) in `removeExecutor`, so I'd appreciate a very careful review of these changes.
      
      ## How was this patch tested?
      
      I added a new unit test to `TaskSchedulerImplSuite`.
      
      /cc kayousterhout and markhamstra, who reviewed #15986.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16045 from JoshRosen/fix-leak-following-total-executor-loss-master.
      9a02f682
    • Nattavut Sutyanyong's avatar
      [SPARK-18614][SQL] Incorrect predicate pushdown from ExistenceJoin · 36006352
      Nattavut Sutyanyong authored
      ## What changes were proposed in this pull request?
      
      ExistenceJoin should be treated the same as LeftOuter and LeftAnti, not InnerLike and LeftSemi. This is not currently exposed because the rewrite of [NOT] EXISTS OR ... to ExistenceJoin happens in rule RewritePredicateSubquery, which is in a separate rule set and placed after the rule PushPredicateThroughJoin. During the transformation in the rule PushPredicateThroughJoin, an ExistenceJoin never exists.
      
      The semantics of ExistenceJoin says we need to preserve all the rows from the left table through the join operation as if it is a regular LeftOuter join. The ExistenceJoin augments the LeftOuter operation with a new column called exists, set to true when the join condition in the ON clause is true and false otherwise. The filter of any rows will happen in the Filter operation above the ExistenceJoin.
      
      Example:
      
      A(c1, c2): { (1, 1), (1, 2) }
      // B can be any value as it is irrelevant in this example
      B(c1): { (NULL) }
      
      select A.*
      from   A
      where  exists (select 1 from B where A.c1 = A.c2)
             or A.c2=2
      
      In this example, the correct result is all the rows from A. If the pattern ExistenceJoin around line 935 in Optimizer.scala is indeed active, the code will push down the predicate A.c1 = A.c2 to be a Filter on relation A, which will incorrectly filter the row (1,2) from A.
      
      ## How was this patch tested?
      
      Since this is not an exposed case, no new test cases is added. The scenario is discovered via a code review of another PR and confirmed to be valid with peer.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16044 from nsyca/spark-18614.
      36006352
    • Mark Hamstra's avatar
      [SPARK-18631][SQL] Changed ExchangeCoordinator re-partitioning to avoid more data skew · f8878a4c
      Mark Hamstra authored
      ## What changes were proposed in this pull request?
      
      Re-partitioning logic in ExchangeCoordinator changed so that adding another pre-shuffle partition to the post-shuffle partition will not be done if doing so would cause the size of the post-shuffle partition to exceed the target partition size.
      
      ## How was this patch tested?
      
      Existing tests updated to reflect new expectations.
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #16065 from markhamstra/SPARK-17064.
      f8878a4c
    • wangzhenhua's avatar
      [SPARK-18429][SQL] implement a new Aggregate for CountMinSketch · d57a594b
      wangzhenhua authored
      ## What changes were proposed in this pull request?
      
      This PR implements a new Aggregate to generate count min sketch, which is a wrapper of CountMinSketch.
      
      ## How was this patch tested?
      
      add test cases
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #15877 from wzhfy/cms.
      d57a594b
    • Tyson Condie's avatar
      [SPARK-18498][SQL] Revise HDFSMetadataLog API for better testing · f643fe47
      Tyson Condie authored
      
      Revise HDFSMetadataLog API such that metadata object serialization and final batch file write are separated. This will allow serialization checks without worrying about batch file name formats. marmbrus zsxwing
      
      Existing tests already ensure this API faithfully support core functionality i.e., creation of batch files.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #15924 from tcondie/SPARK-18498.
      
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      f643fe47
    • Yanbo Liang's avatar
      [SPARK-18592][ML] Move DT/RF/GBT Param setter methods to subclasses · 95f79850
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Mainly two changes:
      * Move DT/RF/GBT Param setter methods to subclasses.
      * Deprecate corresponding setter methods in the model classes.
      
      See discussion here https://github.com/apache/spark/pull/15913#discussion_r89662469.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16017 from yanboliang/spark-18592.
      95f79850
    • hyukjinkwon's avatar
      [SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for backticks · 1a870090
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below:
      
      ```scala
      /** Return an RDD with the pairs from `this` whose keys are not in `other`. */
      ```
      
      So, we could work around this as below:
      
      ```scala
      /**
       * Return an RDD with the pairs from `this` whose keys are not in `other`.
       */
      ```
      
      - javadoc
      
        - **Before**
          ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png)
      
        - **After**
          ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png)
      
      - scaladoc (this one looks fine either way)
      
        - **Before**
          ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png)
      
        - **After**
          ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png)
      
      I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`.
      
      ## How was this patch tested?
      
      I found them via
      
      ```
      grep -r "\/\*\*.*\`" . | grep .scala
      ````
      
      and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16050 from HyukjinKwon/javadoc-markdown.
      1a870090
    • aokolnychyi's avatar
      [MINOR][DOCS] Updates to the Accumulator example in the programming guide.... · f045d9da
      aokolnychyi authored
      [MINOR][DOCS] Updates to the Accumulator example in the programming guide. Fixed typos, AccumulatorV2 in Java
      
      ## What changes were proposed in this pull request?
      
      This pull request contains updates to Scala and Java Accumulator code snippets in the programming guide.
      
      - For Scala, the pull request fixes the signature of the 'add()' method in the custom Accumulator, which contained two params (as the old AccumulatorParam) instead of one (as in AccumulatorV2).
      
      - The Java example was updated to use the AccumulatorV2 class since AccumulatorParam is marked as deprecated.
      
      - Scala and Java examples are more consistent now.
      
      ## How was this patch tested?
      
      This patch was tested manually by building the docs locally.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/20652099/77d98d18-b4f3-11e6-8565-a995fe8cf8e5.png)
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16024 from aokolnychyi/fixed_accumulator_example.
      f045d9da
    • hyukjinkwon's avatar
      [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility... · f830bb91
      hyukjinkwon authored
      [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility in Java API documentation
      
      ## What changes were proposed in this pull request?
      
      This PR make `sbt unidoc` complete with Java 8.
      
      This PR roughly includes several fixes as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```diff
        - * A column that will be computed based on the data in a [[DataFrame]].
        + * A column that will be computed based on the data in a `DataFrame`.
        ```
      
      - Fix throws annotations so that they are recognisable in javadoc
      
      - Fix URL links to `<a href="http..."></a>`.
      
        ```diff
        - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression.
        + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning">
        + * Decision tree (Wikipedia)</a> model for regression.
        ```
      
        ```diff
        -   * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
        +   * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
        +   * Receiver operating characteristic (Wikipedia)</a>
        ```
      
      - Fix < to > to
      
        - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable.
      
        - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558
      
      - Fix `</p>` complaint
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
      f830bb91
    • Davies Liu's avatar
      [SPARK-18188] add checksum for blocks of broadcast · 7d5cb3af
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      A TorrentBroadcast is serialized and compressed first, then splitted as fixed size blocks, if any block is corrupt when fetching from remote, the decompression/deserialization will fail without knowing which block is corrupt. Also, the corrupt block is kept in block manager and reported to driver, so other tasks (in same executor or from different executor) will also fail because of it.
      
      This PR add checksum for each block, and check it after fetching a block from remote executor, because it's very likely that the corruption happen in network. When the corruption happen, it will throw the block away and throw an exception to fail the task, which will be retried.
      
      Added a config for it: `spark.broadcast.checksum`, which is true by default.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #15935 from davies/broadcast_checksum.
      7d5cb3af
Loading