Skip to content
Snippets Groups Projects
  1. Jun 23, 2016
    • Sameer Agarwal's avatar
      [SPARK-16123] Avoid NegativeArraySizeException while reserving additional... · cc71d4fa
      Sameer Agarwal authored
      [SPARK-16123] Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
      
      ## What changes were proposed in this pull request?
      
      This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads.
      
      ## How was this patch tested?
      
      Manual Tests
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13832 from sameeragarwal/negative-array.
      cc71d4fa
    • Dongjoon Hyun's avatar
      [SPARK-16165][SQL] Fix the update logic for InMemoryTableScanExec.readBatches · 264bc636
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including a new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13870 from dongjoon-hyun/SPARK-16165.
      264bc636
    • Shixiong Zhu's avatar
      [SPARK-15443][SQL] Fix 'explain' for streaming Dataset · 0e4bdebe
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      - Fix the `explain` command for streaming Dataset/DataFrame. E.g.,
      ```
      == Parsed Logical Plan ==
      'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- 'MapElements <function1>, obj#6: java.lang.String
         +- 'DeserializeToObject unresolveddeserializer(createexternalrow(getcolumnbyordinal(0, StringType).toString, StructField(value,StringType,true))), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      
      == Analyzed Logical Plan ==
      value: string
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      
      == Optimized Logical Plan ==
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      
      == Physical Plan ==
      *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- *MapElements <function1>, obj#6: java.lang.String
         +- *DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- *Filter <function1>.apply
               +- StreamingRelation FileSource[/Users/zsx/stream], [value#0]
      ```
      
      - Add `StreamingQuery.explain` to display the last execution plan. E.g.,
      ```
      == Parsed Logical Plan ==
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- Relation[value#12] text
      
      == Analyzed Logical Plan ==
      value: string
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- Relation[value#12] text
      
      == Optimized Logical Plan ==
      SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- MapElements <function1>, obj#6: java.lang.String
         +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- Filter <function1>.apply
               +- Relation[value#12] text
      
      == Physical Plan ==
      *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7]
      +- *MapElements <function1>, obj#6: java.lang.String
         +- *DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row
            +- *Filter <function1>.apply
               +- *Scan text [value#12] Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat1836ab91, InputPaths: file:/Users/zsx/stream/a.txt, file:/Users/zsx/stream/b.txt, file:/Users/zsx/stream/c.txt, PushedFilters: [], ReadSchema: struct<value:string>
      ```
      
      ## How was this patch tested?
      
      The added unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13815 from zsxwing/sdf-explain.
      0e4bdebe
    • Dongjoon Hyun's avatar
      [SPARK-16164][SQL] Update `CombineFilters` to try to construct predicates with... · 91b1ef28
      Dongjoon Hyun authored
      [SPARK-16164][SQL] Update `CombineFilters` to try to construct predicates with child predicate first
      
      ## What changes were proposed in this pull request?
      
      This PR changes `CombineFilters` to compose the final predicate condition by using (`child predicate` AND `parent predicate`) instead of (`parent predicate` AND `child predicate`). This is a best effort approach. Some other optimization rules may destroy this order by reorganizing conjunctive predicates.
      
      **Reported Error Scenario**
      Chris McCubbin reported a bug when he used StringIndexer in an ML pipeline with additional filters. It seems that during filter pushdown, we changed the ordering in the logical plan.
      ```scala
      import org.apache.spark.ml.feature._
      val df1 = (0 until 3).map(_.toString).toDF
      val indexer = new StringIndexer()
        .setInputCol("value")
        .setOutputCol("idx")
        .setHandleInvalid("skip")
        .fit(df1)
      val df2 = (0 until 5).map(_.toString).toDF
      val predictions = indexer.transform(df2)
      predictions.show() // this is okay
      predictions.where('idx > 2).show() // this will throw an exception
      ```
      
      Please see the notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html for error messages.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including a new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13872 from dongjoon-hyun/SPARK-16164.
      91b1ef28
    • Ryan Blue's avatar
      [SPARK-13723][YARN] Change behavior of --num-executors with dynamic allocation. · 738f134b
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.
      
      This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.
      
      ## How was this patch tested?
      
      This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
      738f134b
    • Ryan Blue's avatar
      [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval. · a410814c
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      Update `ApplicationMaster` to sleep for at least the minimum allocation interval before calling `allocateResources`. This prevents overloading the `YarnAllocator` that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.
      
      ## How was this patch tested?
      
      Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #13482 from rdblue/SPARK-15725-am-sleep-work-around.
      a410814c
    • Davies Liu's avatar
      [SPARK-16163] [SQL] Cache the statistics for logical plans · 10396d95
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan).
      
      During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics.
      
      For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore.
      
      ## How was this patch tested?
      
      Testsed with TPC-DS Q64, it could be planned in a second after the patch.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13871 from davies/fix_statistics.
      10396d95
    • Yuhao Yang's avatar
      [SPARK-16130][ML] model loading backward compatibility for ml.classfication.LogisticRegression · 60398dab
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      jira: https://issues.apache.org/jira/browse/SPARK-16130
      model loading backward compatibility for ml.classfication.LogisticRegression
      
      ## How was this patch tested?
      existing ut and manual test for loading old models.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #13841 from hhbyyh/lrcomp.
      60398dab
    • Shixiong Zhu's avatar
      [SPARK-16116][SQL] ConsoleSink should not require checkpointLocation · d85bb10c
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When the user uses `ConsoleSink`, we should use a temp location if `checkpointLocation` is not specified.
      
      ## How was this patch tested?
      
      The added unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13817 from zsxwing/console-checkpoint.
      d85bb10c
    • Felix Cheung's avatar
      [SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroup · b5a99766
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
      Also updated roxygen2 doc and R programming guide on deprecations.
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13838 from felixcheung/rjobgroup.
      b5a99766
    • Xiangrui Meng's avatar
      [SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs · 65d1f0f7
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change.
      
      ## How was this patch tested?
      
      Manually checked generated APIs.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13859 from mengxr/SPARK-16154.
      65d1f0f7
    • Peter Ableda's avatar
      [SPARK-16138] Try to cancel executor requests only if we have at least 1 · 5bf2889b
      Peter Ableda authored
      ## What changes were proposed in this pull request?
      Adding additional check to if statement
      
      ## How was this patch tested?
      I built and deployed to internal cluster to observe behaviour. After the change the invalid logging is gone:
      
      ```
      16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s).
      16/06/22 08:46:36 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 1 executors.
      16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
      16/06/22 08:47:36 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1.
      ```
      
      Author: Peter Ableda <abledapeter@gmail.com>
      
      Closes #13850 from peterableda/patch-2.
      5bf2889b
    • Dongjoon Hyun's avatar
      [SPARK-15660][CORE] Update RDD `variance/stdev` description and add popVariance/popStdev · 5eef1e6c
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In Spark-11490, `variance/stdev` are redefined as the **sample** `variance/stdev` instead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs.
      
      - http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.api.java.JavaDoubleRDD
      - http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.rdd.DoubleRDDFunctions
      - http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter
      - http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/api/java/JavaDoubleRDD.html
      - http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
      - http://spark.apache.org/docs/2.0.0-preview/api/java/org/apache/spark/util/StatCounter.html
      
      Also, this PR adds them `popVariance` and `popStdev` functions clearly.
      
      ## How was this patch tested?
      
      Pass the updated Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13403 from dongjoon-hyun/SPARK-15660.
      5eef1e6c
    • Brian Cho's avatar
      [SPARK-16162] Remove dead code OrcTableScan. · 4374a46b
      Brian Cho authored
      ## What changes were proposed in this pull request?
      
      SPARK-14535 removed all calls to class OrcTableScan. This removes the dead code.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Brian Cho <bcho@fb.com>
      
      Closes #13869 from dafrista/clean-up-orctablescan.
      4374a46b
    • Cheng Lian's avatar
      [SQL][MINOR] Fix minor formatting issues in SHOW CREATE TABLE output · f34b5c62
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR fixes two minor formatting issues appearing in `SHOW CREATE TABLE` output.
      
      Before:
      
      ```
      CREATE EXTERNAL TABLE ...
      ...
      WITH SERDEPROPERTIES ('serialization.format' = '1'
      )
      ...
      TBLPROPERTIES ('avro.schema.url' = '/tmp/avro/test.avsc',
        'transient_lastDdlTime' = '1466638180')
      ```
      
      After:
      
      ```
      CREATE EXTERNAL TABLE ...
      ...
      WITH SERDEPROPERTIES (
        'serialization.format' = '1'
      )
      ...
      TBLPROPERTIES (
        'avro.schema.url' = '/tmp/avro/test.avsc',
        'transient_lastDdlTime' = '1466638180'
      )
      ```
      
      ## How was this patch tested?
      
      Manually tested.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13864 from liancheng/show-create-table-format-fix.
      f34b5c62
  2. Jun 22, 2016
    • bomeng's avatar
      [SPARK-15230][SQL] distinct() does not handle column name with dot properly · 925884a6
      bomeng authored
      ## What changes were proposed in this pull request?
      
      When table is created with column name containing dot, distinct() will fail to run. For example,
      ```scala
      val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2)))
      val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false)))
      val df = spark.createDataFrame(rowRDD, schema)
      ```
      running the following will have no problem:
      ```scala
      df.select(new Column("`column.with.dot`"))
      ```
      but running the query with additional distinct() will cause exception:
      ```scala
      df.select(new Column("`column.with.dot`")).distinct()
      ```
      
      The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve().
      
      ## How was this patch tested?
      
      Added a new test case.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #13140 from bomeng/SPARK-15230.
      925884a6
    • Reynold Xin's avatar
      [SPARK-16159][SQL] Move RDD creation logic from FileSourceStrategy.apply · 37f3be5d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning.
      
      ## How was this patch tested?
      This is a simple code move.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13862 from rxin/SPARK-16159.
      37f3be5d
    • gatorsmile's avatar
      [SPARK-16024][SQL][TEST] Verify Column Comment for Data Source Tables · 9f990fa3
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled.
      
      The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog.
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13764 from gatorsmile/dataSourceComment.
      9f990fa3
    • Brian Cho's avatar
      [SPARK-15956][SQL] When unwrapping ORC avoid pattern matching at runtime · 4f869f88
      Brian Cho authored
      ## What changes were proposed in this pull request?
      
      Extend the returning of unwrapper functions from primitive types to all types.
      
      This PR is based on https://github.com/apache/spark/pull/13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista.
      
      ## How was this patch tested?
      
      The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.
      
      Author: Brian Cho <bcho@fb.com>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13854 from hvanhovell/SPARK-15956-scala210.
      4f869f88
    • Prajwal Tuladhar's avatar
      [SPARK-16131] initialize internal logger lazily in Scala preferred way · 044971ec
      Prajwal Tuladhar authored
      ## What changes were proposed in this pull request?
      
      Initialize logger instance lazily in Scala preferred way
      
      ## How was this patch tested?
      
      By running `./build/mvn clean test` locally
      
      Author: Prajwal Tuladhar <praj@infynyxx.com>
      
      Closes #13842 from infynyxx/spark_internal_logger.
      044971ec
    • Xiangrui Meng's avatar
      [SPARK-16155][DOC] remove package grouping in Java docs · 857ecff1
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      In 1.4 and earlier releases, we have package grouping in the generated Java API docs. See http://spark.apache.org/docs/1.4.0/api/java/index.html. However, this disappeared in 1.5.0: http://spark.apache.org/docs/1.5.0/api/java/index.html.
      
      Rather than fixing it, I'd suggest removing grouping. Because it might take some time to fix and it is a manual process to update the grouping in `SparkBuild.scala`. I didn't find anyone complaining about missing groups since 1.5.0 on Google.
      
      Manually checked the generated Java API docs and confirmed that they are the same as in master.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13856 from mengxr/SPARK-16155.
      857ecff1
    • Xiangrui Meng's avatar
      [SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug · 00cc5cca
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      We recently deprecated setLabelCol in ChiSqSelectorModel (#13823):
      
      ~~~scala
        /** group setParam */
        Since("1.6.0")
        deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0")
        def setLabelCol(value: String): this.type = set(labelCol, value)
      ~~~
      
      This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code:
      
      ~~~java
        /** group setParam */
        public  org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value)  { throw new RuntimeException(); }
         *
         * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0.
        */
        public  org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value)  { throw new RuntimeException(); }
      ~~~
      
      Switching to multiline is a workaround.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13855 from mengxr/SPARK-16153.
      00cc5cca
    • Davies Liu's avatar
      [SPARK-16078][SQL] from_utc_timestamp/to_utc_timestamp should not depends on local timezone · 20d411bc
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, we use local timezone to parse or format a timestamp (TimestampType), then use Long as the microseconds since epoch UTC.
      
      In from_utc_timestamp() and to_utc_timestamp(), we did not consider the local timezone, they could return different results with different local timezone.
      
      This PR will do the conversion based on human time (in local timezone), it should return same result in whatever timezone. But because the mapping from absolute timestamp to human time is not exactly one-to-one mapping, it will still return wrong result in some timezone (also in the begging or ending of DST).
      
      This PR is kind of the best effort fix. In long term, we should make the TimestampType be timezone aware to fix this totally.
      
      ## How was this patch tested?
      
      Tested these function in all timezone.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13784 from davies/convert_tz.
      20d411bc
    • Kai Jiang's avatar
      [SPARK-15672][R][DOC] R programming guide update · 43b04b7e
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      Guide for
      - UDFs with dapply, dapplyCollect
      - spark.lapply for running parallel R functions
      
      ## How was this patch tested?
      build locally
      <img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #13660 from vectorijk/spark-15672-R-guide-update.
      43b04b7e
    • Eric Liang's avatar
      [SPARK-16003] SerializationDebugger runs into infinite loop · 6f915c9e
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This fixes SerializationDebugger to not recurse forever when `writeReplace` returns an object of the same class, which is the case for at least the `SQLMetrics` class.
      
      See also the OpenJDK unit tests on the behavior of recursive `writeReplace()`:
      https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/test/java/io/Serializable/nestedReplace/NestedReplace.java
      
      cc davies cloud-fan
      
      ## How was this patch tested?
      
      Unit tests for SerializationDebugger.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13814 from ericl/spark-16003.
      6f915c9e
    • Herman van Hovell's avatar
      [SPARK-15956][SQL] Revert "[] When unwrapping ORC avoid pattern matching… · 472d611a
      Herman van Hovell authored
      This reverts commit 0a9c0275. It breaks the 2.10 build, I'll fix this in a different PR.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13853 from hvanhovell/SPARK-15956-revert.
      472d611a
    • Ahmed Mahran's avatar
      [SPARK-16120][STREAMING] getCurrentLogFiles in ReceiverSuite WAL generating... · c2cebdb7
      Ahmed Mahran authored
      [SPARK-16120][STREAMING] getCurrentLogFiles in ReceiverSuite WAL generating and cleaning case uses external variable instead of the passed parameter
      
      ## What changes were proposed in this pull request?
      
      In `ReceiverSuite.scala`, in the test case "write ahead log - generating and cleaning", the inner method `getCurrentLogFiles` uses external variable `logDirectory1` instead of the passed parameter `logDirectory`. This PR fixes this by using the passed method argument instead of variable from the outer scope.
      
      ## How was this patch tested?
      
      The unit test was re-run and the output logs were checked for the correct paths used.
      
      tdas
      
      Author: Ahmed Mahran <ahmed.mahran@mashin.io>
      
      Closes #13825 from ahmed-mahran/b-receiver-suite-wal-gen-cln.
      c2cebdb7
    • Brian Cho's avatar
      [SPARK-15956][SQL] When unwrapping ORC avoid pattern matching at runtime · 0a9c0275
      Brian Cho authored
      ## What changes were proposed in this pull request?
      
      Extend the returning of unwrapper functions from primitive types to all types.
      
      ## How was this patch tested?
      
      The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.
      
      ===
      
      The github diff is very noisy. Attaching the screenshots below for improved readability:
      
      ![screen shot 2016-06-14 at 5 33 16 pm](https://cloud.githubusercontent.com/assets/1514239/16064580/4d6f7a98-3257-11e6-9172-65e4baff948b.png)
      
      ![screen shot 2016-06-14 at 5 33 28 pm](https://cloud.githubusercontent.com/assets/1514239/16064587/5ae6c244-3257-11e6-8460-69eee70de219.png)
      
      Author: Brian Cho <bcho@fb.com>
      
      Closes #13676 from dafrista/improve-orc-master.
      0a9c0275
    • Xiangrui Meng's avatar
      [MINOR][MLLIB] DefaultParamsReadable/Writable should be DeveloperApi · 6a6010f0
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      `DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13828 from mengxr/default-readable-should-be-developer-api.
      6a6010f0
    • Nick Pentreath's avatar
      [SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg · 18faa588
      Nick Pentreath authored
      [SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
      18faa588
    • Junyang Qian's avatar
      [SPARK-16107][R] group glm methods in documentation · ea3a12b0
      Junyang Qian authored
      ## What changes were proposed in this pull request?
      
      This groups GLM methods (spark.glm, summary, print, predict and write.ml) in the documentation. The example code was updated.
      
      ## How was this patch tested?
      
      N/A
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      ![screen shot 2016-06-21 at 2 31 37 pm](https://cloud.githubusercontent.com/assets/15318264/16247077/f6eafc04-37bc-11e6-89a8-7898ff3e4078.png)
      ![screen shot 2016-06-21 at 2 31 45 pm](https://cloud.githubusercontent.com/assets/15318264/16247078/f6eb1c16-37bc-11e6-940a-2b595b10617c.png)
      
      Author: Junyang Qian <junyangq@databricks.com>
      Author: Junyang Qian <junyangq@Junyangs-MacBook-Pro.local>
      
      Closes #13820 from junyangq/SPARK-16107.
      ea3a12b0
    • Imran Rashid's avatar
      [SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite · cf1995a9
      Imran Rashid authored
      ## What changes were proposed in this pull request?
      
      Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite
      
      1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in.  So added in the periodic revive offers, just like the real scheduler.
      
      2. `assertEmptyDataStructures` would occasionally fail, because it appeared there was still an active job.  This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up.  Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail.
      
      3. `DAGSchedulerSuite` was not stopping all the inner parts it was setting up, so each test was leaking a number of threads.  So we stop those parts too.
      
      4. Turns out that `assertMapOutputAvailable` is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race.
      
      5. When there is an exception in the backend, try to improve the error msg a little bit.  Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything.
      
      ## How was this patch tested?
      
      I ran all the tests in `BlacklistIntegrationSuite` 5k times and everything in `DAGSchedulerSuite` 1k times on my laptop.  Also I ran a full jenkins build with `BlacklistIntegrationSuite` 500 times and `DAGSchedulerSuite` 50 times, see https://github.com/apache/spark/pull/13548.  (I tried more times but jenkins timed out.)
      
      To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads.  (I removed that code from the final pr, just part of the testing.)
      
      And I'll run Jenkins on this a couple of times to do one more check.
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #13565 from squito/blacklist_extra_tests.
      cf1995a9
    • Wenchen Fan's avatar
      [SPARK-16097][SQL] Encoders.tuple should handle null object correctly · 01277d4b
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case.
      
      ## How was this patch tested?
      
      new test in DatasetSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13807 from cloud-fan/bug.
      01277d4b
    • Yin Huai's avatar
      [SPARK-16121] ListingFileCatalog does not list in parallel anymore · 39ad53f7
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Seems the fix of SPARK-14959 breaks the parallel partitioning discovery. This PR fixes the problem
      
      ## How was this patch tested?
      Tested manually. (This PR also adds a proper test for SPARK-14959)
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13830 from yhuai/SPARK-16121.
      39ad53f7
    • Holden Karau's avatar
      [SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs · d281b0ba
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.
      
      ## How was this patch tested?
      
      Built docs locally & PySpark SQL tests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
      d281b0ba
    • gatorsmile's avatar
      [SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib · 0e3ce753
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`.
      
      Also fix a test case issue in `BroadcastJoinSuite`.
      
      BTW, `SQLContext` is not being used in the `MLlib` test suites.
      #### How was this patch tested?
      Existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #13380 from gatorsmile/sqlContextML.
      0e3ce753
  3. Jun 21, 2016
    • hyukjinkwon's avatar
      [SPARK-16104] [SQL] Do not creaate CSV writer object for every flush when writing · 7580f304
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source.
      
      Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush.
      
      It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`.
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13809 from HyukjinKwon/write-perf.
      7580f304
    • Xiangrui Meng's avatar
      [MINOR][MLLIB] deprecate setLabelCol in ChiSqSelectorModel · d77c4e6e
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Deprecate `labelCol`, which is not used by ChiSqSelectorModel.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
      d77c4e6e
    • Felix Cheung's avatar
      [SQL][DOC] SQL programming guide add deprecated methods in 2.0.0 · 79aa1d82
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Doc changes
      
      ## How was this patch tested?
      
      manual
      
      liancheng
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13827 from felixcheung/sqldocdeprecate.
      79aa1d82
    • Xiangrui Meng's avatar
      [SPARK-16118][MLLIB] add getDropLast to OneHotEncoder · 9493b079
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      We forgot the getter of `dropLast` in `OneHotEncoder`
      
      ## How was this patch tested?
      
      unit test
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13821 from mengxr/SPARK-16118.
      9493b079
Loading