Skip to content
Snippets Groups Projects
  1. May 07, 2017
    • Xiao Li's avatar
      [SPARK-20557][SQL] Support JDBC data type Time with Time Zone · cafca54c
      Xiao Li authored
      ### What changes were proposed in this pull request?
      
      This PR is to support JDBC data type TIME WITH TIME ZONE. It can be converted to TIMESTAMP
      
      In addition, before this PR, for unsupported data types, we simply output the type number instead of the type name.
      
      ```
      java.sql.SQLException: Unsupported type 2014
      ```
      After this PR, the message is like
      ```
      java.sql.SQLException: Unsupported type TIMESTAMP_WITH_TIMEZONE
      ```
      
      - Also upgrade the H2 version to `1.4.195` which has the type fix for "TIMESTAMP WITH TIMEZONE". However, it is not fully supported. Thus, we capture the exception, but we still need it to partially test the support of "TIMESTAMP WITH TIMEZONE", because Docker tests are not regularly run.
      
      ### How was this patch tested?
      Added test cases.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17835 from gatorsmile/h2.
      cafca54c
  2. May 05, 2017
    • hyukjinkwon's avatar
      [SPARK-20614][PROJECT INFRA] Use the same log4j configuration with Jenkins in AppVeyor · b433acae
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, there are flooding logs in AppVeyor (in the console). This has been fine because we can download all the logs. However, (given my observations so far), logs are truncated when there are too many. It has been grown recently and it started to get truncated. For example, see  https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1209-master
      
      Even after the log is downloaded, it looks truncated as below:
      
      ```
      [00:44:21] 17/05/04 18:56:18 INFO TaskSetManager: Finished task 197.0 in stage 601.0 (TID 9211) in 0 ms on localhost (executor driver) (194/200)
      [00:44:21] 17/05/04 18:56:18 INFO Executor: Running task 199.0 in stage 601.0 (TID 9213)
      [00:44:21] 17/05/04 18:56:18 INFO Executor: Finished task 198.0 in stage 601.0 (TID 9212). 2473 bytes result sent to driver
      ...
      ```
      
      Probably, it looks better to use the same log4j configuration that we are using for SparkR tests in Jenkins(please see https://github.com/apache/spark/blob/fc472bddd1d9c6a28e57e31496c0166777af597e/R/run-tests.sh#L26 and https://github.com/apache/spark/blob/fc472bddd1d9c6a28e57e31496c0166777af597e/R/log4j.properties)
      ```
      # Set everything to be logged to the file target/unit-tests.log
      log4j.rootCategory=INFO, file
      log4j.appender.file=org.apache.log4j.FileAppender
      log4j.appender.file.append=true
      log4j.appender.file.file=R/target/unit-tests.log
      log4j.appender.file.layout=org.apache.log4j.PatternLayout
      log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n
      
      # Ignore messages below warning level from Jetty, because it's a bit verbose
      log4j.logger.org.eclipse.jetty=WARN
      org.eclipse.jetty.LEVEL=WARN
      ```
      
      ## How was this patch tested?
      
      Manually tested with spark-test account
        - https://ci.appveyor.com/project/spark-test/spark/build/672-r-log4j (there is an example for flaky test here)
        - https://ci.appveyor.com/project/spark-test/spark/build/673-r-log4j (I re-ran the build).
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17873 from HyukjinKwon/appveyor-reduce-logs.
      b433acae
    • Juliusz Sompolski's avatar
      [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch · 5d75b14b
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch.
      
      ## How was this patch tested?
      
      Now the debug message prints the diff between start and end of batch.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #17875 from juliuszsompolski/SPARK-20616.
      5d75b14b
    • Jannik Arndt's avatar
      [SPARK-20557][SQL] Support for db column type TIMESTAMP WITH TIME ZONE · b31648c0
      Jannik Arndt authored
      ## What changes were proposed in this pull request?
      
      SparkSQL can now read from a database table with column type [TIMESTAMP WITH TIME ZONE](https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE).
      
      ## How was this patch tested?
      
      Tested against Oracle database.
      
      JoshRosen, you seem to know the class, would you look at this? Thanks!
      
      Author: Jannik Arndt <jannik@jannikarndt.de>
      
      Closes #17832 from JannikArndt/spark-20557-timestamp-with-timezone.
      b31648c0
    • Shixiong Zhu's avatar
      [SPARK-20603][SS][TEST] Set default number of topic partitions to 1 to reduce the load · bd578828
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      I checked the logs of https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/ and found it took several seconds to create Kafka internal topic `__consumer_offsets`. As Kafka creates this topic lazily, the topic creation happens in the first test `deserialization of initial offset with Spark 2.1.0` and causes it timeout.
      
      This PR changes `offsets.topic.num.partitions` from the default value 50 to 1 to make creating `__consumer_offsets` (50 partitions -> 1 partition) much faster.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #17863 from zsxwing/fix-kafka-flaky-test.
      bd578828
    • Yucai's avatar
      [SPARK-20381][SQL] Add SQL metrics of numOutputRows for ObjectHashAggregateExec · 41439fd5
      Yucai authored
      ## What changes were proposed in this pull request?
      
      ObjectHashAggregateExec is missing numOutputRows, add this metrics for it.
      
      ## How was this patch tested?
      
      Added unit tests for the new metrics.
      
      Author: Yucai <yucai.yu@intel.com>
      
      Closes #17678 from yucai/objectAgg_numOutputRows.
      41439fd5
    • Jarrett Meyer's avatar
      [SPARK-20613] Remove excess quotes in Windows executable · b9ad2d19
      Jarrett Meyer authored
      ## What changes were proposed in this pull request?
      
      Quotes are already added to the RUNNER variable on line 54. There is no need to put quotes on line 67. If you do, you will get an error when launching Spark.
      
      '""C:\Program' is not recognized as an internal or external command, operable program or batch file.
      
      ## How was this patch tested?
      
      Tested manually on Windows 10.
      
      Author: Jarrett Meyer <jarrettmeyer@gmail.com>
      
      Closes #17861 from jarrettmeyer/fix-windows-cmd.
      b9ad2d19
    • madhu's avatar
      [SPARK-20495][SQL][CORE] Add StorageLevel to cacheTable API · 9064f1b0
      madhu authored
      ## What changes were proposed in this pull request?
      Currently cacheTable API only supports MEMORY_AND_DISK. This PR adds additional API to take different storage levels.
      ## How was this patch tested?
      unit tests
      
      Author: madhu <phatak.dev@gmail.com>
      
      Closes #17802 from phatak-dev/cacheTableAPI.
      9064f1b0
    • jyu00's avatar
      [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode · 5773ab12
      jyu00 authored
      ## What changes were proposed in this pull request?
      
      Updated spark-class to turn off posix mode so the process substitution doesn't cause a syntax error.
      
      ## How was this patch tested?
      
      Existing unit tests, manual spark-shell testing with posix mode on
      
      Author: jyu00 <jessieyu@us.ibm.com>
      
      Closes #17852 from jyu00/master.
      5773ab12
    • Yuming Wang's avatar
      [SPARK-19660][SQL] Replace the deprecated property name fs.default.name to... · 37cdf077
      Yuming Wang authored
      [SPARK-19660][SQL] Replace the deprecated property name fs.default.name to fs.defaultFS that newly introduced
      
      ## What changes were proposed in this pull request?
      
      Replace the deprecated property name `fs.default.name` to `fs.defaultFS` that newly introduced.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #17856 from wangyum/SPARK-19660.
      37cdf077
    • hyukjinkwon's avatar
      [INFRA] Close stale PRs · 4411ac70
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to close a stale PR, several PRs suggested to be closed by a committer and obviously inappropriate PRs.
      
      Closes #11119
      Closes #17853
      Closes #17732
      Closes #17456
      Closes #17410
      Closes #17314
      Closes #17362
      Closes #17542
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17855 from HyukjinKwon/close-pr.
      4411ac70
  3. May 04, 2017
    • Wayne Zhang's avatar
      [SPARK-20574][ML] Allow Bucketizer to handle non-Double numeric column · 0d16faab
      Wayne Zhang authored
      ## What changes were proposed in this pull request?
      Bucketizer currently requires input column to be Double, but the logic should work on any numeric data types. Many practical problems have integer/float data types, and it could get very tedious to manually cast them into Double before calling bucketizer. This PR extends bucketizer to handle all numeric types.
      
      ## How was this patch tested?
      New test.
      
      Author: Wayne Zhang <actuaryzhang@uber.com>
      
      Closes #17840 from actuaryzhang/bucketizer.
      0d16faab
    • Dongjoon Hyun's avatar
      [SPARK-20566][SQL] ColumnVector should support `appendFloats` for array · bfc8c79c
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to add a missing `appendFloats` API for array into **ColumnVector** class. For double type, there is `appendDoubles` for array [here](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L818-L824).
      
      ## How was this patch tested?
      
      Pass the Jenkins with a newly added test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17836 from dongjoon-hyun/SPARK-20566.
      bfc8c79c
    • Yanbo Liang's avatar
      [SPARK-20047][FOLLOWUP][ML] Constrained Logistic Regression follow up · c5dceb8c
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Address some minor comments for #17715:
      * Put bound-constrained optimization params under expertParams.
      * Update some docs.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17829 from yanboliang/spark-20047-followup.
      c5dceb8c
    • Felix Cheung's avatar
      [SPARK-20571][SPARKR][SS] Flaky Structured Streaming tests · 57b64703
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Make tests more reliable by having it till processed.
      Increasing timeout value might help but ultimately the flakiness from processing delay when Jenkins is hard to account for. This isn't an actual public API supported
      
      ## How was this patch tested?
      unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17857 from felixcheung/rsstestrelia.
      57b64703
    • zero323's avatar
      [SPARK-20544][SPARKR] R wrapper for input_file_name · f21897fc
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds wrapper for `o.a.s.sql.functions.input_file_name`
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests, `check-cran.sh`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17818 from zero323/SPARK-20544.
      f21897fc
    • zero323's avatar
      [SPARK-20585][SPARKR] R generic hint support · 9c36aa27
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds support for generic hints on `SparkDataFrame`
      
      ## How was this patch tested?
      
      Unit tests, `check-cran.sh`
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17851 from zero323/SPARK-20585.
      9c36aa27
    • Felix Cheung's avatar
      [SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming... · b8302ccd
      Felix Cheung authored
      [SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
      
      ## What changes were proposed in this pull request?
      
      Add
      - R vignettes
      - R programming guide
      - SS programming guide
      - R example
      
      Also disable spark.als in vignettes for now since it's failing (SPARK-20402)
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17814 from felixcheung/rdocss.
      b8302ccd
  4. May 03, 2017
    • Felix Cheung's avatar
      [SPARK-20543][SPARKR] skip tests when running on CRAN · fc472bdd
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      General rule on skip or not:
      skip if
      - RDD tests
      - tests could run long or complicated (streaming, hivecontext)
      - tests on error conditions
      - tests won't likely change/break
      
      ## How was this patch tested?
      
      unit tests, `R CMD check --as-cran`, `R CMD check`
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17817 from felixcheung/rskiptest.
      fc472bdd
    • zero323's avatar
      [SPARK-20584][PYSPARK][SQL] Python generic hint support · 02bbe731
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds `hint` method to PySpark `DataFrame`.
      
      ## How was this patch tested?
      
      Unit tests, doctests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17850 from zero323/SPARK-20584.
      02bbe731
    • hyukjinkwon's avatar
      [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= · 13eb37c8
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test.
      
        ```diff
        -   test("<=>") {
        -     checkAnswer(
        -      testData2.filter($"a" === 1),
        -      testData2.collect().toSeq.filter(r => r.getInt(0) == 1))
        -
        -    checkAnswer(
        -      testData2.filter($"a" === $"b"),
        -      testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1)))
        -   }
        ```
      
      - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`.
      
        ```diff
        +  private lazy val nullData = Seq(
        +    (Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
        +
          ...
        -  test("=!=") {
        +  test("<=>") {
        -    val nullData = spark.createDataFrame(sparkContext.parallelize(
        -      Row(1, 1) ::
        -      Row(1, 2) ::
        -      Row(1, null) ::
        -      Row(null, null) :: Nil),
        -      StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType))))
        -
               checkAnswer(
                 nullData.filter($"b" <=> 1),
          ...
        ```
      
      - Add the tests for `=!=` which looks not existing.
      
        ```diff
        +  test("=!=") {
        +    checkAnswer(
        +      nullData.filter($"b" =!= 1),
        +      Row(1, 2) :: Nil)
        +
        +    checkAnswer(nullData.filter($"b" =!= null), Nil)
        +
        +    checkAnswer(
        +      nullData.filter($"a" =!= $"b"),
        +      Row(1, 2) :: Nil)
        +  }
        ```
      
      ## How was this patch tested?
      
      Manually running the tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17842 from HyukjinKwon/minor-test-fix.
      13eb37c8
    • Liwei Lin's avatar
      [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when... · 6b9e49d1
      Liwei Lin authored
      [SPARK-19965][SS] DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
      
      ## The Problem
      
      Right now DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output:
      
      ```
      [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds)
      [info]   java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
      [info] 	***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
      [info] 	***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
      [info]
      [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
      [info]   at scala.Predef$.assert(Predef.scala:170)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
      [info]   at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
      [info]   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
      [info]   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
      [info]   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
      [info]   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
      [info]   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
      [info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
      [info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
      [info]   at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
      ```
      
      ## What changes were proposed in this pull request?
      
      This patch alters `InMemoryFileIndex` to filter out these `basePath`s whose ancestor is the streaming metadata dir (`_spark_metadata`). E.g., the following and other similar dir or files will be filtered out:
      - (introduced by globbing `basePath/*`)
         - `basePath/_spark_metadata`
      - (introduced by globbing `basePath/*/*`)
         - `basePath/_spark_metadata/0`
         - `basePath/_spark_metadata/1`
         - ...
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17346 from lw-lin/filter-metadata.
      6b9e49d1
    • Reynold Xin's avatar
      [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame · 527fc5d0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL.
      
      As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function:
      
      ```
      df1.join(df2.hint("broadcast"))
      ```
      
      ## How was this patch tested?
      Added a test case in DataFrameJoinSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #17839 from rxin/SPARK-20576.
      527fc5d0
    • Liwei Lin's avatar
      [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one... · 27f543b1
      Liwei Lin authored
      [SPARK-20441][SPARK-20432][SS] Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation
      
      ## What changes were proposed in this pull request?
      
      Within the same streaming query, when one `StreamingRelation` is referred multiple times – e.g. `df.union(df)` – we should transform it only to one `StreamingExecutionRelation`, instead of two or more different `StreamingExecutionRelation`s (each of which would have a separate set of source, source logs, ...).
      
      ## How was this patch tested?
      
      Added two test cases, each of which would fail without this patch.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #17735 from lw-lin/SPARK-20441.
      27f543b1
    • Yan Facai (颜发才)'s avatar
      [SPARK-16957][MLLIB] Use midpoints for split values. · 7f96f2d7
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      
      Use midpoints for split values now, and maybe later to make it weighted.
      
      ## How was this patch tested?
      
      + [x] add unit test.
      + [x] revise Split's unit test.
      
      Author: Yan Facai (颜发才) <facai.yan@gmail.com>
      Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
      
      Closes #17556 from facaiy/ENH/decision_tree_overflow_and_precision_in_aggregation.
      7f96f2d7
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
    • MechCoder's avatar
      [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) · db2fb84b
      MechCoder authored
      Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).
      
      Based on #7963, updated.
      
      ## How was this patch tested?
      
      New doc tests and unit tests. Ran all examples locally.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.
      db2fb84b
    • Michael Armbrust's avatar
      [SPARK-20567] Lazily bind in GenerateExec · 6235132a
      Michael Armbrust authored
      It is not valid to eagerly bind with the child's output as this causes failures when we attempt to canonicalize the plan (replacing the attribute references with dummies).
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #17838 from marmbrus/fixBindExplode.
      6235132a
  5. May 02, 2017
    • Wenchen Fan's avatar
      [SPARK-20558][CORE] clear InheritableThreadLocal variables in SparkContext when stopping it · b946f316
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      To better understand this problem, let's take a look at an example first:
      ```
      object Main {
        def main(args: Array[String]): Unit = {
          var t = new Test
          new Thread(new Runnable {
            override def run() = {}
          }).start()
          println("first thread finished")
      
          t.a = null
          t = new Test
          new Thread(new Runnable {
            override def run() = {}
          }).start()
        }
      
      }
      
      class Test {
        var a = new InheritableThreadLocal[String] {
          override protected def childValue(parent: String): String = {
            println("parent value is: " + parent)
            parent
          }
        }
        a.set("hello")
      }
      ```
      The result is:
      ```
      parent value is: hello
      first thread finished
      parent value is: hello
      parent value is: hello
      ```
      
      Once an `InheritableThreadLocal` has been set value, child threads will inherit its value as long as it has not been GCed, so setting the variable which holds the `InheritableThreadLocal` to `null` doesn't work as we expected.
      
      In `SparkContext`, we have an `InheritableThreadLocal` for local properties, we should clear it when stopping `SparkContext`, or all the future child threads will still inherit it and copy the properties and waste memory.
      
      This is the root cause of https://issues.apache.org/jira/browse/SPARK-20548 , which creates/stops `SparkContext` many times and finally have a lot of `InheritableThreadLocal` alive, and cause OOM when starting new threads in the internal thread pools.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17833 from cloud-fan/core.
      b946f316
    • Marcelo Vanzin's avatar
      [SPARK-20421][CORE] Add a missing deprecation tag. · ef3df912
      Marcelo Vanzin authored
      In the previous patch I deprecated StorageStatus, but not the
      method in SparkContext that exposes that class publicly. So deprecate
      the method too.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #17824 from vanzin/SPARK-20421.
      ef3df912
    • Felix Cheung's avatar
      [SPARK-20490][SPARKR][DOC] add family tag for not function · 13f47dc5
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      doc only
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17828 from felixcheung/rnotfamily.
      13f47dc5
    • Xiao Li's avatar
      [SPARK-19235][SQL][TEST][FOLLOW-UP] Enable Test Cases in DDLSuite with Hive Metastore · b1e639ab
      Xiao Li authored
      ### What changes were proposed in this pull request?
      This is a follow-up of enabling test cases in DDLSuite with Hive Metastore. It consists of the following remaining tasks:
      - Run all the `alter table` and `drop table` DDL tests against data source tables when using Hive metastore.
      - Do not run any `alter table` and `drop table` DDL test against Hive serde tables when using InMemoryCatalog.
      - Reenable `alter table: set serde partition` and `alter table: set serde` tests for Hive serde tables.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17524 from gatorsmile/cleanupDDLSuite.
      b1e639ab
    • Nick Pentreath's avatar
      [SPARK-20300][ML][PYSPARK] Python API for ALSModel.recommendForAllUsers,Items · e300a5a1
      Nick Pentreath authored
      Add Python API for `ALSModel` methods `recommendForAllUsers`, `recommendForAllItems`
      
      ## How was this patch tested?
      
      New doc tests.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17622 from MLnick/SPARK-20300-pyspark-recall.
      e300a5a1
    • Burak Yavuz's avatar
      [SPARK-20549] java.io.CharConversionException: Invalid UTF-32' in JsonToStructs · 86174ea8
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.
      
      ## How was this patch tested?
      
      Regression test
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #17826 from brkyvz/SPARK-20549.
      86174ea8
    • Kazuaki Ishizaki's avatar
      [SPARK-20537][CORE] Fixing OffHeapColumnVector reallocation · afb21bf2
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      As #17773 revealed `OnHeapColumnVector` may copy a part of the original storage.
      
      `OffHeapColumnVector` reallocation also copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the `ColumnVector.appendX` API, while `ColumnVector.putX` is more commonly used.
      This PR copies the new storage data up to the previously-allocated size in`OffHeapColumnVector`.
      
      ## How was this patch tested?
      
      Existing test suites
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17811 from kiszk/SPARK-20537.
      afb21bf2
  6. May 01, 2017
Loading