Skip to content
Snippets Groups Projects
  1. Aug 15, 2016
  2. Aug 14, 2016
    • Zhenglai Zhang's avatar
      [WIP][MINOR][TYPO] Fix several trivival typos · 2a3d286f
      Zhenglai Zhang authored
      ## What changes were proposed in this pull request?
      
      * Fixed one typo `"overriden"` as `"overridden"`, also make sure no other same typo.
      * Fixed one typo `"lowcase"` as `"lowercase"`, also make sure no other same typo.
      
      ## How was this patch tested?
      
      Since the change is very tiny, so I just make sure compilation is successful.
      I am new to the spark community,  please feel free to let me do other necessary steps.
      
      Thanks in advance!
      
      ----
      Updated: Found another typo `lowcase` later and fixed then in the same patch
      
      Author: Zhenglai Zhang <zhenglaizhang@hotmail.com>
      
      Closes #14622 from zhenglaizhang/fixtypo.
      2a3d286f
    • zero323's avatar
      [SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySize · 0ebf7c1b
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient
      
      ## How was this patch tested?
      
      Spark unit tests
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #14614 from zero323/SPARK-17027.
      0ebf7c1b
  3. Aug 13, 2016
    • Sean Owen's avatar
      [SPARK-16966][SQL][CORE] App Name is a randomUUID even when "spark.app.name" exists · cdaa562c
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Don't override app name specified in `SparkConf` with a random app name. Only set it if the conf has no app name even after options have been applied.
      
      See also https://github.com/apache/spark/pull/14602
      This is similar to Sherry302 's original proposal in https://github.com/apache/spark/pull/14556
      
      ## How was this patch tested?
      
      Jenkins test, with new case reproducing the bug
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14630 from srowen/SPARK-16966.2.
      cdaa562c
    • Luciano Resende's avatar
      [SPARK-17023][BUILD] Upgrade to Kafka 0.10.0.1 release · 67f025d9
      Luciano Resende authored
      ## What changes were proposed in this pull request?
      Update Kafka streaming connector to use Kafka 0.10.0.1 release
      
      ## How was this patch tested?
      Tested via Spark unit and integration tests
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #14606 from lresende/kafka-upgrade.
      67f025d9
    • GraceH's avatar
      [SPARK-16968] Add additional options in jdbc when creating a new table · 8c8acdec
      GraceH authored
      ## What changes were proposed in this pull request?
      
      In the PR, we just allow the user to add additional options when create a new table in JDBC writer.
      The options can be table_options or partition_options.
      E.g., "CREATE TABLE t (name string) ENGINE=InnoDB DEFAULT CHARSET=utf8"
      
      Here is the usage example:
      ```
      df.write.option("createTableOptions", "ENGINE=InnoDB DEFAULT CHARSET=utf8").jdbc(...)
      ```
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      will apply test result soon.
      
      Author: GraceH <93113783@qq.com>
      
      Closes #14559 from GraceH/jdbc_options.
      8c8acdec
    • Xin Ren's avatar
      [MINOR][CORE] fix warnings on depreciated methods in... · 7f7133bd
      Xin Ren authored
      [MINOR][CORE] fix warnings on depreciated methods in MesosClusterSchedulerSuite and DiskBlockObjectWriterSuite
      
      ## What changes were proposed in this pull request?
      
      Fixed warnings below after scanning through warnings during build:
      
      ```
      [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSuite.scala:34: imported `Utils' is permanently hidden by definition of object Utils in package mesos
      [warn] import org.apache.spark.scheduler.cluster.mesos.Utils
      [warn]                                                 ^
      ```
      
      and
      ```
      [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:113: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
      [warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
      [warn]                         ^
      [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:119: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
      [warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
      [warn]                         ^
      [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:131: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
      [warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
      [warn]                         ^
      [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/src/test/scala/org/apache/spark/storage/DiskBlockObjectWriterSuite.scala:135: method shuffleBytesWritten in class ShuffleWriteMetrics is deprecated: use bytesWritten instead
      [warn]     assert(writeMetrics.shuffleBytesWritten === file.length())
      [warn]                         ^
      ```
      
      ## How was this patch tested?
      
      Tested manually on local laptop.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14609 from keypointt/suiteWarnings.
      7f7133bd
    • Jagadeesan's avatar
      [SPARK-12370][DOCUMENTATION] Documentation should link to examples … · e46cb78b
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
      
      …from its own release version] [Streaming programming guide]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #14596 from jagadeesanas2/SPARK-12370.
      e46cb78b
  4. Aug 12, 2016
    • WeichenXu's avatar
      [DOC] add config option spark.ui.enabled into document · 91f2735a
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      The configuration doc lost the config option `spark.ui.enabled` (default value is `true`)
      I think this option is important because many cases we would like to turn it off.
      so I add it.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14604 from WeichenXu123/add_doc_param_spark_ui_enabled.
      91f2735a
    • Dongjoon Hyun's avatar
      [SPARK-16771][SQL] WITH clause should not fall into infinite loop. · 2a105134
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR changes the CTE resolving rule to use only **forward-declared** tables in order to prevent infinite loops. More specifically, new logic is like the following.
      
      * Resolve CTEs in `WITH` clauses first before replacing the main SQL body.
      * When resolving CTEs, only forward-declared CTEs or base tables are referenced.
        - Self-referencing is not allowed any more.
        - Cross-referencing is not allowed any more.
      
      **Reported Error Scenarios**
      ```scala
      scala> sql("WITH t AS (SELECT 1 FROM t) SELECT * FROM t")
      java.lang.StackOverflowError
      ...
      scala> sql("WITH t1 AS (SELECT * FROM t2), t2 AS (SELECT 2 FROM t1) SELECT * FROM t1, t2")
      java.lang.StackOverflowError
      ...
      ```
      Note that `t`, `t1`, and `t2` are not declared in database. Spark falls into infinite loops before resolving table names.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with new two testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14397 from dongjoon-hyun/SPARK-16771-TREENODE.
      2a105134
    • Yanbo Liang's avatar
      [SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance · bbae20ad
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
      BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14621 from yanboliang/spark-17033.
      bbae20ad
    • gatorsmile's avatar
      [SPARK-16598][SQL][TEST] Added a test case for verifying the table identifier parsing · 79e2caa1
      gatorsmile authored
      #### What changes were proposed in this pull request?
      So far, the test cases of `TableIdentifierParserSuite` do not cover the quoted cases. We should add one for avoiding regression.
      
      #### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14244 from gatorsmile/quotedIdentifiers.
      79e2caa1
    • hyukjinkwon's avatar
      [MINOR][DOC] Fix style in examples across documentation · f4482225
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the documentation as below:
      
        -  Python has 4 spaces and Java and Scala has 2 spaces (See https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide).
      
        - Avoid excessive parentheses and curly braces for anonymous functions. (See https://github.com/databricks/scala-style-guide#anonymous)
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14593 from HyukjinKwon/minor-documentation.
      f4482225
    • hongshen's avatar
      [SPARK-16985] Change dataFormat from yyyyMMddHHmm to yyyyMMddHHmmss · 993923c8
      hongshen authored
      ## What changes were proposed in this pull request?
      
      In our cluster, sometimes the sql output maybe overrided. When I submit some sql, all insert into the same table, and the sql will cost less one minute, here is the detail,
      1 sql1, 11:03 insert into table.
      2 sql2, 11:04:11 insert into table.
      3 sql3, 11:04:48 insert into table.
      4 sql4, 11:05 insert into table.
      5 sql5, 11:06 insert into table.
      The sql3's output file will override the sql2's output file. here is the log:
      ```
      16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/10000/_tmp.p_20160428/attempt_201605041104_0001_m_000000_1
      
      16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/10000/_tmp.p_20160428/attempt_201605041104_0001_m_000000_1
      
      ```
      
      The reason is the output file use SimpleDateFormat("yyyyMMddHHmm"), if two sql insert into the same table in the same minute, the output will be overrite. I think we should change dateFormat to "yyyyMMddHHmmss", in our cluster, we can't finished a sql in one second.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: hongshen <shenh062326@126.com>
      
      Closes #14574 from shenh062326/SPARK-16985.
      993923c8
    • petermaxlee's avatar
      [SPARK-17013][SQL] Parse negative numeric literals · 00e103a6
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch updates the SQL parser to parse negative numeric literals as numeric literals, instead of unary minus of positive literals.
      
      This allows the parser to parse the minimal value for each data type, e.g. "-32768S".
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14608 from petermaxlee/SPARK-17013.
      00e103a6
    • Dongjoon Hyun's avatar
      [SPARK-16975][SQL] Column-partition path starting '_' should be handled correctly · abff92bf
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`.
      
      **Before**
      ```scala
      scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
      scala> spark.read.parquet("/tmp/parquet")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;
      ```
      
      **After**
      ```scala
      scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
      scala> spark.read.parquet("/tmp/parquet")
      res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET.
      abff92bf
    • Yanbo Liang's avatar
      [MINOR][ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySpark · ccc6dc0f
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so  ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve  breaking change.
      
      ## How was this patch tested?
      No new tests, should pass existing ones.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14454 from yanboliang/TreeEnsembleModel.
      ccc6dc0f
  5. Aug 11, 2016
    • hyukjinkwon's avatar
      [SPARK-16434][SQL] Avoid per-record type dispatch in JSON when reading · ac84fb64
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `JacksonParser.parse` is doing type-based dispatch for each row to convert the tokens to appropriate values for Spark.
      It might not have to be done like this because the schema is already kept.
      
      So, appropriate converters can be created first according to the schema once, and then apply them to each row.
      
      This PR corrects `JacksonParser` so that it creates all converters for the schema once and then applies them to each row rather than type dispatching for every row.
      
      Benchmark was proceeded with the codes below:
      
      #### Parser tests
      
      **Before**
      
      ```scala
      test("Benchmark for JSON converter") {
        val N = 500 << 8
        val row =
          """{"struct":{"field1": true, "field2": 92233720368547758070},
          "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
          "arrayOfString":["str1", "str2"],
          "arrayOfInteger":[1, 2147483647, -2147483648],
          "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
          "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
          "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
          "arrayOfBoolean":[true, false, true],
          "arrayOfNull":[null, null, null, null],
          "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
          "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
          "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
         }"""
        val data = List.fill(N)(row)
        val dummyOption = new JSONOptions(Map.empty[String, String])
        val schema =
          InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), "", dummyOption)
        val factory = new JsonFactory()
      
        val benchmark = new Benchmark("JSON converter", N)
        benchmark.addCase("convert JSON file", 10) { _ =>
          data.foreach { input =>
            val parser = factory.createParser(input)
            parser.nextToken()
            JacksonParser.convertRootField(factory, parser, schema)
          }
        }
        benchmark.run()
      }
      ```
      
      ```
      JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      convert JSON file                             1697 / 1807          0.1       13256.9       1.0X
      ```
      
      **After**
      
      ```scala
      test("Benchmark for JSON converter") {
        val N = 500 << 8
        val row =
          """{"struct":{"field1": true, "field2": 92233720368547758070},
          "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
          "arrayOfString":["str1", "str2"],
          "arrayOfInteger":[1, 2147483647, -2147483648],
          "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
          "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
          "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
          "arrayOfBoolean":[true, false, true],
          "arrayOfNull":[null, null, null, null],
          "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
          "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
          "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
         }"""
        val data = List.fill(N)(row)
        val dummyOption = new JSONOptions(Map.empty[String, String], new SQLConf())
        val schema =
          InferSchema.infer(spark.sparkContext.parallelize(Seq(row)), dummyOption)
      
        val benchmark = new Benchmark("JSON converter", N)
        benchmark.addCase("convert JSON file", 10) { _ =>
          val parser = new JacksonParser(schema, dummyOption)
          data.foreach { input =>
            parser.parse(input)
          }
        }
        benchmark.run()
      }
      ```
      
      ```
      JSON converter:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      convert JSON file                             1401 / 1461          0.1       10947.4       1.0X
      ```
      
      It seems parsing time is improved by roughly ~20%
      
      #### End-to-End test
      
      ```scala
      test("Benchmark for JSON reader") {
        val N = 500 << 8
        val row =
          """{"struct":{"field1": true, "field2": 92233720368547758070},
          "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]},
          "arrayOfString":["str1", "str2"],
          "arrayOfInteger":[1, 2147483647, -2147483648],
          "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808],
          "arrayOfBigInteger":[922337203685477580700, -922337203685477580800],
          "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308],
          "arrayOfBoolean":[true, false, true],
          "arrayOfNull":[null, null, null, null],
          "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}],
          "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]],
          "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]]
         }"""
        val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row)))
        withTempPath { path =>
          df.write.format("json").save(path.getCanonicalPath)
      
          val benchmark = new Benchmark("JSON reader", N)
          benchmark.addCase("reading JSON file", 10) { _ =>
            spark.read.format("json").load(path.getCanonicalPath).collect()
          }
          benchmark.run()
        }
      }
      ```
      
      **Before**
      
      ```
      JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      reading JSON file                             6485 / 6924          0.0       50665.0       1.0X
      ```
      
      **After**
      
      ```
      JSON reader:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      reading JSON file                             6350 / 6529          0.0       49609.3       1.0X
      ```
      
      ## How was this patch tested?
      
      Existing test cases should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14102 from HyukjinKwon/SPARK-16434.
      ac84fb64
    • Jeff Zhang's avatar
      [SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and executor through conf… · 7a9e25c3
      Jeff Zhang authored
      Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python"
      
      Manually test in local & yarn mode for pyspark-shell and pyspark batch mode.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13146 from zjffdu/SPARK-13081.
      7a9e25c3
    • WangTaoTheTonic's avatar
      [SPARK-17022][YARN] Handle potential deadlock in driver handling messages · ea0bf91b
      WangTaoTheTonic authored
      ## What changes were proposed in this pull request?
      
      We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #14605 from WangTaoTheTonic/lock.
      ea0bf91b
    • huangzhaowei's avatar
      [SPARK-16868][WEB UI] Fix executor be both dead and alive on executor ui. · 4ec5c360
      huangzhaowei authored
      ## What changes were proposed in this pull request?
      In a heavy pressure of the spark application, since the executor will register it to driver block manager twice(because of heart beats), the executor will show as picture show:
      ![image](https://cloud.githubusercontent.com/assets/7404824/17467245/c1359094-5d4e-11e6-843a-f6d6347e1bf6.png)
      
      ## How was this patch tested?
      NA
      
      Details in: [SPARK-16868](https://issues.apache.org/jira/browse/SPARK-16868)
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #14530 from SaintBacchus/SPARK-16868.
      4ec5c360
    • Bryan Cutler's avatar
      [SPARK-13602][CORE] Add shutdown hook to DriverRunner to prevent driver process leak · 1c9a386c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      Added shutdown hook to DriverRunner to kill the driver process in case the Worker JVM exits suddenly and the `WorkerWatcher` was unable to properly catch this.  Did some cleanup to consolidate driver state management and setting of finalized vars within the running thread.
      
      ## How was this patch tested?
      
      Added unit tests to verify that final state and exception variables are set accordingly for successfull, failed, and errors in the driver process.  Retrofitted existing test to verify killing of mocked process ends with the correct state and stops properly
      
      Manually tested (with deploy-mode=cluster) that the shutdown hook is called by forcibly exiting the `Worker` and various points in the code with the `WorkerWatcher` both disabled and enabled.  Also, manually killed the driver through the ui and verified that the `DriverRunner` interrupted, killed the process and exited properly.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11746 from BryanCutler/DriverRunner-shutdown-hook-SPARK-13602.
      1c9a386c
    • petermaxlee's avatar
      [SPARK-17018][SQL] literals.sql for testing literal parsing · cf936782
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch adds literals.sql for testing literal parsing end-to-end in SQL.
      
      ## How was this patch tested?
      The patch itself is only about adding test cases.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14598 from petermaxlee/SPARK-17018-2.
      cf936782
    • Wenchen Fan's avatar
      [SPARK-17021][SQL] simplify the constructor parameters of QuantileSummaries · acaf2a81
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      1. `sampled` doesn't need to be `ArrayBuffer`, we never update it, but assign new value
      2. `count` doesn't need to be `var`, we never mutate it.
      3. `headSampled` doesn't need to be in constructor, we never pass a non-empty `headSampled` to constructor
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14603 from cloud-fan/simply.
      acaf2a81
    • Davies Liu's avatar
      [SPARK-16958] [SQL] Reuse subqueries within the same query · 0f72e4f0
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      There could be multiple subqueries that generate same results, we could re-use the result instead of running it multiple times.
      
      This PR also cleanup up how we run subqueries.
      
      For SQL query
      ```sql
      select id,(select avg(id) from t) from t where id > (select avg(id) from t)
      ```
      The explain is
      ```
      == Physical Plan ==
      *Project [id#15L, Subquery subquery29 AS scalarsubquery()#35]
      :  +- Subquery subquery29
      :     +- *HashAggregate(keys=[], functions=[avg(id#15L)])
      :        +- Exchange SinglePartition
      :           +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
      :              +- *Range (0, 1000, splits=4)
      +- *Filter (cast(id#15L as double) > Subquery subquery29)
         :  +- Subquery subquery29
         :     +- *HashAggregate(keys=[], functions=[avg(id#15L)])
         :        +- Exchange SinglePartition
         :           +- *HashAggregate(keys=[], functions=[partial_avg(id#15L)])
         :              +- *Range (0, 1000, splits=4)
         +- *Range (0, 1000, splits=4)
      ```
      The visualized plan:
      
      ![reuse-subquery](https://cloud.githubusercontent.com/assets/40902/17573229/e578d93c-5f0d-11e6-8a3c-0150d81d3aed.png)
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #14548 from davies/subq.
      0f72e4f0
    • Michael Gummelt's avatar
      [SPARK-16952] don't lookup spark home directory when executor uri is set · 4d496802
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      remove requirement to set spark.mesos.executor.home when spark.executor.uri is used
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14552 from mgummelt/fix-spark-home.
      4d496802
    • hyukjinkwon's avatar
      [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation · 7186e8c3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.
      
      This PR fixes three things below:
      
       - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.
      
      - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.
      
      - Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).
      
      ## How was this patch tested?
      
      N/A
      
      Closes https://github.com/apache/spark/pull/14491
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>
      
      Closes #14564 from HyukjinKwon/SPARK-16886.
      7186e8c3
    • huangzhaowei's avatar
      [SPARK-16941] Use concurrentHashMap instead of scala Map in SparkSQLOperationManager. · a45fefd1
      huangzhaowei authored
      ## What changes were proposed in this pull request?
      ThriftServer will have some thread-safe problem in **SparkSQLOperationManager**.
      Add a SynchronizedMap trait for the maps in it to avoid this problem.
      
      Details in [SPARK-16941](https://issues.apache.org/jira/browse/SPARK-16941)
      
      ## How was this patch tested?
      NA
      
      Author: huangzhaowei <carlmartinmax@gmail.com>
      
      Closes #14534 from SaintBacchus/SPARK-16941.
      a45fefd1
    • Andrew Ash's avatar
      Correct example value for spark.ssl.YYY.XXX settings · 8a6b7037
      Andrew Ash authored
      Docs adjustment to:
      - link to other relevant section of docs
      - correct statement about the only value when actually other values are supported
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #14581 from ash211/patch-10.
      8a6b7037
    • petermaxlee's avatar
      [SPARK-17015][SQL] group-by/order-by ordinal and arithmetic tests · a7b02db4
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch adds three test files:
      1. arithmetic.sql.out
      2. order-by-ordinal.sql
      3. group-by-ordinal.sql
      
      This includes https://github.com/apache/spark/pull/14594.
      
      ## How was this patch tested?
      This is a test case change.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14595 from petermaxlee/SPARK-17015.
      a7b02db4
    • petermaxlee's avatar
      [SPARK-17011][SQL] Support testing exceptions in SQLQueryTestSuite · 0db373aa
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch adds exception testing to SQLQueryTestSuite. When there is an exception in query execution, the query result contains the the exception class along with the exception message.
      
      As part of this, I moved some additional test cases for limit from SQLQuerySuite over to SQLQueryTestSuite.
      
      ## How was this patch tested?
      This is a test harness change.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14592 from petermaxlee/SPARK-17011.
      0db373aa
    • Tao Wang's avatar
      [SPARK-17010][MINOR][DOC] Wrong description in memory management document · 7a6a3c3f
      Tao Wang authored
      ## What changes were proposed in this pull request?
      
      change the remain percent to right one.
      
      ## How was this patch tested?
      
      Manual review
      
      Author: Tao Wang <wangtao111@huawei.com>
      
      Closes #14591 from WangTaoTheTonic/patch-1.
      7a6a3c3f
  6. Aug 10, 2016
    • petermaxlee's avatar
      [SPARK-17007][SQL] Move test data files into a test-data folder · 665e1753
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch moves all the test data files in sql/core/src/test/resources to sql/core/src/test/resources/test-data, so we don't clutter the top level sql/core/src/test/resources. Also deleted sql/core/src/test/resources/old-repeated.parquet since it is no longer used.
      
      The change will make it easier to spot sql-tests directory.
      
      ## How was this patch tested?
      This is a test-only change.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14589 from petermaxlee/SPARK-17007.
      665e1753
    • petermaxlee's avatar
      [SPARK-17008][SPARK-17009][SQL] Normalization and isolation in SQLQueryTestSuite. · 425c7c2d
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch enhances SQLQueryTestSuite in two ways:
      
      1. SPARK-17009: Use a new SparkSession for each test case to provide stronger isolation (e.g. config changes in one test case does not impact another). That said, we do not currently isolate catalog changes.
      2. SPARK-17008: Normalize query output using sorting, inspired by HiveComparisonTest.
      
      I also ported a few new test cases over from SQLQuerySuite.
      
      ## How was this patch tested?
      This is a test harness update.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14590 from petermaxlee/SPARK-17008.
      425c7c2d
    • jerryshao's avatar
      [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN · ab648c00
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Add a configurable token manager for Spark on running on yarn.
      
      ### Current Problems ###
      
      1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
      2. Also this problem exits in timely token renewer and updater.
      
      ### Changes In This Proposal ###
      
      In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
      
      1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
      2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
      3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.
      
      ### Behavior Changes ###
      
      For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).
      
      For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
      
      1. `spark.yarn.security.tokens.test.enabled` to true
      2. `spark.yarn.security.tokens.test.class` to the full qualified class name.
      
      So we still keep the same semantics as current code while add one new configuration.
      
      ### Current Status ###
      
      - [x] token provider interface and management framework.
      - [x] implement built-in token providers (hdfs, hbase, hive).
      - [x] Coverage of unit test.
      - [x] Integrated test with security cluster.
      
      ## How was this patch tested?
      
      Unit test and integrated test.
      
      Please suggest and review, any comment is greatly appreciated.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14065 from jerryshao/SPARK-16342.
      ab648c00
    • Rajesh Balamohan's avatar
      [SPARK-12920][CORE] Honor "spark.ui.retainedStages" to reduce mem-pressure · bd2c12fb
      Rajesh Balamohan authored
      When large number of jobs are run concurrently with Spark thrift server, thrift server starts running at high CPU due to GC pressure. Job UI retention causes memory pressure with large jobs. https://issues.apache.org/jira/secure/attachment/12783302/SPARK-12920.profiler_job_progress_listner.png has the profiler snapshot. This PR honors `spark.ui.retainedStages` strictly to reduce memory pressure.
      
      Manual and unit tests
      
      Author: Rajesh Balamohan <rbalamohan@apache.org>
      
      Closes #10846 from rajeshbalamohan/SPARK-12920.
      bd2c12fb
    • Qifan Pu's avatar
      [SPARK-16928] [SQL] Recursive call of ColumnVector::getInt() breaks JIT inlining · bf5cb8af
      Qifan Pu authored
      ## What changes were proposed in this pull request?
      
      In both `OnHeapColumnVector` and `OffHeapColumnVector`, we implemented `getInt()` with the following code pattern:
      ```
      public int getInt(int rowId) {
      if (dictionary == null)
      { return intData[rowId]; }
      else
      { return dictionary.decodeToInt(dictionaryIds.getInt(rowId)); }
      }
      ```
      As `dictionaryIds` is also a `ColumnVector`, this results in a recursive call of `getInt()` and breaks JIT inlining. As a result, `getInt()` will not get inlined.
      
      We fix this by adding a separate method `getDictId()` specific for `dictionaryIds` to use.
      
      ## How was this patch tested?
      
      We tested the difference with the following aggregate query on a TPCDS dataset (with scale factor = 5):
      ```
      select
        max(ss_sold_date_sk) as max_ss_sold_date_sk,
      from store_sales
      ```
      The query runtime is improved, from 202ms (before) to 159ms (after).
      
      Author: Qifan Pu <qifan.pu@gmail.com>
      
      Closes #14513 from ooq/SPARK-16928.
      bf5cb8af
Loading