Skip to content
Snippets Groups Projects
  1. Jul 30, 2016
    • Sean Owen's avatar
      [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose... · 0dc4310b
      Sean Owen authored
      [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
      
      ## What changes were proposed in this pull request?
      
      Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14332 from srowen/SPARK-16694.
      0dc4310b
  2. Jul 29, 2016
    • Tathagata Das's avatar
      [SPARK-16748][SQL] SparkExceptions during planning should not wrapped in TreeNodeException · bbc24754
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      We do not want SparkExceptions from job failures in the planning phase to create TreeNodeException. Hence do not wrap SparkException in TreeNodeException.
      
      ## How was this patch tested?
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14395 from tdas/SPARK-16748.
      bbc24754
    • Nicholas Chammas's avatar
      [SPARK-16772][PYTHON][DOCS] Restore "datatype string" to Python API docstrings · 2182e432
      Nicholas Chammas authored
      ## What changes were proposed in this pull request?
      
      This PR corrects [an error made in an earlier PR](https://github.com/apache/spark/pull/14393/files#r72843069).
      
      ## How was this patch tested?
      
      ```sh
      $ ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      ```
      
      I also built the docs and confirmed that they looked good in my browser.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #14408 from nchammas/SPARK-16772.
      2182e432
    • Sun Dapeng's avatar
      [SPARK-16761][DOC][ML] Fix doc link in docs/ml-guide.md · 2c15323a
      Sun Dapeng authored
      ## What changes were proposed in this pull request?
      
      Fix the link at http://spark.apache.org/docs/latest/ml-guide.html.
      
      ## How was this patch tested?
      
      None
      
      Author: Sun Dapeng <sdp@apache.org>
      
      Closes #14386 from sundapeng/doclink.
      2c15323a
    • Michael Gummelt's avatar
      [SPARK-16637] Unified containerizer · 266b92fa
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)}
      
      This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/
      
      The benefit is losing the dependency on `dockerd`, and all the costs which it incurs.
      
      I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs.
      
      This is blocked on: https://github.com/apache/spark/pull/14167
      
      ## How was this patch tested?
      
      - manually testing jobs submitted with both "mesos" and "docker" settings for the new config var.
      - spark/mesos integration test suite
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14275 from mgummelt/unified-containerizer.
      266b92fa
    • Adam Roberts's avatar
      [SPARK-16751] Upgrade derby to 10.12.1.1 · 04a2c072
      Adam Roberts authored
      ## What changes were proposed in this pull request?
      
      Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1
      
      The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs
      
      ## How was this patch tested?
      Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.
      
      I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present.
      
      I don't know if this would also remove it from the assembly jar in our 1.x branches.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #14379 from a-roberts/patch-4.
      04a2c072
    • Yanbo Liang's avatar
      [SPARK-16750][ML] Fix GaussianMixture training failed due to feature column type mistake · 0557a454
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
      See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
      Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR.
      
      ## How was this patch tested?
      No new tests, should pass existing ones.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14378 from yanboliang/spark-16750.
      0557a454
    • Wesley Tang's avatar
      [SPARK-16664][SQL] Fix persist call on Data frames with more than 200… · d1d5069a
      Wesley Tang authored
      ## What changes were proposed in this pull request?
      
      f12f11e5 introduced this bug, missed foreach as map
      
      ## How was this patch tested?
      
      Test added
      
      Author: Wesley Tang <tangmingjun@mininglamp.com>
      
      Closes #14324 from breakdawn/master.
      d1d5069a
  3. Jul 28, 2016
    • Nicholas Chammas's avatar
      [SPARK-16772] Correct API doc references to PySpark classes + formatting fixes · 274f3b9e
      Nicholas Chammas authored
      ## What's Been Changed
      
      The PR corrects several broken or missing class references in the Python API docs. It also correct formatting problems.
      
      For example, you can see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerFunction) how Sphinx is not picking up the reference to `DataType`. That's because the reference is relative to the current module, whereas `DataType` is in a different module.
      
      You can also see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame) how the formatting for byte, tinyint, and so on is italic instead of monospace. That's because in ReST single backticks just make things italic, unlike in Markdown.
      
      ## Testing
      
      I tested this PR by [building the Python docs](https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html) and reviewing the results locally in my browser. I confirmed that the broken or missing class references were resolved, and that the formatting was corrected.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #14393 from nchammas/python-docstring-fixes.
      274f3b9e
    • Sameer Agarwal's avatar
      [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on OutOfMemoryError · 3fd39b87
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader.
      
      ## How was this patch tested?
      
      Existing Tests
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #14387 from sameeragarwal/oom.
      3fd39b87
    • Sylvain Zimmer's avatar
      [SPARK-16740][SQL] Fix Long overflow in LongToUnsafeRowMap · 1178d61e
      Sylvain Zimmer authored
      ## What changes were proposed in this pull request?
      
      Avoid overflow of Long type causing a NegativeArraySizeException a few lines later.
      
      ## How was this patch tested?
      
      Unit tests for HashedRelationSuite still pass.
      
      I can confirm the python script I included in https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. Unfortunately I don't have the knowledge/time to write a Scala test case for HashedRelationSuite right now. As the patch is pretty obvious I hope it can be included without this.
      
      Thanks!
      
      Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>
      
      Closes #14373 from sylvinus/master.
      1178d61e
    • Liang-Chi Hsieh's avatar
      [SPARK-16639][SQL] The query with having condition that contains grouping by column should work · 9ade77c3
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The query with having condition that contains grouping by column will be failed during analysis. E.g.,
      
          create table tbl(a int, b string);
          select count(b) from tbl group by a + 1 having a + 1 = 2;
      
      Having condition should be able to use grouping by column.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #14296 from viirya/having-contains-grouping-column.
      9ade77c3
    • gatorsmile's avatar
      [SPARK-16552][SQL] Store the Inferred Schemas into External Catalog Tables when Creating Tables · 762366fd
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables:
      
      **Group A. Users specify the schema.**
      
      _Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example,
      ```SQL
      CREATE TABLE tab STORED AS TEXTFILE
      AS SELECT * from input
      ```
      
      _Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
      ```SQL
      CREATE TABLE jsonTable (_1 string, _2 string)
      USING org.apache.spark.sql.json
      ```
      
      **Group B. Spark SQL infers the schema at runtime.**
      
      _Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example,
      ```SQL
      CREATE TABLE jsonTable
      USING org.apache.spark.sql.json
      OPTIONS (path '${tempDir.getCanonicalPath}')
      ```
      
      Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.
      
      This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache.
      
      In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now.
      
      #### How was this patch tested?
      TODO: add more cases to cover the changes.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14207 from gatorsmile/userSpecifiedSchema.
      762366fd
    • Dongjoon Hyun's avatar
      [SPARK-15232][SQL] Add subquery SQL building tests to LogicalPlanToSQLSuite · 5c2ae79b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      We currently test subquery SQL building using the `HiveCompatibilitySuite`. The is not desired since SQL building is actually a part of `sql/core` and because we are slowly reducing our dependency on Hive. This PR adds the same tests from the whitelist of `HiveCompatibilitySuite` into `LogicalPlanToSQLSuite`.
      
      ## How was this patch tested?
      
      This adds more testcases. Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14383 from dongjoon-hyun/SPARK-15232.
      5c2ae79b
    • petermaxlee's avatar
      [SPARK-16730][SQL] Implement function aliases for type casts · 11d427c9
      petermaxlee authored
      ## What changes were proposed in this pull request?
      Spark 1.x supports using the Hive type name as function names for doing casts, e.g.
      ```sql
      SELECT int(1.0);
      SELECT string(2.0);
      ```
      
      The above query would work in Spark 1.x because Spark 1.x fail back to Hive for unimplemented functions, and break in Spark 2.0 because the fall back was removed.
      
      This patch implements function aliases using an analyzer rule for the following cast functions:
      - boolean
      - tinyint
      - smallint
      - int
      - bigint
      - float
      - double
      - decimal
      - date
      - timestamp
      - binary
      - string
      
      ## How was this patch tested?
      Added end-to-end tests in SQLCompatibilityFunctionSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14364 from petermaxlee/SPARK-16730-2.
      11d427c9
  4. Jul 27, 2016
    • KevinGrealish's avatar
      [SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per... · b14d7b5c
      KevinGrealish authored
      [SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode.
      
      ## What changes were proposed in this pull request?
      
      This fix allows submit of pyspark jobs to specify python 2 or 3.
      
      Change ordering in setup for application master environment so env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be overridden by spark.yarn.appMasterEnv.* conf settings. This applies to YARN in cluster mode. This allows them to be set per submission without needing the unset the env vars (which is not always possible - e.g. batch submit with LIVY only exposes the arguments to spark-submit)
      
      ## How was this patch tested?
      Manual and existing unit tests.
      
      Author: KevinGrealish <KevinGre@microsoft.com>
      
      Closes #13824 from KevinGrealish/SPARK-16110.
      b14d7b5c
    • Bartek Wiśniewski's avatar
      [MINOR][DOC] missing keyword new · bc4851ad
      Bartek Wiśniewski authored
      ## What changes were proposed in this pull request?
      
      added missing keyword for java example
      
      ## How was this patch tested?
      
      wasn't
      
      Author: Bartek Wiśniewski <wedi@Ava.local>
      
      Closes #14381 from wedi-dev/quickfix/missing_keyword.
      bc4851ad
    • Mark Grover's avatar
      [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics · 70f846a3
      Mark Grover authored
      ## What changes were proposed in this pull request?
      Adding a new property to SparkConf called spark.metrics.namespace that allows users to
      set a custom namespace for executor and driver metrics in the metrics systems.
      
      By default, the root namespace used for driver or executor metrics is
      the value of `spark.app.id`. However, often times, users want to be able to track the metrics
      across apps for driver and executor metrics, which is hard to do with application ID
      (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases,
      users can set the `spark.metrics.namespace` property to another spark configuration key like
      `spark.app.name` which is then used to populate the root namespace of the metrics system
      (with the app name in our example). `spark.metrics.namespace` property can be set to any
      arbitrary spark property key, whose value would be used to set the root namespace of the
      metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor
      does the `spark.metrics.namespace` property have any such affect on such metrics.
      
      ## How was this patch tested?
      Added new unit tests, modified existing unit tests.
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #14270 from markgrover/spark-5847.
      70f846a3
    • krishnakalyan3's avatar
      [SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc · 7e8279fd
      krishnakalyan3 authored
      ## What changes were proposed in this pull request?
      Updated ML pipeline Cross Validation Scaladoc & PyDoc.
      
      ## How was this patch tested?
      
      Documentation update
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: krishnakalyan3 <krishnakalyan3@gmail.com>
      
      Closes #13894 from krishnakalyan3/kfold-cv.
      7e8279fd
    • Liang-Chi Hsieh's avatar
      [MINOR][DOC][SQL] Fix two documents regarding size in bytes · 045fc360
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Fix two places in SQLConf documents regarding size in bytes and statistics.
      
      ## How was this patch tested?
      No. Just change document.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #14341 from viirya/fix-doc-size-in-bytes.
      045fc360
    • Yanbo Liang's avatar
      [MINOR][ML] Fix some mistake in LinearRegression formula. · 3c3371bb
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix some mistake in ```LinearRegression``` formula.
      
      ## How was this patch tested?
      Documents change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14369 from yanboliang/LiR-formula.
      3c3371bb
    • petermaxlee's avatar
      [SPARK-16729][SQL] Throw analysis exception for invalid date casts · ef0ccbcb
      petermaxlee authored
      ## What changes were proposed in this pull request?
      Spark currently throws exceptions for invalid casts for all other data types except date type. Somehow date type returns null. It should be consistent and throws analysis exception as well.
      
      ## How was this patch tested?
      Added a unit test case in CastSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14358 from petermaxlee/SPARK-16729.
      ef0ccbcb
    • Dongjoon Hyun's avatar
      [SPARK-16621][SQL] Generate stable SQLs in SQLBuilder · 5b8e848b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, the generated SQLs have not-stable IDs for generated attributes.
      The stable generated SQL will give more benefit for understanding or testing the queries.
      This PR provides stable SQL generation by the followings.
      
       - Provide unique ids for generated subqueries, `gen_subquery_xxx`.
       - Provide unique and stable ids for generated attributes, `gen_attr_xxx`.
      
      **Before**
      ```scala
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0
      ```
      
      **After**
      ```scala
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL
      res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0
      ```
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14257 from dongjoon-hyun/SPARK-16621.
      5b8e848b
  5. Jul 26, 2016
    • Qifan Pu's avatar
      [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator · 738b4cc5
      Qifan Pu authored
      ## What changes were proposed in this pull request?
      
      This PR is the first step for the following feature:
      
      For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
      In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs.
      
      In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added.
      
      ## How was this patch tested?
      
      Unit tests: `RowBasedKeyValueBatchSuite`
      
      Author: Qifan Pu <qifan.pu@gmail.com>
      
      Closes #14349 from ooq/SPARK-16524.
      738b4cc5
    • Dhruve Ashar's avatar
      [SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size configurable · 0b71d9ae
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      This change adds a new configuration entry to specify the size of the spark listener bus event queue. The value for this config ("spark.scheduler.listenerbus.eventqueue.size") is set to a default to 10000.
      
      Note:
      I haven't currently documented the configuration entry. We can decide whether it would be appropriate to make it a public configuration or keep it as an undocumented one. Refer JIRA for more details.
      
      ## How was this patch tested?
      Ran existing jobs and verified the event queue size with debug logs and from the Spark WebUI Environment tab.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #14269 from dhruve/bug/SPARK-15703.
      0b71d9ae
    • Philipp Hoffmann's avatar
      [SPARK-15271][MESOS] Allow force pulling executor docker images · 0869b3a5
      Philipp Hoffmann authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Mesos agents by default will not pull docker images which are cached
      locally already. In order to run Spark executors from mutable tags like
      `:latest` this commit introduces a Spark setting
      (`spark.mesos.executor.docker.forcePullImage`). Setting this flag to
      true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous
      implementation and Mesos' default
      behaviour).
      
      Author: Philipp Hoffmann <mail@philipphoffmann.de>
      
      Closes #14348 from philipphoffmann/force-pull-image.
      0869b3a5
    • Wenchen Fan's avatar
      [SPARK-16663][SQL] desc table should be consistent between data source and hive serde tables · a2abb583
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently there are 2 inconsistence:
      
      1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema
      2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null
      
      ## How was this patch tested?
      
      new test in `HiveDDLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14302 from cloud-fan/minor3.
      a2abb583
    • WeichenXu's avatar
      [SPARK-16697][ML][MLLIB] improve LDA submitMiniBatch method to avoid redundant RDD computation · 4c969559
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]`
      and also move the place of unpersisting `expElogbetaBc` broadcast variable,
      to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
      and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)`
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14335 from WeichenXu123/improve_LDA.
      4c969559
    • hyukjinkwon's avatar
      [SPARK-16675][SQL] Avoid per-record type dispatch in JDBC when writing · 3b2b785e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, `JdbcUtils.savePartition` is doing type-based dispatch for each row to write appropriate values.
      
      So, appropriate setters for `PreparedStatement` can be created first according to the schema, and then apply them to each row. This approach is similar with `CatalystWriteSupport`.
      
      This PR simply make the setters to avoid this.
      
      ## How was this patch tested?
      
      Existing tests should cover this.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14323 from HyukjinKwon/SPARK-16675.
      3b2b785e
    • Tathagata Das's avatar
      [TEST][STREAMING] Fix flaky Kafka rate controlling test · 03c27435
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      The current test is incorrect, because
      - The expected number of messages does not take into account that the topic has 2 partitions, and rate is set per partition.
      - Also in some cases, the test ran out of data in Kafka while waiting for the right amount of data per batch.
      
      The PR
      - Reduces the number of partitions to 1
      - Adds more data to Kafka
      - Runs with 0.5 second so that batches are created slowly
      
      ## How was this patch tested?
      Ran many times locally, going to run it many times in Jenkins
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14361 from tdas/kafka-rate-test-fix.
      03c27435
    • Wenchen Fan's avatar
      [SPARK-16706][SQL] support java map in encoder · 6959061f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      finish the TODO, create a new expression `ExternalMapToCatalyst` to iterate the map directly.
      
      ## How was this patch tested?
      
      new test in `JavaDatasetSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14344 from cloud-fan/java-map.
      6959061f
  6. Jul 25, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning · 7b06a894
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect.
      
      Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #14327 from viirya/fix-sample-pushdown.
      7b06a894
    • Yin Huai's avatar
      [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions · 815f3eec
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR contains three changes.
      
      First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
      1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
      2. If the offset row does not exist, the default value will be used.
      3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).
      
      Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.
      
      Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.
      
      ## How was this patch tested?
      New tests in SQLWindowFunctionSuite
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14284 from yhuai/lead-lag.
      815f3eec
    • Michael Armbrust's avatar
      [SPARK-16724] Expose DefinedByConstructorParams · f99e34e8
      Michael Armbrust authored
      We don't generally make things in catalyst/execution private.  Instead they are just undocumented due to their lack of stability guarantees.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #14356 from marmbrus/patch-1.
      f99e34e8
    • Dongjoon Hyun's avatar
      [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries · 8a8d26f1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this.
      ```scala
      scala> sql("CREATE TABLE t1(a int)")
      scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
      scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
      java.lang.UnsupportedOperationException: empty.reduceLeft
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a new test suite.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14307 from dongjoon-hyun/SPARK-16672.
      8a8d26f1
    • Nicholas Brown's avatar
      Fix description of spark.speculation.quantile · ba0aade6
      Nicholas Brown authored
      ## What changes were proposed in this pull request?
      
      Minor doc fix regarding the spark.speculation.quantile configuration parameter.  It incorrectly states it should be a percentage, when it should be a fraction.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      I tried building the documentation but got some unidoc errors.  I also got them when building off origin/master, so I don't think I caused that problem.  I did run the web app and saw the changes reflected as expected.
      
      Author: Nicholas Brown <nbrown@adroitdigital.com>
      
      Closes #14352 from nwbvt/master.
      ba0aade6
    • gatorsmile's avatar
      [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs · 3fc45669
      gatorsmile authored
      ## What changes were proposed in this pull request?
      **Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)**
      When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example,
      ```
      hive> CREATE TABLE tab1 (id int);
      OK
      Time taken: 0.196 seconds
      hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
      FAILED: SemanticException [Error 10218]: Existing table is not a view
       The following is an existing table, not a view: default.tab1
      hive> ALTER VIEW tab1 AS SELECT * FROM t1;
      FAILED: SemanticException [Error 10218]: Existing table is not a view
       The following is an existing table, not a view: default.tab1
      hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
      OK
      Time taken: 0.678 seconds
      ```
      
      **Issue 2: Strange Error when Issuing Load Table Against A View**
      Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example,
      ```SQL
      LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
      ```
      ```
      java.lang.reflect.InvocationTargetException was thrown.
      java.lang.reflect.InvocationTargetException
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
      ```
      ## How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14314 from gatorsmile/tableDDLAgainstView.
      3fc45669
    • Shixiong Zhu's avatar
      [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails · e164a04b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR moves `ssc.stop()` into `finally` for `StreamingContextSuite.createValidCheckpoint` to avoid leaking a StreamingContext since leaking a StreamingContext will fail a lot of tests and make us hard to find the real failure one.
      
      ## How was this patch tested?
      
      Jenkins unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14354 from zsxwing/ssc-leak.
      e164a04b
    • Tao Lin's avatar
      [SPARK-15590][WEBUI] Paginate Job Table in Jobs tab · db36e1e7
      Tao Lin authored
      ## What changes were proposed in this pull request?
      
      This patch adds pagination support for the Job Tables in the Jobs tab. Pagination is provided for all of the three Job Tables (active, completed, and failed). Interactions (jumping, sorting, and setting page size) for paged tables are also included.
      
      The diff didn't keep track of some lines based on the original ones. The function `makeRow`of the original `AllJobsPage.scala` is reused. They are separated at the beginning of the function `jobRow` (L427-439) and the function `row`(L594-618) in the new `AllJobsPage.scala`.
      
      ## How was this patch tested?
      
      Tested manually by using checking the Web UI after completing and failing hundreds of jobs.
      Generate completed jobs by:
      ```scala
      val d = sc.parallelize(Array(1,2,3,4,5))
      for(i <- 1 to 255){ var b = d.collect() }
      ```
      Generate failed jobs by calling the following code multiple times:
      ```scala
      var b = d.map(_/0).collect()
      ```
      Interactions like jumping, sorting, and setting page size are all tested.
      
      This shows the pagination for completed jobs:
      ![paginate success jobs](https://cloud.githubusercontent.com/assets/5558370/15986498/efa12ef6-303b-11e6-8b1d-c3382aeb9ad0.png)
      
      This shows the sorting works in job tables:
      ![sorting](https://cloud.githubusercontent.com/assets/5558370/15986539/98c8a81a-303c-11e6-86f2-8d2bc7924ee9.png)
      
      This shows the pagination for failed jobs and the effect of jumping and setting page size:
      ![paginate failed jobs](https://cloud.githubusercontent.com/assets/5558370/15986556/d8c1323e-303c-11e6-8e4b-7bdb030ea42b.png)
      
      Author: Tao Lin <nblintao@gmail.com>
      
      Closes #13620 from nblintao/dev.
      db36e1e7
    • Tathagata Das's avatar
      [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog · c979c8bb
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14292 from tdas/SPARK-14131.
      c979c8bb
Loading