Skip to content
Snippets Groups Projects
  1. Jul 08, 2016
    • cody koeninger's avatar
      [SPARK-13569][STREAMING][KAFKA] pattern based topic subscription · fd6e8f0e
      cody koeninger authored
      ## What changes were proposed in this pull request?
      Allow for kafka topic subscriptions based on a regex pattern.
      
      ## How was this patch tested?
      Unit tests, manual tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14026 from koeninger/SPARK-13569.
      fd6e8f0e
    • Dongjoon Hyun's avatar
      [SPARK-16387][SQL] JDBC Writer should use dialect to quote field names. · 3b22291b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too.
      
      **Reported Error Scenario (MySQL case)**
      ```scala
      scala> val url="jdbc:mysql://localhost:3306/temp"
      scala> val prop = new java.util.Properties
      scala> prop.setProperty("user","root")
      scala> spark.createDataset(Seq("a","b","c")).toDF("order")
      scala> df.write.mode("overwrite").jdbc(url, "temptable", prop)
      ...MySQLSyntaxErrorException: ... near 'order TEXT )
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests and manually do the above case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14107 from dongjoon-hyun/SPARK-16387.
      3b22291b
    • Yin Huai's avatar
      [SPARK-16453][BUILD] release-build.sh is missing hive-thriftserver for scala 2.10 · 60ba436b
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14108 from yhuai/SPARK-16453.
      60ba436b
    • wujian's avatar
      [SPARK-16281][SQL] Implement parse_url SQL function · f5fef691
      wujian authored
      ## What changes were proposed in this pull request?
      
      This PR adds parse_url SQL functions in order to remove Hive fallback.
      
      A new implementation of #13999
      
      ## How was this patch tested?
      
      Pass the exist tests including new testcases.
      
      Author: wujian <jan.chou.wu@gmail.com>
      
      Closes #14008 from janplus/SPARK-16281.
      f5fef691
    • Dongjoon Hyun's avatar
      [SPARK-16429][SQL] Include `StringType` columns in `describe()` · 142df483
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument.
      
      **Background**
      ```scala
      scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
      +-------+------------------+-------+
      |summary|               age|   name|
      +-------+------------------+-------+
      |  count|                 2|      3|
      |   mean|              24.5|   null|
      | stddev|7.7781745930520225|   null|
      |    min|                19|   Andy|
      |    max|                30|Michael|
      +-------+------------------+-------+
      ```
      
      **Before**
      ```scala
      scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
      +-------+------------------+
      |summary|               age|
      +-------+------------------+
      |  count|                 2|
      |   mean|              24.5|
      | stddev|7.7781745930520225|
      |    min|                19|
      |    max|                30|
      +-------+------------------+
      ```
      
      **After**
      ```scala
      scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
      +-------+------------------+-------+
      |summary|               age|   name|
      +-------+------------------+-------+
      |  count|                 2|      3|
      |   mean|              24.5|   null|
      | stddev|7.7781745930520225|   null|
      |    min|                19|   Andy|
      |    max|                30|Michael|
      +-------+------------------+-------+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a update testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14095 from dongjoon-hyun/SPARK-16429.
      142df483
    • Ryan Blue's avatar
      [SPARK-16420] Ensure compression streams are closed. · 67e085ef
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory.
      
      ## How was this patch tested?
      
      Current tests are sufficient. This should not change behavior.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak.
      67e085ef
    • Jurriaan Pruis's avatar
      [SPARK-13638][SQL] Add quoteAll option to CSV DataFrameWriter · 38cf8f2a
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      Adds an quoteAll option for writing CSV which will quote all fields.
      See https://issues.apache.org/jira/browse/SPARK-13638
      
      ## How was this patch tested?
      
      Added a test to verify the output columns are quoted for all fields in the Dataframe
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13374 from jurriaan/csv-quote-all.
      38cf8f2a
    • Xusen Yin's avatar
      [SPARK-16369][MLLIB] tallSkinnyQR of RowMatrix should aware of empty partition · 255d74fe
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition.
      
      See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details.
      
      ## How was this patch tested?
      
      Scala unit test.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #14049 from yinxusen/SPARK-16369.
      255d74fe
    • Dongjoon Hyun's avatar
      [SPARK-16285][SQL] Implement sentences SQL functions · a54438cb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `sentences` SQL function.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14004 from dongjoon-hyun/SPARK_16285.
      a54438cb
    • petermaxlee's avatar
      [SPARK-16436][SQL] checkEvaluation should support NaN · 8228b063
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This small patch modifies ExpressionEvalHelper. checkEvaluation to support comparing NaN values for floating point comparisons.
      
      ## How was this patch tested?
      This is a test harness change.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14103 from petermaxlee/SPARK-16436.
      8228b063
    • Dongjoon Hyun's avatar
      [SPARK-16052][SQL] Improve `CollapseRepartition` optimizer for Repartition/RepartitionBy · dff73bfa
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR improves `CollapseRepartition` to optimize the adjacent combinations of **Repartition** and **RepartitionBy**. Also, this PR adds a testsuite for this optimizer.
      
      **Target Scenario**
      ```scala
      scala> val dsView1 = spark.range(8).repartition(8, $"id")
      scala> dsView1.createOrReplaceTempView("dsView1")
      scala> sql("select id from dsView1 distribute by id").explain(true)
      ```
      
      **Before**
      ```scala
      scala> sql("select id from dsView1 distribute by id").explain(true)
      == Parsed Logical Plan ==
      'RepartitionByExpression ['id]
      +- 'Project ['id]
         +- 'UnresolvedRelation `dsView1`
      
      == Analyzed Logical Plan ==
      id: bigint
      RepartitionByExpression [id#0L]
      +- Project [id#0L]
         +- SubqueryAlias dsview1
            +- RepartitionByExpression [id#0L], 8
               +- Range (0, 8, splits=8)
      
      == Optimized Logical Plan ==
      RepartitionByExpression [id#0L]
      +- RepartitionByExpression [id#0L], 8
         +- Range (0, 8, splits=8)
      
      == Physical Plan ==
      Exchange hashpartitioning(id#0L, 200)
      +- Exchange hashpartitioning(id#0L, 8)
         +- *Range (0, 8, splits=8)
      ```
      
      **After**
      ```scala
      scala> sql("select id from dsView1 distribute by id").explain(true)
      == Parsed Logical Plan ==
      'RepartitionByExpression ['id]
      +- 'Project ['id]
         +- 'UnresolvedRelation `dsView1`
      
      == Analyzed Logical Plan ==
      id: bigint
      RepartitionByExpression [id#0L]
      +- Project [id#0L]
         +- SubqueryAlias dsview1
            +- RepartitionByExpression [id#0L], 8
               +- Range (0, 8, splits=8)
      
      == Optimized Logical Plan ==
      RepartitionByExpression [id#0L]
      +- Range (0, 8, splits=8)
      
      == Physical Plan ==
      Exchange hashpartitioning(id#0L, 200)
      +- *Range (0, 8, splits=8)
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including a new testsuite).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13765 from dongjoon-hyun/SPARK-16052.
      dff73bfa
    • Tathagata Das's avatar
      [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrigger · 5bce4580
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      An option that limits the file stream source to read 1 file at a time enables rate limiting. It has the additional convenience that a static set of files can be used like a stream for testing as this will allows those files to be considered one at a time.
      
      This PR adds option `maxFilesPerTrigger`.
      
      ## How was this patch tested?
      
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14094 from tdas/SPARK-16430.
      5bce4580
  2. Jul 07, 2016
    • Dongjoon Hyun's avatar
      [SPARK-16425][R] `describe()` should not fail with non-numeric columns · 6aa7d09f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.
      
      **Before**
      ```r
      > df <- createDataFrame(faithful)
      > df <- withColumn(df, "boolean", df$waiting==79)
      > summary(df)
      16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
      ```
      
      **After**
      ```r
      > df <- createDataFrame(faithful)
      > df <- withColumn(df, "boolean", df$waiting==79)
      > summary(df)
      SparkDataFrame[summary:string, eruptions:string, waiting:string]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a updated testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14096 from dongjoon-hyun/SPARK-16425.
      6aa7d09f
    • Felix Cheung's avatar
      [SPARK-16310][SPARKR] R na.string-like default for csv source · f4767bcc
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Apply default "NA" as null string for R, like R read.csv na.string parameter.
      
      https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
      na.strings = "NA"
      
      An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
      
      (couldn't open JIRA, will do that later)
      
      ## How was this patch tested?
      
      unit tests
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13984 from felixcheung/rcsvnastring.
      f4767bcc
    • Daoyuan Wang's avatar
      [SPARK-16415][SQL] fix catalog string error · 28710b42
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate.
      
      ## How was this patch tested?
      
      added a test case.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #14089 from adrian-wang/catalogstring.
      28710b42
    • Liwei Lin's avatar
      [SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach() · 0f7175de
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      There are cases where `complete` output mode does not output updated aggregated value; for details please refer to [SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350).
      
      The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in `ForeachSink.addBatch()`, `foreachPartition()` does not support incremental planning for now.
      
      This patches makes `foreachPartition()` support incremental planning in `ForeachSink`, by making a special version of `Dataset` with its `rdd()` method supporting incremental planning.
      
      ## How was this patch tested?
      
      Added a unit test which failed before the change
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14030 from lw-lin/fix-foreach-complete.
      0f7175de
    • Dongjoon Hyun's avatar
      [SPARK-16174][SQL] Improve `OptimizeIn` optimizer to remove literal repetitions · a04cab8f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR improves `OptimizeIn` optimizer to remove the literal repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19).
      
      **Before**
      ```scala
      scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
      == Physical Plan ==
      *Filter state#6 IN (TN,TN,TN,TN,TN,TN,TN)
      +- Generate explode([CA,TN]), false, false, [state#6]
         +- Scan OneRowRelation[]
      ```
      
      **After**
      ```scala
      scala> sql("select state from (select explode(array('CA','TN')) state) where state in ('TN','TN','TN','TN','TN','TN','TN')").explain
      == Physical Plan ==
      *Filter state#6 IN (TN)
      +- Generate explode([CA,TN]), false, false, [state#6]
         +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including a new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13876 from dongjoon-hyun/SPARK-16174.
      a04cab8f
    • MechCoder's avatar
      [SPARK-16399][PYSPARK] Force PYSPARK_PYTHON to python · 6343f665
      MechCoder authored
      ## What changes were proposed in this pull request?
      
      I would like to change
      
      ```bash
      if hash python2.7 2>/dev/null; then
        # Attempt to use Python 2.7, if installed:
        DEFAULT_PYTHON="python2.7"
      else
        DEFAULT_PYTHON="python"
      fi
      ```
      
      to just ```DEFAULT_PYTHON="python"```
      
      I'm not sure if it is a great assumption that python2.7 is used by default, when python points to something else.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: MechCoder <mks542@nyu.edu>
      
      Closes #14016 from MechCoder/followup.
      6343f665
    • Xusen Yin's avatar
      [SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix · 4c6f00d0
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      The following Java code because of type erasing:
      
      ```Java
      JavaRDD<Vector> rows = jsc.parallelize(...);
      RowMatrix mat = new RowMatrix(rows.rdd());
      QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true);
      ```
      
      We should use retag to restore the type to prevent the following exception:
      
      ```Java
      java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;
      ```
      
      ## How was this patch tested?
      
      Java unit test
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #14051 from yinxusen/SPARK-16372.
      4c6f00d0
    • Reynold Xin's avatar
      [SPARK-16400][SQL] Remove InSet filter pushdown from Parquet · 986b2514
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14076 from rxin/SPARK-16400.
      986b2514
    • gatorsmile's avatar
      [SPARK-16368][SQL] Fix Strange Errors When Creating View With Unmatched Column Num · ab05db0b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`.
      
      For example, given Table `t1` only has 3 columns
      ```SQL
      create view v1(col2, col4, col3, col5) as select * from t1
      ```
      Currently, Spark SQL reports the following error:
      ```
      requirement failed
      java.lang.IllegalArgumentException: requirement failed
      	at scala.Predef$.require(Predef.scala:212)
      	at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90)
      ```
      
      This error message is very confusing. This PR is to detect the error and issue a meaningful error message.
      
      #### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14047 from gatorsmile/viewMismatchedColumns.
      ab05db0b
    • Tom Magrino's avatar
      [SPARK-15885][WEB UI] Provide links to executor logs from stage details page in UI · ce3ea969
      Tom Magrino authored
      ## What changes were proposed in this pull request?
      
      This moves over old PR https://github.com/apache/spark/pull/13664 to target master rather than branch-1.6.
      
      Added links to logs (or an indication that there are no logs) for entries which list an executor in the stage details page of the UI.
      
      This helps streamline the workflow where a user views a stage details page and determines that they would like to see the associated executor log for further examination.  Previously, a user would have to cross reference the executor id listed on the stage details page with the corresponding entry on the executors tab.
      
      Link to the JIRA: https://issues.apache.org/jira/browse/SPARK-15885
      
      ## How was this patch tested?
      
      Ran existing unit tests.
      Ran test queries on a platform which did not record executor logs and again on a platform which did record executor logs and verified that the new table column was empty and links to the logs (which were verified as linking to the appropriate files), respectively.
      
      Attached is a screenshot of the UI page with no links, with the new columns highlighted.  Additional screenshot of these columns with the populated links.
      
      Without links:
      ![updated without logs](https://cloud.githubusercontent.com/assets/1450821/16059721/2b69dbaa-3239-11e6-9eed-e539764ca159.png)
      
      With links:
      ![updated with logs](https://cloud.githubusercontent.com/assets/1450821/16059725/32c6e316-3239-11e6-90bd-2553f43f7779.png)
      
      This contribution is my original work and I license the work to the project under the Apache Spark project's open source license.
      
      Author: Tom Magrino <tmagrino@fb.com>
      
      Closes #13861 from tmagrino/uilogstweak.
      ce3ea969
    • Shixiong Zhu's avatar
      [SPARK-16021][TEST-MAVEN] Fix the maven build · 4b5a72c7
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Fixed the maven build for #13983
      
      ## How was this patch tested?
      
      The existing tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14084 from zsxwing/fix-maven.
      4b5a72c7
    • MasterDDT's avatar
      [SPARK-16398][CORE] Make cancelJob and cancelStage APIs public · 69f53914
      MasterDDT authored
      ## What changes were proposed in this pull request?
      
      Make SparkContext `cancelJob` and `cancelStage` APIs public. This allows applications to use `SparkListener` to do their own management of jobs via events, but without using the REST API.
      
      ## How was this patch tested?
      
      Existing tests (dev/run-tests)
      
      Author: MasterDDT <miteshp@live.com>
      
      Closes #14072 from MasterDDT/SPARK-16398.
      69f53914
  3. Jul 06, 2016
    • gatorsmile's avatar
      [SPARK-16374][SQL] Remove Alias from MetastoreRelation and SimpleCatalogRelation · 42279bff
      gatorsmile authored
      #### What changes were proposed in this pull request?
      Different from the other leaf nodes, `MetastoreRelation` and `SimpleCatalogRelation` have a pre-defined `alias`, which is used to change the qualifier of the node. However, based on the existing alias handling, alias should be put in `SubqueryAlias`.
      
      This PR is to separate alias handling from `MetastoreRelation` and `SimpleCatalogRelation` to make it consistent with the other nodes. It simplifies the signature and conversion to a `BaseRelation`.
      
      For example, below is an example query for `MetastoreRelation`,  which is converted to a `LogicalRelation`:
      ```SQL
      SELECT tmp.a + 1 FROM test_parquet_ctas tmp WHERE tmp.a > 2
      ```
      
      Before changes, the analyzed plan is
      ```
      == Analyzed Logical Plan ==
      (a + 1): int
      Project [(a#951 + 1) AS (a + 1)#952]
      +- Filter (a#951 > 2)
         +- SubqueryAlias tmp
            +- Relation[a#951] parquet
      ```
      After changes, the analyzed plan becomes
      ```
      == Analyzed Logical Plan ==
      (a + 1): int
      Project [(a#951 + 1) AS (a + 1)#952]
      +- Filter (a#951 > 2)
         +- SubqueryAlias tmp
            +- SubqueryAlias test_parquet_ctas
               +- Relation[a#951] parquet
      ```
      
      **Note: the optimized plans are the same.**
      
      For `SimpleCatalogRelation`, the existing code always generates two Subqueries. Thus, no change is needed.
      
      #### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14053 from gatorsmile/removeAliasFromMetastoreRelation.
      42279bff
    • hyukjinkwon's avatar
      [SPARK-14839][SQL] Support for other types for `tableProperty` rule in SQL syntax · 34283de1
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types.
      
      This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values).
      
      Also, `TODO add bucketing and partitioning.` was removed because it was resolved in https://github.com/apache/spark/commit/24bea000476cdd0b43be5160a76bc5b170ef0b42
      
      ## How was this patch tested?
      
      Unit test in `MetastoreDataSourcesSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13517 from HyukjinKwon/SPARK-14839.
      34283de1
    • Eric Liang's avatar
      [SPARK-16021] Fill freed memory in test to help catch correctness bugs · 44c7c62b
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This patches `MemoryAllocator` to fill clean and freed memory with known byte values, similar to https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Find-a-memory-corruption-bug . Memory filling is flag-enabled in test only by default.
      
      ## How was this patch tested?
      
      Unit test that it's on in test.
      
      cc sameeragarwal
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #13983 from ericl/spark-16021.
      44c7c62b
    • cody koeninger's avatar
      [SPARK-16212][STREAMING][KAFKA] apply test tweaks from 0-10 to 0-8 as well · b8ebf63c
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Bring the kafka-0-8 subproject up to date with some test modifications from development on 0-10.
      
      Main changes are
      - eliminating waits on concurrent queue in favor of an assert on received results,
      - atomics instead of volatile (although this probably doesn't matter)
      - increasing uniqueness of topic names
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14073 from koeninger/kafka-0-8-test-direct-cleanup.
      b8ebf63c
    • Reynold Xin's avatar
      [SPARK-16371][SQL] Two follow-up tasks · 8e3e4ed6
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a small follow-up for SPARK-16371:
      
      1. Hide removeMetadata from public API.
      2. Add JIRA ticket number to test case name.
      
      ## How was this patch tested?
      Updated a test comment.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14074 from rxin/parquet-filter.
      8e3e4ed6
    • Michael Gummelt's avatar
      [MESOS] expand coarse-grained mode docs · 9c041990
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      docs
      
      ## How was this patch tested?
      
      viewed the docs in github
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14059 from mgummelt/coarse-grained.
      9c041990
    • Sean Owen's avatar
      [SPARK-16379][CORE][MESOS] Spark on mesos is broken due to race condition in Logging · a8f89df3
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      The commit https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec introduced a lazy val to simplify code in Logging. Simple enough, though one side effect is that accessing log now means grabbing the instance's lock. This in turn turned up a form of deadlock in the Mesos code. It was arguably a bit of a problem in how this code is structured, but, in any event the safest thing to do seems to be to revert the commit, and that's 90% of the change here; it's just not worth the risk of similar more subtle issues.
      
      What I didn't revert here was the removal of this odd override of log in the Mesos code. In retrospect it might have been put in place at some stage as a defense against this type of problem. After all the Logging code still involved a lock at initialization before the change in question.
      
      Even after the revert, it doesn't seem like it does anything, given how Logging works now, so I left it removed. However, I also removed the particular log message that ended up playing a part in this problem anyway, maybe being paranoid, to make sure this type of problem can't happen even with how the current locking works in logging initialization.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14069 from srowen/SPARK-16379.
      a8f89df3
    • tmnd1991's avatar
      [SPARK-15740][MLLIB] Word2VecSuite "big model load / save" caused OOM in maven jenkins builds · 040f6f9f
      tmnd1991 authored
      ## What changes were proposed in this pull request?
      "test big model load / save" in Word2VecSuite, lately resulted into OOM.
      Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test.
      
      ## How was this patch tested?
      It was tested running the following unit test:
      org.apache.spark.mllib.feature.Word2VecSuite
      
      Author: tmnd1991 <antonio.murgia2@studio.unibo.it>
      
      Closes #13509 from tmnd1991/SPARK-15740.
      040f6f9f
    • hyukjinkwon's avatar
      [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and... · 4f8ceed5
      hyukjinkwon authored
      [SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet
      
      ## What changes were proposed in this pull request?
      
      Currently, if there is a schema as below:
      
      ```
      root
        |-- _1: struct (nullable = true)
        |    |-- _1: integer (nullable = true)
      ```
      
      and if we execute the codes below:
      
      ```scala
      df.filter("_1 IS NOT NULL").count()
      ```
      
      This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those).
      
      The reason is, `ParquetFilters.getFieldMap` produces results below:
      
      ```
      (_1,StructType(StructField(_1,IntegerType,true)))
      (_1,IntegerType)
      ```
      
      and then it becomes a `Map`
      
      ```
      (_1,IntegerType)
      ```
      
      Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`.
      
      So, Parquet filter2 produces incorrect results, for example, the codes below:
      
      ```
      df.filter("_1 IS NOT NULL").count()
      ```
      
      produces always 0.
      
      This PR prevents this by not finding nested fields.
      
      ## How was this patch tested?
      
      Unit test in `ParquetFilterSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14067 from HyukjinKwon/SPARK-16371.
      4f8ceed5
    • petermaxlee's avatar
      [SPARK-16304] LinkageError should not crash Spark executor · 480357cc
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError.
      
      ## How was this patch tested?
      Added an end-to-end test in FailureSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #13982 from petermaxlee/SPARK-16304.
      480357cc
    • hyukjinkwon's avatar
      [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation · 4e14199f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes wrongly formatted examples in PySpark documentation as below:
      
      - **`SparkSession`**
      
        - **Before**
      
          ![2016-07-06 11 34 41](https://cloud.githubusercontent.com/assets/6477701/16605847/ae939526-436d-11e6-8ab8-6ad578362425.png)
      
        - **After**
      
          ![2016-07-06 11 33 56](https://cloud.githubusercontent.com/assets/6477701/16605845/ace9ee78-436d-11e6-8923-b76d4fc3e7c3.png)
      
      - **`Builder`**
      
        - **Before**
          ![2016-07-06 11 34 44](https://cloud.githubusercontent.com/assets/6477701/16605844/aba60dbc-436d-11e6-990a-c87bc0281c6b.png)
      
        - **After**
          ![2016-07-06 1 26 37](https://cloud.githubusercontent.com/assets/6477701/16607562/586704c0-437d-11e6-9483-e0af93d8f74e.png)
      
      This PR also fixes several similar instances across the documentation in `sql` PySpark module.
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #14063 from HyukjinKwon/minor-pyspark-builder.
      4e14199f
    • WeichenXu's avatar
      [DOC][SQL] update out-of-date code snippets using SQLContext in all documents. · b1310425
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      I search the whole documents directory using SQLContext, and update the following places:
      
      - docs/configuration.md, sparkR code snippets.
      - docs/streaming-programming-guide.md, several example code.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14025 from WeichenXu123/WIP_SQLContext_update.
      b1310425
    • Cheng Lian's avatar
      [SPARK-15979][SQL] Renames CatalystWriteSupport to ParquetWriteSupport · 23eff5e5
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up.
      
      ## How was this patch tested?
      
      N/A.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #14070 from liancheng/spark-15979-follow-up.
      23eff5e5
    • Tao Lin's avatar
      [SPARK-15591][WEBUI] Paginate Stage Table in Stages tab · 478b71d0
      Tao Lin authored
      ## What changes were proposed in this pull request?
      
      This patch adds pagination support for the Stage Tables in the Stage tab. Pagination is provided for all of the four Job Tables (active, pending, completed, and failed). Besides, the paged stage tables are also used in JobPage (the detail page for one job) and PoolPage.
      
      Interactions (jumping, sorting, and setting page size) for paged tables are also included.
      
      ## How was this patch tested?
      
      Tested manually by using checking the Web UI after completing and failing hundreds of jobs.  Same as the testings for [Paginate Job Table in Jobs tab](https://github.com/apache/spark/pull/13620).
      
      This shows the pagination for completed stages:
      ![paged stage table](https://cloud.githubusercontent.com/assets/5558370/16125696/5804e35e-3427-11e6-8923-5c5948982648.png)
      
      Author: Tao Lin <nblintao@gmail.com>
      
      Closes #13708 from nblintao/stageTable.
      478b71d0
    • gatorsmile's avatar
      [SPARK-16229][SQL] Drop Empty Table After CREATE TABLE AS SELECT fails · 21eadd1d
      gatorsmile authored
      #### What changes were proposed in this pull request?
      In `CREATE TABLE AS SELECT`, if the `SELECT` query failed, the table should not exist. For example,
      
      ```SQL
      CREATE TABLE tab
      STORED AS TEXTFILE
      SELECT 1 AS a, (SELECT a FROM (SELECT 1 AS a UNION ALL SELECT 2 AS a) t) AS b
      ```
      The above query failed as expected but an empty table `t` is created.
      
      This PR is to drop the created table when hitting any non-fatal exception.
      
      #### How was this patch tested?
      Added a test case to verify the behavior
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13926 from gatorsmile/dropTableAfterException.
      21eadd1d
    • MechCoder's avatar
      [SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data · 909c6d81
      MechCoder authored
      ## What changes were proposed in this pull request?
      
      The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree.
      
      ## How was this patch tested?
      
      The patch is a test....
      
      Author: MechCoder <mks542@nyu.edu>
      
      Closes #13981 from MechCoder/dt_variance.
      909c6d81
Loading