Skip to content
Snippets Groups Projects
  1. Jul 12, 2016
    • Takuya UESHIN's avatar
      [SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a... · 5b28e025
      Takuya UESHIN authored
      [SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.
      
      ## What changes were proposed in this pull request?
      
      Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`.
      In this case, serialize and then deserialize happens but it would not be needed.
      
      This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13890 from ueshin/issues/SPARK-16189.
      5b28e025
    • WeichenXu's avatar
      [MINOR][ML] update comment where is inconsistent with code in ml.regression.LinearRegression · fc11c509
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0`
      the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14121 from WeichenXu123/update_lr_comment.
      fc11c509
    • petermaxlee's avatar
      [SPARK-16199][SQL] Add a method to list the referenced columns in data source Filter · c9a67621
      petermaxlee authored
      ## What changes were proposed in this pull request?
      It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.
      
      This is based on rxin's patch (#13901) and adds unit tests.
      
      ## How was this patch tested?
      Added a new suite FiltersSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14120 from petermaxlee/SPARK-16199.
      c9a67621
  2. Jul 11, 2016
    • Russell Spitzer's avatar
      [SPARK-12639][SQL] Mark Filters Fully Handled By Sources with * · b1e5281c
      Russell Spitzer authored
      ## What changes were proposed in this pull request?
      
      In order to make it clear which filters are fully handled by the
      underlying datasource we will mark them with an *. This will give a
      clear visual queue to users that the filter is being treated differently
      by catalyst than filters which are just presented to the underlying
      DataSource.
      
      Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
      ### Before
      ```
      //SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
      == Physical Plan ==
      Project [a#0]
      +- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
         +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
      ```
      
      ### After
      ```
      == Physical Plan ==
      Project [a#0]
      +- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
         +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
      ```
      
      ## How was the this patch tested?
      
      Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested
      
      Post 1.6.1
      Tested by modifying the FilteredScanSuite to run explains.
      
      Author: Russell Spitzer <Russell.Spitzer@gmail.com>
      
      Closes #11317 from RussellSpitzer/SPARK-12639-Star.
      b1e5281c
    • Sameer Agarwal's avatar
      [SPARK-16488] Fix codegen variable namespace collision in pmod and partitionBy · 9cc74f95
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a variable namespace collision bug in pmod and partitionBy
      
      ## How was this patch tested?
      
      Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #14144 from sameeragarwal/codegen-bug.
      9cc74f95
    • Tathagata Das's avatar
      [SPARK-16430][SQL][STREAMING] Fixed bug in the maxFilesPerTrigger in FileStreamSource · e50efd53
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches.
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14143 from tdas/SPARK-16430-1.
      e50efd53
    • Shixiong Zhu's avatar
      [SPARK-16433][SQL] Improve StreamingQuery.explain when no data arrives · 91a443b8
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Display `No physical plan. Waiting for data.` instead of `N/A`  for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14100 from zsxwing/SPARK-16433.
      91a443b8
    • Xin Ren's avatar
      [MINOR][STREAMING][DOCS] Minor changes on kinesis integration · 05d7151c
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Some minor changes for documentation page "Spark Streaming + Kinesis Integration".
      
      Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.
      
      ## How was this patch tested?
      
      Tested manually, on my local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14097 from keypointt/kinesisDoc.
      05d7151c
    • James Thomas's avatar
      [SPARK-16114][SQL] structured streaming event time window example · 9e2c763d
      James Thomas authored
      ## What changes were proposed in this pull request?
      
      A structured streaming example with event time windowing.
      
      ## How was this patch tested?
      
      Run locally
      
      Author: James Thomas <jamesjoethomas@gmail.com>
      
      Closes #13957 from jjthomas/current.
      9e2c763d
    • Marcelo Vanzin's avatar
      [SPARK-16349][SQL] Fall back to isolated class loader when classes not found. · b4fbe140
      Marcelo Vanzin authored
      Some Hadoop classes needed by the Hive metastore client jars are not present
      in Spark's packaging (for example, "org/apache/hadoop/mapred/MRVersion"). So
      if the parent class loader fails to find a class, try to load it from the
      isolated class loader, in case it's available there.
      
      Tested by setting spark.sql.hive.metastore.jars to local paths with Hive/Hadoop
      libraries and verifying that Spark can talk to the metastore.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #14020 from vanzin/SPARK-16349.
      b4fbe140
    • Felix Cheung's avatar
      [SPARK-16144][SPARKR] update R API doc for mllib · 7f38b9d5
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty:
      ![image](https://cloud.githubusercontent.com/assets/8969467/16481934/856dd0ea-3e62-11e6-9474-e4d57d1ca001.png)
      
      Here's what I meant as the fix:
      ![image](https://cloud.githubusercontent.com/assets/8969467/16481943/911f02ec-3e62-11e6-9d68-17363a9f5628.png)
      
      ![image](https://cloud.githubusercontent.com/assets/8969467/16481950/9bc057aa-3e62-11e6-8127-54870701c4b1.png)
      
      I didn't realize there was already a JIRA on this. mengxr yanboliang
      
      ## How was this patch tested?
      
      check doc generated.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13993 from felixcheung/rmllibdoc.
      7f38b9d5
    • Yanbo Liang's avatar
      [SPARKR][DOC] SparkR ML user guides update for 2.0 · 2ad031be
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Update SparkR ML section to make them consistent with SparkR API docs.
      * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
      
      ## How was this patch tested?
      Only docs update, manually check the generated docs.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14011 from yanboliang/r-user-guide-update.
      2ad031be
    • Dongjoon Hyun's avatar
      [SPARK-16458][SQL] SessionCatalog should support `listColumns` for temporary tables · 840853ed
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing.
      
      **Before**
      ```scala
      scala> spark.range(10).createOrReplaceTempView("t1")
      
      scala> spark.catalog.listTables().collect()
      res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`])
      
      scala> spark.catalog.listColumns("t1").collect()
      org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.;
      ```
      
      **After**
      ```
      scala> spark.catalog.listColumns("t1").collect()
      res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false'])
      ```
      ## How was this patch tested?
      
      Pass the Jenkins tests including a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14114 from dongjoon-hyun/SPARK-16458.
      840853ed
    • Reynold Xin's avatar
      [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT · ffcb6e05
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14130 from rxin/SPARK-16477.
      ffcb6e05
    • Dongjoon Hyun's avatar
      [SPARK-16459][SQL] Prevent dropping current database · 7ac79da0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR prevents dropping the current database to avoid errors like the followings.
      
      ```scala
      scala> sql("create database delete_db")
      scala> sql("use delete_db")
      scala> sql("drop database delete_db")
      scala> sql("create table t as select 1")
      org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found;
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including an updated testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14115 from dongjoon-hyun/SPARK-16459.
      7ac79da0
    • Xin Ren's avatar
      [SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R language binding · 9cb1eb7a
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-16381
      
      ## What changes were proposed in this pull request?
      
      Update SQL examples and programming guide for R language binding.
      
      Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code.
      
      ## How was this patch tested?
      
      Manual test on my local machine.
      Screenshot as below:
      
      ![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png)
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14082 from keypointt/SPARK-16381.
      9cb1eb7a
    • gatorsmile's avatar
      [SPARK-16355][SPARK-16354][SQL] Fix Bugs When LIMIT/TABLESAMPLE is Non-foldable, Zero or Negative · e2262789
      gatorsmile authored
      #### What changes were proposed in this pull request?
      **Issue 1:** When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example,
      ```Scala
      Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v")
        .createOrReplaceTempView("test")
      val df1 = spark.table("test")
      val df2 = spark.table("test").limit(0)
      val df = df1.join(df2, Seq("k"), "left")
      ```
      The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0.
      
      **Issue 2:** When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation.  Below is the example query.
      ```SQL
      SELECT * FROM testData TABLESAMPLE (-1 rows)
      SELECT * FROM testData LIMIT -1
      ```
      This PR is to issue an appropriate exception in this case.
      
      **Issue 3:** Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example,
      ```SQL
      SELECT * FROM testData LIMIT rand() > 0.2
      ```
      Then, a misleading error message is issued, like
      ```
      assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
      +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
         +- LocalLimit (_nondeterministic#202 > 0.2)
            +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
               +- LogicalRDD [key#11, value#12]
      
      java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2)
      +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203]
         +- LocalLimit (_nondeterministic#202 > 0.2)
            +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202]
               +- LogicalRDD [key#11, value#12]
      ```
      This PR detects it and then issues a meaningful error message.
      
      #### How was this patch tested?
      Added test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14034 from gatorsmile/limit.
      e2262789
    • petermaxlee's avatar
      [SPARK-16318][SQL] Implement all remaining xpath functions · 82f08744
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath.
      
      ## How was this patch tested?
      Added unit tests and end-to-end tests.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #13991 from petermaxlee/SPARK-16318.
      82f08744
    • Reynold Xin's avatar
      [SPARK-16476] Restructure MimaExcludes for easier union excludes · 52b5bb0b
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      It is currently fairly difficult to have proper mima excludes when we cut a version branch. I'm proposing a small change to take the exclude list out of the exclude function, and put it in a variable so we can easily union excludes.
      
      After this change, we can bump pom.xml version to 2.1.0-SNAPSHOT, without bumping the diff base version. Note that I also deleted all the exclude rules for version 1.x, to cut down the size of the file.
      
      ## How was this patch tested?
      N/A - this is a build infra change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14128 from rxin/SPARK-16476.
      52b5bb0b
  3. Jul 10, 2016
  4. Jul 09, 2016
    • gatorsmile's avatar
      [SPARK-16401][SQL] Data Source API: Enable Extending RelationProvider and... · 7374e518
      gatorsmile authored
      [SPARK-16401][SQL] Data Source API: Enable Extending RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider
      
      #### What changes were proposed in this pull request?
      When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation.
      ```Scala
      spark.read
      .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
        .load()
        .write.
      format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema")
        .save()
      ```
      
      The error they hit is like
      ```
      org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
      org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.;
      	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319)
      	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494)
      	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
      ```
      
      Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](https://github.com/gatorsmile/spark/blob/dd644f8117e889cebd6caca58702a7c7e3d88bef/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L429) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`.
      
      #### How was this patch tested?
      Added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14075 from gatorsmile/dataSource.
      7374e518
  5. Jul 08, 2016
    • Michael Gummelt's avatar
      [SPARK-11857][MESOS] Deprecate fine grained · b1db26ac
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Documentation changes to indicate that fine-grained mode is now deprecated.  No code changes were made, and all fine-grained mode instructions were left in place.  We can remove all of that once the deprecation cycle completes (Does Spark have a standard deprecation cycle?  One major version?)
      
      Blocked on https://github.com/apache/spark/pull/14059
      
      ## How was this patch tested?
      
      Viewed in Github
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14078 from mgummelt/deprecate-fine-grained.
      b1db26ac
    • Eric Liang's avatar
      [SPARK-16432] Empty blocks fail to serialize due to assert in ChunkedByteBuffer · d8b06f18
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc JoshRosen
      
      ## How was this patch tested?
      
      Unit tests, also checked that the original reproduction case in https://github.com/apache/spark/pull/11748#issuecomment-230760283 is resolved.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #14099 from ericl/spark-16432.
      d8b06f18
    • Sean Owen's avatar
      [SPARK-16376][WEBUI][SPARK WEB UI][APP-ID] HTTP ERROR 500 when using rest api... · 6cef0183
      Sean Owen authored
      [SPARK-16376][WEBUI][SPARK WEB UI][APP-ID] HTTP ERROR 500 when using rest api "/applications//jobs" if array "stageIds" is empty
      
      ## What changes were proposed in this pull request?
      
      Avoid error finding max of empty Seq when stageIds is empty. It does fix the immediate problem; I don't know if it results in meaningful output, but not an error at least.
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14105 from srowen/SPARK-16376.
      6cef0183
    • cody koeninger's avatar
      [SPARK-13569][STREAMING][KAFKA] pattern based topic subscription · fd6e8f0e
      cody koeninger authored
      ## What changes were proposed in this pull request?
      Allow for kafka topic subscriptions based on a regex pattern.
      
      ## How was this patch tested?
      Unit tests, manual tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #14026 from koeninger/SPARK-13569.
      fd6e8f0e
    • Dongjoon Hyun's avatar
      [SPARK-16387][SQL] JDBC Writer should use dialect to quote field names. · 3b22291b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too.
      
      **Reported Error Scenario (MySQL case)**
      ```scala
      scala> val url="jdbc:mysql://localhost:3306/temp"
      scala> val prop = new java.util.Properties
      scala> prop.setProperty("user","root")
      scala> spark.createDataset(Seq("a","b","c")).toDF("order")
      scala> df.write.mode("overwrite").jdbc(url, "temptable", prop)
      ...MySQLSyntaxErrorException: ... near 'order TEXT )
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests and manually do the above case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14107 from dongjoon-hyun/SPARK-16387.
      3b22291b
    • Yin Huai's avatar
      [SPARK-16453][BUILD] release-build.sh is missing hive-thriftserver for scala 2.10 · 60ba436b
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR adds hive-thriftserver profile to scala 2.10 build created by release-build.sh.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #14108 from yhuai/SPARK-16453.
      60ba436b
    • wujian's avatar
      [SPARK-16281][SQL] Implement parse_url SQL function · f5fef691
      wujian authored
      ## What changes were proposed in this pull request?
      
      This PR adds parse_url SQL functions in order to remove Hive fallback.
      
      A new implementation of #13999
      
      ## How was this patch tested?
      
      Pass the exist tests including new testcases.
      
      Author: wujian <jan.chou.wu@gmail.com>
      
      Closes #14008 from janplus/SPARK-16281.
      f5fef691
    • Dongjoon Hyun's avatar
      [SPARK-16429][SQL] Include `StringType` columns in `describe()` · 142df483
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument.
      
      **Background**
      ```scala
      scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
      +-------+------------------+-------+
      |summary|               age|   name|
      +-------+------------------+-------+
      |  count|                 2|      3|
      |   mean|              24.5|   null|
      | stddev|7.7781745930520225|   null|
      |    min|                19|   Andy|
      |    max|                30|Michael|
      +-------+------------------+-------+
      ```
      
      **Before**
      ```scala
      scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
      +-------+------------------+
      |summary|               age|
      +-------+------------------+
      |  count|                 2|
      |   mean|              24.5|
      | stddev|7.7781745930520225|
      |    min|                19|
      |    max|                30|
      +-------+------------------+
      ```
      
      **After**
      ```scala
      scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
      +-------+------------------+-------+
      |summary|               age|   name|
      +-------+------------------+-------+
      |  count|                 2|      3|
      |   mean|              24.5|   null|
      | stddev|7.7781745930520225|   null|
      |    min|                19|   Andy|
      |    max|                30|Michael|
      +-------+------------------+-------+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a update testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14095 from dongjoon-hyun/SPARK-16429.
      142df483
    • Ryan Blue's avatar
      [SPARK-16420] Ensure compression streams are closed. · 67e085ef
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory.
      
      ## How was this patch tested?
      
      Current tests are sufficient. This should not change behavior.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak.
      67e085ef
    • Jurriaan Pruis's avatar
      [SPARK-13638][SQL] Add quoteAll option to CSV DataFrameWriter · 38cf8f2a
      Jurriaan Pruis authored
      ## What changes were proposed in this pull request?
      
      Adds an quoteAll option for writing CSV which will quote all fields.
      See https://issues.apache.org/jira/browse/SPARK-13638
      
      ## How was this patch tested?
      
      Added a test to verify the output columns are quoted for all fields in the Dataframe
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13374 from jurriaan/csv-quote-all.
      38cf8f2a
    • Xusen Yin's avatar
      [SPARK-16369][MLLIB] tallSkinnyQR of RowMatrix should aware of empty partition · 255d74fe
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition.
      
      See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details.
      
      ## How was this patch tested?
      
      Scala unit test.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #14049 from yinxusen/SPARK-16369.
      255d74fe
    • Dongjoon Hyun's avatar
      [SPARK-16285][SQL] Implement sentences SQL functions · a54438cb
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `sentences` SQL function.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14004 from dongjoon-hyun/SPARK_16285.
      a54438cb
    • petermaxlee's avatar
      [SPARK-16436][SQL] checkEvaluation should support NaN · 8228b063
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This small patch modifies ExpressionEvalHelper. checkEvaluation to support comparing NaN values for floating point comparisons.
      
      ## How was this patch tested?
      This is a test harness change.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14103 from petermaxlee/SPARK-16436.
      8228b063
    • Dongjoon Hyun's avatar
      [SPARK-16052][SQL] Improve `CollapseRepartition` optimizer for Repartition/RepartitionBy · dff73bfa
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR improves `CollapseRepartition` to optimize the adjacent combinations of **Repartition** and **RepartitionBy**. Also, this PR adds a testsuite for this optimizer.
      
      **Target Scenario**
      ```scala
      scala> val dsView1 = spark.range(8).repartition(8, $"id")
      scala> dsView1.createOrReplaceTempView("dsView1")
      scala> sql("select id from dsView1 distribute by id").explain(true)
      ```
      
      **Before**
      ```scala
      scala> sql("select id from dsView1 distribute by id").explain(true)
      == Parsed Logical Plan ==
      'RepartitionByExpression ['id]
      +- 'Project ['id]
         +- 'UnresolvedRelation `dsView1`
      
      == Analyzed Logical Plan ==
      id: bigint
      RepartitionByExpression [id#0L]
      +- Project [id#0L]
         +- SubqueryAlias dsview1
            +- RepartitionByExpression [id#0L], 8
               +- Range (0, 8, splits=8)
      
      == Optimized Logical Plan ==
      RepartitionByExpression [id#0L]
      +- RepartitionByExpression [id#0L], 8
         +- Range (0, 8, splits=8)
      
      == Physical Plan ==
      Exchange hashpartitioning(id#0L, 200)
      +- Exchange hashpartitioning(id#0L, 8)
         +- *Range (0, 8, splits=8)
      ```
      
      **After**
      ```scala
      scala> sql("select id from dsView1 distribute by id").explain(true)
      == Parsed Logical Plan ==
      'RepartitionByExpression ['id]
      +- 'Project ['id]
         +- 'UnresolvedRelation `dsView1`
      
      == Analyzed Logical Plan ==
      id: bigint
      RepartitionByExpression [id#0L]
      +- Project [id#0L]
         +- SubqueryAlias dsview1
            +- RepartitionByExpression [id#0L], 8
               +- Range (0, 8, splits=8)
      
      == Optimized Logical Plan ==
      RepartitionByExpression [id#0L]
      +- Range (0, 8, splits=8)
      
      == Physical Plan ==
      Exchange hashpartitioning(id#0L, 200)
      +- *Range (0, 8, splits=8)
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including a new testsuite).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13765 from dongjoon-hyun/SPARK-16052.
      dff73bfa
    • Tathagata Das's avatar
      [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrigger · 5bce4580
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      An option that limits the file stream source to read 1 file at a time enables rate limiting. It has the additional convenience that a static set of files can be used like a stream for testing as this will allows those files to be considered one at a time.
      
      This PR adds option `maxFilesPerTrigger`.
      
      ## How was this patch tested?
      
      New unit test
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14094 from tdas/SPARK-16430.
      5bce4580
  6. Jul 07, 2016
    • Dongjoon Hyun's avatar
      [SPARK-16425][R] `describe()` should not fail with non-numeric columns · 6aa7d09f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.
      
      **Before**
      ```r
      > df <- createDataFrame(faithful)
      > df <- withColumn(df, "boolean", df$waiting==79)
      > summary(df)
      16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
      ```
      
      **After**
      ```r
      > df <- createDataFrame(faithful)
      > df <- withColumn(df, "boolean", df$waiting==79)
      > summary(df)
      SparkDataFrame[summary:string, eruptions:string, waiting:string]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins with a updated testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14096 from dongjoon-hyun/SPARK-16425.
      6aa7d09f
    • Felix Cheung's avatar
      [SPARK-16310][SPARKR] R na.string-like default for csv source · f4767bcc
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Apply default "NA" as null string for R, like R read.csv na.string parameter.
      
      https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
      na.strings = "NA"
      
      An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
      
      (couldn't open JIRA, will do that later)
      
      ## How was this patch tested?
      
      unit tests
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13984 from felixcheung/rcsvnastring.
      f4767bcc
    • Daoyuan Wang's avatar
      [SPARK-16415][SQL] fix catalog string error · 28710b42
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      In #13537 we truncate `simpleString` if it is a long `StructType`. But sometimes we need `catalogString` to reconstruct `TypeInfo`, for example in description of [SPARK-16415 ](https://issues.apache.org/jira/browse/SPARK-16415). So we need to keep the implementation of `catalogString` not affected by our truncate.
      
      ## How was this patch tested?
      
      added a test case.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #14089 from adrian-wang/catalogstring.
      28710b42
    • Liwei Lin's avatar
      [SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach() · 0f7175de
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      There are cases where `complete` output mode does not output updated aggregated value; for details please refer to [SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350).
      
      The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in `ForeachSink.addBatch()`, `foreachPartition()` does not support incremental planning for now.
      
      This patches makes `foreachPartition()` support incremental planning in `ForeachSink`, by making a special version of `Dataset` with its `rdd()` method supporting incremental planning.
      
      ## How was this patch tested?
      
      Added a unit test which failed before the change
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14030 from lw-lin/fix-foreach-complete.
      0f7175de
Loading