Skip to content
Snippets Groups Projects
  1. Jan 08, 2017
    • Dilip Biswal's avatar
      [SPARK-19093][SQL] Cached tables are not used in SubqueryExpression · 4351e622
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      Consider the plans inside subquery expressions while looking up cache manager to make
      use of cached data. Currently CacheManager.useCachedData does not consider the
      subquery expressions in the plan.
      
      SQL
      ```
      select * from rows where not exists (select * from rows)
      ```
      Before the fix
      ```
      == Optimized Logical Plan ==
      Join LeftAnti
      :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :     +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
      +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
         +- Relation[_1#3775,_2#3776] parquet
      ```
      
      After
      ```
      == Optimized Logical Plan ==
      Join LeftAnti
      :- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :     +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
      +- Project [_1#256 AS _1#256#298, _2#257 AS _2#257#299]
         +- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
               +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
      ```
      
      Query2
      ```
       SELECT * FROM t1
       WHERE
       c1 IN (SELECT c1 FROM t2 WHERE c1 IN (SELECT c1 FROM t3 WHERE c1 = 1))
      ```
      Before
      ```
      == Analyzed Logical Plan ==
      c1: int
      Project [c1#3]
      +- Filter predicate-subquery#47 [(c1#3 = c1#10)]
         :  +- Project [c1#10]
         :     +- Filter predicate-subquery#46 [(c1#10 = c1#17)]
         :        :  +- Project [c1#17]
         :        :     +- Filter (c1#17 = 1)
         :        :        +- SubqueryAlias t3, `t3`
         :        :           +- Project [value#15 AS c1#17]
         :        :              +- LocalRelation [value#15]
         :        +- SubqueryAlias t2, `t2`
         :           +- Project [value#8 AS c1#10]
         :              +- LocalRelation [value#8]
         +- SubqueryAlias t1, `t1`
            +- Project [value#1 AS c1#3]
               +- LocalRelation [value#1]
      
      == Optimized Logical Plan ==
      Join LeftSemi, (c1#3 = c1#10)
      :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
      :     +- LocalTableScan [c1#3]
      +- Project [value#8 AS c1#10]
         +- Join LeftSemi, (value#8 = c1#17)
            :- LocalRelation [value#8]
            +- Project [value#15 AS c1#17]
               +- Filter (value#15 = 1)
                  +- LocalRelation [value#15]
      
      ```
      After
      ```
      == Analyzed Logical Plan ==
      c1: int
      Project [c1#3]
      +- Filter predicate-subquery#47 [(c1#3 = c1#10)]
         :  +- Project [c1#10]
         :     +- Filter predicate-subquery#46 [(c1#10 = c1#17)]
         :        :  +- Project [c1#17]
         :        :     +- Filter (c1#17 = 1)
         :        :        +- SubqueryAlias t3, `t3`
         :        :           +- Project [value#15 AS c1#17]
         :        :              +- LocalRelation [value#15]
         :        +- SubqueryAlias t2, `t2`
         :           +- Project [value#8 AS c1#10]
         :              +- LocalRelation [value#8]
         +- SubqueryAlias t1, `t1`
            +- Project [value#1 AS c1#3]
               +- LocalRelation [value#1]
      
      == Optimized Logical Plan ==
      Join LeftSemi, (c1#3 = c1#10)
      :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
      :     +- LocalTableScan [c1#3]
      +- Join LeftSemi, (c1#10 = c1#17)
         :- InMemoryRelation [c1#10], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t2
         :     +- LocalTableScan [c1#10]
         +- Filter (c1#17 = 1)
            +- InMemoryRelation [c1#17], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
                  +- LocalTableScan [c1#3]
      ```
      ## How was this patch tested?
      Added new tests in CachedTableSuite.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #16493 from dilipbiswal/SPARK-19093.
      4351e622
    • zuotingbing's avatar
      [SPARK-19026] SPARK_LOCAL_DIRS(multiple directories on different disks) cannot be deleted · cd1d00ad
      zuotingbing authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19026
      
      SPARK_LOCAL_DIRS (Standalone) can  be a comma-separated list of multiple directories on different disks, e.g. SPARK_LOCAL_DIRS=/dir1,/dir2,/dir3, if there is a IOExecption when create sub directory on dir3 , the sub directory which have been created successfully on dir1 and dir2 cannot be deleted anymore when the application finishes.
      So we should catch the IOExecption at Utils.createDirectory  , otherwise the variable "appDirectories(appId)" which the function maybeCleanupApplication calls will not be set then dir1 and dir2 will not be cleaned up .
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #16439 from zuotingbing/master.
      Unverified
      cd1d00ad
    • Yanbo Liang's avatar
      [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files · 6b6b555a
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
      * mllib_classification.R
      * mllib_clustering.R
      * mllib_recommendation.R
      * mllib_regression.R
      * mllib_stat.R
      * mllib_tree.R
      * mllib_utils.R
      
      Note: Only reorg, no actual code change.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16312 from yanboliang/spark-18862.
      6b6b555a
  2. Jan 07, 2017
    • Dongjoon Hyun's avatar
      [SPARK-18941][SQL][DOC] Add a new behavior document on `CREATE/DROP TABLE` with `LOCATION` · 923e5948
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276).
      
      ## How was this patch tested?
      
      ```
      SKIP_API=1 jekyll build
      ```
      
      **Newly Added Description**
      <img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png">
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16400 from dongjoon-hyun/SPARK-18941.
      923e5948
    • Sean Owen's avatar
      [SPARK-19106][DOCS] Styling for the configuration docs is broken · 54138f6e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      configuration.html section headings were not specified correctly in markdown and weren't rendering, being recognized correctly. Removed extra p tags and pulled level 4 titles up to level 3, since level 3 had been skipped. This improves the TOC.
      
      ## How was this patch tested?
      
      Doc build, manual check.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16490 from srowen/SPARK-19106.
      Unverified
      54138f6e
    • wm624@hotmail.com's avatar
      [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for... · 036b5034
      wm624@hotmail.com authored
      [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model
      
      ## What changes were proposed in this pull request?
      
      While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
      For example, in the test("read/write DistributedLDAModel"), I add the test:
      val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
      val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
      assert(logPrior === logPrior2)
      The test fails:
      -4.394180878889078 did not equal -4.294290536919573
      
      The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.
      
      Please refer to #16464 for details.
      ## How was this patch tested?
      Add a new unit test for testing logPrior.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16491 from wangmiao1981/ldabug.
      036b5034
    • Wenchen Fan's avatar
      [SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter · b3d39620
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs:
      1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it.
      2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)`
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16479 from cloud-fan/hive-writer.
      b3d39620
    • Yanbo Liang's avatar
      [MINOR] Bump R version to 2.2.0. · cdda3372
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      #16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted.
      
      ## How was this patch tested?
      N/A
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16488 from yanboliang/r-version.
      Unverified
      cdda3372
    • hyukjinkwon's avatar
      [SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for... · 68ea290b
      hyukjinkwon authored
      [SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row
      
      ## What changes were proposed in this pull request?
      
      It seems allowed to not set a key and value for a dict to represent the value is `None` or missing as below:
      
      ``` python
      spark.createDataFrame([{"x": 1}, {"y": 2}]).show()
      ```
      
      ```
      +----+----+
      |   x|   y|
      +----+----+
      |   1|null|
      |null|   2|
      +----+----+
      ```
      
      However,  it seems it is not for `Row` as below:
      
      ``` python
      spark.createDataFrame([Row(x=1), Row(y=2)]).show()
      ```
      
      ``` scala
      16/06/19 16:25:56 ERROR Executor: Exception in task 6.0 in stage 66.0 (TID 316)
      java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 1 values are provided.
          at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:147)
          at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656)
          at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656)
          at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
          at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
          at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
          at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
          at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
      ```
      
      The behaviour seems right but it seems it might confuse users just like this JIRA was reported.
      
      This PR adds the explanation for `Row` class.
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13771 from HyukjinKwon/SPARK-13748.
      Unverified
      68ea290b
  3. Jan 06, 2017
    • sueann's avatar
      [SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit · d60f6f62
      sueann authored
      ## What changes were proposed in this pull request?
      
      Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.
      
      ## How was this patch tested?
      
      Ran unit tests and checked the log file (see output in comments).
      
      Author: sueann <sueann@databricks.com>
      
      Closes #16480 from sueann/SPARK-18194.
      d60f6f62
    • Tathagata Das's avatar
      [SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for... · b59cddab
      Tathagata Das authored
      [SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options
      
      ## What changes were proposed in this pull request?
      
      Updates
      - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
      - Updated Output Modes section with Update mode
      - Added options for all the sources and sinks
      
      ---------------------------
      ---------------------------
      
      ![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)
      
      ---------------------------
      ---------------------------
      <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
      <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">
      
      ---------------------------
      ---------------------------
      ![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
      ![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
      ![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16468 from tdas/SPARK-19074.
      b59cddab
    • zuotingbing's avatar
      [SPARK-19083] sbin/start-history-server.sh script use of $@ without quotes · a9a13737
      zuotingbing authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19083#
      
      sbin/start-history-server.sh script use of $ without quotes, this will affect the length of args which used in HistoryServerArguments::parse(args: List[String])
      
      Author: zuotingbing <zuo.tingbing9@zte.com.cn>
      
      Closes #16484 from zuotingbing/sh.
      a9a13737
    • Kay Ousterhout's avatar
      [SPARK-17931] Eliminate unnecessary task (de) serialization · 2e139eed
      Kay Ousterhout authored
      In the existing code, there are three layers of serialization
          involved in sending a task from the scheduler to an executor:
              - A Task object is serialized
              - The Task object is copied to a byte buffer that also
                contains serialized information about any additional JARs,
                files, and Properties needed for the task to execute. This
                byte buffer is stored as the member variable serializedTask
                in the TaskDescription class.
              - The TaskDescription is serialized (in addition to the serialized
                task + JARs, the TaskDescription class contains the task ID and
                other metadata) and sent in a LaunchTask message.
      
      While it *is* necessary to have two layers of serialization, so that
      the JAR, file, and Property info can be deserialized prior to
      deserializing the Task object, the third layer of deserialization is
      unnecessary.  This commit eliminates a layer of serialization by moving
      the JARs, files, and Properties into the TaskDescription class.
      
      This commit also serializes the Properties manually (by traversing the map),
      as is done with the JARs and files, which reduces the final serialized size.
      
      Unit tests
      
      This is a simpler alternative to the approach proposed in #15505.
      
      shivaram and I did some benchmarking of this and #15505 on a 20-machine m2.4xlarge EC2 machines (160 cores). We ran ~30 trials of code [1] (a very simple job with 10K tasks per stage) and measured the average time per stage:
      
      Before this change: 2490ms
      With this change: 2345 ms (so ~6% improvement over the baseline)
      With witgo's approach in #15505: 2046 ms (~18% improvement over baseline)
      
      The reason that #15505 has a more significant improvement is that it also moves the serialization from the TaskSchedulerImpl thread to the CoarseGrainedSchedulerBackend thread. I added that functionality on top of this change, and got almost the same improvement [1] as #15505 (average of 2103ms). I think we should decouple these two changes, both so we have some record of the improvement form each individual improvement, and because this change is more about simplifying the code base (the improvement is negligible) while the other is about performance improvement.  The plan, currently, is to merge this PR and then merge the remaining part of #15505 that moves serialization.
      
      [1] The reason the improvement wasn't quite as good as with #15505 when we ran the benchmarks is almost certainly because, at the point when we ran the benchmarks, I hadn't updated the code to manually serialize the Properties (instead the code was using Java's default serialization for the Properties object, whereas #15505 manually serialized the Properties).  This PR has since been updated to manually serialize the Properties, just like the other maps.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #16053 from kayousterhout/SPARK-17931.
      2e139eed
    • jerryshao's avatar
      [SPARK-19033][CORE] Add admin acls for history server · 4a4c3dc9
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.
      
      So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.
      
      ## How was this patch tested?
      
      Unit test added.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16470 from jerryshao/SPARK-19033.
      4a4c3dc9
    • Michal Senkyr's avatar
      [SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a... · 903bb8e8
      Michal Senkyr authored
      [SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)
      
      ## What changes were proposed in this pull request?
      
      Added a `to` call at the end of the code generated by `ScalaReflection.deserializerFor` if the requested type is not a supertype of `WrappedArray[_]` that uses `CanBuildFrom[_, _, _]` to convert result into an arbitrary subtype of `Seq[_]`.
      
      Care was taken to preserve the original deserialization where it is possible to avoid the overhead of conversion in cases where it is not needed
      
      `ScalaReflection.serializerFor` could already be used to serialize any `Seq[_]` so it was not altered
      
      `SQLImplicits` had to be altered and new implicit encoders added to permit serialization of other sequence types
      
      Also fixes [SPARK-16815] Dataset[List[T]] leads to ArrayStoreException
      
      ## How was this patch tested?
      ```bash
      ./build/mvn -DskipTests clean package && ./dev/run-tests
      ```
      
      Also manual execution of the following sets of commands in the Spark shell:
      ```scala
      case class TestCC(key: Int, letters: List[String])
      
      val ds1 = sc.makeRDD(Seq(
      (List("D")),
      (List("S","H")),
      (List("F","H")),
      (List("D","L","L"))
      )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]
      
      val test1=ds1.map{_.key}
      test1.show
      ```
      
      ```scala
      case class X(l: List[String])
      spark.createDataset(Seq(List("A"))).map(X).show
      ```
      
      ```scala
      spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect
      ```
      
      After adding arbitrary sequence support also tested with the following commands:
      
      ```scala
      case class QueueClass(q: scala.collection.immutable.Queue[Int])
      
      spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect
      ```
      
      Author: Michal Senkyr <mike.senkyr@gmail.com>
      
      Closes #16240 from michalsenkyr/sql-caseclass-list-fix.
      903bb8e8
  4. Jan 05, 2017
    • Kevin Yu's avatar
      [SPARK-18871][SQL] New test cases for IN/NOT IN subquery · bcc510b0
      Kevin Yu authored
      ## What changes were proposed in this pull request?
      This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.
      
      Based on the discussion, we will create  `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory.
      
      This is the high level grouping for IN subquery:
      
      `subquery/in-subquery/`
      `subquery/in-subquery/simple-in.sql`
      `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
      `subquery/in-subquery/not-in-group-by.sql`
      `subquery/in-subquery/in-order-by.sql`
      `subquery/in-subquery/in-limit.sql`
      `subquery/in-subquery/in-having.sql`
      `subquery/in-subquery/in-joins.sql`
      `subquery/in-subquery/not-in-joins.sql`
      `subquery/in-subquery/in-set-operations.sql`
      `subquery/in-subquery/in-with-cte.sql`
      `subquery/in-subquery/not-in-with-cte.sql`
      subquery/in-subquery/in-multiple-columns.sql`
      
      We will deliver it through multiple prs, this is the first pr for the IN subquery, it has
      
      `subquery/in-subquery/simple-in.sql`
      `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
      
      These are the results from running on DB2.
      [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt)
      [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt)
      [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt)
      [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt)
      
      ## How was this patch tested?
      
      This patch is adding tests.
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #16337 from kevinyu98/spark-18871.
      bcc510b0
    • Yanbo Liang's avatar
      [MINOR] Correct LogisticRegression test case for probability2prediction. · dfc4c935
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```.
      
      ## How was this patch tested?
      Change unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16477 from yanboliang/lor-pred.
      dfc4c935
    • Wenchen Fan's avatar
      [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables · cca945b6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.
      
      Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.
      
      TODO(for follow-up PRs):
      1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
      2. `SHOW CREATE TABLE` should be updated to use the new syntax.
      3. we should decide if we wanna change the behavior of `SET LOCATION`.
      
      ## How was this patch tested?
      
      new tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16296 from cloud-fan/create-table.
      cca945b6
    • Rui Li's avatar
      [SPARK-14958][CORE] Failed task not handled when there's error deserializing failure reason · f5d18af6
      Rui Li authored
      ## What changes were proposed in this pull request?
      
      TaskResultGetter tries to deserialize the TaskEndReason before handling the failed task. If an error is thrown during deserialization, the failed task won't be handled, which leaves the job hanging.
      The PR proposes to handle the failed task in a finally block.
      ## How was this patch tested?
      
      In my case I hit a NoClassDefFoundError and the job hangs. Manually verified the patch can fix it.
      
      Author: Rui Li <rui.li@intel.com>
      Author: Rui Li <lirui@apache.org>
      Author: Rui Li <shlr@cn.ibm.com>
      
      Closes #12775 from lirui-intel/SPARK-14958.
      f5d18af6
    • Wenchen Fan's avatar
      [SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable · 30345c43
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we append data to a partitioned table with `DataFrameWriter.saveAsTable`, there are 2 issues:
      1. doesn't work when the partition has custom location.
      2. will recover all partitions
      
      This PR fixes them by moving the special partition handling code from `DataSourceAnalysis` to `InsertIntoHadoopFsRelationCommand`, so that the `DataFrameWriter.saveAsTable` code path can also benefit from it.
      
      ## How was this patch tested?
      
      newly added regression tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16460 from cloud-fan/append.
      30345c43
  5. Jan 04, 2017
    • uncleGen's avatar
      [SPARK-19009][DOC] Add streaming rest api doc · 6873430c
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      add streaming rest api doc
      
      related to pr #16253
      
      cc saturday-shi srowen
      
      ## How was this patch tested?
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16414 from uncleGen/SPARK-19009.
      6873430c
    • Kay Ousterhout's avatar
      [SPARK-19062] Utils.writeByteBuffer bug fix · 00074b57
      Kay Ousterhout authored
      This commit changes Utils.writeByteBuffer so that it does not change
      the position of the ByteBuffer that it writes out, and adds a unit test for
      this functionality.
      
      cc mridulm
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #16462 from kayousterhout/SPARK-19062.
      00074b57
    • Herman van Hovell's avatar
      [SPARK-19070] Clean-up dataset actions · 4262fb0d
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Dataset actions currently spin off a new `Dataframe` only to track query execution. This PR simplifies this code path by using the `Dataset.queryExecution` directly. This PR also merges the typed and untyped action evaluation paths.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16466 from hvanhovell/SPARK-19070.
      4262fb0d
    • Niranjan Padmanabhan's avatar
      [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo · a1e40b1f
      Niranjan Padmanabhan authored
      ## What changes were proposed in this pull request?
      There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
      
      ## How was this patch tested?
      N/A since only docs or comments were updated.
      
      Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
      
      Closes #16455 from neurons/np.structure_streaming_doc.
      Unverified
      a1e40b1f
    • Zheng RuiFeng's avatar
      [SPARK-19054][ML] Eliminate extra pass in NB · 7a825058
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      eliminate unnecessary extra pass in NB's train
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16453 from zhengruifeng/nb_getNC.
      Unverified
      7a825058
    • Wenchen Fan's avatar
      [SPARK-19060][SQL] remove the supportsPartial flag in AggregateFunction · 101556d0
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Now all aggregation functions support partial aggregate, we can remove the `supportsPartual` flag in `AggregateFunction`
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16461 from cloud-fan/partial.
      101556d0
    • mingfei's avatar
      [SPARK-19073] LauncherState should be only set to SUBMITTED after the application is submitted · fe1c895e
      mingfei authored
      ## What changes were proposed in this pull request?
      LauncherState should be only set to SUBMITTED after the application is submitted.
      Currently the state is set before the application is actually submitted.
      
      ## How was this patch tested?
      no test is added in this patch
      
      Author: mingfei <mingfei.smf@alipay.com>
      
      Closes #16459 from shimingfei/fixLauncher.
      Unverified
      fe1c895e
    • Wenchen Fan's avatar
      [SPARK-19072][SQL] codegen of Literal should not output boxed value · cbd11d23
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In https://github.com/apache/spark/pull/16402 we made a mistake that, when double/float is infinity, the `Literal` codegen will output boxed value and cause wrong result.
      
      This PR fixes this by special handling infinity to not output boxed value.
      
      ## How was this patch tested?
      
      new regression test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16469 from cloud-fan/literal.
      cbd11d23
  6. Jan 03, 2017
    • gatorsmile's avatar
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned... · b67b35f7
      gatorsmile authored
      [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog
      
      ### What changes were proposed in this pull request?
      The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.
      
      This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.
      
      ### How was this patch tested?
      Added test cases for both HiveExternalCatalog and InMemoryCatalog
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16448 from gatorsmile/unsetSerdeProp.
      b67b35f7
    • Devaraj K's avatar
      [SPARK-15555][MESOS] Driver with --supervise option cannot be killed in Mesos mode · 89bf370e
      Devaraj K authored
      ## What changes were proposed in this pull request?
      
      Not adding the Killed applications for retry.
      ## How was this patch tested?
      
      I have verified manually in the Mesos cluster, with the changes the killed applications move to Finished Drivers section and will not retry.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #13323 from devaraj-kavali/SPARK-15555.
      89bf370e
    • Dongjoon Hyun's avatar
      [SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a... · 7a2b5f93
      Dongjoon Hyun authored
      [SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`
      
      ## What changes were proposed in this pull request?
      
      CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.
      
      **decimal.csv**
      ```
      9.03E+12
      1.19E+11
      ```
      
      **BEFORE**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(3,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
      java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
      ```
      
      **AFTER**
      ```scala
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
      root
       |-- _c0: decimal(4,-9) (nullable = true)
      
      scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
      +---------+
      |      _c0|
      +---------+
      |9.030E+12|
      | 1.19E+11|
      +---------+
      ```
      
      ## How was this patch tested?
      
      Pass the newly add test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16320 from dongjoon-hyun/SPARK-18877.
      7a2b5f93
    • Liang-Chi Hsieh's avatar
      [SPARK-18932][SQL] Support partial aggregation for collect_set/collect_list · 52636226
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently collect_set/collect_list aggregation expression don't support partial aggregation. This patch is to enable partial aggregation for them.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16371 from viirya/collect-partial-support.
      52636226
    • Weiqing Yang's avatar
      [MINOR] Add missing sc.stop() to end of examples · e5c307c5
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`.
      
      ## How was this patch tested?
      Pass the build and unit tests.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #16426 from weiqingy/testIssue.
      Unverified
      e5c307c5
  7. Jan 02, 2017
    • Zhenhua Wang's avatar
      [SPARK-18998][SQL] Add a cbo conf to switch between default statistics and estimated statistics · ae83c211
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      We add a cbo configuration to switch between default stats and estimated stats.
      We also define a new statistics method `planStats` in LogicalPlan with conf as its parameter, in order to pass the cbo switch and other estimation related configurations in the future. `planStats` is used on the caller sides (i.e. in Optimizer and Strategies) to make transformation decisions based on stats.
      
      ## How was this patch tested?
      
      Add a test case using a dummy LogicalPlan.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #16401 from wzhfy/cboSwitch.
      ae83c211
    • gatorsmile's avatar
      [SPARK-19029][SQL] Remove databaseName from SimpleCatalogRelation · a6cd9dbc
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Remove useless `databaseName ` from `SimpleCatalogRelation`.
      
      ### How was this patch tested?
      Existing test cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16438 from gatorsmile/removeDBFromSimpleCatalogRelation.
      a6cd9dbc
    • hyukjinkwon's avatar
      [SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts · 46b21260
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to check pep8 against all other Python scripts and fix the errors as below:
      
      ```bash
      ./dev/create-release/generate-contributors.py
      ./dev/create-release/releaseutils.py
      ./dev/create-release/translate-contributors.py
      ./dev/lint-python
      ./python/docs/epytext.py
      ./examples/src/main/python/mllib/decision_tree_classification_example.py
      ./examples/src/main/python/mllib/decision_tree_regression_example.py
      ./examples/src/main/python/mllib/gradient_boosting_classification_example.py
      ./examples/src/main/python/mllib/gradient_boosting_regression_example.py
      ./examples/src/main/python/mllib/linear_regression_with_sgd_example.py
      ./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
      ./examples/src/main/python/mllib/naive_bayes_example.py
      ./examples/src/main/python/mllib/random_forest_classification_example.py
      ./examples/src/main/python/mllib/random_forest_regression_example.py
      ./examples/src/main/python/mllib/svm_with_sgd_example.py
      ./examples/src/main/python/streaming/network_wordjoinsentiments.py
      ./sql/hive/src/test/resources/data/scripts/cat.py
      ./sql/hive/src/test/resources/data/scripts/cat_error.py
      ./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py
      ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py
      ./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py
      ./sql/hive/src/test/resources/data/scripts/escapednewline.py
      ./sql/hive/src/test/resources/data/scripts/escapedtab.py
      ./sql/hive/src/test/resources/data/scripts/input20_script.py
      ./sql/hive/src/test/resources/data/scripts/newline.py
      ```
      
      ## How was this patch tested?
      
      - `./python/docs/epytext.py`
      
        ```bash
        cd ./python/docs $$ make html
        ```
      
      - pep8 check (Python 2.7 / Python 3.3.6)
      
        ```
        ./dev/lint-python
        ```
      
      - `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working)
      
        ```bash
        python -m doctest -v ./dev/merge_spark_pr.py
        ```
      
      - `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working)
      
        ```bash
        python generate-contributors.py
        python translate-contributors.py
        ```
      
      - Examples (Python 2.7 / Python 3.3.6)
      
        ```bash
        ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py
        ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p
        ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py
        ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py
        ```
      
      - Examples (Python 2.7 only / Python 3.3.6 not working)
        ```
        ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py
        ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
        ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py
        ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py
        ```
      
      - `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes)
      
        Manually tested only changed ones.
      
      - `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working)
      
        Manually tested this after disabling actually adding comments and links.
      
      And also via Jenkins tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16405 from HyukjinKwon/minor-pep8.
      Unverified
      46b21260
    • hyukjinkwon's avatar
      [SPARK-19022][TESTS] Fix tests dependent on OS due to different newline characters · f1330b1d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      There are two tests failing on Windows due to the different newlines.
      
      ```
       - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds)
       "{
          "id" : "39788670-6722-48b7-a248-df6ba08722ac",
          "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390",
          "name" : "myName",
          ...
        }" did not equal "{
          "id" : "39788670-6722-48b7-a248-df6ba08722ac",
          "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390",
          "name" : "myName",
          ...
        }"
        ...
      ```
      
      ```
       - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds)
       "{
          "message" : "active",
          "isDataAvailable" : true,
          "isTriggerActive" : false
        }" did not equal "{
          "message" : "active",
          "isDataAvailable" : true,
          "isTriggerActive" : false
        }"
        ...
      ```
      
      The reason is, `pretty` in `org.json4s.pretty` writes OS-dependent newlines but the string defined in the tests are `\n`. This ends up with test failures.
      
      This PR proposes to compare these regardless of newline concerns.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      **Before**
      https://ci.appveyor.com/project/spark-test/spark/build/417-newlines-fix-before
      
      **After**
      https://ci.appveyor.com/project/spark-test/spark/build/418-newlines-fix
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16433 from HyukjinKwon/tests-StreamingQueryStatusAndProgressSuite.
      Unverified
      f1330b1d
    • Liang-Chi Hsieh's avatar
      [MINOR][DOC] Minor doc change for YARN credential providers · 0ac2f1e7
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.
      
      ## How was this patch tested?
      
      N/A. Just doc change.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16444 from viirya/minor-credential-provider-doc.
      Unverified
      0ac2f1e7
    • Liwei Lin's avatar
      [SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide · 808b84e2
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Currently some code snippets in the programming guide just do not compile. We should fix them.
      
      ## How was this patch tested?
      
      ```
      SKIP_API=1 jekyll build
      ```
      
      ## Screenshot from part of the change:
      
      ![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16442 from lw-lin/ss-pro-guide-.
      Unverified
      808b84e2
    • Sean Owen's avatar
      [BUILD] Close stale PRs · ba488126
      Sean Owen authored
      Closes #12968
      Closes #16215
      Closes #16212
      Closes #16086
      Closes #15713
      Closes #16413
      Closes #16396
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16447 from srowen/CloseStalePRs.
      Unverified
      ba488126
Loading