Skip to content
Snippets Groups Projects
  1. Apr 28, 2016
    • Ergin Seyfe's avatar
      [SPARK-14576][WEB UI] Spark console should display Web UI url · 23256be0
      Ergin Seyfe authored
      ## What changes were proposed in this pull request?
      This is a proposal to print the Spark Driver UI link when spark-shell is launched.
      
      ## How was this patch tested?
      Launched spark-shell in local mode and cluster mode. Spark-shell console output included following line:
      "Spark context Web UI available at <Spark web url>"
      
      Author: Ergin Seyfe <eseyfe@fb.com>
      
      Closes #12341 from seyfe/spark_console_display_webui_link.
      23256be0
    • Liang-Chi Hsieh's avatar
      [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation · 7c6937a8
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.
      
      For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.
      
      We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation.
      
      ## How was this patch tested?
      
      `UserDefinedTypeSuite`
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12259 from viirya/improve-sql-usertype.
      7c6937a8
    • Wenchen Fan's avatar
      [SPARK-14654][CORE] New accumulator API · bf5496db
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR introduces a new accumulator API  which is much simpler than before:
      
      1. the type hierarchy is simplified, now we only have an `Accumulator` class
      2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue`
      3. there in only one `register` method, the accumulator registration and cleanup registration are combined.
      4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration.
      
      `SQLMetric` is a good example to show the simplicity of this new API.
      
      What we break:
      
      1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue`
      2. accumulator can't be serialized before registered.
      
      Problems need to be addressed in follow-ups:
      
      1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value.
      2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event?
      3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12612 from cloud-fan/acc.
      bf5496db
    • Jakob Odersky's avatar
      [SPARK-10001][CORE] Don't short-circuit actions in signal handlers · be317d4a
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      The current signal handlers have a subtle bug that stops evaluating registered actions as soon as one of them returns true, this is because `forall` is short-circuited.
      This PR adds a strict mapping stage before evaluating returned result.
      There are no known occurrences of the bug and this is a preemptive fix.
      
      ## How was this patch tested?
      As with the original introduction of signal handlers, this was tested manually (unit testing with signals is not straightforward).
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #12745 from jodersky/SPARK-10001-hotfix.
      be317d4a
  2. Apr 27, 2016
    • Davies Liu's avatar
      [SPARK-14961] Build HashedRelation larger than 1G · ae4e3def
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G.
      
      This PR improves LongToUnsafeRowMap  to scale up to 8G bytes by using array of Long instead of array of byte.
      
      ## How was this patch tested?
      
      Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12740 from davies/larger_broadcast.
      ae4e3def
    • hyukjinkwon's avatar
      [SPARK-12143][SQL] Binary type support for Hive thrift server · f5da592f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-12143
      
      This PR adds the support for conversion between `SparkRow` in Spark and `RowSet` in Hive for `BinaryType` as `Array[Byte]` (JDBC)
      ## How was this patch tested?
      
      Unittests in `HiveThriftBinaryServerSuite` (regression test)
      
      Closes #10139
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12733 from HyukjinKwon/SPARK-12143.
      f5da592f
    • Arun Allamsetty's avatar
      [MINOR][MAINTENANCE] Sort the entries in .gitignore. · b0ce0d13
      Arun Allamsetty authored
      ## What changes were proposed in this pull request?
      
      The contents of `.gitignore` have been sorted to make it more readable. The actual contents of the file have not been changed.
      
      ## How was this patch tested?
      
      Does not require any tests.
      
      Author: Arun Allamsetty <arun@instructure.com>
      
      Closes #12742 from aa8y/gitignore.
      b0ce0d13
    • Josh Rosen's avatar
      [SPARK-14966] SizeEstimator should ignore classes in the scala.reflect package · 8c49cebc
      Josh Rosen authored
      In local profiling, I noticed SizeEstimator spending tons of time estimating the size of objects which contain TypeTag or ClassTag fields. The problem with these tags is that they reference global Scala reflection objects, which, in turn, reference many singletons, such as TestHive. This throws off the accuracy of the size estimation and wastes tons of time traversing a huge object graph.
      
      As a result, I think that SizeEstimator should ignore any classes in the `scala.reflect` package.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12741 from JoshRosen/ignore-scala-reflect-in-size-estimator.
      8c49cebc
    • Joseph K. Bradley's avatar
      [SPARK-14671][ML] Pipeline setStages should handle subclasses of PipelineStage · f5ebb18c
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Pipeline.setStages failed for some code examples which worked in 1.5 but fail in 1.6.  This tends to occur when using a mix of transformers from ml.feature. It is because Java Arrays are non-covariant and the addition of MLWritable to some transformers means the stages0/1 arrays above are not of type Array[PipelineStage].  This PR modifies the following to accept subclasses of PipelineStage:
      * Pipeline.setStages()
      * Params.w()
      
      ## How was this patch tested?
      
      Unit test which fails to compile before this fix.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12430 from jkbradley/pipeline-setstages.
      f5ebb18c
    • Josh Rosen's avatar
      6466d6c8
    • Oscar D. Lara Yejas's avatar
      [SPARK-13436][SPARKR] Added parameter drop to subsetting operator [ · e4bfb4aa
      Oscar D. Lara Yejas authored
      Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it.
      
      In R:
      ```
      > name <- "Sepal_Length"
      > class(iris[, name])
      [1] "numeric"
      ```
      Currently, in SparkR:
      ```
      > name <- "Sepal_Length"
      > class(irisDF[, name])
      [1] "DataFrame"
      ```
      
      Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this:
      
      ```
      > name <- "Sepal_Length"
      > class(irisDF[, name, drop=T])
      [1] "Column"
      
      > class(irisDF[, name, drop=F])
      [1] "DataFrame"
      ```
      
      This is consistent with R:
      
      ```
      > name <- "Sepal_Length"
      > class(iris[, name])
      [1] "numeric"
      > class(iris[, name, drop=F])
      [1] "data.frame"
      ```
      
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #11318 from olarayej/SPARK-13436.
      e4bfb4aa
    • Andrew Or's avatar
      [SPARK-14940][SQL] Move ExternalCatalog to own file · 37575115
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness.
      
      ## How was this patch tested?
      
      Just moving things around.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12721 from andrewor14/move-external-catalog.
      37575115
    • Yanbo Liang's avatar
      [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg option · 4672e983
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.
      
      ## How was this patch tested?
      Unit tests.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12702 from yanboliang/spark-14899.
      4672e983
    • Cheng Lian's avatar
      [SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source CTAS syntax · 24bea000
      Cheng Lian authored
      Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL:
      
      ```
      CREATE TABLE <table-name>
      USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)]
      [PARTITIONED BY (col1, col2, ...)]
      [CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS]
      AS SELECT ...
      ```
      
      Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax.
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12734 from liancheng/spark-14954.
      24bea000
    • Dongjoon Hyun's avatar
      [SPARK-14867][BUILD] Remove `--force` option in `build/mvn` · f405de87
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `build/mvn` provides a convenient option, `--force`, in order to use the recommended version of maven without changing PATH environment variable. However, there were two problems.
      
      - `dev/lint-java` does not use the newly installed maven.
      
        ```bash
      $ ./build/mvn --force clean
      $ ./dev/lint-java
      Using `mvn` from path: /usr/local/bin/mvn
      ```
      - It's not easy to type `--force` option always.
      
      If '--force' option is used once, we had better prefer the installed maven recommended by Spark.
      This PR makes `build/mvn` check the existence of maven installed by `--force` option first.
      
      According to the comments, this PR aims to the followings:
      - Detect the maven version from `pom.xml`.
      - Install maven if there is no or old maven.
      - Remove `--force` option.
      
      ## How was this patch tested?
      
      Manual.
      
      ```bash
      $ ./build/mvn --force clean
      $ ./dev/lint-java
      Using `mvn` from path: /Users/dongjoon/spark/build/apache-maven-3.3.9/bin/mvn
      ...
      $ rm -rf ./build/apache-maven-3.3.9/
      $ ./dev/lint-java
      Using `mvn` from path: /usr/local/bin/mvn
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12631 from dongjoon-hyun/SPARK-14867.
      f405de87
    • Dongjoon Hyun's avatar
      [SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queries · af92299f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer.
      
      **Sum**
      ```scala
      scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head
      java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1)
      scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
      :     +- INPUT
      +- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23]
         +- Exchange SinglePartition, None
            +- Generate explode([1.0,2.0]), false, false, [a#21]
               +- Scan OneRowRelation[]
      ```
      
      **Average**
      ```scala
      scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head
      java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5))
      scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
      :     +- INPUT
      +- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44]
         +- Exchange SinglePartition, None
            +- Generate explode([1.0,2.0]), false, false, [a#42]
               +- Scan OneRowRelation[]
      ```
      
      After this PR, those queries work fine and new optimized physical plans look like the followings.
      
      **Sum**
      ```scala
      scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
      :     +- INPUT
      +- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35]
         +- Exchange SinglePartition, None
            +- Generate explode([1.0,2.0]), false, false, [a#33]
               +- Scan OneRowRelation[]
      ```
      
      **Average**
      ```scala
      scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
      :     +- INPUT
      +- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47]
         +- Exchange SinglePartition, None
            +- Generate explode([1.0,2.0]), false, false, [a#45]
               +- Scan OneRowRelation[]
      ```
      
      In this PR, *SUM over window* pattern matching is based on the code of hvanhovell ; he should be credited for the work he did.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with newly added testcases)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12421 from dongjoon-hyun/SPARK-14664.
      af92299f
    • wm624@hotmail.com's avatar
      [SPARK-14937][ML][DOCUMENT] spark.ml LogisticRegression sqlCtx in scala is... · c74fd1e5
      wm624@hotmail.com authored
      [SPARK-14937][ML][DOCUMENT] spark.ml LogisticRegression sqlCtx in scala is inconsistent with java and python
      
      ## What changes were proposed in this pull request?
      In spark.ml document, the LogisticRegression scala example uses sqlCtx. It is inconsistent with java and python examples which use sqlContext. In addition, a user can't copy & paste to run the example in spark-shell as sqlCtx doesn't exist in spark-shell while sqlContext exists.
      
      Change the scala example referred by the spark.ml example.
      
      ## How was this patch tested?
      
      Compile the example scala file and it passes compilation.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12717 from wangmiao1981/doc.
      c74fd1e5
    • Josh Rosen's avatar
      [SPARK-14930][SPARK-13693] Fix race condition in CheckpointWriter.stop() · 450136ec
      Josh Rosen authored
      CheckpointWriter.stop() is prone to a race condition: if one thread calls `stop()` right as a checkpoint write task begins to execute, that write task may become blocked when trying to access `fs`, the shared Hadoop FileSystem, since both the `fs` getter and `stop` method synchronize on the same lock. Here's a thread-dump excerpt which illustrates the problem:
      
      ```java
      "pool-31-thread-1" #156 prio=5 os_prio=31 tid=0x00007fea02cd2000 nid=0x5c0b waiting for monitor entry [0x000000013bc4c000]
         java.lang.Thread.State: BLOCKED (on object monitor)
          at org.apache.spark.streaming.CheckpointWriter.org$apache$spark$streaming$CheckpointWriter$$fs(Checkpoint.scala:302)
          - waiting to lock <0x00000007bf53ee78> (a org.apache.spark.streaming.CheckpointWriter)
          at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:224)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
      
      "pool-1-thread-1-ScalaTest-running-MapWithStateSuite" #11 prio=5 os_prio=31 tid=0x00007fe9ff879800 nid=0x5703 waiting on condition [0x000000012e54c000]
         java.lang.Thread.State: TIMED_WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for  <0x00000007bf564568> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
          at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
          at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
          at org.apache.spark.streaming.CheckpointWriter.stop(Checkpoint.scala:291)
          - locked <0x00000007bf53ee78> (a org.apache.spark.streaming.CheckpointWriter)
          at org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:159)
          - locked <0x00000007bf53ea90> (a org.apache.spark.streaming.scheduler.JobGenerator)
          at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:115)
          - locked <0x00000007bf53d3f0> (a org.apache.spark.streaming.scheduler.JobScheduler)
          at org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:680)
          at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
          at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:679)
          - locked <0x00000007bf516a70> (a org.apache.spark.streaming.StreamingContext)
          at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:644)
          - locked <0x00000007bf516a70> (a org.apache.spark.streaming.StreamingContext)
      [...]
      ```
      
      We can fix this problem by having `stop` and `fs` be synchronized on different locks: the synchronization on `stop` only needs to guard against multiple threads calling `stop` at the same time, whereas the synchronization on `fs` is only necessary for cross-thread visibility. There's only ever a single active checkpoint writer thread at a time, so we don't need to guard against concurrent access to `fs`. Thus, `fs` can simply become a `volatile` var, similar to `lastCheckpointTime`.
      
      This change should fix [SPARK-13693](https://issues.apache.org/jira/browse/SPARK-13693), a flaky `MapWithStateSuite` test suite which has recently been failing several times per day. It also results in a huge test speedup: prior to this patch, `MapWithStateSuite` took about 80 seconds to run, whereas it now runs in less than 10 seconds. For the `streaming` project's tests as a whole, they now run in ~220 seconds vs. ~354 before.
      
      /cc zsxwing and tdas for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12712 from JoshRosen/fix-checkpoint-writer-race.
      450136ec
    • Hemant Bhanawat's avatar
      [SPARK-14729][SCHEDULER] Refactored YARN scheduler creation code to use newly... · e4d439c8
      Hemant Bhanawat authored
      [SPARK-14729][SCHEDULER] Refactored YARN scheduler creation code to use newly added ExternalClusterManager
      
      ## What changes were proposed in this pull request?
      With the addition of ExternalClusterManager(ECM) interface in PR #11723, any cluster manager can now be integrated with Spark. It was suggested in  ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function  into YarnClusterManager that implements ECM interface.
      
      ## How was this patch tested?
      Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too.
      
      Author: Hemant Bhanawat <hemant@snappydata.io>
      
      Closes #12641 from hbhanawat/yarnClusterMgr.
      e4d439c8
    • Mike Dusenberry's avatar
      [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes · 607f5034
      Mike Dusenberry authored
      This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:
      
      * `RowMatrix` <sup>**[1]**</sup>
        1. `computeGramianMatrix`
        2. `computeCovariance`
        3. `computeColumnSummaryStatistics`
        4. `columnSimilarities`
        5. `tallSkinnyQR` <sup>**[2]**</sup>
      * `IndexedRowMatrix` <sup>**[3]**</sup>
        1. `computeGramianMatrix`
      * `CoordinateMatrix`
        1. `transpose`
      * `BlockMatrix`
        1. `validate`
        2. `cache`
        3. `persist`
        4. `transpose`
      
      **[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
      **[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
      **[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.
      
      Author: Mike Dusenberry <mwdusenb@us.ibm.com>
      
      Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
      607f5034
    • Liwei Lin's avatar
      [SPARK-14874][SQL][STREAMING] Remove the obsolete Batch representation · a234cc61
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b) and then became useless.
      
      This patch:
      - removes the `Batch` class
      - ~~does some related renaming~~ (update: this has been reverted)
      - fixes some related comments
      
      ## How was this patch tested?
      
      N/A
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12638 from lw-lin/remove-batch.
      a234cc61
    • Herman van Hovell's avatar
      [SPARK-14950][SQL] Fix BroadcastHashJoin's unique key Anti-Joins · 7dd01d9c
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug.
      
      ### How was this patch tested?
      Added tests cases to `ExistenceJoinSuite`.
      
      cc davies gatorsmile
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12730 from hvanhovell/SPARK-14950.
      7dd01d9c
    • Reynold Xin's avatar
      [SPARK-14949][SQL] Remove HiveConf dependency from InsertIntoHiveTable · ea017b55
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the use of HiveConf from InsertIntoHiveTable. I think this is the last major use of HiveConf and after this we can try to remove the execution HiveConf.
      
      ## How was this patch tested?
      Internal refactoring and should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12728 from rxin/SPARK-14949.
      ea017b55
    • Victor Chima's avatar
      Unintentional white spaces in kryo classes configuration parameters · 08dc8936
      Victor Chima authored
      ## What changes were proposed in this pull request?
      
      Pruned off white spaces present in the user provided comma separated list of classes for **spark.kryo.classesToRegister** and **spark.kryo.registrator**.
      
      ## How was this patch tested?
      
      Manual tests
      
      Author: Victor Chima <blazy2k9@gmail.com>
      
      Closes #12701 from blazy2k9/master.
      08dc8936
    • Dongjoon Hyun's avatar
      [MINOR][BUILD] Enable RAT checking on `LZ4BlockInputStream.java`. · c5443560
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since `LZ4BlockInputStream.java` is not licensed to Apache Software Foundation (ASF), the Apache License header of that file is not monitored until now.
      This PR aims to enable RAT checking on `LZ4BlockInputStream.java` by excluding from `dev/.rat-excludes`.
      This will prevent accidental removal of Apache License header from that file.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (Specifically, RAT check stage).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12677 from dongjoon-hyun/minor_rat_exclusion_file.
      c5443560
    • Yin Huai's avatar
      [SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE COLUMN,... · 54a3eb83
      Yin Huai authored
      [SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands
      
      ## What changes were proposed in this pull request?
      This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands.
      
      ## How was this patch tested?
      Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12714 from yhuai/banNativeCommand.
      54a3eb83
    • Reynold Xin's avatar
      [SPARK-14944][SPARK-14943][SQL] Remove HiveConf from HiveTableScanExec,... · d73d67f6
      Reynold Xin authored
      [SPARK-14944][SPARK-14943][SQL] Remove HiveConf from HiveTableScanExec, HiveTableReader, and ScriptTransformation
      
      ## What changes were proposed in this pull request?
      This patch removes HiveConf from HiveTableScanExec and HiveTableReader and instead just uses our own configuration system. I'm splitting the large change of removing HiveConf into multiple independent pull requests because it is very difficult to debug test failures when they are all combined in one giant one.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12727 from rxin/SPARK-14944.
      d73d67f6
    • Liwei Lin's avatar
      [SPARK-14911] [CORE] Fix a potential data race in TaskMemoryManager · b2a45606
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      [[SPARK-13210][SQL] catch OOM when allocate memory and expand array](https://github.com/apache/spark/commit/37bc203c8dd5022cb11d53b697c28a737ee85bcc) introduced an `acquiredButNotUsed` field, but it might not be correctly synchronized:
      - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271));
      - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, taskAttemptId, tungstenMemoryMode)` (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400)) might not be correctly synchronized, and thus might not see `acquiredButNotUsed`'s most recent value.
      
      This patch makes `acquiredButNotUsed` volatile to fix this.
      
      ## How was this patch tested?
      
      This should be covered by existing suits.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12681 from lw-lin/fix-acquiredButNotUsed.
      b2a45606
    • Reynold Xin's avatar
      [SPARK-14913][SQL] Simplify configuration API · 8fda5a73
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users.
      
      As part of this patch, I also removed some config options deprecated in Spark 1.x.
      
      ## How was this patch tested?
      Updated relevant tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12689 from rxin/SPARK-14913.
      8fda5a73
  3. Apr 26, 2016
    • Andrew Or's avatar
      [SPARK-13477][SQL] Expose new user-facing Catalog interface · d8a83a56
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      #12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface.
      
      ## How was this patch tested?
      
      See `CatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12713 from andrewor14/user-facing-catalog.
      d8a83a56
    • Dilip Biswal's avatar
      [SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS · d93976d8
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands.
      
      Command Syntax:
      ``` SQL
      SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]
      ```
      ``` SQL
      SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
      ```
      
      ## How was this patch tested?
      
      Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite
      to verify plans.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #12222 from dilipbiswal/dkb_show_columns.
      d93976d8
    • Joseph K. Bradley's avatar
      [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local · bd2c9a6d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.
      
      This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
      * Renamed fields to match numpy, scipy: mu => mean, sigma => cov
      
      This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
      * Modifying the constructor
      * Adding a computeProbabilities method
      
      Also:
      * Added EPSILON to mllib-local for use in MultivariateGaussian
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12593 from jkbradley/sparkml-gmm-fix.
      bd2c9a6d
    • Oscar D. Lara Yejas's avatar
      [SPARK-13734][SPARKR] Added histogram function · 0c99c23b
      Oscar D. Lara Yejas authored
      ## What changes were proposed in this pull request?
      
      Added method histogram() to compute the histogram of a Column
      
      Usage:
      
      ```
      ## Create a DataFrame from the Iris dataset
      irisDF <- createDataFrame(sqlContext, iris)
      
      ## Render a histogram for the Sepal_Length column
      histogram(irisDF, "Sepal_Length", nbins=12)
      
      ```
      ![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png)
      
      Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name
      
      ## How was this patch tested?
      
      All unit tests pass. I added specific unit cases for different scenarios.
      
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #11569 from olarayej/SPARK-13734.
      0c99c23b
    • Josh Rosen's avatar
      [SPARK-14925][BUILD] Re-introduce 'unused' dependency so that published POMs are flattened · 75879ac3
      Josh Rosen authored
      Spark's published POMs are supposed to be flattened and not contain variable substitution (see SPARK-3812), but the dummy dependency that was required for this was accidentally removed. We should re-introduce this dependency in order to fix an issue where the un-flattened POMs cause the wrong dependencies to be included in Scala 2.10 published POMs.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12706 from JoshRosen/SPARK-14925-published-poms-should-be-flattened.
      75879ac3
    • Sameer Agarwal's avatar
      [SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-precision decimals · 9797cc20
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time.
      
      ## How was this patch tested?
      
      This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12710 from sameeragarwal/vectorized-enable.
      9797cc20
    • Joseph K. Bradley's avatar
      [SPARK-12301][ML] Made all tree and ensemble classes not final · 6c5a837c
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      There have been continuing requests (e.g., SPARK-7131) for allowing users to extend and modify MLlib models and algorithms.
      
      This PR makes tree and ensemble classes, Node types, and Split types in spark.ml no longer final.  This matches most other spark.ml algorithms.
      
      Constructors for models are still private since we may need to refactor how stats are maintained in tree nodes.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12711 from jkbradley/final-trees.
      6c5a837c
    • Zheng RuiFeng's avatar
      [SPARK-14514][DOC] Add python example for VectorSlicer · e88476c8
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add the missing python example for VectorSlicer
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12282 from zhengruifeng/vecslicer_pe.
      e88476c8
    • Dongjoon Hyun's avatar
      [SPARK-14907][MLLIB] Use repartition in GLMRegressionModel.save · e4f3eec5
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write.
      ```
      - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
      - // TODO: repartition with 1 partition after SPARK-5532 gets fixed
      - dataRDD.write.parquet(Loader.dataPath(path))
      + sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12676 from dongjoon-hyun/SPARK-14907.
      e4f3eec5
    • Davies Liu's avatar
      [SPARK-14853] [SQL] Support LeftSemi/LeftAnti in SortMergeJoinExec · 7131b03b
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec.
      
      This PR also simplify the join selection in SparkStrategy.
      
      ## How was this patch tested?
      
      Added new tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12668 from davies/smj_semi.
      7131b03b
    • Joseph K. Bradley's avatar
      [SPARK-14903][SPARK-14071][ML][PYTHON] Revert : MLWritable.write property · 89f082de
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      SPARK-14071 changed MLWritable.write to be a property.  This reverts that change since there was not a good way to make MLReadable.read appear to be a property.
      
      ## How was this patch tested?
      
      existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12671 from jkbradley/revert-MLWritable-write-py.
      89f082de
Loading