Skip to content
Snippets Groups Projects
  1. May 01, 2015
    • Sandy Ryza's avatar
      [SPARK-4550] In sort-based shuffle, store map outputs in serialized form · 0a2b15ce
      Sandy Ryza authored
      Refer to the JIRA for the design doc and some perf results.
      
      I wanted to call out some of the more possibly controversial changes up front:
      * Map outputs are only stored in serialized form when Kryo is in use.  I'm still unsure whether Java-serialized objects can be relocated.  At the very least, Java serialization writes out a stream header which causes problems with the current approach, so I decided to leave investigating this to future work.
      * The shuffle now explicitly operates on key-value pairs instead of any object.  Data is written to shuffle files in alternating keys and values instead of key-value tuples.  `BlockObjectWriter.write` now accepts a key argument and a value argument instead of any object.
      * The map output buffer can hold a max of Integer.MAX_VALUE bytes.  Though this wouldn't be terribly difficult to change.
      * When spilling occurs, the objects that still in memory at merge time end up serialized and deserialized an extra time.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4450 from sryza/sandy-spark-4550 and squashes the following commits:
      
      8c70dd9 [Sandy Ryza] Fix serialization
      9c16fe6 [Sandy Ryza] Fix a couple tests and move getAutoReset to KryoSerializerInstance
      6c54e06 [Sandy Ryza] Fix scalastyle
      d8462d8 [Sandy Ryza] SPARK-4550
      0a2b15ce
    • Patrick Wendell's avatar
      a9fc5055
    • Zhan Zhang's avatar
      [SPARK-6479] [BLOCK MANAGER] Create off-heap block storage API · 36a7a680
      Zhan Zhang authored
      This is the classes for creating off-heap block storage API. It also includes the migration for Tachyon. The diff seems to be big, but it mainly just rename tachyon to offheap. New implementation for hdfs will be submit for review in spark-6112.
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      
      Closes #5430 from zhzhan/SPARK-6479 and squashes the following commits:
      
      60acd84 [Zhan Zhang] minor change to kickoff the test
      12f54c9 [Zhan Zhang] solve merge conflicts
      a54132c [Zhan Zhang] solve review comments
      ffb8e00 [Zhan Zhang] rebase to sparkcontext change
      6e121e0 [Zhan Zhang] resolve review comments and restructure blockmanasger code
      a7aed6c [Zhan Zhang] add Tachyon migration code
      186de31 [Zhan Zhang] initial commit for off-heap block storage api
      36a7a680
  2. Apr 30, 2015
    • Burak Yavuz's avatar
      [SPARK-7248] implemented random number generators for DataFrames · b5347a46
      Burak Yavuz authored
      Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames.
      
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5819 from brkyvz/df-rng and squashes the following commits:
      
      50d69d4 [Burak Yavuz] add seed for test that failed
      4234c3a [Burak Yavuz] fix Rand expression
      13cad5c [Burak Yavuz] couple fixes
      7d53953 [Burak Yavuz] waiting for hive tests
      b453716 [Burak Yavuz] move radn with seed down
      03637f0 [Burak Yavuz] fix broken hive func
      c5909eb [Burak Yavuz] deleted old implementation of Rand
      6d43895 [Burak Yavuz] implemented random generators
      b5347a46
    • zsxwing's avatar
      [SPARK-7282] [STREAMING] Fix the race conditions in StreamingListenerSuite · 69a739c7
      zsxwing authored
      Fixed the following flaky test
      ```Scala
      [info] StreamingListenerSuite:
      [info] - batch info reporting (782 milliseconds)
      [info] - receiver info reporting *** FAILED *** (3 seconds, 911 milliseconds)
      [info]   The code passed to eventually never returned normally. Attempted 10 times over 3.4735783689999997 seconds. Last failure message: 0 did not equal 1. (StreamingListenerSuite.scala:104)
      [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
      [info]   at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
      [info]   at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
      [info]   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      [info]   at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
      [info]   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply$mcV$sp(StreamingListenerSuite.scala:104)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
      [info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
      [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
      [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
      [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
      [info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.runTest(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      [info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      [info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      [info]   at scala.collection.immutable.List.foreach(List.scala:318)
      [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      [info]   at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
      [info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.run(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
      [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
      [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
      [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
      [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      [info]   at java.lang.Thread.run(Thread.java:745)
      [info]   Cause: org.scalatest.exceptions.TestFailedException: 0 did not equal 1
      [info]   at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
      [info]   at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231)
      [info]   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6277)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply$mcV$sp(StreamingListenerSuite.scala:105)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(StreamingListenerSuite.scala:104)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(StreamingListenerSuite.scala:104)
      [info]   at org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:394)
      [info]   at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:408)
      [info]   at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
      [info]   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      [info]   at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
      [info]   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply$mcV$sp(StreamingListenerSuite.scala:104)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite$$anonfun$2.apply(StreamingListenerSuite.scala:94)
      [info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
      [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
      [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
      [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
      [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
      [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
      [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
      [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
      [info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
      [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
      [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.runTest(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
      [info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
      [info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
      [info]   at scala.collection.immutable.List.foreach(List.scala:318)
      [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
      [info]   at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
      [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
      [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
      [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
      [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
      [info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      [info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
      [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
      [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.org$scalatest$BeforeAndAfter$$super$run(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
      [info]   at org.apache.spark.streaming.StreamingListenerSuite.run(StreamingListenerSuite.scala:34)
      [info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
      [info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
      [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
      [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
      [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      [info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      [info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      [info]   at java.lang.Thread.run(Thread.java:745)
      ```
      
      The original codes didn't have a memory barrier in the `eventually` closure, which might fail the test, because JVM doesn't guarantee the memory consistency between different threads without  a memory barrier.
      
      This PR used `ConcurrentLinkedQueue` to set up the memory barrier.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5812 from zsxwing/SPARK-7282 and squashes the following commits:
      
      59115ef [zsxwing] Use SynchronizedBuffer
      014dd2b [zsxwing] Fix the race conditions in StreamingListenerSuite
      69a739c7
    • Patrick Wendell's avatar
      Revert "[SPARK-5213] [SQL] Pluggable SQL Parser Support" · beeafcfd
      Patrick Wendell authored
      This reverts commit 3ba5aaab.
      beeafcfd
    • scwf's avatar
      [SPARK-7123] [SQL] support table.star in sqlcontext · 473552fa
      scwf authored
      Run following sql get error
      `SELECT r.*
      FROM testData l join testData2 r on (l.key = r.a)`
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5690 from scwf/tablestar and squashes the following commits:
      
      3b2e2b6 [scwf] support table.star
      473552fa
    • Cheng Hao's avatar
      [SPARK-5213] [SQL] Pluggable SQL Parser Support · 3ba5aaab
      Cheng Hao authored
      This PR aims to make the SQL Parser Pluggable, and user can register it's own parser via Spark SQL CLI.
      
      ```
      # add the jar into the classpath
      $hchengmydesktop:spark>bin/spark-sql --jars sql99.jar
      
      -- switch to "hiveql" dialect
         spark-sql>SET spark.sql.dialect=hiveql;
         spark-sql>SELECT * FROM src LIMIT 1;
      
      -- switch to "sql" dialect
         spark-sql>SET spark.sql.dialect=sql;
         spark-sql>SELECT * FROM src LIMIT 1;
      
      -- switch to a custom dialect
         spark-sql>SET spark.sql.dialect=com.xxx.xxx.SQL99Dialect;
         spark-sql>SELECT * FROM src LIMIT 1;
      
      -- register the non-exist SQL dialect
         spark-sql> SET spark.sql.dialect=NotExistedClass;
         spark-sql> SELECT * FROM src LIMIT 1;
      -- Exception will be thrown and switch to default sql dialect ("sql" for SQLContext and "hiveql" for HiveContext)
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4015 from chenghao-intel/sqlparser and squashes the following commits:
      
      493775c [Cheng Hao] update the code as feedback
      81a731f [Cheng Hao] remove the unecessary comment
      aab0b0b [Cheng Hao] polish the code a little bit
      49b9d81 [Cheng Hao] shrink the comment for rebasing
      3ba5aaab
    • Vyacheslav Baranov's avatar
      [SPARK-6913][SQL] Fixed "java.sql.SQLException: No suitable driver found" · e991255e
      Vyacheslav Baranov authored
      Fixed `java.sql.SQLException: No suitable driver found` when loading DataFrame into Spark SQL if the driver is supplied with `--jars` argument.
      
      The problem is in `java.sql.DriverManager` class that can't access drivers loaded by Spark ClassLoader.
      
      Wrappers that forward requests are created for these drivers.
      
      Also, it's not necessary any more to include JDBC drivers in `--driver-class-path` in local mode, specifying in `--jars` argument is sufficient.
      
      Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
      
      Closes #5782 from SlavikBaranov/SPARK-6913 and squashes the following commits:
      
      510c43f [Vyacheslav Baranov] [SPARK-6913] Fixed review comments
      b2a727c [Vyacheslav Baranov] [SPARK-6913] Fixed thread race on driver registration
      c8294ae [Vyacheslav Baranov] [SPARK-6913] Fixed "No suitable driver found" when using using JDBC driver added with SparkContext.addJar
      e991255e
    • wangfei's avatar
      [SPARK-7109] [SQL] Push down left side filter for left semi join · a0d8a61a
      wangfei authored
      Now in spark sql optimizer we only push down right side filter for left semi join, actually we can push down left side filter because left semi join is doing filter on left table essentially.
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5677 from scwf/leftsemi and squashes the following commits:
      
      483d205 [wangfei] update with master to fix compile issue
      82df0e1 [wangfei] Merge branch 'master' of https://github.com/apache/spark into leftsemi
      d68a053 [wangfei] added apply
      8f48a3d [scwf] added test
      ebadaa9 [wangfei] left filter push down for left semi join
      a0d8a61a
    • scwf's avatar
      [SPARK-7093] [SQL] Using newPredicate in NestedLoopJoin to enable code generation · 07973381
      scwf authored
      Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to make it can make use of code generation
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #5665 from scwf/NLP and squashes the following commits:
      
      d19dd31 [scwf] improvement
      a887c02 [scwf] improve for NLP boundCondition
      07973381
    • rakeshchalasani's avatar
      [SPARK-7280][SQL] Add "drop" column/s on a data frame · ee044139
      rakeshchalasani authored
      Takes a column name/s and returns a new DataFrame that drops a column/s.
      
      Author: rakeshchalasani <vnit.rakesh@gmail.com>
      
      Closes #5818 from rakeshchalasani/SPARK-7280 and squashes the following commits:
      
      ce2ec09 [rakeshchalasani] Minor edit
      45c06f1 [rakeshchalasani] Change withColumnRename and format changes
      f68945a [rakeshchalasani] Minor fix
      0b9104d [rakeshchalasani] Drop one column at a time
      289afd2 [rakeshchalasani] [SPARK-7280][SQL] Add "drop" column/s on a data frame
      ee044139
    • Burak Yavuz's avatar
      [SPARK-7242][SQL][MLLIB] Frequent items for DataFrames · 149b3ee2
      Burak Yavuz authored
      Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
      public API under:
      ```
      df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame
      ```
      
      The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives.
      
      cc mengxr rxin
      
      Let's get the implementations in, I can add python API in a follow up PR.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5799 from brkyvz/freq-items and squashes the following commits:
      
      a6ec82c [Burak Yavuz] addressed comments v?
      39b1bba [Burak Yavuz] removed toSeq
      0915e23 [Burak Yavuz] addressed comments v2.1
      3a5c177 [Burak Yavuz] addressed comments v2.0
      482e741 [Burak Yavuz] removed old import
      38e784d [Burak Yavuz] addressed comments v1.0
      8279d4d [Burak Yavuz] added default value for support
      3d82168 [Burak Yavuz] made base implementation
      149b3ee2
    • DB Tsai's avatar
      [SPARK-7279] Removed diffSum which is theoretical zero in LinearRegression and coding formating · 1c3e402e
      DB Tsai authored
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #5809 from dbtsai/format and squashes the following commits:
      
      6904eed [DB Tsai] triger jenkins
      9146e19 [DB Tsai] initial commit
      1c3e402e
    • Josh Rosen's avatar
      [Build] Enable MiMa checks for SQL · fa01bec4
      Josh Rosen authored
      Now that 1.3 has been released, we should enable MiMa checks for the `sql` subproject.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5727 from JoshRosen/enable-more-mima-checks and squashes the following commits:
      
      3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
      0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks
      e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql]
      44d0d01 [Josh Rosen] Add back 'launcher' exclude
      1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.
      fa01bec4
    • Zhongshuai Pei's avatar
      [SPARK-7267][SQL]Push down Project when it's child is Limit · 77cc25fb
      Zhongshuai Pei authored
      SQL
      ```
      select key from (select key,value from t1 limit 100) t2 limit 10
      ```
      Optimized Logical Plan before modifying
      ```
      == Optimized Logical Plan ==
      Limit 10
        Project key#228
          Limit 100
            MetastoreRelation default, t1, None
      ```
      Optimized Logical Plan after modifying
      ```
      == Optimized Logical Plan ==
      Limit 10
        Limit 100
          Project key#228
            MetastoreRelation default, t1, None
      ```
      After this, we can combine limits
      
      Author: Zhongshuai Pei <799203320@qq.com>
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5797 from DoingDone9/ProjectLimit and squashes the following commits:
      
      70d0fca [Zhongshuai Pei] Update FilterPushdownSuite.scala
      dc83ae9 [Zhongshuai Pei] Update FilterPushdownSuite.scala
      485c61c [Zhongshuai Pei] Update Optimizer.scala
      f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
      f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
      f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
      34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      77cc25fb
    • Josh Rosen's avatar
      [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add... · 07a86205
      Josh Rosen authored
      [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add facade in front of Unsafe; remove use of Unsafe.setMemory
      
      This patch suppresses compiler warnings due to our use of `sun.misc.Unsafe` (introduced in #5725).  These warnings can only be suppressed via the `-XDignore.symbol.file` javac flag; the `SuppressWarnings` annotation won't work for these.
      
      In order to restrict uses of this compiler flag to the `unsafe` module, I placed a facade in front of `Unsafe` so that other modules won't call it directly. This facade also will also help us to avoid accidental usage of deprecated Unsafe methods or methods that aren't supported in Java 6.
      
      I also removed an unnecessary use of `Unsafe.setMemory`, which isn't present in certain versions of Java 6, and excluded the new `unsafe` module from Javadoc.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5814 from JoshRosen/unsafe-compiler-warnings-fixes and squashes the following commits:
      
      9e8c483 [Josh Rosen] Exclude new unsafe module from Javadoc
      ba75ecf [Josh Rosen] Only apply -XDignore.symbol.file flag in unsafe project.
      7403345 [Josh Rosen] Put facade in front of Unsafe.
      50230c0 [Josh Rosen] Remove usage of Unsafe.setMemory
      96d41c9 [Josh Rosen] Use -XDignore.symbol.file to suppress warnings about sun.misc.Unsafe usage
      07a86205
    • Liang-Chi Hsieh's avatar
      [SPARK-7196][SQL] Support precision and scale of decimal type for JDBC · 6702324b
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7196
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5777 from viirya/jdbc_precision and squashes the following commits:
      
      f40f5e6 [Liang-Chi Hsieh] Support precision and scale for NUMERIC type.
      49acbf9 [Liang-Chi Hsieh] Add unit test.
      a509e19 [Liang-Chi Hsieh] Support precision and scale of decimal type for JDBC.
      6702324b
    • Patrick Wendell's avatar
    • Joseph K. Bradley's avatar
      [SPARK-7207] [ML] [BUILD] Added ml.recommendation, ml.regression to SparkBuild · adbdb19a
      Joseph K. Bradley authored
      Added ml.recommendation, ml.regression to SparkBuild
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5758 from jkbradley/SPARK-7207 and squashes the following commits:
      
      a28158a [Joseph K. Bradley] Added ml.recommendation, ml.regression to SparkBuild
      adbdb19a
    • Hari Shreedharan's avatar
      [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS · 6c65da6b
      Hari Shreedharan authored
      Current Spark apps running on Secure YARN/HDFS would not be able to write data
      to HDFS after 7 days, since delegation tokens cannot be renewed beyond that. This
      means Spark Streaming apps will not be able to run on Secure YARN.
      
      This commit adds basic functionality to fix this issue. In this patch:
      - new parameters are added - principal and keytab, which can be used to login to a KDC
      - the client logs in, and then get tokens to start the AM
      - the keytab is copied to the staging directory
      - the AM waits for 60% of the time till expiry of the tokens and then logs in using the keytab
      - each time after 60% of the time, new tokens are created and sent to the executors
      
      Currently, to avoid complicating the architecture, we set the keytab and principal in the
      SparkHadoopUtil singleton, and schedule a login. Once the login is completed, a callback is scheduled.
      
      This is being posted for feedback, so I can gather feedback on the general implementation.
      
      There are currently a bunch of things to do:
      - [x] logging
      - [x] testing - I plan to manually test this soon. If you have ideas of how to add unit tests, comment.
      - [x] add code to ensure that if these params are set in non-YARN cluster mode, we complain
      - [x] documentation
      - [x] Have the executors request for credentials from the AM, so that retries are possible.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #4688 from harishreedharan/kerberos-longrunning and squashes the following commits:
      
      36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
      611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
      09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
      6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
      072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
      f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
      ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
      7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
      e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
      0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
      bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
      f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
      2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
      61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
      62c45ce [Hari Shreedharan] Relogin from keytab periodically.
      fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
      42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
      0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
      9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
      f4fd711 [Hari Shreedharan] Fix SparkConf usage.
      2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
      af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
      f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
      f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
      5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
      b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
      0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
      d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
      8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
      fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
      41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
      d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
      bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
      f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
      2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
      ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
      77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
      6c65da6b
    • Burak Yavuz's avatar
      [SPARK-7224] added mock repository generator for --packages tests · 7dacc08a
      Burak Yavuz authored
      This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central.
      
      cc pwendell I know that there existed Util functions to create Jars and stuff already, but they didn't really serve my purposes as they appended random prefixes that was breaking things.
      
      I also added the local repository tests. Notice that they work without passing the `repo` to `resolveMavenCoordinates`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5790 from brkyvz/maven-utils and squashes the following commits:
      
      3ec79b7 [Burak Yavuz] addressed comments v0.2
      a39151b [Burak Yavuz] address comments v0.1
      172dfef [Burak Yavuz] use Ivy format
      7476d06 [Burak Yavuz] added mock repository generator
      7dacc08a
    • Patrick Wendell's avatar
    • Vincenzo Selvaggio's avatar
      [SPARK-1406] Mllib pmml model export · 254e0509
      Vincenzo Selvaggio authored
      See PDF attached to the JIRA issue 1406.
      
      The contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: selvinsource <vselvaggio@hotmail.it>
      
      Closes #3062 from selvinsource/mllib_pmml_model_export_SPARK-1406 and squashes the following commits:
      
      852aac6 [Vincenzo Selvaggio] [SPARK-1406] Update JPMML version to 1.1.15 in LICENSE file
      085cf42 [Vincenzo Selvaggio] [SPARK-1406] Added Double Min and Max Fixed scala style
      30165c4 [Vincenzo Selvaggio] [SPARK-1406] Fixed extreme cases for logit
      7a5e0ec [Vincenzo Selvaggio] [SPARK-1406] Binary classification for SVM and Logistic Regression
      cfcb596 [Vincenzo Selvaggio] [SPARK-1406] Throw IllegalArgumentException when exporting a multinomial logistic regression
      25dce33 [Vincenzo Selvaggio] [SPARK-1406] Update code to latest pmml model
      dea98ca [Vincenzo Selvaggio] [SPARK-1406] Exclude transitive dependency for pmml model
      66b7c12 [Vincenzo Selvaggio] [SPARK-1406] Updated pmml model lib to 1.1.15, latest Java 6 compatible
      a0a55f7 [Vincenzo Selvaggio] Merge pull request #2 from mengxr/SPARK-1406
      3c22f79 [Xiangrui Meng] more code style
      e2313df [Vincenzo Selvaggio] Merge pull request #1 from mengxr/SPARK-1406
      472d757 [Xiangrui Meng] fix code style
      1676e15 [Vincenzo Selvaggio] fixed scala issue
      e2ffae8 [Vincenzo Selvaggio] fixed scala style
      b8823b0 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      b25bbf7 [Vincenzo Selvaggio] [SPARK-1406] Added export of pmml to distributed file system using the spark context
      7a949d0 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
      f46c75c [Vincenzo Selvaggio] [SPARK-1406] Added PMMLExportable to supported models
      7b33b4e [Vincenzo Selvaggio] [SPARK-1406] Added a PMMLExportable interface Restructured code in a new package mllib.pmml Supported models implements the new PMMLExportable interface: LogisticRegression, SVM, KMeansModel, LinearRegression, RidgeRegression, Lasso
      d559ec5 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      8fe12bb [Vincenzo Selvaggio] [SPARK-1406] Adjusted logistic regression export description and target categories
      03bc3a5 [Vincenzo Selvaggio] added logistic regression
      da2ec11 [Vincenzo Selvaggio] [SPARK-1406] added linear SVM PMML export
      82f2131 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      19adf29 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
      1faf985 [Vincenzo Selvaggio] [SPARK-1406] Added target field to the regression model for completeness Adjusted unit test to deal with this change
      3ae8ae5 [Vincenzo Selvaggio] [SPARK-1406] Adjusted imported order according to the guidelines
      c67ce81 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      78515ec [Vincenzo Selvaggio] [SPARK-1406] added pmml export for LinearRegressionModel, RidgeRegressionModel and LassoModel
      e29dfb9 [Vincenzo Selvaggio] removed version, by default is set to 4.2 (latest from jpmml) removed copyright
      ae8b993 [Vincenzo Selvaggio] updated some commented tests to use the new ModelExporter object reordered the imports
      df8a89e [Vincenzo Selvaggio] added pmml version to pmml model changed the copyright to spark
      a1b4dc3 [Vincenzo Selvaggio] updated imports
      834ca44 [Vincenzo Selvaggio] reordered the import accordingly to the guidelines
      349a76b [Vincenzo Selvaggio] new helper object to serialize the models to pmml format
      c3ef9b8 [Vincenzo Selvaggio] set it to private
      6357b98 [Vincenzo Selvaggio] set it to private
      e1eb251 [Vincenzo Selvaggio] removed serialization part, this will be part of the ModelExporter helper object
      aba5ee1 [Vincenzo Selvaggio] fixed cluster export
      cd6c07c [Vincenzo Selvaggio] fixed scala style to run tests
      f75b988 [Vincenzo Selvaggio] Merge remote-tracking branch 'origin/master' into mllib_pmml_model_export_SPARK-1406
      07a29bf [selvinsource] Update LICENSE
      8841439 [Vincenzo Selvaggio] adjust scala style in order to compile
      1433b11 [Vincenzo Selvaggio] complete suite tests
      8e71b8d [Vincenzo Selvaggio] kmeans pmml export implementation
      9bc494f [Vincenzo Selvaggio] added scala suite tests added saveLocalFile to ModelExport trait
      226e184 [Vincenzo Selvaggio] added javadoc and export model type in case there is a need to support other types of export (not just PMML)
      a0e3679 [Vincenzo Selvaggio] export and pmml export traits kmeans test implementation
      254e0509
    • Zhongshuai Pei's avatar
      [SPARK-7225][SQL] CombineLimits optimizer does not work · 44595144
      Zhongshuai Pei authored
      SQL
      ```
      select key from (select key from src limit 100) t2 limit 10
      ```
      Optimized Logical Plan before modifying
      ```
      == Optimized Logical Plan ==
      Limit 10
      Limit 100
      Project key#3
      MetastoreRelation default, src, None
      ```
      Optimized Logical Plan after modifying
      ```
      == Optimized Logical Plan ==
      Limit 10
       Project [key#1]
        MetastoreRelation default, src, None
      ```
      
      Author: Zhongshuai Pei <799203320@qq.com>
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5770 from DoingDone9/limitOptimizer and squashes the following commits:
      
      c68eaa7 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
      97e18cf [Zhongshuai Pei] Update Optimizer.scala
      19ab875 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
      7db4566 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
      e2a491d [Zhongshuai Pei] Update Optimizer.scala
      f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
      f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
      f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
      34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      44595144
  3. Apr 29, 2015
    • DB Tsai's avatar
      Some code clean up. · ba49eb16
      DB Tsai authored
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #5794 from dbtsai/clean and squashes the following commits:
      
      ad639dd [DB Tsai] Indentation
      834d527 [DB Tsai] Some code clean up.
      ba49eb16
    • Burak Yavuz's avatar
      [SPARK-7156][SQL] Addressed follow up comments for randomSplit · 5553198f
      Burak Yavuz authored
      small fixes regarding comments in PR #5761
      
      cc rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5795 from brkyvz/split-followup and squashes the following commits:
      
      369c522 [Burak Yavuz] changed wording a little
      1ea456f [Burak Yavuz] Addressed follow up comments
      5553198f
    • 云峤's avatar
      [SPARK-7234][SQL] Fix DateType mismatch when codegen on. · 7143f6e9
      云峤 authored
      Author: 云峤 <chensong.cs@alibaba-inc.com>
      
      Closes #5778 from kaka1992/fix_codegenon_datetype_mismatch and squashes the following commits:
      
      1ad4cff [云峤] SPARK-7234 fix dateType mismatch
      7143f6e9
    • zsxwing's avatar
      [SPARK-6862] [STREAMING] [WEBUI] Add BatchPage to display details of a batch · 1b7106b8
      zsxwing authored
      This is an initial commit for SPARK-6862. Once SPARK-6796 is merged, I will add the links to StreamingPage so that the user can jump to BatchPage.
      
      Screenshots:
      ![success](https://cloud.githubusercontent.com/assets/1000778/7102439/bbe75406-e0b3-11e4-84fe-3e6de629a49a.png)
      ![failure](https://cloud.githubusercontent.com/assets/1000778/7102440/bc124454-e0b3-11e4-921a-c8b39d6b61bc.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5473 from zsxwing/SPARK-6862 and squashes the following commits:
      
      0727d35 [zsxwing] Change BatchUIData to a case class
      b380cfb [zsxwing] Add createJobStart to eliminate duplicate codes
      9a3083d [zsxwing] Rename XxxDatas -> XxxData
      087ba98 [zsxwing] Refactor BatchInfo to store only necessary fields
      cb62e4f [zsxwing] Use Seq[(OutputOpId, SparkJobId)] to store the id relations
      72f8e7e [zsxwing] Add unit tests for BatchPage
      1282b10 [zsxwing] Handle some corner cases and add tests for StreamingJobProgressListener
      77a69ae [zsxwing] Refactor codes as per TD's comments
      35ffd80 [zsxwing] Merge branch 'master' into SPARK-6862
      15bdf9b [zsxwing] Add batch links and unit tests
      4bf66b6 [zsxwing] Merge branch 'master' into SPARK-6862
      7168807 [zsxwing] Limit the max width of the error message and fix nits in the UI
      0b226f9 [zsxwing] Change 'Last Error' to 'Error'
      fc98a43 [zsxwing] Put clearing local properties to finally and remove redundant private[streaming]
      0c7b2eb [zsxwing] Add BatchPage to display details of a batch
      1b7106b8
    • Joseph K. Bradley's avatar
      [SPARK-7176] [ML] Add validation functionality to Param · 114bad60
      Joseph K. Bradley authored
      Main change: Added isValid field to Param.  Modified all usages to use isValid when relevant.  Added helper methods in ParamValidate.
      
      Also overrode Params.validate() in:
      * CrossValidator + model
      * Pipeline + model
      
      I made a few updates for the elastic net patch:
      * I changed "tol" to "convergenceTol"
      * I added some documentation
      
      This PR is Scala + Java only.  Python will be in a follow-up PR.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5740 from jkbradley/enforce-validate and squashes the following commits:
      
      ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
      76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
      af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg.  Fixed test failures.  Renamed ParamValidate to ParamValidators.  Removed explicit type from ParamValidators calls where possible.
      bb2665a [Joseph K. Bradley] merged with elastic net pr
      ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
      6895dfc [Joseph K. Bradley] small cleanups
      069ac6d [Joseph K. Bradley] many cleanups
      928fb84 [Joseph K. Bradley] Maybe done
      a910ac7 [Joseph K. Bradley] still workin
      6d60e2e [Joseph K. Bradley] Still workin
      b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
      dbc9fb2 [Joseph K. Bradley] merged with master.  enforcing Params.validate
      114bad60
    • wangfei's avatar
      [SQL] [Minor] Print detail query execution info when spark answer is not right · 1fdfdb47
      wangfei authored
      Print detail query execution info including parsed/analyzed/optimized/Physical plan for query when spak answer is not rignt.
      
      ```
      Results do not match for query:
      == Parsed Logical Plan ==
      'Aggregate ['x.str], ['x.str,SUM('x.strCount) AS c1#46]
       'Join Inner, Some(('x.str = 'y.str))
        'UnresolvedRelation [df], Some(x)
        'UnresolvedRelation [df], Some(y)
      
      == Analyzed Logical Plan ==
      Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L]
       Join Inner, Some((str#44 = str#51))
        Subquery x
         Subquery df
          Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L]
           Project [_1#41 AS int#43,_2#42 AS str#44]
            LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]]
        Subquery y
         Subquery df
          Aggregate [str#51], [str#51,COUNT(str#51) AS strCount#47L]
           Project [_1#41 AS int#50,_2#42 AS str#51]
            LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]]
      
      == Optimized Logical Plan ==
      Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L]
       Project [str#44,strCount#45L]
        Join Inner, Some((str#44 = str#51))
         Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L]
          LocalRelation [str#44], [[1],[2],[3]]
         Aggregate [str#51], [str#51]
          LocalRelation [str#51], [[1],[2],[3]]
      
      == Physical Plan ==
      Aggregate false, [str#44], [str#44,CombineSum(PartialSum#53L) AS c1#46L]
       Aggregate true, [str#44], [str#44,SUM(strCount#45L) AS PartialSum#53L]
        Project [str#44,strCount#45L]
         BroadcastHashJoin [str#44], [str#51], BuildRight
          Aggregate false, [str#44], [str#44,Coalesce(SUM(PartialCount#55L),0) AS strCount#45L]
           Exchange (HashPartitioning [str#44], 5), []
            Aggregate true, [str#44], [str#44,COUNT(str#44) AS PartialCount#55L]
             LocalTableScan [str#44], [[1],[2],[3]]
          Aggregate false, [str#51], [str#51]
           Exchange (HashPartitioning [str#51], 5), []
            Aggregate true, [str#51], [str#51]
             LocalTableScan [str#51], [[1],[2],[3]]
      
      Code Generation: false
      == RDD ==
      == Results ==
      !== Correct Answer - 3 ==   == Spark Answer - 3 ==
       [1,1]                      [1,1]
      ![2,3]                      [2,1]
       [3,1]                      [3,1]
      ```
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #5774 from scwf/checkanswer and squashes the following commits:
      
      5be6f78 [wangfei] print detail query execution info when Spark Answer is not right
      1fdfdb47
    • Joseph K. Bradley's avatar
      [SPARK-7259] [ML] VectorIndexer: do not copy non-ML metadata to output column · b1ef6a60
      Joseph K. Bradley authored
      Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column.  Removed ml.util.TestingUtils since VectorIndexer was the only use.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5789 from jkbradley/vector-indexer-metadata and squashes the following commits:
      
      b28e159 [Joseph K. Bradley] Changed VectorIndexer so it does not carry non-ML metadata from the input to the output column.  Removed ml.util.TestingUtils since VectorIndexer was the only use.
      b1ef6a60
    • Cheng Hao's avatar
      [SPARK-7229] [SQL] SpecificMutableRow should take integer type as internal representation for Date · f8cbb0a4
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5772 from chenghao-intel/specific_row and squashes the following commits:
      
      2cd064d [Cheng Hao] scala style issue
      60347a2 [Cheng Hao] SpecificMutableRow should take integer type as internal representation for DateType
      f8cbb0a4
    • yongtang's avatar
      [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input · 3fc6cfd0
      yongtang authored
      See JIRA: https://issues.apache.org/jira/browse/SPARK-7155
      
      SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
      ```scala
      sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
      ```
      will throw
      ```
      org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
      ```
      However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.
      
      That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.
      
      This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.
      
      A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().
      
      Note: The contribution is my original work and that I license the work to the project under the project's open source license.
      
      Author: yongtang <yongtang@users.noreply.github.com>
      
      Closes #5708 from yongtang/SPARK-7155 and squashes the following commits:
      
      654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
      26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
      73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
      3fc6cfd0
    • Qiping Li's avatar
      [SPARK-7181] [CORE] fix inifite loop in Externalsorter's mergeWithAggregation · 7f4b5837
      Qiping Li authored
      see [SPARK-7181](https://issues.apache.org/jira/browse/SPARK-7181).
      
      Author: Qiping Li <liqiping1991@gmail.com>
      
      Closes #5737 from chouqin/externalsorter and squashes the following commits:
      
      2924b93 [Qiping Li] fix inifite loop in Externalsorter's mergeWithAggregation
      7f4b5837
    • Burak Yavuz's avatar
      [SPARK-7156][SQL] support RandomSplit in DataFrames · d7dbce8f
      Burak Yavuz authored
      This is built on top of kaka1992 's PR #5711 using Logical plans.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5761 from brkyvz/random-sample and squashes the following commits:
      
      a1fb0aa [Burak Yavuz] remove unrelated file
      69669c3 [Burak Yavuz] fix broken test
      1ddb3da [Burak Yavuz] copy base
      6000328 [Burak Yavuz] added python api and fixed test
      3c11d1b [Burak Yavuz] fixed broken test
      f400ade [Burak Yavuz] fix build errors
      2384266 [Burak Yavuz] addressed comments v0.1
      e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
      d7dbce8f
    • Xusen Yin's avatar
      [SPARK-6529] [ML] Add Word2Vec transformer · c9d530e2
      Xusen Yin authored
      See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).
      
      There are some notes:
      
      1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
      2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
      3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5596 from yinxusen/SPARK-6529 and squashes the following commits:
      
      ee2b37a [Xusen Yin] merge with former HEAD
      4945462 [Xusen Yin] merge with #5626
      3bc2cbd [Xusen Yin] change foldLeft to for loop and use blas
      5dd4ee7 [Xusen Yin] fix scala style
      743e0d5 [Xusen Yin] fix comments and code style
      04c48e9 [Xusen Yin] ensure the functionality
      a190f2c [Xusen Yin] fix code style and refine the transform function of word2vec
      02848fa [Xusen Yin] refine comments
      34a55c0 [Xusen Yin] fix errors
      109d124 [Xusen Yin] add test suite and pass it
      04dde06 [Xusen Yin] add shared params
      c594095 [Xusen Yin] add word2vec transformer
      23d77fa [Xusen Yin] merge with #5626
      e8cfaf7 [Xusen Yin] fix conflict with master
      66e7bd3 [Xusen Yin] change foldLeft to for loop and use blas
      566ec20 [Xusen Yin] fix scala style
      b54399f [Xusen Yin] fix comments and code style
      1211e86 [Xusen Yin] ensure the functionality
      6b97ec8 [Xusen Yin] fix code style and refine the transform function of word2vec
      7cde18f [Xusen Yin] rm sharedParams
      618abd0 [Xusen Yin] refine comments
      e29680a [Xusen Yin] fix errors
      fe3afe9 [Xusen Yin] add test suite and pass it
      02767fb [Xusen Yin] add shared params
      6a514f1 [Xusen Yin] add word2vec transformer
      c9d530e2
    • DB Tsai's avatar
      [SPARK-7222] [ML] Added mathematical derivation in comment and compressed the... · 15995c88
      DB Tsai authored
      [SPARK-7222] [ML] Added mathematical derivation in comment and compressed the model, removed the correction terms in LinearRegression with ElasticNet
      
      Added detailed mathematical derivation of how scaling and LeastSquaresAggregator work. Refactored the code so the model is compressed based on the storage. We may try compression based on the prediction time.
      
      Also, I found that diffSum will be always zero mathematically, so no corrections are required.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #5767 from dbtsai/lir-doc and squashes the following commits:
      
      5e346c9 [DB Tsai] refactoring
      fc9f582 [DB Tsai] doc
      58456d8 [DB Tsai] address feedback
      69757b8 [DB Tsai] actually diffSum is mathematically zero! No correction is needed.
      5929e49 [DB Tsai] typo
      63f7d1e [DB Tsai] Added compression to the model based on storage
      203a295 [DB Tsai] Add more documentation to LinearRegression in new ML framework.
      15995c88
    • Josh Rosen's avatar
      [SPARK-6629] cancelJobGroup() may not work for jobs whose job groups are... · 3a180c19
      Josh Rosen authored
      [SPARK-6629] cancelJobGroup() may not work for jobs whose job groups are inherited from parent threads
      
      When a job is submitted with a job group and that job group is inherited from a parent thread, there are multiple bugs that may prevent this job from being cancelable via `SparkContext.cancelJobGroup()`:
      
      - When filtering jobs based on their job group properties, DAGScheduler calls `get()` instead of `getProperty()`, which does not respect inheritance, so it will skip over jobs whose job group properties were inherited.
      - `Properties` objects are mutable, but we do not make defensive copies / snapshots, so modifications of the parent thread's job group will cause running jobs' groups to change; this also breaks cancelation.
      
      Both of these issues are easy to fix: use `getProperty()` and perform defensive copying.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5288 from JoshRosen/localProperties-mutability-race and squashes the following commits:
      
      9e29654 [Josh Rosen] Fix style issue
      5d90750 [Josh Rosen] Merge remote-tracking branch 'origin/master' into localProperties-mutability-race
      3f7b9e8 [Josh Rosen] Add JIRA reference; move clone into DAGScheduler
      707e417 [Josh Rosen] Clone local properties to prevent mutations from breaking job cancellation.
      b376114 [Josh Rosen] Fix bug that prevented jobs with inherited job group properties from being cancelled.
      3a180c19
    • Tathagata Das's avatar
      [SPARK-6752] [STREAMING] [REOPENED] Allow StreamingContext to be recreated... · a9c4e299
      Tathagata Das authored
      [SPARK-6752] [STREAMING] [REOPENED] Allow StreamingContext to be recreated from checkpoint and existing SparkContext
      
      Original PR #5428 got reverted due to issues between MutableBoolean and Hadoop 1.0.4 (see JIRA). This replaces MutableBoolean with AtomicBoolean.
      
      srowen pwendell
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #5773 from tdas/SPARK-6752 and squashes the following commits:
      
      a0c0ead [Tathagata Das] Fix for hadoop 1.0.4
      70ae85b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-6752
      94db63c [Tathagata Das] Fix long line.
      524f519 [Tathagata Das] Many changes based on PR comments.
      eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
      36a7823 [Tathagata Das] Minor changes.
      204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext
      a9c4e299
Loading