Skip to content
Snippets Groups Projects
  1. Jun 21, 2014
  2. Jun 20, 2014
    • Marcelo Vanzin's avatar
      Fix some tests. · 648553d4
      Marcelo Vanzin authored
      - JavaAPISuite was trying to compare a bare path with a URI. Fix by
        extracting the path from the URI, since we know it should be a
        local path anyway/
      
      - b9be1609 excluded the ASM dependency everywhere, but easymock needs
        it (because cglib needs it). So re-add the dependency, with test
        scope this time.
      
      The second one above actually uncovered a weird situation: the maven
      test target works, even though I can't find the class sbt complains
      about in its classpath. sbt complains with:
      
        [error] Uncaught exception when running org.apache.spark.util
        .random.RandomSamplerSuite: java.lang.NoClassDefFoundError:
        org/objectweb/asm/Type
      
      To avoid more weirdness caused by that, I explicitly added the asm
      dependency to both maven and sbt (for tests only), and verified
      the classes don't end up in the final assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #917 from vanzin/flaky-tests and squashes the following commits:
      
      d022320 [Marcelo Vanzin] Fix some tests.
      648553d4
    • Anant's avatar
      [SPARK-2061] Made splits deprecated in JavaRDDLike · 010c460d
      Anant authored
      The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
      Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #1062 from anantasty/SPARK-2061 and squashes the following commits:
      
      b83ce6b [Anant] Fixed syntax issue
      21f9210 [Anant] Fixed version number in deprecation string
      9315b76 [Anant] made related changes to use partitions in python api
      8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
      010c460d
    • Patrick Wendell's avatar
      a6786424
    • Doris Xin's avatar
      [SPARK-1970] Update unit test in XORShiftRandomSuite to use ChiSquareTest from commons-math3 · e99903b8
      Doris Xin authored
      Updating the chisquare unit test in XORShiftRandomSuite to use the ChiSquareTest in commons-math3 instead of hardcoding the chisquare statistic for the desired confidence interval.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1073 from dorx/math3Unit and squashes the following commits:
      
      da0e891 [Doris Xin] remove math3 from common pom
      9954143 [Doris Xin] merge master
      c19948f [Doris Xin] Merge branch 'master' into math3Unit
      8f84f19 [Doris Xin] [SPARK-1970] unit test in XORShiftRandomSuite
      ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
      1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
      e99903b8
    • Andrew Ash's avatar
      SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1 · 08d0aca7
      Andrew Ash authored
      Before:
      
      ```
      14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use
      java.net.BindException: Address already in use
      	at sun.nio.ch.Net.bind0(Native Method)
      	at sun.nio.ch.Net.bind(Net.java:444)
      	at sun.nio.ch.Net.bind(Net.java:436)
      	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
      	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
      	at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
      	at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
      	at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
      	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
      	at org.eclipse.jetty.server.Server.doStart(Server.java:293)
      	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
      	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
      	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
      	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
      	at scala.util.Try$.apply(Try.scala:161)
      	at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
      	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
      	at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
      	at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
      	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
      	at $line3.$read$$iwC$$iwC.<init>(<console>:8)
      	at $line3.$read$$iwC.<init>(<console>:14)
      	at $line3.$read.<init>(<console>:16)
      	at $line3.$read$.<init>(<console>:20)
      	at $line3.$read$.<clinit>(<console>)
      	at $line3.$eval$.<init>(<console>:7)
      	at $line3.$eval$.<clinit>(<console>)
      	at $line3.$eval.$print(<console>)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
      	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
      	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
      	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
      	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
      	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
      	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
      	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
      	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
      	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
      	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
      	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
      	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
      	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
      	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
      	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
      	at org.apache.spark.repl.Main$.main(Main.scala:31)
      	at org.apache.spark.repl.Main.main(Main.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@7439e55a: java.net.BindException: Address already in use
      java.net.BindException: Address already in use
      	at sun.nio.ch.Net.bind0(Native Method)
      	at sun.nio.ch.Net.bind(Net.java:444)
      	at sun.nio.ch.Net.bind(Net.java:436)
      	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
      	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
      	at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
      	at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
      	at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
      	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
      	at org.eclipse.jetty.server.Server.doStart(Server.java:293)
      	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
      	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
      	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
      	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
      	at scala.util.Try$.apply(Try.scala:161)
      	at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
      	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
      	at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
      	at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
      	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
      	at $line3.$read$$iwC$$iwC.<init>(<console>:8)
      	at $line3.$read$$iwC.<init>(<console>:14)
      	at $line3.$read.<init>(<console>:16)
      	at $line3.$read$.<init>(<console>:20)
      	at $line3.$read$.<clinit>(<console>)
      	at $line3.$eval$.<init>(<console>:7)
      	at $line3.$eval$.<clinit>(<console>)
      	at $line3.$eval.$print(<console>)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
      	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
      	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
      	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
      	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
      	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
      	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
      	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
      	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
      	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
      	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
      	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
      	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
      	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
      	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
      	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
      	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
      	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
      	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
      	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
      	at org.apache.spark.repl.Main$.main(Main.scala:31)
      	at org.apache.spark.repl.Main.main(Main.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      14/06/08 23:58:23 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
      14/06/08 23:58:23 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
      14/06/08 23:58:23 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
      ````
      
      After:
      ```
      14/06/09 00:04:12 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
      14/06/09 00:04:12 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
      14/06/09 00:04:12 INFO Server: jetty-8.y.z-SNAPSHOT
      14/06/09 00:04:12 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041
      14/06/09 00:04:12 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
      ```
      
      Lengthy logging comes from this line of code in Jetty: http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty.aggregate/jetty-all/9.1.3.v20140225/org/eclipse/jetty/util/component/AbstractLifeCycle.java#210
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1019 from ash211/SPARK-1902 and squashes the following commits:
      
      0dd02f7 [Andrew Ash] Leave old org.eclipse.jetty silencing in place
      1e2866b [Andrew Ash] Address CR comments
      9d85eed [Andrew Ash] SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
      08d0aca7
    • Aaron Davidson's avatar
      [SQL] Use hive.SessionState, not the thread local SessionState · 20447849
      Aaron Davidson authored
      Note that this is simply mimicing lookupRelation(). I do not have a concrete notion of why this solution is necessarily right-er than SessionState.get, but SessionState.get is returning null, which is bad.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1148 from aarondav/createtable and squashes the following commits:
      
      37c3e7c [Aaron Davidson] [SQL] Use hive.SessionState, not the thread local SessionState
      20447849
    • Reynold Xin's avatar
      Move ScriptTransformation into the appropriate place. · d4c7572d
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1162 from rxin/script and squashes the following commits:
      
      2c836b9 [Reynold Xin] Move ScriptTransformation into the appropriate place.
      d4c7572d
    • Andrew Or's avatar
      Clean up CacheManager et al. · 01125a11
      Andrew Or authored
      **UPDATE**
      
      I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix.
      
      Now this is mainly a precursor to another PR (once again).
      
      ---------
      *Old comment*
      
      The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1083 from andrewor14/unroll-them-partitions and squashes the following commits:
      
      7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
      3d9a366 [Andrew Or] Minor change for readability
      d12b95f [Andrew Or] Remove unused imports (minor)
      a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
      cf5f565 [Andrew Or] Remove special handling for MEM_*_SER
      0091ec0 [Andrew Or] Address review feedback
      44ef282 [Andrew Or] Actually return updated blocks in putBytes
      2941c89 [Andrew Or] Clean up BlockStore (minor)
      a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
      01125a11
    • Reynold Xin's avatar
      [SPARK-2225] Turn HAVING without GROUP BY into WHERE. · 0ac71d12
      Reynold Xin authored
      @willb
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1161 from rxin/having-filter and squashes the following commits:
      
      fa8359a [Reynold Xin] [SPARK-2225] Turn HAVING without GROUP BY into WHERE.
      0ac71d12
    • William Benton's avatar
      SPARK-2180: support HAVING clauses in Hive queries · 171ebb3a
      William Benton authored
      This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations.  The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #1136 from willb/SPARK-2180 and squashes the following commits:
      
      3bbaf26 [William Benton] Added casts to HAVING expressions
      83f1340 [William Benton] scalastyle fixes
      18387f1 [William Benton] Add test for HAVING without GROUP BY
      b880bef [William Benton] Added semantic error for HAVING without GROUP BY
      942428e [William Benton] Added test coverage for SPARK-2180.
      56084cc [William Benton] Add support for HAVING clauses in Hive queries.
      171ebb3a
    • Allan Douglas R. de Oliveira's avatar
      SPARK-1868: Users should be allowed to cogroup at least 4 RDDs · 6a224c31
      Allan Douglas R. de Oliveira authored
      Adds cogroup for 4 RDDs.
      
      Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com>
      
      Closes #813 from douglaz/more_cogroups and squashes the following commits:
      
      f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case
      0e9009c [Allan Douglas R. de Oliveira] Added scala tests
      c3ffcdd [Allan Douglas R. de Oliveira] Added java tests
      517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith
      2f402d5 [Allan Douglas R. de Oliveira] Removed TODO
      17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function
      7877a2a [Allan Douglas R. de Oliveira] Fixed code
      ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark
      c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4
      e94963c [Allan Douglas R. de Oliveira] Fixed spacing
      f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues
      d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs
      6a224c31
    • Gang Bai's avatar
      [SPARK-2163] class LBFGS optimize with Double tolerance instead of Int · d484ddef
      Gang Bai authored
      https://issues.apache.org/jira/browse/SPARK-2163
      
      This pull request includes the change for **[SPARK-2163]**:
      
      * Changed the convergence tolerance parameter from type `Int` to type `Double`.
      * Added types for vars in `class LBFGS`, making the style consistent with `class GradientDescent`.
      * Added associated test to check that optimizing via `class LBFGS` produces the same results as via calling `runLBFGS` from `object LBFGS`.
      
      This is a very minor change but it will solve the problem in my implementation of a regression model for count data, where I make use of LBFGS for parameter estimation.
      
      Author: Gang Bai <me@baigang.net>
      
      Closes #1104 from BaiGang/fix_int_tol and squashes the following commits:
      
      cecf02c [Gang Bai] Changed setConvergenceTol'' to specify tolerance with a parameter of type Double. For the reason and the problem caused by an Int parameter, please check https://issues.apache.org/jira/browse/SPARK-2163. Added a test in LBFGSSuite for validating that optimizing via class LBFGS produces the same results as calling runLBFGS from object LBFGS. Keep the indentations and styles correct.
      d484ddef
    • Reynold Xin's avatar
      [SPARK-2218] rename Equals to EqualTo in Spark SQL expressions. · 2f6a835e
      Reynold Xin authored
      Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer.
      
      Note that this sits on top of #1144.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1146 from rxin/equals and squashes the following commits:
      
      f8583fd [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals
      326b388 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals
      bd19807 [Reynold Xin] Rename EqualsTo to EqualTo.
      81148d1 [Reynold Xin] [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions.
      c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
      2f6a835e
    • Takuya UESHIN's avatar
      [SPARK-2196] [SQL] Fix nullability of CaseWhen. · 32495289
      Takuya UESHIN authored
      `CaseWhen` should use `branches.length` to check if `elseValue` is provided or not.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1133 from ueshin/issues/SPARK-2196 and squashes the following commits:
      
      510f12d [Takuya UESHIN] Add some tests.
      dc25e8d [Takuya UESHIN] Fix nullable of CaseWhen to be nullable if the elseValue is nullable.
      4f049cc [Takuya UESHIN] Fix nullability of CaseWhen.
      32495289
    • Aaron Davidson's avatar
      SPARK-2203: PySpark defaults to use same num reduce partitions as map side · f46e02fc
      Aaron Davidson authored
      For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.
      
      In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2203
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1138 from aarondav/pyfix and squashes the following commits:
      
      1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
      f46e02fc
    • Reynold Xin's avatar
      [SPARK-2209][SQL] Cast shouldn't do null check twice. · c55bbb49
      Reynold Xin authored
      Also took the chance to clean up cast a little bit. Too many arrows on each line before!
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1143 from rxin/cast and squashes the following commits:
      
      dd006cb [Reynold Xin] Code review feedback.
      c2b88ae [Reynold Xin] [SPARK-2209][SQL] Cast shouldn't do null check twice.
      c55bbb49
    • Reynold Xin's avatar
      [SPARK-2210] cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0) · 61756409
      Reynold Xin authored
      ```
      explain select cast(cast(key=0 as boolean) as boolean) aaa from src
      ```
      should be
      ```
      [Physical execution plan:]
      [Project [(key#10:0 = 0) AS aaa#7]]
      [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
      ```
      
      However, it is currently
      ```
      [Physical execution plan:]
      [Project [NOT((key#10=0) = 0) AS aaa#7]]
      [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
      ```
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1144 from rxin/booleancast and squashes the following commits:
      
      c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
      61756409
    • Andre Schumacher's avatar
      SPARK-1293 [SQL] Parquet support for nested types · f479cf37
      Andre Schumacher authored
      It should be possible to import and export data stored in Parquet's columnar format that contains nested types. For example:
      ```java
      message AddressBook {
         required binary owner;
         optional group ownerPhoneNumbers {
            repeated binary array;
         }
         optional group contacts {
            repeated group array {
               required binary name;
               optional binary phoneNumber;
            }
         }
         optional group nameToApartmentNumber {
            repeated group map {
               required binary key;
               required int32 value;
            }
         }
      }
      ```
      The example could model a type (AddressBook) that contains records made of strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a list of pairs or a map that can contain null values but keys must not be null). The list of tasks are as follows:
      
      <h6>Implement support for converting nested Parquet types to Spark/Catalyst types:</h6>
      - [x] Structs
      - [x] Lists
      - [x] Maps (note: currently keys need to be Strings)
      
      <h6>Implement import (via ``parquetFile``) of nested Parquet types (first version in this PR)</h6>
      - [x] Initial version
      
      <h6>Implement export (via ``saveAsParquetFile``)</h6>
      - [x] Initial version
      
      <h6>Test support for AvroParquet, etc.</h6>
      - [x] Initial testing of import of avro-generated Parquet data (simple + nested)
      
      Example:
      ```scala
      val data = TestSQLContext
        .parquetFile("input.dir")
        .toSchemaRDD
      data.registerAsTable("data")
      sql("SELECT owner, contacts[1].name, nameToApartmentNumber['John'] FROM data").collect()
      ```
      
      Author: Andre Schumacher <andre.schumacher@iki.fi>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #360 from AndreSchumacher/nested_parquet and squashes the following commits:
      
      30708c8 [Andre Schumacher] Taking out AvroParquet test for now to remove Avro dependency
      95c1367 [Andre Schumacher] Changes to ParquetRelation and its metadata
      7eceb67 [Andre Schumacher] Review feedback
      94eea3a [Andre Schumacher] Scalastyle
      403061f [Andre Schumacher] Fixing some issues with tests and schema metadata
      b8a8b9a [Andre Schumacher] More fixes to short and byte conversion
      63d1b57 [Andre Schumacher] Cleaning up and Scalastyle
      88e6bdb [Andre Schumacher] Attempting to fix loss of schema
      37e0a0a [Andre Schumacher] Cleaning up
      14c3fd8 [Andre Schumacher] Attempting to fix Spark-Parquet schema conversion
      3e1456c [Michael Armbrust] WIP: Directly serialize catalyst attributes.
      f7aeba3 [Michael Armbrust] [SPARK-1982] Support for ByteType and ShortType.
      3104886 [Michael Armbrust] Nested Rows should be Rows, not Seqs.
      3c6b25f [Andre Schumacher] Trying to reduce no-op changes wrt master
      31465d6 [Andre Schumacher] Scalastyle: fixing commented out bottom
      de02538 [Andre Schumacher] Cleaning up ParquetTestData
      2f5a805 [Andre Schumacher] Removing stripMargin from test schemas
      191bc0d [Andre Schumacher] Changing to Seq for ArrayType, refactoring SQLParser for nested field extension
      cbb5793 [Andre Schumacher] Code review feedback
      32229c7 [Andre Schumacher] Removing Row nested values and placing by generic types
      0ae9376 [Andre Schumacher] Doc strings and simplifying ParquetConverter.scala
      a6b4f05 [Andre Schumacher] Cleaning up ArrayConverter, moving classTag to NativeType, adding NativeRow
      431f00f [Andre Schumacher] Fixing problems introduced during rebase
      c52ff2c [Andre Schumacher] Adding native-array converter
      619c397 [Andre Schumacher] Completing Map testcase
      79d81d5 [Andre Schumacher] Replacing field names for array and map in WriteSupport
      f466ff0 [Andre Schumacher] Added ParquetAvro tests and revised Array conversion
      adc1258 [Andre Schumacher] Optimizing imports
      e99cc51 [Andre Schumacher] Fixing nested WriteSupport and adding tests
      1dc5ac9 [Andre Schumacher] First version of WriteSupport for nested types
      d1911dc [Andre Schumacher] Simplifying ArrayType conversion
      f777b4b [Andre Schumacher] Scalastyle
      824500c [Andre Schumacher] Adding attribute resolution for MapType
      b539fde [Andre Schumacher] First commit for MapType
      a594aed [Andre Schumacher] Scalastyle
      4e25fcb [Andre Schumacher] Adding resolution of complex ArrayTypes
      f8f8911 [Andre Schumacher] For primitive rows fall back to more efficient converter, code reorg
      6dbc9b7 [Andre Schumacher] Fixing some problems intruduced during rebase
      b7fcc35 [Andre Schumacher] Documenting conversions, bugfix, wrappers of Rows
      ee70125 [Andre Schumacher] fixing one problem with arrayconverter
      98219cf [Andre Schumacher] added struct converter
      5d80461 [Andre Schumacher] fixing one problem with nested structs and breaking up files
      1b1b3d6 [Andre Schumacher] Fixing one problem with nested arrays
      ddb40d2 [Andre Schumacher] Extending tests for nested Parquet data
      745a42b [Andre Schumacher] Completing testcase for nested data (Addressbook(
      6125c75 [Andre Schumacher] First working nested Parquet record input
      4d4892a [Andre Schumacher] First commit nested Parquet read converters
      aa688fe [Andre Schumacher] Adding conversion of nested Parquet schemas
      f479cf37
    • Yin Huai's avatar
      [SPARK-2177][SQL] describe table result contains only one column · f397e92e
      Yin Huai authored
      ```
      scala> hql("describe src").collect().foreach(println)
      
      [key                 	string              	None                ]
      [value               	string              	None                ]
      ```
      
      The result should contain 3 columns instead of one. This screws up JDBC or even the downstream consumer of the Scala/Java/Python APIs.
      
      I am providing a workaround. We handle a subset of describe commands in Spark SQL, which are defined by ...
      ```
      DESCRIBE [EXTENDED] [db_name.]table_name
      ```
      All other cases are treated as Hive native commands.
      
      Also, if we upgrade Hive to 0.13, we need to check the results of context.sessionState.isHiveServerQuery() to determine how to split the result. This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. We may want to set Hive to use JsonMetaDataFormatter for the output of a DDL statement (`set hive.ddl.output.format=json` introduced by https://issues.apache.org/jira/browse/HIVE-2822).
      
      The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2177
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1118 from yhuai/SPARK-2177 and squashes the following commits:
      
      fd2534c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
      b9b9aa5 [Yin Huai] rxin's comments.
      e7c4e72 [Yin Huai] Fix unit test.
      656b068 [Yin Huai] 100 characters.
      6387217 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
      8003cf3 [Yin Huai] Generate strings with the format like Hive for unit tests.
      9787fff [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
      440c5af [Yin Huai] rxin's comments.
      f1a417e [Yin Huai] Update doc.
      83adb2f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
      366f891 [Yin Huai] Add describe command.
      74bd1d4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
      342fdf7 [Yin Huai] Split to up to 3 parts.
      725e88c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177
      bb8bbef [Yin Huai] Split every string in the result of a describe command.
      f397e92e
    • Michael Armbrust's avatar
      [SQL] Improve Speed of InsertIntoHiveTable · d3b7671c
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1130 from marmbrus/noFunctional and squashes the following commits:
      
      ccdb68c [Michael Armbrust] Remove functional programming and Array allocations from fast path in InsertIntoHiveTable.
      d3b7671c
    • Reynold Xin's avatar
      More minor scaladoc cleanup for Spark SQL. · 278ec8a2
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1142 from rxin/sqlclean and squashes the following commits:
      
      67a789e [Reynold Xin] More minor scaladoc cleanup for Spark SQL.
      278ec8a2
  3. Jun 19, 2014
    • Patrick Wendell's avatar
      HOTFIX: SPARK-2208 local metrics tests can fail on fast machines · e5514790
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1141 from pwendell/hotfix and squashes the following commits:
      
      83e4c79 [Patrick Wendell] HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
      e5514790
    • Reynold Xin's avatar
      A few minor Spark SQL Scaladoc fixes. · 5464e791
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1139 from rxin/sparksqldoc and squashes the following commits:
      
      c3049d8 [Reynold Xin] Fixed line length.
      66dc72c [Reynold Xin] A few minor Spark SQL Scaladoc fixes.
      5464e791
    • nravi's avatar
      [SPARK-2151] Recognize memory format for spark-submit · f14b00a9
      nravi authored
      int format expected for input memory parameter when spark-submit is invoked in standalone cluster mode. Make it consistent with rest of Spark.
      
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #1095 from nishkamravi2/master and squashes the following commits:
      
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      f14b00a9
    • Michael Armbrust's avatar
      [SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more than once. · 777c5958
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits:
      
      9c6d9e4 [Michael Armbrust] Fix typo.
      5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result.
      777c5958
    • witgo's avatar
      [SPARK-2051]In yarn.ClientBase spark.yarn.dist.* do not work · bce0897b
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #969 from witgo/yarn_ClientBase and squashes the following commits:
      
      8117765 [witgo] review commit
      3bdbc52 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      5261b6c [witgo] fix sys.props.get("SPARK_YARN_DIST_FILES")
      e3c1107 [witgo] update docs
      b6a9aa1 [witgo] merge master
      c8b4554 [witgo] review commit
      2f48789 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      8d7b82f [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      1048549 [witgo] remove Utils.resolveURIs
      871f1db [witgo] add spark.yarn.dist.* documentation
      41bce59 [witgo] review commit
      35d6fa0 [witgo] move to ClientArguments
      55d72fc [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      9cdff16 [witgo] review commit
      8bc2f4b [witgo] review commit
      20e667c [witgo] Merge branch 'master' into yarn_ClientBase
      0961151 [witgo] merge master
      ce609fc [witgo] Merge branch 'master' into yarn_ClientBase
      8362489 [witgo] yarn.ClientBase spark.yarn.dist.* do not work
      bce0897b
    • WangTao's avatar
      Minor fix · 67fca189
      WangTao authored
      The value "env" is never used in SparkContext.scala.
      Add detailed comment for method setDelaySeconds in MetadataCleaner.scala instead of the unsure one.
      
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #1105 from WangTaoTheTonic/master and squashes the following commits:
      
      688358e [WangTao] Minor fix
      67fca189
    • Reynold Xin's avatar
      [SPARK-2187] Explain should not run the optimizer twice. · 640c2943
      Reynold Xin authored
      @yhuai @marmbrus @concretevitamin
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1123 from rxin/explain and squashes the following commits:
      
      def83b0 [Reynold Xin] Update unit tests for explain.
      a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.
      640c2943
    • Doris Xin's avatar
      Squishing a typo bug before it causes real harm · 566f70f2
      Doris Xin authored
      in updateNumRows method in RowMatrix
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1125 from dorx/updateNumRows and squashes the following commits:
      
      8564aef [Doris Xin] Squishing a typo bug before it causes real harm
      566f70f2
  4. Jun 18, 2014
    • Michael Armbrust's avatar
      [SPARK-2184][SQL] AddExchange isn't idempotent · 5ff75c74
      Michael Armbrust authored
      ...redPartitioning.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1122 from marmbrus/fixAddExchange and squashes the following commits:
      
      3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning.
      5ff75c74
    • Doris Xin's avatar
      Remove unicode operator from RDD.scala · 45a95f82
      Doris Xin authored
      Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1119 from dorx/unicode and squashes the following commits:
      
      05618c3 [Doris Xin] Remove unicode operator from RDD.scala
      45a95f82
    • Mark Hamstra's avatar
      SPARK-2158 Clean up core/stdout file from FileAppenderSuite · 4cbeea83
      Mark Hamstra authored
      @tdas
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #1100 from markhamstra/SPARK-2158 and squashes the following commits:
      
      ae8e069 [Mark Hamstra] Response to TD's review
      2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite
      4cbeea83
    • Kay Ousterhout's avatar
      [SPARK-1466] Raise exception if pyspark Gateway process doesn't start. · 38702487
      Kay Ousterhout authored
      If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging.
      
      Thanks to @shivaram and @stogers for helping to fix this issue!
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #383 from kayousterhout/pyspark and squashes the following commits:
      
      36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
      38702487
    • Reynold Xin's avatar
      Updated the comment for SPARK-2162. · dd96fcda
      Reynold Xin authored
      A follow up on #1103
      
      @andrewor14
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1117 from rxin/SPARK-2162 and squashes the following commits:
      
      a4231de [Reynold Xin] Updated the comment for SPARK-2162.
      dd96fcda
    • Raymond Liu's avatar
      [SPARK-2162] Double check in doGetLocal to avoid read on removed block. · 5ad5e348
      Raymond Liu authored
      other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed.
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1103 from colorant/bm and squashes the following commits:
      
      daac114 [Raymond Liu] Address comments
      d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.
      5ad5e348
    • Yin Huai's avatar
      [SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an explain command · 587d3201
      Yin Huai authored
      ```
      hql("explain select * from src group by key").collect().foreach(println)
      
      [ExplainCommand [plan#27:0]]
      [ Aggregate false, [key#25], [key#25,value#26]]
      [  Exchange (HashPartitioning [key#25:0], 200)]
      [   Exchange (HashPartitioning [key#25:0], 200)]
      [    Aggregate true, [key#25], [key#25]]
      [     HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None]
      ```
      
      There are two exchange operators.
      
      However, if we do not use explain...
      ```
      hql("select * from src group by key")
      
      res4: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[8] at RDD at SchemaRDD.scala:100
      == Query Plan ==
      Aggregate false, [key#8], [key#8,value#9]
       Exchange (HashPartitioning [key#8:0], 200)
        Aggregate true, [key#8], [key#8]
         HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
      ```
      The plan is fine.
      
      The cause of this bug is explained below.
      
      When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted.
      
      I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`.
      
      The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1116 from yhuai/SPARK-2176 and squashes the following commits:
      
      197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.
      587d3201
    • Vadim Chekan's avatar
      [STREAMING] SPARK-2009 Key not found exception when slow receiver starts · 889f7b76
      Vadim Chekan authored
      I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" exception when using kafka stream and 1 sec batchPeriod.
      
      Investigation showed that the reason is that ReceiverLauncher.startReceivers is asynchronous (started in a thread).
      https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206
      
      
      
      In case of slow starting receiver, such as Kafka, it easily takes more than 2sec to start. In result, no single "compute" will be called on ReceiverInputDStream before first batch job is executed and receivedBlockInfo remains empty (obviously). Batch job will cause ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception.
      
      The patch makes getReceivedBlockInfo more robust by tolerating missing values.
      
      Author: Vadim Chekan <kot.begemot@gmail.com>
      
      Closes #961 from vchekan/branch-1.0 and squashes the following commits:
      
      e86f82b [Vadim Chekan] Fixed indentation
      4609563 [Vadim Chekan] Key not found exception: if receiver is slow to start, it is possible that getReceivedBlockInfo will be called before compute has been called
      (cherry picked from commit 26f6b989)
      
      Signed-off-by: default avatarPatrick Wendell <pwendell@gmail.com>
      889f7b76
Loading