- Jun 21, 2014
-
-
Patrick Wendell authored
-
Reynold Xin authored
The single file was getting very long (500+ loc). Author: Reynold Xin <rxin@apache.org> Closes #1166 from rxin/hiveOperators and squashes the following commits: 5b43068 [Reynold Xin] [SQL] Break hiveOperators.scala into multiple files.
-
Reynold Xin authored
This makes it easier to use config options in operators. Author: Reynold Xin <rxin@apache.org> Closes #1164 from rxin/sqlcontext and squashes the following commits: 797b2fd [Reynold Xin] Pass SQLContext instead of SparkContext into physical operators.
-
- Jun 20, 2014
-
-
Marcelo Vanzin authored
- JavaAPISuite was trying to compare a bare path with a URI. Fix by extracting the path from the URI, since we know it should be a local path anyway/ - b9be1609 excluded the ASM dependency everywhere, but easymock needs it (because cglib needs it). So re-add the dependency, with test scope this time. The second one above actually uncovered a weird situation: the maven test target works, even though I can't find the class sbt complains about in its classpath. sbt complains with: [error] Uncaught exception when running org.apache.spark.util .random.RandomSamplerSuite: java.lang.NoClassDefFoundError: org/objectweb/asm/Type To avoid more weirdness caused by that, I explicitly added the asm dependency to both maven and sbt (for tests only), and verified the classes don't end up in the final assembly. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #917 from vanzin/flaky-tests and squashes the following commits: d022320 [Marcelo Vanzin] Fix some tests.
-
Anant authored
The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061 Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API. Author: Anant <anant.asty@gmail.com> Closes #1062 from anantasty/SPARK-2061 and squashes the following commits: b83ce6b [Anant] Fixed syntax issue 21f9210 [Anant] Fixed version number in deprecation string 9315b76 [Anant] made related changes to use partitions in python api 8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
-
Patrick Wendell authored
-
Doris Xin authored
Updating the chisquare unit test in XORShiftRandomSuite to use the ChiSquareTest in commons-math3 instead of hardcoding the chisquare statistic for the desired confidence interval. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1073 from dorx/math3Unit and squashes the following commits: da0e891 [Doris Xin] remove math3 from common pom 9954143 [Doris Xin] merge master c19948f [Doris Xin] Merge branch 'master' into math3Unit 8f84f19 [Doris Xin] [SPARK-1970] unit test in XORShiftRandomSuite ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
-
Andrew Ash authored
Before: ``` 14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) at org.apache.spark.SparkContext.<init>(SparkContext.scala:223) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957) at $line3.$read$$iwC$$iwC.<init>(<console>:8) at $line3.$read$$iwC.<init>(<console>:14) at $line3.$read.<init>(<console>:16) at $line3.$read$.<init>(<console>:20) at $line3.$read$.<clinit>(<console>) at $line3.$eval$.<init>(<console>:7) at $line3.$eval$.<clinit>(<console>) at $line3.$eval.$print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@7439e55a: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:293) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) at org.apache.spark.SparkContext.<init>(SparkContext.scala:223) at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957) at $line3.$read$$iwC$$iwC.<init>(<console>:8) at $line3.$read$$iwC.<init>(<console>:14) at $line3.$read.<init>(<console>:16) at $line3.$read$.<init>(<console>:20) at $line3.$read$.<clinit>(<console>) at $line3.$eval$.<init>(<console>:7) at $line3.$eval$.<clinit>(<console>) at $line3.$eval.$print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121) at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120) at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263) at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120) at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56) at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913) at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142) at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56) at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104) at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 14/06/08 23:58:23 INFO JettyUtils: Failed to create UI at port, 4040. Trying again. 14/06/08 23:58:23 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use) 14/06/08 23:58:23 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041 ```` After: ``` 14/06/09 00:04:12 INFO JettyUtils: Failed to create UI at port, 4040. Trying again. 14/06/09 00:04:12 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use) 14/06/09 00:04:12 INFO Server: jetty-8.y.z-SNAPSHOT 14/06/09 00:04:12 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041 14/06/09 00:04:12 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041 ``` Lengthy logging comes from this line of code in Jetty: http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty.aggregate/jetty-all/9.1.3.v20140225/org/eclipse/jetty/util/component/AbstractLifeCycle.java#210 Author: Andrew Ash <andrew@andrewash.com> Closes #1019 from ash211/SPARK-1902 and squashes the following commits: 0dd02f7 [Andrew Ash] Leave old org.eclipse.jetty silencing in place 1e2866b [Andrew Ash] Address CR comments 9d85eed [Andrew Ash] SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
-
Aaron Davidson authored
Note that this is simply mimicing lookupRelation(). I do not have a concrete notion of why this solution is necessarily right-er than SessionState.get, but SessionState.get is returning null, which is bad. Author: Aaron Davidson <aaron@databricks.com> Closes #1148 from aarondav/createtable and squashes the following commits: 37c3e7c [Aaron Davidson] [SQL] Use hive.SessionState, not the thread local SessionState
-
Reynold Xin authored
Author: Reynold Xin <rxin@apache.org> Closes #1162 from rxin/script and squashes the following commits: 2c836b9 [Reynold Xin] Move ScriptTransformation into the appropriate place.
-
Andrew Or authored
**UPDATE** I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix. Now this is mainly a precursor to another PR (once again). --------- *Old comment* The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case. Author: Andrew Or <andrewor14@gmail.com> Closes #1083 from andrewor14/unroll-them-partitions and squashes the following commits: 7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions 3d9a366 [Andrew Or] Minor change for readability d12b95f [Andrew Or] Remove unused imports (minor) a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions cf5f565 [Andrew Or] Remove special handling for MEM_*_SER 0091ec0 [Andrew Or] Address review feedback 44ef282 [Andrew Or] Actually return updated blocks in putBytes 2941c89 [Andrew Or] Clean up BlockStore (minor) a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
-
Reynold Xin authored
@willb Author: Reynold Xin <rxin@apache.org> Closes #1161 from rxin/having-filter and squashes the following commits: fa8359a [Reynold Xin] [SPARK-2225] Turn HAVING without GROUP BY into WHERE.
-
William Benton authored
This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations. The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`. Author: William Benton <willb@redhat.com> Closes #1136 from willb/SPARK-2180 and squashes the following commits: 3bbaf26 [William Benton] Added casts to HAVING expressions 83f1340 [William Benton] scalastyle fixes 18387f1 [William Benton] Add test for HAVING without GROUP BY b880bef [William Benton] Added semantic error for HAVING without GROUP BY 942428e [William Benton] Added test coverage for SPARK-2180. 56084cc [William Benton] Add support for HAVING clauses in Hive queries.
-
Allan Douglas R. de Oliveira authored
Adds cogroup for 4 RDDs. Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com> Closes #813 from douglaz/more_cogroups and squashes the following commits: f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case 0e9009c [Allan Douglas R. de Oliveira] Added scala tests c3ffcdd [Allan Douglas R. de Oliveira] Added java tests 517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith 2f402d5 [Allan Douglas R. de Oliveira] Removed TODO 17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function 7877a2a [Allan Douglas R. de Oliveira] Fixed code ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4 e94963c [Allan Douglas R. de Oliveira] Fixed spacing f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs
-
Gang Bai authored
https://issues.apache.org/jira/browse/SPARK-2163 This pull request includes the change for **[SPARK-2163]**: * Changed the convergence tolerance parameter from type `Int` to type `Double`. * Added types for vars in `class LBFGS`, making the style consistent with `class GradientDescent`. * Added associated test to check that optimizing via `class LBFGS` produces the same results as via calling `runLBFGS` from `object LBFGS`. This is a very minor change but it will solve the problem in my implementation of a regression model for count data, where I make use of LBFGS for parameter estimation. Author: Gang Bai <me@baigang.net> Closes #1104 from BaiGang/fix_int_tol and squashes the following commits: cecf02c [Gang Bai] Changed setConvergenceTol'' to specify tolerance with a parameter of type Double. For the reason and the problem caused by an Int parameter, please check https://issues.apache.org/jira/browse/SPARK-2163. Added a test in LBFGSSuite for validating that optimizing via class LBFGS produces the same results as calling runLBFGS from object LBFGS. Keep the indentations and styles correct.
-
Reynold Xin authored
Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer. Note that this sits on top of #1144. Author: Reynold Xin <rxin@apache.org> Closes #1146 from rxin/equals and squashes the following commits: f8583fd [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals 326b388 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals bd19807 [Reynold Xin] Rename EqualsTo to EqualTo. 81148d1 [Reynold Xin] [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions. c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
-
Takuya UESHIN authored
`CaseWhen` should use `branches.length` to check if `elseValue` is provided or not. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1133 from ueshin/issues/SPARK-2196 and squashes the following commits: 510f12d [Takuya UESHIN] Add some tests. dc25e8d [Takuya UESHIN] Fix nullable of CaseWhen to be nullable if the elseValue is nullable. 4f049cc [Takuya UESHIN] Fix nullability of CaseWhen.
-
Aaron Davidson authored
For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 Author: Aaron Davidson <aaron@databricks.com> Closes #1138 from aarondav/pyfix and squashes the following commits: 1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
-
Reynold Xin authored
Also took the chance to clean up cast a little bit. Too many arrows on each line before! Author: Reynold Xin <rxin@apache.org> Closes #1143 from rxin/cast and squashes the following commits: dd006cb [Reynold Xin] Code review feedback. c2b88ae [Reynold Xin] [SPARK-2209][SQL] Cast shouldn't do null check twice.
-
Reynold Xin authored
``` explain select cast(cast(key=0 as boolean) as boolean) aaa from src ``` should be ``` [Physical execution plan:] [Project [(key#10:0 = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` However, it is currently ``` [Physical execution plan:] [Project [NOT((key#10=0) = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` Author: Reynold Xin <rxin@apache.org> Closes #1144 from rxin/booleancast and squashes the following commits: c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
-
Andre Schumacher authored
It should be possible to import and export data stored in Parquet's columnar format that contains nested types. For example: ```java message AddressBook { required binary owner; optional group ownerPhoneNumbers { repeated binary array; } optional group contacts { repeated group array { required binary name; optional binary phoneNumber; } } optional group nameToApartmentNumber { repeated group map { required binary key; required int32 value; } } } ``` The example could model a type (AddressBook) that contains records made of strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a list of pairs or a map that can contain null values but keys must not be null). The list of tasks are as follows: <h6>Implement support for converting nested Parquet types to Spark/Catalyst types:</h6> - [x] Structs - [x] Lists - [x] Maps (note: currently keys need to be Strings) <h6>Implement import (via ``parquetFile``) of nested Parquet types (first version in this PR)</h6> - [x] Initial version <h6>Implement export (via ``saveAsParquetFile``)</h6> - [x] Initial version <h6>Test support for AvroParquet, etc.</h6> - [x] Initial testing of import of avro-generated Parquet data (simple + nested) Example: ```scala val data = TestSQLContext .parquetFile("input.dir") .toSchemaRDD data.registerAsTable("data") sql("SELECT owner, contacts[1].name, nameToApartmentNumber['John'] FROM data").collect() ``` Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Michael Armbrust <michael@databricks.com> Closes #360 from AndreSchumacher/nested_parquet and squashes the following commits: 30708c8 [Andre Schumacher] Taking out AvroParquet test for now to remove Avro dependency 95c1367 [Andre Schumacher] Changes to ParquetRelation and its metadata 7eceb67 [Andre Schumacher] Review feedback 94eea3a [Andre Schumacher] Scalastyle 403061f [Andre Schumacher] Fixing some issues with tests and schema metadata b8a8b9a [Andre Schumacher] More fixes to short and byte conversion 63d1b57 [Andre Schumacher] Cleaning up and Scalastyle 88e6bdb [Andre Schumacher] Attempting to fix loss of schema 37e0a0a [Andre Schumacher] Cleaning up 14c3fd8 [Andre Schumacher] Attempting to fix Spark-Parquet schema conversion 3e1456c [Michael Armbrust] WIP: Directly serialize catalyst attributes. f7aeba3 [Michael Armbrust] [SPARK-1982] Support for ByteType and ShortType. 3104886 [Michael Armbrust] Nested Rows should be Rows, not Seqs. 3c6b25f [Andre Schumacher] Trying to reduce no-op changes wrt master 31465d6 [Andre Schumacher] Scalastyle: fixing commented out bottom de02538 [Andre Schumacher] Cleaning up ParquetTestData 2f5a805 [Andre Schumacher] Removing stripMargin from test schemas 191bc0d [Andre Schumacher] Changing to Seq for ArrayType, refactoring SQLParser for nested field extension cbb5793 [Andre Schumacher] Code review feedback 32229c7 [Andre Schumacher] Removing Row nested values and placing by generic types 0ae9376 [Andre Schumacher] Doc strings and simplifying ParquetConverter.scala a6b4f05 [Andre Schumacher] Cleaning up ArrayConverter, moving classTag to NativeType, adding NativeRow 431f00f [Andre Schumacher] Fixing problems introduced during rebase c52ff2c [Andre Schumacher] Adding native-array converter 619c397 [Andre Schumacher] Completing Map testcase 79d81d5 [Andre Schumacher] Replacing field names for array and map in WriteSupport f466ff0 [Andre Schumacher] Added ParquetAvro tests and revised Array conversion adc1258 [Andre Schumacher] Optimizing imports e99cc51 [Andre Schumacher] Fixing nested WriteSupport and adding tests 1dc5ac9 [Andre Schumacher] First version of WriteSupport for nested types d1911dc [Andre Schumacher] Simplifying ArrayType conversion f777b4b [Andre Schumacher] Scalastyle 824500c [Andre Schumacher] Adding attribute resolution for MapType b539fde [Andre Schumacher] First commit for MapType a594aed [Andre Schumacher] Scalastyle 4e25fcb [Andre Schumacher] Adding resolution of complex ArrayTypes f8f8911 [Andre Schumacher] For primitive rows fall back to more efficient converter, code reorg 6dbc9b7 [Andre Schumacher] Fixing some problems intruduced during rebase b7fcc35 [Andre Schumacher] Documenting conversions, bugfix, wrappers of Rows ee70125 [Andre Schumacher] fixing one problem with arrayconverter 98219cf [Andre Schumacher] added struct converter 5d80461 [Andre Schumacher] fixing one problem with nested structs and breaking up files 1b1b3d6 [Andre Schumacher] Fixing one problem with nested arrays ddb40d2 [Andre Schumacher] Extending tests for nested Parquet data 745a42b [Andre Schumacher] Completing testcase for nested data (Addressbook( 6125c75 [Andre Schumacher] First working nested Parquet record input 4d4892a [Andre Schumacher] First commit nested Parquet read converters aa688fe [Andre Schumacher] Adding conversion of nested Parquet schemas
-
Yin Huai authored
``` scala> hql("describe src").collect().foreach(println) [key string None ] [value string None ] ``` The result should contain 3 columns instead of one. This screws up JDBC or even the downstream consumer of the Scala/Java/Python APIs. I am providing a workaround. We handle a subset of describe commands in Spark SQL, which are defined by ... ``` DESCRIBE [EXTENDED] [db_name.]table_name ``` All other cases are treated as Hive native commands. Also, if we upgrade Hive to 0.13, we need to check the results of context.sessionState.isHiveServerQuery() to determine how to split the result. This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. We may want to set Hive to use JsonMetaDataFormatter for the output of a DDL statement (`set hive.ddl.output.format=json` introduced by https://issues.apache.org/jira/browse/HIVE-2822). The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2177 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1118 from yhuai/SPARK-2177 and squashes the following commits: fd2534c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 b9b9aa5 [Yin Huai] rxin's comments. e7c4e72 [Yin Huai] Fix unit test. 656b068 [Yin Huai] 100 characters. 6387217 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 8003cf3 [Yin Huai] Generate strings with the format like Hive for unit tests. 9787fff [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 440c5af [Yin Huai] rxin's comments. f1a417e [Yin Huai] Update doc. 83adb2f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 366f891 [Yin Huai] Add describe command. 74bd1d4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 342fdf7 [Yin Huai] Split to up to 3 parts. 725e88c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 bb8bbef [Yin Huai] Split every string in the result of a describe command.
-
Michael Armbrust authored
Author: Michael Armbrust <michael@databricks.com> Closes #1130 from marmbrus/noFunctional and squashes the following commits: ccdb68c [Michael Armbrust] Remove functional programming and Array allocations from fast path in InsertIntoHiveTable.
-
Reynold Xin authored
Author: Reynold Xin <rxin@apache.org> Closes #1142 from rxin/sqlclean and squashes the following commits: 67a789e [Reynold Xin] More minor scaladoc cleanup for Spark SQL.
-
- Jun 19, 2014
-
-
Patrick Wendell authored
Author: Patrick Wendell <pwendell@gmail.com> Closes #1141 from pwendell/hotfix and squashes the following commits: 83e4c79 [Patrick Wendell] HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
-
Reynold Xin authored
Author: Reynold Xin <rxin@apache.org> Closes #1139 from rxin/sparksqldoc and squashes the following commits: c3049d8 [Reynold Xin] Fixed line length. 66dc72c [Reynold Xin] A few minor Spark SQL Scaladoc fixes.
-
nravi authored
int format expected for input memory parameter when spark-submit is invoked in standalone cluster mode. Make it consistent with rest of Spark. Author: nravi <nravi@c1704.halxg.cloudera.com> Closes #1095 from nishkamravi2/master and squashes the following commits: 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
-
Michael Armbrust authored
Author: Michael Armbrust <michael@databricks.com> Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits: 9c6d9e4 [Michael Armbrust] Fix typo. 5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result.
-
witgo authored
Author: witgo <witgo@qq.com> Closes #969 from witgo/yarn_ClientBase and squashes the following commits: 8117765 [witgo] review commit 3bdbc52 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 5261b6c [witgo] fix sys.props.get("SPARK_YARN_DIST_FILES") e3c1107 [witgo] update docs b6a9aa1 [witgo] merge master c8b4554 [witgo] review commit 2f48789 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 8d7b82f [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 1048549 [witgo] remove Utils.resolveURIs 871f1db [witgo] add spark.yarn.dist.* documentation 41bce59 [witgo] review commit 35d6fa0 [witgo] move to ClientArguments 55d72fc [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 9cdff16 [witgo] review commit 8bc2f4b [witgo] review commit 20e667c [witgo] Merge branch 'master' into yarn_ClientBase 0961151 [witgo] merge master ce609fc [witgo] Merge branch 'master' into yarn_ClientBase 8362489 [witgo] yarn.ClientBase spark.yarn.dist.* do not work
-
WangTao authored
The value "env" is never used in SparkContext.scala. Add detailed comment for method setDelaySeconds in MetadataCleaner.scala instead of the unsure one. Author: WangTao <barneystinson@aliyun.com> Closes #1105 from WangTaoTheTonic/master and squashes the following commits: 688358e [WangTao] Minor fix
-
Reynold Xin authored
@yhuai @marmbrus @concretevitamin Author: Reynold Xin <rxin@apache.org> Closes #1123 from rxin/explain and squashes the following commits: def83b0 [Reynold Xin] Update unit tests for explain. a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.
-
Doris Xin authored
in updateNumRows method in RowMatrix Author: Doris Xin <doris.s.xin@gmail.com> Closes #1125 from dorx/updateNumRows and squashes the following commits: 8564aef [Doris Xin] Squishing a typo bug before it causes real harm
-
- Jun 18, 2014
-
-
Michael Armbrust authored
...redPartitioning. Author: Michael Armbrust <michael@databricks.com> Closes #1122 from marmbrus/fixAddExchange and squashes the following commits: 3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning.
-
Doris Xin authored
Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1119 from dorx/unicode and squashes the following commits: 05618c3 [Doris Xin] Remove unicode operator from RDD.scala
-
Mark Hamstra authored
@tdas Author: Mark Hamstra <markhamstra@gmail.com> Closes #1100 from markhamstra/SPARK-2158 and squashes the following commits: ae8e069 [Mark Hamstra] Response to TD's review 2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite
-
Kay Ousterhout authored
If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging. Thanks to @shivaram and @stogers for helping to fix this issue! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #383 from kayousterhout/pyspark and squashes the following commits: 36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
-
Reynold Xin authored
A follow up on #1103 @andrewor14 Author: Reynold Xin <rxin@apache.org> Closes #1117 from rxin/SPARK-2162 and squashes the following commits: a4231de [Reynold Xin] Updated the comment for SPARK-2162.
-
Raymond Liu authored
other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed. Author: Raymond Liu <raymond.liu@intel.com> Closes #1103 from colorant/bm and squashes the following commits: daac114 [Raymond Liu] Address comments d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.
-
Yin Huai authored
``` hql("explain select * from src group by key").collect().foreach(println) [ExplainCommand [plan#27:0]] [ Aggregate false, [key#25], [key#25,value#26]] [ Exchange (HashPartitioning [key#25:0], 200)] [ Exchange (HashPartitioning [key#25:0], 200)] [ Aggregate true, [key#25], [key#25]] [ HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None] ``` There are two exchange operators. However, if we do not use explain... ``` hql("select * from src group by key") res4: org.apache.spark.sql.SchemaRDD = SchemaRDD[8] at RDD at SchemaRDD.scala:100 == Query Plan == Aggregate false, [key#8], [key#8,value#9] Exchange (HashPartitioning [key#8:0], 200) Aggregate true, [key#8], [key#8] HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None ``` The plan is fine. The cause of this bug is explained below. When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted. I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`. The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1116 from yhuai/SPARK-2176 and squashes the following commits: 197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.
-
Vadim Chekan authored
I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" exception when using kafka stream and 1 sec batchPeriod. Investigation showed that the reason is that ReceiverLauncher.startReceivers is asynchronous (started in a thread). https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206 In case of slow starting receiver, such as Kafka, it easily takes more than 2sec to start. In result, no single "compute" will be called on ReceiverInputDStream before first batch job is executed and receivedBlockInfo remains empty (obviously). Batch job will cause ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception. The patch makes getReceivedBlockInfo more robust by tolerating missing values. Author: Vadim Chekan <kot.begemot@gmail.com> Closes #961 from vchekan/branch-1.0 and squashes the following commits: e86f82b [Vadim Chekan] Fixed indentation 4609563 [Vadim Chekan] Key not found exception: if receiver is slow to start, it is possible that getReceivedBlockInfo will be called before compute has been called (cherry picked from commit 26f6b989) Signed-off-by:
Patrick Wendell <pwendell@gmail.com>
-