Skip to content
Snippets Groups Projects
  1. Jun 11, 2014
    • Prashant Sharma's avatar
      [SPARK-2069] MIMA false positives · 5b754b45
      Prashant Sharma authored
      Fixes SPARK 2070 and 2071
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1021 from ScrapCodes/SPARK-2070/package-private-methods and squashes the following commits:
      
      7979a57 [Prashant Sharma] addressed code review comments
      558546d [Prashant Sharma] A little fancy error message.
      59275ab [Prashant Sharma] SPARK-2071 Mima ignores classes and its members from previous versions too.
      0c4ff2b [Prashant Sharma] SPARK-2070 Ignore methods along with annotated classes.
      5b754b45
    • Sandy Ryza's avatar
      SPARK-1639. Tidy up some Spark on YARN code · 2a4225dd
      Sandy Ryza authored
      This contains a bunch of small tidyings of the Spark on YARN code.
      
      I focused on the yarn stable code.  @tgravescs, let me know if you'd like me to make these for the alpha code as well.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #561 from sryza/sandy-spark-1639 and squashes the following commits:
      
      72b6a02 [Sandy Ryza] Fix comment and set name on driver thread
      c2190b2 [Sandy Ryza] SPARK-1639. Tidy up some Spark on YARN code
      2a4225dd
    • Qiuzhuang.Lian's avatar
      SPARK-2107: FilterPushdownSuite doesn't need Junit jar. · 6e119303
      Qiuzhuang.Lian authored
      Author: Qiuzhuang.Lian <Qiuzhuang.Lian@gmail.com>
      
      Closes #1046 from Qiuzhuang/master and squashes the following commits:
      
      0a9921a [Qiuzhuang.Lian] SPARK-2107: FilterPushdownSuite doesn't need Junit jar.
      6e119303
    • Xiangrui Meng's avatar
      [SPARK-2091][MLLIB] use numpy.dot instead of ndarray.dot · 0f1dc3a7
      Xiangrui Meng authored
      `ndarray.dot` is not available in numpy 1.4. This PR makes pyspark/mllib compatible with numpy 1.4.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1035 from mengxr/numpy-1.4 and squashes the following commits:
      
      7ad2f0c [Xiangrui Meng] use numpy.dot instead of ndarray.dot
      0f1dc3a7
    • Cheng Lian's avatar
      [SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables · 0266a0c8
      Cheng Lian authored
      JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968)
      
      This PR added support for SQL/HiveQL command for caching/uncaching tables:
      
      ```
      scala> sql("CACHE TABLE src")
      ...
      res0: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[0] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      CacheCommandPhysical src, true
      
      scala> table("src")
      ...
      res1: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[3] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false
      
      scala> isCached("src")
      res2: Boolean = true
      
      scala> sql("CACHE TABLE src")
      ...
      res3: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[4] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      CacheCommandPhysical src, false
      
      scala> table("src")
      ...
      res4: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[11] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None
      
      scala> isCached("src")
      res5: Boolean = false
      ```
      
      Things also work for `hql`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1038 from liancheng/sqlCacheTable and squashes the following commits:
      
      ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands
      6f4ce42 [Cheng Lian] Moved logical command classes to a separate file
      3458a24 [Cheng Lian] Added comment for public API
      f0ffacc [Cheng Lian] Added isCached() predicate
      15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements
      0266a0c8
    • Takuya UESHIN's avatar
      [SPARK-2093] [SQL] NullPropagation should use exact type value. · 0402bd77
      Takuya UESHIN authored
      `NullPropagation` should use exact type value when transform `Count` or `Sum`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1034 from ueshin/issues/SPARK-2093 and squashes the following commits:
      
      65b6ff1 [Takuya UESHIN] Modify the literal value of the result of transformation from Sum to long value.
      830c20b [Takuya UESHIN] Add Cast to the result of transformation from Count.
      9314806 [Takuya UESHIN] Fix NullPropagation to use exact type value.
      0402bd77
  2. Jun 10, 2014
    • Zongheng Yang's avatar
      HOTFIX: clear() configs in SQLConf-related unit tests. · 601032f5
      Zongheng Yang authored
      Thanks goes to @liancheng, who pointed out that `sql/test-only *.SQLConfSuite *.SQLQuerySuite` passed but `sql/test-only *.SQLQuerySuite *.SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls.
      
      This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1040 from concretevitamin/sqlconf-tests and squashes the following commits:
      
      6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.
      601032f5
    • Nicholas Chammas's avatar
      [SPARK-2065] give launched instances names · a2052a44
      Nicholas Chammas authored
      This update resolves [SPARK-2065](https://issues.apache.org/jira/browse/SPARK-2065). It gives launched EC2 instances descriptive names by using instance tags. Launched instances now show up in the EC2 console with these names.
      
      I used `format()` with named parameters, which I believe is the recommended practice for string formatting in Python, but which doesn’t seem to be used elsewhere in the script.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1043 from nchammas/master and squashes the following commits:
      
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      a2052a44
    • witgo's avatar
      Resolve scalatest warnings during build · c48b6222
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #1032 from witgo/ShouldMatchers and squashes the following commits:
      
      7ebf34c [witgo] Resolve scalatest warnings during build
      c48b6222
    • Tathagata Das's avatar
      [SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old executor logs · 4823bf47
      Tathagata Das authored
      Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed.
      
      Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function).
      
      This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with  `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`).  The web UI has also been modified to show the logs across the rolled-over files.
      
      You can test this locally (without waiting a whole day) by setting  configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to).
      
      ```
      stderr
      stderr--2014-05-27--14-37
      stderr--2014-05-27--14-47
      stderr--2014-05-27--15-05
      stdout
      stdout--2014-05-27--14-47
      ```
      
      The web ui should show logs across these files.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #895 from tdas/rolling-logs and squashes the following commits:
      
      fd8f87f [Tathagata Das] Minor change.
      d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      ad956c1 [Tathagata Das] Scala style fix.
      1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments.
      c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names.
      4224409 [Tathagata Das] Style fix.
      108a9f8 [Tathagata Das] Added better constraint handling for rolling policies.
      f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled
      312d874 [Tathagata Das] Minor fixes based on PR comments.
      8a67d83 [Tathagata Das] Fixed comments.
      b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly.
      b7e8272 [Tathagata Das] Style fix,
      374c9a9 [Tathagata Das] Added missing license.
      24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements.
      adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files.
      cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
      4823bf47
    • joyyoj's avatar
      [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not re... · 29660443
      joyyoj authored
      flume event sent to Spark will fail if the body is too large and numHeaders is greater than zero
      
      Author: joyyoj <sunshch@gmail.com>
      
      Closes #951 from joyyoj/master and squashes the following commits:
      
      f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly
      29660443
    • egraldlo's avatar
      [SQL] Add average overflow test case from #978 · 1abbde0e
      egraldlo authored
      By @egraldlo.
      
      Author: egraldlo <egraldlo@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1033 from marmbrus/pr/978 and squashes the following commits:
      
      e228c5e [Michael Armbrust] Remove "test".
      762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table.
      d414cd7 [egraldlo] fommatting issues
      1153f75 [egraldlo] do best to avoid overflowing in function avg().
      1abbde0e
    • Ankur Dave's avatar
      HOTFIX: Increase time limit for Bagel test · 55a0e87e
      Ankur Dave authored
      The test was timing out on some slow EC2 workers.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1037 from ankurdave/bagel-test-time-limit and squashes the following commits:
      
      67fd487 [Ankur Dave] Increase time limit for Bagel test
      55a0e87e
    • Patrick Wendell's avatar
      HOTFIX: Fix Python tests on Jenkins. · fb499be1
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1036 from pwendell/jenkins-test and squashes the following commits:
      
      9c99856 [Patrick Wendell] Better output during tests
      71e7b74 [Patrick Wendell] Removing incorrect python path
      74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.
      fb499be1
    • Cheng Hao's avatar
      [SPARK-2076][SQL] Pushdown the join filter & predication for outer join · db0c038a
      Cheng Hao authored
      As the rule described in https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior, we can optimize the SQL Join by pushing down the Join predicate and Where predicate.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1015 from chenghao-intel/join_predicate_push_down and squashes the following commits:
      
      10feff9 [Cheng Hao] fix bug of changing the join type in PredicatePushDownThroughJoin
      44c6700 [Cheng Hao] Add logical to support pushdown the join filter
      0bce426 [Cheng Hao] Pushdown the join filter & predicate for outer join
      db0c038a
    • witgo's avatar
      [SPARK-1978] In some cases, spark-yarn does not automatically restart the failed container · 884ca718
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #921 from witgo/allocateExecutors and squashes the following commits:
      
      bc3aa66 [witgo] review commit
      8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
      32ac7af [witgo] review commit
      056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
      04c6f7e [witgo] Merge branch 'master' into allocateExecutors
      aff827c [witgo] review commit
      5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
      1faf4f4 [witgo] Merge branch 'master' into allocateExecutors
      3c464bd [witgo] add time limit to allocateExecutors
      e00b656 [witgo] In some cases, yarn does not automatically restart the container
      884ca718
    • Cheng Lian's avatar
      Moved hiveOperators.scala to the right package folder · a9a461c5
      Cheng Lian authored
      The package is `org.apache.spark.sql.hive.execution`, while the file was placed under `sql/hive/src/main/scala/org/apache/spark/sql/hive/`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1029 from liancheng/moveHiveOperators and squashes the following commits:
      
      d632eb8 [Cheng Lian] Moved hiveOperators.scala to the right package folder
      a9a461c5
    • Zongheng Yang's avatar
      [SPARK-1508][SQL] Add SQLConf to SQLContext. · 08ed9ad8
      Zongheng Yang authored
      This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.
      
      The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.
      
      For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #956 from concretevitamin/sqlconf and squashes the following commits:
      
      4968c11 [Zongheng Yang] Very minor cleanup.
      d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
      c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
      26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
      dd19666 [Zongheng Yang] Update a comment.
      baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
      5f7e6d8 [Zongheng Yang] Add default num partitions.
      22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
      e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
      88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
      271f0b1 [Zongheng Yang] Minor change.
      f8983d1 [Zongheng Yang] Minor changes per review comments.
      1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
      b766af9 [Zongheng Yang] Remove a test.
      d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
      555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
      c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
      2ea8cdc [Zongheng Yang] Wrap long line.
      41d7f09 [Zongheng Yang] Fix imports.
      13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
      b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
      6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
      5b67985 [Zongheng Yang] New line at EOF.
      c651797 [Zongheng Yang] Add commands.scala.
      efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
      c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
      0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
      41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
      2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
      3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
      d0c4578 [Zongheng Yang] Tmux typo.
      0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
      ce22d80 [Zongheng Yang] Fix parsing issues.
      cb722c1 [Zongheng Yang] Finish up SQLConf patch.
      4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
      08ed9ad8
    • Nick Pentreath's avatar
      SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats · f971d6cb
      Nick Pentreath authored
      So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.
      
      This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.
      
      # Overview
      The basics are as follows:
      1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
      2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
      3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
      4. ```PickleSerializer``` on the Python side deserializes.
      
      This works "out the box" for simple ```Writable```s:
      * ```Text```
      * ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
      * ```NullWritable```
      * ```BooleanWritable```
      * ```BytesWritable```
      * ```MapWritable```
      
      It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).
      
      I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
      ```python
      conf = {"es.resource" : "index/type" }
      rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
      rdd.first()
      ```
      
      I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.
      
      # Some things still outstanding:
      1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
      2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
      3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
      4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:
      
      268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
      761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
      4c972d8 [Nick Pentreath] Add license headers
      d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      cde6af9 [Nick Pentreath] Parameterize converter trait
      5ebacfa [Nick Pentreath] Update docs for PySpark input formats
      a985492 [Nick Pentreath] Move Converter examples to own package
      365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
      eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
      1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
      3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
      b65606f [Nick Pentreath] Add converter interface
      5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
      085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
      43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
      94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
      1a4a1d6 [Nick Pentreath] Address @mateiz style comments
      01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
      9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      84fe8e3 [Nick Pentreath] Python programming guide space formatting
      d0f52b6 [Nick Pentreath] Python programming guide
      7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      93ef995 [Nick Pentreath] Add back context.py changes
      9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
      077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
      5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      35b8e3a [Nick Pentreath] Another fix for test ordering
      bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e001b94 [Nick Pentreath] Fix test failures due to ordering
      78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
      64eb051 [Nick Pentreath] Scalastyle fix
      e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
      c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
      1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
      17a656b [Nick Pentreath] remove binary sequencefile for tests
      f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
      450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      31a2fff [Nick Pentreath] Scalastyle fixes
      fc5099e [Nick Pentreath] Add Apache license headers
      4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
      b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
      951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      f6aac55 [Nick Pentreath] Bring back msgpack
      9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
      a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
      7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
      25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
      65360d5 [Nick Pentreath] Adding test SequenceFiles
      0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      d72bf18 [Nick Pentreath] msgpack
      dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e67212a [Nick Pentreath] Add back msgpack dependency
      f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      97ef708 [Nick Pentreath] Remove old writeToStream
      2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
      174f520 [Nick Pentreath] Add back graphx settings
      703ee65 [Nick Pentreath] Add back msgpack
      619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      eb40036 [Nick Pentreath] Remove unused comment lines
      4d7ef2e [Nick Pentreath] Fix indentation
      f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
      0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
      4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
      818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
      4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
      4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
      d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
      f971d6cb
    • DB Tsai's avatar
      Make sure that empty string is filtered out when we get the secondary jars from conf · 6f2db8c2
      DB Tsai authored
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #1027 from dbtsai/dbtsai-classloader and squashes the following commits:
      
      9ac6be3 [DB Tsai] Fixed line too long
      c9c7ad7 [DB Tsai] Make sure that empty string is filtered out when we get the secondary jars from conf.
      6f2db8c2
  3. Jun 09, 2014
    • Zongheng Yang's avatar
      [SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD. · a9ec033c
      Zongheng Yang authored
      This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1003 from concretevitamin/explain-cmd and squashes the following commits:
      
      5b7911f [Zongheng Yang] Add a regression test.
      1bfa379 [Zongheng Yang] Modify output().
      719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical.
      4318fd7 [Zongheng Yang] Make all output one Row.
      439c6ab [Zongheng Yang] Minor cleanups.
      408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.
      a9ec033c
    • Michael Armbrust's avatar
      [SQL] Simple framework for debugging query execution · c6e041d1
      Michael Armbrust authored
      Only records number of tuples and unique dataTypes output right now...
      
      Example:
      ```scala
      scala> import org.apache.spark.sql.execution.debug._
      scala> hql("SELECT value FROM src WHERE key > 10").debug(sparkContext)
      
      Results returned: 489
      == Project [value#1:0] ==
      Tuples output: 489
       value StringType: {java.lang.String}
      == Filter (key#0:1 > 10) ==
      Tuples output: 489
       value StringType: {java.lang.String}
       key IntegerType: {java.lang.Integer}
      == HiveTableScan [value#1,key#0], (MetastoreRelation default, src, None), None ==
      Tuples output: 500
       value StringType: {java.lang.String}
       key IntegerType: {java.lang.Integer}
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1005 from marmbrus/debug and squashes the following commits:
      
      dcc3ca6 [Michael Armbrust] Add comments.
      c9dded2 [Michael Armbrust] Simple framework for debugging query execution
      c6e041d1
    • Bernardo Gomez Palacio's avatar
      [SPARK-1522] : YARN ClientBase throws a NPE if there is no YARN Application CP · e2734476
      Bernardo Gomez Palacio authored
      The current implementation of ClientBase.getDefaultYarnApplicationClasspath inspects
      the MRJobConfig class for the field DEFAULT_YARN_APPLICATION_CLASSPATH when it should
      be really looking into YarnConfiguration. If the Application Configuration has no
      yarn.application.classpath defined a NPE exception will be thrown.
      
      Additional Changes include:
      * Test Suite for ClientBase added
      
      [ticket: SPARK-1522] : https://issues.apache.org/jira/browse/SPARK-1522
      
      Author      : bernardo.gomezpalacio@gmail.com
      Testing     : SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt test
      
      Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
      
      Closes #433 from berngp/feature/SPARK-1522 and squashes the following commits:
      
      2c2e118 [Bernardo Gomez Palacio] [SPARK-1522]: YARN ClientBase throws a NPE if there is no YARN Application specific CP
      e2734476
    • Kay Ousterhout's avatar
      Added a TaskSetManager unit test. · 6cf335d7
      Kay Ousterhout authored
      This test ensures that when there are no
      alive executors that satisfy a particular locality level,
      the TaskSetManager doesn't ever use that as the maximum
      allowed locality level (this optimization ensures that a
      job doesn't wait extra time in an attempt to satisfy
      a scheduling locality level that is impossible).
      
      @mateiz and @lirui-intel this unit test illustrates an issue
      with #892 (it fails with that patch).
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #1024 from kayousterhout/scheduler_unit_test and squashes the following commits:
      
      de6a08f [Kay Ousterhout] Added a TaskSetManager unit test.
      6cf335d7
    • Daoyuan's avatar
      [SPARK-1495][SQL]add support for left semi join · 0cf60028
      Daoyuan authored
      Just submit another solution for #395
      
      Author: Daoyuan <daoyuan.wang@intel.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #837 from adrian-wang/left-semi-join-support and squashes the following commits:
      
      d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837
      6713c09 [Michael Armbrust] Better debugging for failed query tests.
      035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join.
      5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser.
      4c726e5 [Daoyuan] improvement according to Michael
      8d4a121 [Daoyuan] add golden files for leftsemijoin
      83a3c8a [Daoyuan] scala style fix
      14cff80 [Daoyuan] add support for left semi join
      0cf60028
    • Andrew Ash's avatar
      SPARK-1944 Document --verbose in spark-shell -h · 35630c86
      Andrew Ash authored
      https://issues.apache.org/jira/browse/SPARK-1944
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1020 from ash211/SPARK-1944 and squashes the following commits:
      
      a831c4d [Andrew Ash] SPARK-1944 Document --verbose in spark-shell -h
      35630c86
    • Syed Hashmi's avatar
      [SPARK-1308] Add getNumPartitions to pyspark RDD · 6113ac15
      Syed Hashmi authored
      Add getNumPartitions to pyspark RDD to provide an intuitive way to get number of partitions in RDD like we can do in scala today.
      
      Author: Syed Hashmi <shashmi@cloudera.com>
      
      Closes #995 from syedhashmi/master and squashes the following commits:
      
      de0ed5e [Syed Hashmi] [SPARK-1308] Add getNumPartitions to pyspark RDD
      6113ac15
    • Andrew Ash's avatar
      Grammar: read -> reads · 32ee9f06
      Andrew Ash authored
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1016 from ash211/patch-6 and squashes the following commits:
      
      e3865c8 [Andrew Ash] Grammar: read -> reads
      32ee9f06
    • Neville Li's avatar
      [SPARK-2067] use relative path for Spark logo in UI · 15ddbef4
      Neville Li authored
      Author: Neville Li <neville@spotify.com>
      
      Closes #1006 from nevillelyh/gh/SPARK-2067 and squashes the following commits:
      
      9ee64cf [Neville Li] [SPARK-2067] use relative path for Spark logo in UI
      15ddbef4
  4. Jun 08, 2014
    • Reynold Xin's avatar
      SPARK-1628 follow up: Improve RangePartitioner's documentation. · 219dc00b
      Reynold Xin authored
      Adding a paragraph clarifying a weird behavior in RangePartitioner.
      
      See also #549.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1012 from rxin/partitioner-doc and squashes the following commits:
      
      6f0109e [Reynold Xin] SPARK-1628 follow up: Improve RangePartitioner's documentation.
      219dc00b
    • maji2014's avatar
      Update run-example · e9261d08
      maji2014 authored
      Old code can only be ran under spark_home and use "bin/run-example".
       Error "./run-example: line 55: ./bin/spark-submit: No such file or directory" appears when running in other place. So change this
      
      Author: maji2014 <maji3@asiainfo-linkage.com>
      
      Closes #1011 from maji2014/master and squashes the following commits:
      
      2cc1af6 [maji2014] Update run-example
      
      Closes #988.
      e9261d08
    • zsxwing's avatar
      SPARK-1628: Add missing hashCode methods in Partitioner subclasses · a71c6d1c
      zsxwing authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-1628
      
      Added `hashCode` in HashPartitioner, RangePartitioner, PythonPartitioner and PageRankUtils.CustomPartitioner.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #549 from zsxwing/SPARK-1628 and squashes the following commits:
      
      2620936 [zsxwing] SPARK-1628: Add missing hashCode methods in Partitioner subclasses
      a71c6d1c
    • Colin Patrick McCabe's avatar
      SPARK-1898: In deploy.yarn.Client, use YarnClient not YarnClientImpl · ee96e940
      Colin Patrick McCabe authored
      https://issues.apache.org/jira/browse/SPARK-1898
      
      Author: Colin Patrick McCabe <cmccabe@cloudera.com>
      
      Closes #850 from cmccabe/master and squashes the following commits:
      
      d66eddc [Colin Patrick McCabe] SPARK-1898: In deploy.yarn.Client, use YarnClient rather than YarnClientImpl
      ee96e940
    • Bernardo Gomez Palacio's avatar
      SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop Version · a338834f
      Bernardo Gomez Palacio authored
      The Maven Profiles that refer to hadoopX, e.g. `hadoop2.4`, should set the expected
      `hadoop.version` and `yarn.version`.
      
      e.g.
      
      ```
      <profile>
            <id>hadoop-2.4</id>
            <properties>
              <hadoop.version>2.4.0</hadoop.version>
               <yarn.version>${hadoop.version}</yarn.version>
              <protobuf.version>2.5.0</protobuf.version>
              <jets3t.version>0.9.0</jets3t.version>
            </properties>
      </profile>
      ```
      
      Builds can still define the `-Dhadoop.version` option but this will correctly default the
      Hadoop Version to the one that is expected according the profile that is selected.
      
      e.g.
      
      ```$ mvn -P hadoop-2.4,yarn clean install```
      or
      
      ```$ mvn -P hadoop-0.23,yarn clean install```
      
      [ticket] : https://issues.apache.org/jira/browse/SPARK-2026
      
      Author      : berngp
      Reviewer    : ?
      
      Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
      
      Closes #998 from berngp/feature/SPARK-2026 and squashes the following commits:
      
      07ba4f7 [Bernardo Gomez Palacio] SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop Version
      a338834f
  5. Jun 07, 2014
    • Neville Li's avatar
      SPARK-2056 Set RDD name to input path · 7b877b27
      Neville Li authored
      Author: Neville Li <neville@spotify.com>
      
      Closes #992 from nevillelyh/master and squashes the following commits:
      
      3011739 [Neville Li] [SPARK-2056] Set RDD name to input path
      7b877b27
    • Patrick Wendell's avatar
      HOTFIX: Support empty body in merge script · 3ace10dc
      Patrick Wendell authored
      Discovered in #992
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1007 from pwendell/hotfix and squashes the following commits:
      
      af90aa0 [Patrick Wendell] HOTFIX: Support empty body in merge script
      3ace10dc
    • Michael Armbrust's avatar
      [SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on data in HDFS · a6c72ab1
      Michael Armbrust authored
      Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?).
      
      In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1004 from marmbrus/fixAggBug and squashes the following commits:
      
      e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
      a6c72ab1
  6. Jun 06, 2014
    • witgo's avatar
      [SPARK-1841]: update scalatest to version 2.1.5 · 41c4a331
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #713 from witgo/scalatest and squashes the following commits:
      
      b627a6a [witgo] merge master
      51fb3d6 [witgo] merge master
      3771474 [witgo] fix RDDSuite
      996d6f9 [witgo] fix TimeStampedWeakValueHashMap test
      9dfa4e7 [witgo] merge bug
      1479b22 [witgo] merge master
      29b9194 [witgo] fix code style
      022a7a2 [witgo] fix test dependency
      a52c0fa [witgo] fix test dependency
      cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest
      046540d [witgo] fix RDDSuite.scala
      2c543b9 [witgo] fix ReplSuite.scala
      c458928 [witgo] update scalatest to version 2.1.5
      41c4a331
    • Michael Armbrust's avatar
      [SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive. · 8d210560
      Michael Armbrust authored
      Followup: #989
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #994 from marmbrus/caseSensitiveFunctions2 and squashes the following commits:
      
      9d9c8ed [Michael Armbrust] Fix DIV and BETWEEN.
      8d210560
    • Ankur Dave's avatar
      [SPARK-1552] Fix type comparison bug in {map,outerJoin}Vertices · 8d85359f
      Ankur Dave authored
      In GraphImpl, mapVertices and outerJoinVertices use a more efficient implementation when the map function conserves vertex attribute types. This is implemented by comparing the ClassTags of the old and new vertex attribute types. However, ClassTags store erased types, so the comparison will return a false positive for types with different type parameters, such as Option[Int] and Option[Double].
      
      This PR resolves the problem by requesting that the compiler generate evidence of equality between the old and new vertex attribute types, and providing a default value for the evidence parameter if the two types are not equal. The methods can then check the value of the evidence parameter to see whether the types are equal.
      
      It also adds a test called "mapVertices changing type with same erased type" that failed before the PR and succeeds now.
      
      Callers of mapVertices and outerJoinVertices can no longer use a wildcard for a graph's VD type. To avoid "Error occurred in an application involving default arguments," they must bind VD to a type parameter, as this PR does for ShortestPaths and LabelPropagation.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #967 from ankurdave/SPARK-1552 and squashes the following commits:
      
      68a4fff [Ankur Dave] Undo conserve naming
      7388705 [Ankur Dave] Remove unnecessary ClassTag for VD parameters
      a704e5f [Ankur Dave] Use type equality constraint with default argument
      29a5ab7 [Ankur Dave] Add failing test
      f458c83 [Ankur Dave] Revert "[SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices"
      16d6af8 [Ankur Dave] [SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices
      8d85359f
Loading