Skip to content
Snippets Groups Projects
  1. Jun 12, 2014
    • Patrick Wendell's avatar
      SPARK-1843: Replace assemble-deps with env variable. · 1c04652c
      Patrick Wendell authored
      (This change is actually small, I moved some logic into
      compute-classpath that was previously in spark-class).
      
      Assemble deps has existed for a while to allow developers to
      run local code with new changes quickly. When I'm developing I
      typically use a simpler approach which just prepends the Spark
      classes to the classpath before the assembly jar. This is well
      defined in the JVM and the Spark classes take precedence over those
      in the assembly.
      
      This approach is portable across both builds which is the main reason I'd
      like to switch to it. It's also a bit easier to toggle on and off quickly.
      
      The way you use this is the following:
      ```
      $ ./bin/spark-shell # Use spark with the normal assembly
      $ export SPARK_PREPEND_CLASSES=true
      $ ./bin/spark-shell # Now it's using compiled classes
      $ unset SPARK_PREPEND_CLASSES
      $ ./bin/spark-shell # Back to normal
      ```
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #877 from pwendell/assemble-deps and squashes the following commits:
      
      8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps
      faa3168 [Patrick Wendell] Adding a warning for compatibility
      3f151a7 [Patrick Wendell] Small fix
      bbfb73c [Patrick Wendell] Review feedback
      328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
      1c04652c
    • Marcelo Vanzin's avatar
      [SPARK-2080] Yarn: report HS URL in client mode, correct user in cluster mode. · ecde5b83
      Marcelo Vanzin authored
      Yarn client mode was not setting the app's tracking URL to the
      History Server's URL when configured by the user. Now client mode
      behaves the same as cluster mode.
      
      In SparkContext.scala, the "user.name" system property had precedence
      over the SPARK_USER environment variable. This means that SPARK_USER
      was never used, since "user.name" is always set by the JVM. In Yarn
      cluster mode, this means the application always reported itself as
      being run by user "yarn" (or whatever user was running the Yarn NM).
      One could argue that the correct fix would be to use UGI.getCurrentUser()
      here, but at least for Yarn that will match what SPARK_USER is set
      to.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Thomas Graves <tgraves@apache.org>
      
      Closes #1002 from vanzin/yarn-client-url and squashes the following commits:
      
      4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also.
      4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.
      ecde5b83
    • Doris Xin's avatar
      [SPARK-2088] fix NPE in toString · 83c226d4
      Doris Xin authored
      After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. An NPE is thrown when toString is called by the serializer when creationSiteInfo is null.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1028 from dorx/toStringNPE and squashes the following commits:
      
      f20021e [Doris Xin] unit test for toString after desrialization
      6f0a586 [Doris Xin] Merge branch 'master' into toStringNPE
      f47fecf [Doris Xin] Merge branch 'master' into toStringNPE
      76199c6 [Doris Xin] [SPARK-2088] fix NPE in toString
      83c226d4
    • Sandy Ryza's avatar
      SPARK-554. Add aggregateByKey. · ce92a9c1
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #705 from sryza/sandy-spark-554 and squashes the following commits:
      
      2302b8f [Sandy Ryza] Add MIMA exclude
      f52e0ad [Sandy Ryza] Fix Python tests for real
      2f3afa3 [Sandy Ryza] Fix Python test
      0b735e9 [Sandy Ryza] Fix line lengths
      ae56746 [Sandy Ryza] Fix doc (replace T with V)
      c2be415 [Sandy Ryza] Java and Python aggregateByKey
      23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
      ce92a9c1
    • Jeff Thompson's avatar
      fixed typo in docstring for min() · 43d53d51
      Jeff Thompson authored
      Hi, I found this typo while learning spark and thought I'd do a pull request.
      
      Author: Jeff Thompson <jeffreykeatingthompson@gmail.com>
      
      Closes #1065 from jkthompson/docstring-typo-minmax and squashes the following commits:
      
      29b6a26 [Jeff Thompson] fixed typo in docstring for min()
      43d53d51
    • Henry Saputra's avatar
      Cleanup on Connection and ConnectionManager · 4d8ae709
      Henry Saputra authored
      Simple cleanup on Connection and ConnectionManager to make IDE happy while working of issue:
      1. Replace var with var
      2. Add parentheses to Queue#dequeu to be consistent with side-effects.
      3. Remove return on final line of a method.
      
      Author: Henry Saputra <henry.saputra@gmail.com>
      
      Closes #1060 from hsaputra/cleanup_connection_classes and squashes the following commits:
      
      245fd09 [Henry Saputra] Cleanup on Connection and ConnectionManager to make IDE happy while working of issue: 1. Replace var with var 2. Add parentheses to Queue#dequeu to be consistent with side-effects. 3. Remove return on final line of a method.
      4d8ae709
  2. Jun 11, 2014
    • Yadong's avatar
      'killFuture' is never used · e056320c
      Yadong authored
      Author: Yadong <qiyadong2010@gmail.com>
      
      Closes #1052 from watermen/bug-fix1 and squashes the following commits:
      
      409d09a [Yadong] 'killFuture' is never used
      e056320c
    • Matei Zaharia's avatar
      [SPARK-2044] Pluggable interface for shuffles · 508fd371
      Matei Zaharia authored
      This is a first cut at moving shuffle logic behind a pluggable interface, as described at https://issues.apache.org/jira/browse/SPARK-2044, to let us more easily experiment with new shuffle implementations. It moves the existing shuffle code to a class HashShuffleManager behind a general ShuffleManager interface.
      
      Two things are still missing to make this complete:
      * MapOutputTracker needs to be hidden behind the ShuffleManager interface; this will also require adding methods to ShuffleManager that will let the DAGScheduler interact with it as it does with the MapOutputTracker today
      * The code to do map-sides and reduce-side combine in ShuffledRDD, PairRDDFunctions, etc needs to be moved into the ShuffleManager's readers and writers
      
      However, some of these may also be done later after we merge the current interface.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1009 from mateiz/pluggable-shuffle and squashes the following commits:
      
      7a09862 [Matei Zaharia] review comments
      be33d3f [Matei Zaharia] review comments
      1513d4e [Matei Zaharia] Add ASF header
      ac56831 [Matei Zaharia] Bug fix and better error message
      4f681ba [Matei Zaharia] Move write part of ShuffleMapTask to ShuffleManager
      f6f011d [Matei Zaharia] Move hash shuffle reader behind ShuffleManager interface
      55c7717 [Matei Zaharia] Changed RDD code to use ShuffleReader
      75cc044 [Matei Zaharia] Partial work to move hash shuffle in
      508fd371
    • Tor Myklebust's avatar
      [SPARK-1672][MLLIB] Separate user and product partitioning in ALS · d9203350
      Tor Myklebust authored
      Some clean up work following #593.
      
      1. Allow to set different number user blocks and number product blocks in `ALS`.
      2. Update `MovieLensALS` to reflect the change.
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:
      
      0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
      36420c7 [Xiangrui Meng] set exclusion rules for ALS
      9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
      84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
      d17a8bf [Xiangrui Meng] merge master
      a4925fd [Tor Myklebust] Style.
      bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
      021f54b [Tor Myklebust] Separate user and product blocks.
      dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
      23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
      495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      674933a [Tor Myklebust] Fix style.
      40edc23 [Tor Myklebust] Fix missing space.
      f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
      5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
      36a0f43 [Tor Myklebust] Make the partitioner private.
      d872b09 [Tor Myklebust] Add negative id ALS test.
      df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
      c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
      c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
      d9203350
    • Takuya UESHIN's avatar
      [SPARK-2052] [SQL] Add optimization for CaseConversionExpression's. · 9a2448da
      Takuya UESHIN authored
      Add optimization for `CaseConversionExpression`'s.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #990 from ueshin/issues/SPARK-2052 and squashes the following commits:
      
      2568666 [Takuya UESHIN] Move some rules back.
      dde7ede [Takuya UESHIN] Add tests to check if ConstantFolding can handle null literals and remove the unneeded rules from NullPropagation.
      c4eea67 [Takuya UESHIN] Fix toString methods.
      23e2363 [Takuya UESHIN] Make CaseConversionExpressions foldable if the child is foldable.
      0ff7568 [Takuya UESHIN] Add tests for collapsing case statements.
      3977d80 [Takuya UESHIN] Add optimization for CaseConversionExpression's.
      9a2448da
    • Patrick Wendell's avatar
    • Patrick Wendell's avatar
      HOTFIX: PySpark tests should be order insensitive. · 14e6dc94
      Patrick Wendell authored
      This has been messing up the SQL PySpark tests on Jenkins.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1054 from pwendell/pyspark and squashes the following commits:
      
      1eb5487 [Patrick Wendell] False change
      06f062d [Patrick Wendell] HOTFIX: PySpark tests should be order insensitive
      14e6dc94
    • Andrew Or's avatar
      HOTFIX: A few PySpark tests were not actually run · fe78b8b6
      Andrew Or authored
      This is a hot fix for the hot fix in fb499be1. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests:
      - pyspark/broadcast.py
      - pyspark/accumulators.py
      - pyspark/serializers.py
      
      (@pwendell I might have told you the wrong thing)
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1053 from andrewor14/python-test-fix and squashes the following commits:
      
      d2e5401 [Andrew Or] Explain why these tests are handled differently
      0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked
      fe78b8b6
    • Daoyuan's avatar
      [SQL] Code Cleanup: Left Semi Hash Join · ce6deb1e
      Daoyuan authored
      
      Some improvement for PR #837, add another case to white list and use `filter` to build result iterator.
      
      Author: Daoyuan <daoyuan.wang@intel.com>
      
      Closes #1049 from adrian-wang/clean-LeftSemiJoinHash and squashes the following commits:
      
      b314d5a [Daoyuan] change hashSet name
      27579a9 [Daoyuan] add semijoin to white list and use filter to create new iterator in LeftSemiJoinBNL
      
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      ce6deb1e
    • Sameer Agarwal's avatar
      [SPARK-2042] Prevent unnecessary shuffle triggered by take() · 4107cce5
      Sameer Agarwal authored
      This PR implements `take()` on a `SchemaRDD` by inserting a logical limit that is followed by a `collect()`. This is also accompanied by adding a catalyst optimizer rule for collapsing adjacent limits. Doing so prevents an unnecessary shuffle that is sometimes triggered by `take()`.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #1048 from sameeragarwal/master and squashes the following commits:
      
      3eeb848 [Sameer Agarwal] Fixing Tests
      1b76ff1 [Sameer Agarwal] Deprecating limit(limitExpr: Expression) in v1.1.0
      b723ac4 [Sameer Agarwal] Added limit folding tests
      a0ff7c4 [Sameer Agarwal] Adding catalyst rule to fold two consecutive limits
      8d42d03 [Sameer Agarwal] Implement trigger() as limit() followed by collect()
      4107cce5
    • Lars Albertsson's avatar
      SPARK-2113: awaitTermination() after stop() will hang in Spark Stremaing · 4d5c12aa
      Lars Albertsson authored
      Author: Lars Albertsson <lalle@spotify.com>
      
      Closes #1001 from lallea/contextwaiter_stopped and squashes the following commits:
      
      93cd314 [Lars Albertsson] Mend StreamingContext stop() followed by awaitTermination().
      4d5c12aa
    • Prashant Sharma's avatar
      [SPARK-2108] Mark SparkContext methods that return block information as developer API's · e508f599
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1047 from ScrapCodes/SPARK-2108/mark-as-dev-api and squashes the following commits:
      
      073ee34 [Prashant Sharma] [SPARK-2108] Mark SparkContext methods that return block information as developer API's
      e508f599
    • Prashant Sharma's avatar
      [SPARK-2069] MIMA false positives · 5b754b45
      Prashant Sharma authored
      Fixes SPARK 2070 and 2071
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1021 from ScrapCodes/SPARK-2070/package-private-methods and squashes the following commits:
      
      7979a57 [Prashant Sharma] addressed code review comments
      558546d [Prashant Sharma] A little fancy error message.
      59275ab [Prashant Sharma] SPARK-2071 Mima ignores classes and its members from previous versions too.
      0c4ff2b [Prashant Sharma] SPARK-2070 Ignore methods along with annotated classes.
      5b754b45
    • Sandy Ryza's avatar
      SPARK-1639. Tidy up some Spark on YARN code · 2a4225dd
      Sandy Ryza authored
      This contains a bunch of small tidyings of the Spark on YARN code.
      
      I focused on the yarn stable code.  @tgravescs, let me know if you'd like me to make these for the alpha code as well.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #561 from sryza/sandy-spark-1639 and squashes the following commits:
      
      72b6a02 [Sandy Ryza] Fix comment and set name on driver thread
      c2190b2 [Sandy Ryza] SPARK-1639. Tidy up some Spark on YARN code
      2a4225dd
    • Qiuzhuang.Lian's avatar
      SPARK-2107: FilterPushdownSuite doesn't need Junit jar. · 6e119303
      Qiuzhuang.Lian authored
      Author: Qiuzhuang.Lian <Qiuzhuang.Lian@gmail.com>
      
      Closes #1046 from Qiuzhuang/master and squashes the following commits:
      
      0a9921a [Qiuzhuang.Lian] SPARK-2107: FilterPushdownSuite doesn't need Junit jar.
      6e119303
    • Xiangrui Meng's avatar
      [SPARK-2091][MLLIB] use numpy.dot instead of ndarray.dot · 0f1dc3a7
      Xiangrui Meng authored
      `ndarray.dot` is not available in numpy 1.4. This PR makes pyspark/mllib compatible with numpy 1.4.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1035 from mengxr/numpy-1.4 and squashes the following commits:
      
      7ad2f0c [Xiangrui Meng] use numpy.dot instead of ndarray.dot
      0f1dc3a7
    • Cheng Lian's avatar
      [SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables · 0266a0c8
      Cheng Lian authored
      JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968)
      
      This PR added support for SQL/HiveQL command for caching/uncaching tables:
      
      ```
      scala> sql("CACHE TABLE src")
      ...
      res0: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[0] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      CacheCommandPhysical src, true
      
      scala> table("src")
      ...
      res1: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[3] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false
      
      scala> isCached("src")
      res2: Boolean = true
      
      scala> sql("CACHE TABLE src")
      ...
      res3: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[4] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      CacheCommandPhysical src, false
      
      scala> table("src")
      ...
      res4: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[11] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None
      
      scala> isCached("src")
      res5: Boolean = false
      ```
      
      Things also work for `hql`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1038 from liancheng/sqlCacheTable and squashes the following commits:
      
      ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands
      6f4ce42 [Cheng Lian] Moved logical command classes to a separate file
      3458a24 [Cheng Lian] Added comment for public API
      f0ffacc [Cheng Lian] Added isCached() predicate
      15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements
      0266a0c8
    • Takuya UESHIN's avatar
      [SPARK-2093] [SQL] NullPropagation should use exact type value. · 0402bd77
      Takuya UESHIN authored
      `NullPropagation` should use exact type value when transform `Count` or `Sum`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1034 from ueshin/issues/SPARK-2093 and squashes the following commits:
      
      65b6ff1 [Takuya UESHIN] Modify the literal value of the result of transformation from Sum to long value.
      830c20b [Takuya UESHIN] Add Cast to the result of transformation from Count.
      9314806 [Takuya UESHIN] Fix NullPropagation to use exact type value.
      0402bd77
  3. Jun 10, 2014
    • Zongheng Yang's avatar
      HOTFIX: clear() configs in SQLConf-related unit tests. · 601032f5
      Zongheng Yang authored
      Thanks goes to @liancheng, who pointed out that `sql/test-only *.SQLConfSuite *.SQLQuerySuite` passed but `sql/test-only *.SQLQuerySuite *.SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls.
      
      This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1040 from concretevitamin/sqlconf-tests and squashes the following commits:
      
      6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.
      601032f5
    • Nicholas Chammas's avatar
      [SPARK-2065] give launched instances names · a2052a44
      Nicholas Chammas authored
      This update resolves [SPARK-2065](https://issues.apache.org/jira/browse/SPARK-2065). It gives launched EC2 instances descriptive names by using instance tags. Launched instances now show up in the EC2 console with these names.
      
      I used `format()` with named parameters, which I believe is the recommended practice for string formatting in Python, but which doesn’t seem to be used elsewhere in the script.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1043 from nchammas/master and squashes the following commits:
      
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      a2052a44
    • witgo's avatar
      Resolve scalatest warnings during build · c48b6222
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #1032 from witgo/ShouldMatchers and squashes the following commits:
      
      7ebf34c [witgo] Resolve scalatest warnings during build
      c48b6222
    • Tathagata Das's avatar
      [SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old executor logs · 4823bf47
      Tathagata Das authored
      Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed.
      
      Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function).
      
      This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with  `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`).  The web UI has also been modified to show the logs across the rolled-over files.
      
      You can test this locally (without waiting a whole day) by setting  configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to).
      
      ```
      stderr
      stderr--2014-05-27--14-37
      stderr--2014-05-27--14-47
      stderr--2014-05-27--15-05
      stdout
      stdout--2014-05-27--14-47
      ```
      
      The web ui should show logs across these files.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #895 from tdas/rolling-logs and squashes the following commits:
      
      fd8f87f [Tathagata Das] Minor change.
      d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      ad956c1 [Tathagata Das] Scala style fix.
      1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments.
      c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names.
      4224409 [Tathagata Das] Style fix.
      108a9f8 [Tathagata Das] Added better constraint handling for rolling policies.
      f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled
      312d874 [Tathagata Das] Minor fixes based on PR comments.
      8a67d83 [Tathagata Das] Fixed comments.
      b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly.
      b7e8272 [Tathagata Das] Style fix,
      374c9a9 [Tathagata Das] Added missing license.
      24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements.
      adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
      931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files.
      cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
      4823bf47
    • joyyoj's avatar
      [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not re... · 29660443
      joyyoj authored
      flume event sent to Spark will fail if the body is too large and numHeaders is greater than zero
      
      Author: joyyoj <sunshch@gmail.com>
      
      Closes #951 from joyyoj/master and squashes the following commits:
      
      f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly
      29660443
    • egraldlo's avatar
      [SQL] Add average overflow test case from #978 · 1abbde0e
      egraldlo authored
      By @egraldlo.
      
      Author: egraldlo <egraldlo@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1033 from marmbrus/pr/978 and squashes the following commits:
      
      e228c5e [Michael Armbrust] Remove "test".
      762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table.
      d414cd7 [egraldlo] fommatting issues
      1153f75 [egraldlo] do best to avoid overflowing in function avg().
      1abbde0e
    • Ankur Dave's avatar
      HOTFIX: Increase time limit for Bagel test · 55a0e87e
      Ankur Dave authored
      The test was timing out on some slow EC2 workers.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #1037 from ankurdave/bagel-test-time-limit and squashes the following commits:
      
      67fd487 [Ankur Dave] Increase time limit for Bagel test
      55a0e87e
    • Patrick Wendell's avatar
      HOTFIX: Fix Python tests on Jenkins. · fb499be1
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1036 from pwendell/jenkins-test and squashes the following commits:
      
      9c99856 [Patrick Wendell] Better output during tests
      71e7b74 [Patrick Wendell] Removing incorrect python path
      74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.
      fb499be1
    • Cheng Hao's avatar
      [SPARK-2076][SQL] Pushdown the join filter & predication for outer join · db0c038a
      Cheng Hao authored
      As the rule described in https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior, we can optimize the SQL Join by pushing down the Join predicate and Where predicate.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1015 from chenghao-intel/join_predicate_push_down and squashes the following commits:
      
      10feff9 [Cheng Hao] fix bug of changing the join type in PredicatePushDownThroughJoin
      44c6700 [Cheng Hao] Add logical to support pushdown the join filter
      0bce426 [Cheng Hao] Pushdown the join filter & predicate for outer join
      db0c038a
    • witgo's avatar
      [SPARK-1978] In some cases, spark-yarn does not automatically restart the failed container · 884ca718
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #921 from witgo/allocateExecutors and squashes the following commits:
      
      bc3aa66 [witgo] review commit
      8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
      32ac7af [witgo] review commit
      056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
      04c6f7e [witgo] Merge branch 'master' into allocateExecutors
      aff827c [witgo] review commit
      5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
      1faf4f4 [witgo] Merge branch 'master' into allocateExecutors
      3c464bd [witgo] add time limit to allocateExecutors
      e00b656 [witgo] In some cases, yarn does not automatically restart the container
      884ca718
    • Cheng Lian's avatar
      Moved hiveOperators.scala to the right package folder · a9a461c5
      Cheng Lian authored
      The package is `org.apache.spark.sql.hive.execution`, while the file was placed under `sql/hive/src/main/scala/org/apache/spark/sql/hive/`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1029 from liancheng/moveHiveOperators and squashes the following commits:
      
      d632eb8 [Cheng Lian] Moved hiveOperators.scala to the right package folder
      a9a461c5
    • Zongheng Yang's avatar
      [SPARK-1508][SQL] Add SQLConf to SQLContext. · 08ed9ad8
      Zongheng Yang authored
      This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands.
      
      The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf.
      
      For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #956 from concretevitamin/sqlconf and squashes the following commits:
      
      4968c11 [Zongheng Yang] Very minor cleanup.
      d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method.
      c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf
      26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in.
      dd19666 [Zongheng Yang] Update a comment.
      baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor.
      5f7e6d8 [Zongheng Yang] Add default num partitions.
      22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method.
      e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap.
      88dd0c8 [Zongheng Yang] Remove redundant SET Keyword.
      271f0b1 [Zongheng Yang] Minor change.
      f8983d1 [Zongheng Yang] Minor changes per review comments.
      1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case.
      b766af9 [Zongheng Yang] Remove a test.
      d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf).
      555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment.
      c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list.
      2ea8cdc [Zongheng Yang] Wrap long line.
      41d7f09 [Zongheng Yang] Fix imports.
      13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands.
      b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf.
      6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete.
      5b67985 [Zongheng Yang] New line at EOF.
      c651797 [Zongheng Yang] Add commands.scala.
      efd82db [Zongheng Yang] Clean up semantics of several cases of SET.
      c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs).
      0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL.
      41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite.
      2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest.
      3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes.
      d0c4578 [Zongheng Yang] Tmux typo.
      0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext.
      ce22d80 [Zongheng Yang] Fix parsing issues.
      cb722c1 [Zongheng Yang] Finish up SQLConf patch.
      4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.
      08ed9ad8
    • Nick Pentreath's avatar
      SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats · f971d6cb
      Nick Pentreath authored
      So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.
      
      This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.
      
      # Overview
      The basics are as follows:
      1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
      2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
      3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
      4. ```PickleSerializer``` on the Python side deserializes.
      
      This works "out the box" for simple ```Writable```s:
      * ```Text```
      * ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
      * ```NullWritable```
      * ```BooleanWritable```
      * ```BytesWritable```
      * ```MapWritable```
      
      It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).
      
      I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
      ```python
      conf = {"es.resource" : "index/type" }
      rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
      rdd.first()
      ```
      
      I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.
      
      # Some things still outstanding:
      1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
      2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
      3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
      4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:
      
      268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
      761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
      4c972d8 [Nick Pentreath] Add license headers
      d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      cde6af9 [Nick Pentreath] Parameterize converter trait
      5ebacfa [Nick Pentreath] Update docs for PySpark input formats
      a985492 [Nick Pentreath] Move Converter examples to own package
      365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
      eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
      1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
      3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
      b65606f [Nick Pentreath] Add converter interface
      5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
      085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
      43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
      94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
      1a4a1d6 [Nick Pentreath] Address @mateiz style comments
      01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
      9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      84fe8e3 [Nick Pentreath] Python programming guide space formatting
      d0f52b6 [Nick Pentreath] Python programming guide
      7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      93ef995 [Nick Pentreath] Add back context.py changes
      9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
      077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
      5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      35b8e3a [Nick Pentreath] Another fix for test ordering
      bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e001b94 [Nick Pentreath] Fix test failures due to ordering
      78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
      64eb051 [Nick Pentreath] Scalastyle fix
      e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
      c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
      1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
      17a656b [Nick Pentreath] remove binary sequencefile for tests
      f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
      450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      31a2fff [Nick Pentreath] Scalastyle fixes
      fc5099e [Nick Pentreath] Add Apache license headers
      4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
      b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
      951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      f6aac55 [Nick Pentreath] Bring back msgpack
      9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
      a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
      7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
      25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
      65360d5 [Nick Pentreath] Adding test SequenceFiles
      0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      d72bf18 [Nick Pentreath] msgpack
      dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e67212a [Nick Pentreath] Add back msgpack dependency
      f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      97ef708 [Nick Pentreath] Remove old writeToStream
      2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
      174f520 [Nick Pentreath] Add back graphx settings
      703ee65 [Nick Pentreath] Add back msgpack
      619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      eb40036 [Nick Pentreath] Remove unused comment lines
      4d7ef2e [Nick Pentreath] Fix indentation
      f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
      0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
      4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
      818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
      4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
      4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
      d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
      f971d6cb
    • DB Tsai's avatar
      Make sure that empty string is filtered out when we get the secondary jars from conf · 6f2db8c2
      DB Tsai authored
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #1027 from dbtsai/dbtsai-classloader and squashes the following commits:
      
      9ac6be3 [DB Tsai] Fixed line too long
      c9c7ad7 [DB Tsai] Make sure that empty string is filtered out when we get the secondary jars from conf.
      6f2db8c2
  4. Jun 09, 2014
    • Zongheng Yang's avatar
      [SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD. · a9ec033c
      Zongheng Yang authored
      This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1003 from concretevitamin/explain-cmd and squashes the following commits:
      
      5b7911f [Zongheng Yang] Add a regression test.
      1bfa379 [Zongheng Yang] Modify output().
      719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical.
      4318fd7 [Zongheng Yang] Make all output one Row.
      439c6ab [Zongheng Yang] Minor cleanups.
      408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.
      a9ec033c
    • Michael Armbrust's avatar
      [SQL] Simple framework for debugging query execution · c6e041d1
      Michael Armbrust authored
      Only records number of tuples and unique dataTypes output right now...
      
      Example:
      ```scala
      scala> import org.apache.spark.sql.execution.debug._
      scala> hql("SELECT value FROM src WHERE key > 10").debug(sparkContext)
      
      Results returned: 489
      == Project [value#1:0] ==
      Tuples output: 489
       value StringType: {java.lang.String}
      == Filter (key#0:1 > 10) ==
      Tuples output: 489
       value StringType: {java.lang.String}
       key IntegerType: {java.lang.Integer}
      == HiveTableScan [value#1,key#0], (MetastoreRelation default, src, None), None ==
      Tuples output: 500
       value StringType: {java.lang.String}
       key IntegerType: {java.lang.Integer}
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1005 from marmbrus/debug and squashes the following commits:
      
      dcc3ca6 [Michael Armbrust] Add comments.
      c9dded2 [Michael Armbrust] Simple framework for debugging query execution
      c6e041d1
    • Bernardo Gomez Palacio's avatar
      [SPARK-1522] : YARN ClientBase throws a NPE if there is no YARN Application CP · e2734476
      Bernardo Gomez Palacio authored
      The current implementation of ClientBase.getDefaultYarnApplicationClasspath inspects
      the MRJobConfig class for the field DEFAULT_YARN_APPLICATION_CLASSPATH when it should
      be really looking into YarnConfiguration. If the Application Configuration has no
      yarn.application.classpath defined a NPE exception will be thrown.
      
      Additional Changes include:
      * Test Suite for ClientBase added
      
      [ticket: SPARK-1522] : https://issues.apache.org/jira/browse/SPARK-1522
      
      Author      : bernardo.gomezpalacio@gmail.com
      Testing     : SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt test
      
      Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
      
      Closes #433 from berngp/feature/SPARK-1522 and squashes the following commits:
      
      2c2e118 [Bernardo Gomez Palacio] [SPARK-1522]: YARN ClientBase throws a NPE if there is no YARN Application specific CP
      e2734476
Loading