Skip to content
Snippets Groups Projects
  1. May 29, 2014
    • Ankur Dave's avatar
      initial version of LPA · b7e28fa4
      Ankur Dave authored
      A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework.  Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      Author: haroldsultan <haroldsultan@gmail.com>
      Author: Harold Sultan <haroldsultan@gmail.com>
      
      Closes #905 from haroldsultan/master and squashes the following commits:
      
      327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation
      227a4d0 [Ankur Dave] Untabify
      0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation
      0e24303 [Ankur Dave] Add LabelPropagationSuite
      84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA
      9830342 [Harold Sultan] initial version of LPA
      b7e28fa4
    • Cheng Lian's avatar
      [SPARK-1368][SQL] Optimized HiveTableScan · 8f7141fb
      Cheng Lian authored
      JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)
      
      This PR introduces two major updates:
      
      - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
      - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.
      
      My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:
      
      ```
      Original:
      
      [info] CSV: 27676 ms, RCFile: 26415 ms
      [info] CSV: 27703 ms, RCFile: 26029 ms
      [info] CSV: 27511 ms, RCFile: 25962 ms
      
      Optimized:
      
      [info] CSV: 13820 ms, RCFile: 10402 ms
      [info] CSV: 14158 ms, RCFile: 10691 ms
      [info] CSV: 13606 ms, RCFile: 10346 ms
      ```
      
      The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.
      
      Preparation code:
      
      ```scala
      package org.apache.spark.examples.sql.hive
      
      import org.apache.spark.sql.hive.LocalHiveContext
      import org.apache.spark.{SparkConf, SparkContext}
      
      object HiveTableScanPrepare extends App {
        val sparkContext = new SparkContext(
          new SparkConf()
            .setMaster("local")
            .setAppName(getClass.getSimpleName.stripSuffix("$")))
      
        val hiveContext = new LocalHiveContext(sparkContext)
      
        import hiveContext._
      
        hql("drop table scan_csv")
        hql("drop table scan_rcfile")
      
        hql("""create table scan_csv (key int, value string)
              |  row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
              |  with serdeproperties ('field.delim'=',')
            """.stripMargin)
      
        hql(s"""load data local inpath "${args(0)}" into table scan_csv""")
      
        hql("""create table scan_rcfile (key int, value string)
              |  row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
              |stored as
              |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
              |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
            """.stripMargin)
      
        hql(
          """
            |from scan_csv
            |insert overwrite table scan_rcfile
            |select scan_csv.key, scan_csv.value
          """.stripMargin)
      }
      ```
      
      Benchmark code:
      
      ```scala
      package org.apache.spark.examples.sql.hive
      
      import org.apache.spark.sql.hive.LocalHiveContext
      import org.apache.spark.{SparkConf, SparkContext}
      
      object HiveTableScanBenchmark extends App {
        val sparkContext = new SparkContext(
          new SparkConf()
            .setMaster("local")
            .setAppName(getClass.getSimpleName.stripSuffix("$")))
      
        val hiveContext = new LocalHiveContext(sparkContext)
      
        import hiveContext._
      
        val scanCsv = hql("select key from scan_csv")
        val scanRcfile = hql("select key from scan_rcfile")
      
        val csvDuration = benchmark(scanCsv.count())
        val rcfileDuration = benchmark(scanRcfile.count())
      
        println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")
      
        def benchmark(f: => Unit) = {
          val begin = System.currentTimeMillis()
          f
          val end = System.currentTimeMillis()
          end - begin
        }
      }
      ```
      
      @marmbrus Please help review, thanks!
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #758 from liancheng/fastHiveTableScan and squashes the following commits:
      
      4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest
      cf640d8 [Cheng Lian] More HiveTableScan optimisations:
      bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest.
      6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
      eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan
      8f7141fb
    • Yin Huai's avatar
      SPARK-1935: Explicitly add commons-codec 1.5 as a dependency. · 60b89fe6
      Yin Huai authored
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #889 from yhuai/SPARK-1935 and squashes the following commits:
      
      7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency.
      60b89fe6
    • Jyotiska NK's avatar
      Added doctest and method description in context.py · 9cff1dd2
      Jyotiska NK authored
      Added doctest for method textFile and description for methods _initialize_context and _ensure_initialized in context.py
      
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #187 from jyotiska/pyspark_context and squashes the following commits:
      
      356f945 [Jyotiska NK] Added doctest for textFile method in context.py
      5b23686 [Jyotiska NK] Updated context.py with method descriptions
      9cff1dd2
  2. May 28, 2014
    • witgo's avatar
      [SPARK-1712]: TaskDescription instance is too big causes Spark to hang · 4dbb27b0
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #694 from witgo/SPARK-1712_new and squashes the following commits:
      
      0f52483 [witgo] review commit
      83ce29b [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      52e6752 [witgo] reset test SparkContext
      63636b6 [witgo] review commit
      44a59ee [witgo] review commit
      3b6d48c [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      926bd6a [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      9a5cfad [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      03cc562 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      b0930b0 [witgo] review commit
      b1174bd [witgo] merge master
      f76679b [witgo] merge master
      689495d [witgo] fix scala style bug
      1d35c3c [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      062c182 [witgo] fix small bug for code style
      0a428cf [witgo] add unit tests
      158b2dc [witgo] review commit
      4afe71d [witgo] review commit
      9e4ffa7 [witgo] review commit
      1d35c7d [witgo] fix hang
      7965580 [witgo] fix Statement order
      0e29eac [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      3ea1ca1 [witgo] remove duplicate serialize
      743a7ad [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      86e2048 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
      2a89adc [witgo] SPARK-1712: TaskDescription instance is too big causes Spark to hang
      4dbb27b0
    • David Lemieux's avatar
      Spark 1916 · 4312cf0b
      David Lemieux authored
      
      The changes could be ported back to 0.9 as well.
      Changing in.read to in.readFully to read the whole input stream rather than the first 1020 bytes.
      This should ok considering that Flume caps the body size to 32K by default.
      
      Author: David Lemieux <david.lemieux@radialpoint.com>
      
      Closes #865 from lemieud/SPARK-1916 and squashes the following commits:
      
      a265673 [David Lemieux] Updated SparkFlumeEvent to read the whole stream rather than the first X bytes.
      (cherry picked from commit 0b769b73)
      
      Signed-off-by: default avatarPatrick Wendell <pwendell@gmail.com>
      4312cf0b
    • Patrick Wendell's avatar
      Organize configuration docs · 7801d44f
      Patrick Wendell authored
      This PR improves and organizes the config option page
      and makes a few other changes to config docs. See a preview here:
      http://people.apache.org/~pwendell/config-improvements/configuration.html
      
      The biggest changes are:
      1. The configs for the standalone master/workers were moved to the
      standalone page and out of the general config doc.
      2. SPARK_LOCAL_DIRS was missing from the standalone docs.
      3. Expanded discussion of injecting configs with spark-submit, including an
      example.
      4. Config options were organized into the following categories:
      - Runtime Environment
      - Shuffle Behavior
      - Spark UI
      - Compression and Serialization
      - Execution Behavior
      - Networking
      - Scheduling
      - Security
      - Spark Streaming
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #880 from pwendell/config-cleanup and squashes the following commits:
      
      93f56c3 [Patrick Wendell] Feedback from Matei
      6f66efc [Patrick Wendell] More feedback
      16ae776 [Patrick Wendell] Adding back header section
      d9c264f [Patrick Wendell] Small fix
      e0c1728 [Patrick Wendell] Response to Matei's review
      27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
      e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
      a374369 [Patrick Wendell] Line wrapping fixes
      fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
      3289ea4 [Patrick Wendell] Pulling in changes from #856
      106ee31 [Patrick Wendell] Small link fix
      f7e79bc [Patrick Wendell] Re-organizing config options.
      54b184d [Patrick Wendell] Adding standalone configs to the standalone page
      592e94a [Patrick Wendell] Stash
      29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
      2d719ef [Patrick Wendell] Small fix
      4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
      204b248 [Patrick Wendell] Small fixes
      7801d44f
    • jmu's avatar
      Fix doc about NetworkWordCount/JavaNetworkWordCount usage of spark streaming · 82eadc3b
      jmu authored
      Usage: NetworkWordCount <master> <hostname> <port>
      -->
      Usage: NetworkWordCount <hostname> <port>
      
      Usage: JavaNetworkWordCount <master> <hostname> <port>
      -->
      Usage: JavaNetworkWordCount <hostname> <port>
      
      Author: jmu <jmujmu@gmail.com>
      
      Closes #826 from jmu/master and squashes the following commits:
      
      9fb7980 [jmu] Merge branch 'master' of https://github.com/jmu/spark
      b9a6b02 [jmu] Fix doc for NetworkWordCount/JavaNetworkWordCount Usage: NetworkWordCount <master> <hostname> <port> --> Usage: NetworkWordCount <hostname> <port>
      82eadc3b
    • Takuya UESHIN's avatar
      [SPARK-1938] [SQL] ApproxCountDistinctMergeFunction should return Int value. · 9df86835
      Takuya UESHIN authored
      `ApproxCountDistinctMergeFunction` should return `Int` value because the `dataType` of `ApproxCountDistinct` is `IntegerType`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #893 from ueshin/issues/SPARK-1938 and squashes the following commits:
      
      3970e88 [Takuya UESHIN] Remove a superfluous line.
      5ad7ec1 [Takuya UESHIN] Make dataType for each of CountDistinct, ApproxCountDistinctMerge and ApproxCountDistinct LongType.
      cbe7c71 [Takuya UESHIN] Revert a change.
      fc3ac0f [Takuya UESHIN] Fix evaluated value type of ApproxCountDistinctMergeFunction to Int.
      9df86835
  3. May 27, 2014
  4. May 26, 2014
    • Reynold Xin's avatar
      Updated dev Python scripts to make them PEP8 compliant. · 9ed37190
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #875 from rxin/pep8-dev-scripts and squashes the following commits:
      
      04b084f [Reynold Xin] Made dev Python scripts PEP8 compliant.
      9ed37190
    • Reynold Xin's avatar
    • Zhen Peng's avatar
      SPARK-1929 DAGScheduler suspended by local task OOM · 8d271c90
      Zhen Peng authored
      DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.
      
      Author: Zhen Peng <zhenpeng01@baidu.com>
      
      Closes #883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:
      
      76f7eda [Zhen Peng] remove redundant memory allocations
      aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
      8d271c90
    • Ankur Dave's avatar
      [SPARK-1931] Reconstruct routing tables in Graph.partitionBy · 56c771cb
      Ankur Dave authored
      905173df introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties.
      
      This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #885 from ankurdave/SPARK-1931 and squashes the following commits:
      
      3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins
      9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
      56c771cb
    • zsxwing's avatar
      SPARK-1925: Replace '&' with '&&' · cb7fe503
      zsxwing authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-1925
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #879 from zsxwing/SPARK-1925 and squashes the following commits:
      
      5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'
      cb7fe503
    • witgo's avatar
      Fix scalastyle warnings in yarn alpha · bee6c4f4
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #884 from witgo/scalastyle and squashes the following commits:
      
      4b08ae4 [witgo] Fix scalastyle warnings in yarn alpha
      bee6c4f4
    • Takuya UESHIN's avatar
      [SPARK-1914] [SQL] Simplify CountFunction not to traverse to evaluate all child expressions. · d6395d86
      Takuya UESHIN authored
      `CountFunction` should count up only if the child's evaluated value is not null.
      
      Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #861 from ueshin/issues/SPARK-1914 and squashes the following commits:
      
      3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914
      2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.
      d6395d86
  5. May 25, 2014
    • Patrick Wendell's avatar
      HOTFIX: Add no-arg SparkContext constructor in Java · b6d22af0
      Patrick Wendell authored
      Self explanatory.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #878 from pwendell/java-constructor and squashes the following commits:
      
      2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java
      b6d22af0
    • Aaron Davidson's avatar
      [SQL] Minor: Introduce SchemaRDD#aggregate() for simple aggregations · c3576ffc
      Aaron Davidson authored
      ```scala
      rdd.aggregate(Sum('val))
      ```
      is just shorthand for
      
      ```scala
      rdd.groupBy()(Sum('val))
      ```
      
      but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows.
      
      Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #874 from aarondav/schemardd and squashes the following commits:
      
      e9e68ee [Aaron Davidson] Add comment
      db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations
      c3576ffc
    • Andrew Ash's avatar
      SPARK-1903 Document Spark's network connections · 06595296
      Andrew Ash authored
      https://issues.apache.org/jira/browse/SPARK-1903
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #856 from ash211/SPARK-1903 and squashes the following commits:
      
      6e7782a [Andrew Ash] Add the technology used on each port
      1d9b5d3 [Andrew Ash] Document port for history server
      56193ee [Andrew Ash] spark.ui.port becomes worker.ui.port and master.ui.port
      a774c07 [Andrew Ash] Wording in network section
      90e8237 [Andrew Ash] Use real :toc instead of the hand-written one
      edaa337 [Andrew Ash] Master -> Standalone Cluster Master
      57e8869 [Andrew Ash] Port -> Default Port
      3d4d289 [Andrew Ash] Title to title case
      c7d42d9 [Andrew Ash] [WIP] SPARK-1903 Add initial port listing for documentation
      a416ae9 [Andrew Ash] Word wrap to 100 lines
      06595296
    • Reynold Xin's avatar
      Fix PEP8 violations in Python mllib. · d33d3c61
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #871 from rxin/mllib-pep8 and squashes the following commits:
      
      848416f [Reynold Xin] Fixed a typo in the previous cleanup (c -> sc).
      a8db4cd [Reynold Xin] Fix PEP8 violations in Python mllib.
      d33d3c61
    • Reynold Xin's avatar
      Python docstring update for sql.py. · 14f0358b
      Reynold Xin authored
      Mostly related to the following two rules in PEP8 and PEP257:
      - Line length < 72 chars.
      - First line should be a concise description of the function/class.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #869 from rxin/docstring-schemardd and squashes the following commits:
      
      7cf0cbc [Reynold Xin] Updated sql.py for pep8 docstring.
      0a4aef9 [Reynold Xin] Merge branch 'master' into docstring-schemardd
      6678937 [Reynold Xin] Python docstring update for sql.py.
      14f0358b
    • Reynold Xin's avatar
      Fix PEP8 violations in examples/src/main/python. · d79c2b28
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #870 from rxin/examples-python-pep8 and squashes the following commits:
      
      2829e84 [Reynold Xin] Fix PEP8 violations in examples/src/main/python.
      d79c2b28
    • Reynold Xin's avatar
      Added license header for tox.ini. · 55fddf9c
      Reynold Xin authored
      
      (cherry picked from commit fa541f32c5b92e6868a9c99cbb2c87115d624d23)
      Signed-off-by: default avatarReynold Xin <rxin@apache.org>
      55fddf9c
    • Reynold Xin's avatar
      SPARK-1822: Some minor cleanup work on SchemaRDD.count() · d66642e3
      Reynold Xin authored
      Minor cleanup following #841.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #868 from rxin/schema-count and squashes the following commits:
      
      5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()
      d66642e3
    • Reynold Xin's avatar
      Added PEP8 style configuration file. · 5c7faecd
      Reynold Xin authored
      This sets the max line length to 100 as a PEP8 exception.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #872 from rxin/pep8 and squashes the following commits:
      
      2f26029 [Reynold Xin] Added PEP8 style configuration file.
      5c7faecd
    • Kan Zhang's avatar
      [SPARK-1822] SchemaRDD.count() should use query optimizer · 6052db9d
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #841 from kanzhang/SPARK-1822 and squashes the following commits:
      
      2f8072a [Kan Zhang] [SPARK-1822] Minor style update
      cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc
      e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer
      6052db9d
    • Colin Patrick Mccabe's avatar
      spark-submit: add exec at the end of the script · 6e9fb632
      Colin Patrick Mccabe authored
      Add an 'exec' at the end of the spark-submit script, to avoid keeping a
      bash process hanging around while it runs.  This makes ps look a little
      bit nicer.
      
      Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
      
      Closes #858 from cmccabe/SPARK-1907 and squashes the following commits:
      
      7023b64 [Colin Patrick Mccabe] spark-submit: add exec at the end of the script
      6e9fb632
  6. May 24, 2014
    • Cheng Lian's avatar
      [SPARK-1913][SQL] Bug fix: column pruning error in Parquet support · 5afe6af0
      Cheng Lian authored
      JIRA issue: [SPARK-1913](https://issues.apache.org/jira/browse/SPARK-1913)
      
      When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #863 from liancheng/spark-1913 and squashes the following commits:
      
      f976b73 [Cheng Lian] Addessed the readability issue commented by @rxin
      f5b257d [Cheng Lian] Added back comments deleted by mistake
      ae60ab3 [Cheng Lian] [SPARK-1913] Attributes referenced only in predicates pushed down should remain in ParquetTableScan operator
      5afe6af0
    • Zhen Peng's avatar
      [SPARK-1886] check executor id existence when executor exit · 4e4831b8
      Zhen Peng authored
      Author: Zhen Peng <zhenpeng01@baidu.com>
      
      Closes #827 from zhpengg/bugfix-executor-id-not-found and squashes the following commits:
      
      cd8bb65 [Zhen Peng] bugfix: check executor id existence when executor exit
      4e4831b8
    • Patrick Wendell's avatar
      SPARK-1911: Emphasize that Spark jars should be built with Java 6. · 75a03277
      Patrick Wendell authored
      This commit requires the user to manually say "yes" when buiding Spark
      without Java 6. The prompt can be bypassed with a flag (e.g. if the user
      is scripting around make-distribution).
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #859 from pwendell/java6 and squashes the following commits:
      
      4921133 [Patrick Wendell] Adding Pyspark Notice
      fee8c9e [Patrick Wendell] SPARK-1911: Emphasize that Spark jars should be built with Java 6.
      75a03277
    • Andrew Or's avatar
      [SPARK-1900 / 1918] PySpark on YARN is broken · 5081a0a9
      Andrew Or authored
      If I run the following on a YARN cluster
      ```
      bin/spark-submit sheep.py --master yarn-client
      ```
      it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
      ```
      bin/spark-submit file:/path/to/sheep.py --master yarn-client
      ```
      However, this also fails. This time it is because python does not understand URI schemes.
      
      This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
      
      Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #853 from andrewor14/submit-paths and squashes the following commits:
      
      0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
      323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
      3c36587 [Andrew Or] Improve error messages (minor)
      854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
      6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
      3bb0359 [Andrew Or] Update more comments (minor)
      2a1f8a0 [Andrew Or] Update comments (minor)
      6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
      a68c4d1 [Andrew Or] Handle Windows python file path correctly
      427a250 [Andrew Or] Resolve paths properly for Windows
      a591a4a [Andrew Or] Update tests for resolving URIs
      6c8621c [Andrew Or] Move resolveURIs to Utils
      db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
      f542dce [Andrew Or] Fix outdated tests
      691c4ce [Andrew Or] Ignore special primary resource names
      5342ac7 [Andrew Or] Add missing space in error message
      02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
      5081a0a9
  7. May 23, 2014
  8. May 22, 2014
    • Tathagata Das's avatar
      Updated scripts for auditing releases · b2bdd0e5
      Tathagata Das authored
      - Added script to automatically generate change list CHANGES.txt
      - Added test for verifying linking against maven distributions of `spark-sql` and `spark-hive`
      - Added SBT projects for testing functionality of `spark-sql` and `spark-hive`
      - Fixed issues in existing tests that might have come up because of changes in Spark 1.0
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #844 from tdas/update-dev-scripts and squashes the following commits:
      
      25090ba [Tathagata Das] Added missing license
      e2e20b3 [Tathagata Das] Updated tests for auditing releases.
      b2bdd0e5
    • Andrew Or's avatar
      [SPARK-1896] Respect spark.master (and --master) before MASTER in spark-shell · cce77457
      Andrew Or authored
      The hierarchy for configuring the Spark master in the shell is as follows:
      ```
      MASTER > --master > spark.master (spark-defaults.conf)
      ```
      This is inconsistent with the way we run normal applications, which is:
      ```
      --master > spark.master (spark-defaults.conf) > MASTER
      ```
      
      I was trying to run a shell locally on a standalone cluster launched through the ec2 scripts, which automatically set `MASTER` in spark-env.sh. It was surprising to me that `--master` didn't take effect, considering that this is the way we tell users to set their masters [here](http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/scala-programming-guide.html#initializing-spark).
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #846 from andrewor14/shell-master and squashes the following commits:
      
      2cb81c9 [Andrew Or] Respect spark.master before MASTER in REPL
      cce77457
Loading