Skip to content
Snippets Groups Projects
  1. Mar 12, 2015
  2. Mar 11, 2015
    • Tathagata Das's avatar
      [SPARK-6128][Streaming][Documentation] Updates to Spark Streaming Programming Guide · cd3b68d9
      Tathagata Das authored
      Updates to the documentation are as follows:
      
      - Added information on Kafka Direct API and Kafka Python API
      - Added joins to the main streaming guide
      - Improved details on the fault-tolerance semantics
      
      Generated docs located here
      http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#fault-tolerance-semantics
      
      More things to add:
      - Configuration for Kafka receive rate
      - May be add concurrentJobs
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4956 from tdas/streaming-guide-update-1.3 and squashes the following commits:
      
      819408c [Tathagata Das] Minor fixes.
      debe484 [Tathagata Das] Added DataFrames and MLlib
      380cf8d [Tathagata Das] Fix link
      04167a6 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-guide-update-1.3
      0b77486 [Tathagata Das] Updates based on Josh's comments.
      86c4c2a [Tathagata Das] Updated streaming guides
      82de92a [Tathagata Das] Add Kafka to Python api docs
      cd3b68d9
    • Tathagata Das's avatar
      [SPARK-6274][Streaming][Examples] Added examples streaming + sql examples. · 51a79a77
      Tathagata Das authored
      Added Scala, Java and Python streaming examples showing DataFrame and SQL operations within streaming.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4975 from tdas/streaming-sql-examples and squashes the following commits:
      
      705cba1 [Tathagata Das] Fixed python lint error
      75a3fad [Tathagata Das] Fixed python lint error
      5fbf789 [Tathagata Das] Removed empty lines at the end
      874b943 [Tathagata Das] Added examples streaming + sql examples.
      51a79a77
    • Sean Owen's avatar
      SPARK-6245 [SQL] jsonRDD() of empty RDD results in exception · 55c4831d
      Sean Owen authored
      Avoid `UnsupportedOperationException` from JsonRDD.inferSchema on empty RDD.
      
      Not sure if this is supposed to be an error (but a better one), but it seems like this case can come up if the input is down-sampled so much that nothing is sampled.
      
      Now stuff like this:
      ```
      sqlContext.jsonRDD(sc.parallelize(List[String]()))
      ```
      just results in
      ```
      org.apache.spark.sql.DataFrame = []
      ```
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4971 from srowen/SPARK-6245 and squashes the following commits:
      
      3699964 [Sean Owen] Set() -> Set.empty
      3c619e1 [Sean Owen] Avoid UnsupportedOperationException from JsonRDD.inferSchema on empty RDD
      55c4831d
    • Sandy Ryza's avatar
      SPARK-3642. Document the nuances of shared variables. · 2d87a415
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #2490 from sryza/sandy-spark-3642 and squashes the following commits:
      
      aae3340 [Sandy Ryza] SPARK-3642. Document the nuances of broadcast variables
      2d87a415
    • Ilya Ganelin's avatar
      [SPARK-4423] Improve foreach() documentation to avoid confusion between local-... · 548643a9
      Ilya Ganelin authored
      [SPARK-4423] Improve foreach() documentation to avoid confusion between local- and cluster-mode behavior
      
      Hi all - I've added a writeup on how closures work within Spark to help clarify the general case for this problem and similar problems. I hope this addresses the issue and would love any feedback.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #4696 from ilganeli/SPARK-4423 and squashes the following commits:
      
      c5dc498 [Ilya Ganelin] Fixed typo
      07b78e8 [Ilya Ganelin] Updated to fix capitalization
      48c1983 [Ilya Ganelin] Updated to fix capitalization and clarify wording
      2fd2a07 [Ilya Ganelin] Incoporated a few more minor fixes. Fixed a bug in python code. Added semicolons for java
      4772f99 [Ilya Ganelin] Incorporated latest feedback
      448bd79 [Ilya Ganelin] Updated some verbage and added section links
      5dbbda5 [Ilya Ganelin] Improved some wording
      d374d3a [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4423
      2600668 [Ilya Ganelin] Minor edits
      c768ab2 [Ilya Ganelin] Updated documentation to add a section on closures. This helps understand confusing behavior of foreach and map functions when attempting to modify variables outside of the scope of an RDD action or transformation
      548643a9
    • Marcelo Vanzin's avatar
      [SPARK-6228] [network] Move SASL classes from network/shuffle to network... · 5b335bdd
      Marcelo Vanzin authored
      .../common.
      
      No code changes. Left the shuffle-related files in the shuffle module.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4953 from vanzin/SPARK-6228 and squashes the following commits:
      
      664ef30 [Marcelo Vanzin] [SPARK-6228] [network] Move SASL classes from network/shuffle to network/common.
      5b335bdd
    • Sean Owen's avatar
      SPARK-6225 [CORE] [SQL] [STREAMING] Resolve most build warnings, 1.3.0 edition · 6e94c4ea
      Sean Owen authored
      Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4950 from srowen/SPARK-6225 and squashes the following commits:
      
      3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark
      c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
      6e94c4ea
    • zzcclp's avatar
      [SPARK-6279][Streaming]In KafkaRDD.scala, Miss expressions flag "s" at logging string · ec30c178
      zzcclp authored
      In KafkaRDD.scala, Miss expressions flag "s" at logging string
      In logging file, it print `Beginning offset $
      {part.fromOffset}
      is the same as ending offset ` but not `Beginning offset 111 is the same as ending offset `.
      
      Author: zzcclp <xm_zzc@sina.com>
      
      Closes #4979 from zzcclp/SPARK-6279 and squashes the following commits:
      
      768f88e [zzcclp] Miss expressions flag "s"
      ec30c178
    • Hongbo Liu's avatar
      [SQL][Minor] fix typo in comments · 40f49795
      Hongbo Liu authored
      Removed an repeated "from" in the comments.
      
      Author: Hongbo Liu <liuhb86@gmail.com>
      
      Closes #4976 from liuhb86/mine and squashes the following commits:
      
      e280e7c [Hongbo Liu] [SQL][Minor] fix typo in comments
      40f49795
    • Sean Owen's avatar
      [MINOR] [DOCS] Fix map -> mapToPair in Streaming Java example · 35b25640
      Sean Owen authored
      Fix map -> mapToPair in Java example. (And zap some unneeded "throws Exception" while here)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4967 from srowen/MapToPairFix and squashes the following commits:
      
      ded2bc0 [Sean Owen] Fix map -> mapToPair in Java example. (And zap some unneeded "throws Exception" while here)
      35b25640
    • Marcelo Vanzin's avatar
      [SPARK-4924] Add a library for launching Spark jobs programmatically. · 517975d8
      Marcelo Vanzin authored
      This change encapsulates all the logic involved in launching a Spark job
      into a small Java library that can be easily embedded into other applications.
      
      The overall goal of this change is twofold, as described in the bug:
      
      - Provide a public API for launching Spark processes. This is a common request
        from users and currently there's no good answer for it.
      
      - Remove a lot of the duplicated code and other coupling that exists in the
        different parts of Spark that deal with launching processes.
      
      A lot of the duplication was due to different code needed to build an
      application's classpath (and the bootstrapper needed to run the driver in
      certain situations), and also different code needed to parse spark-submit
      command line options in different contexts. The change centralizes those
      as much as possible so that all code paths can rely on the library for
      handling those appropriately.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:
      
      18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
      2ce741f [Marcelo Vanzin] Add lots of quotes.
      3b28a75 [Marcelo Vanzin] Update new pom.
      a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      897141f [Marcelo Vanzin] Review feedback.
      e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      28cd35e [Marcelo Vanzin] Remove stale comment.
      b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
      5f4ddcc [Marcelo Vanzin] Better usage messages.
      92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
      6184c07 [Marcelo Vanzin] Rename field.
      4c19196 [Marcelo Vanzin] Update comment.
      7e66c18 [Marcelo Vanzin] Fix pyspark tests.
      0031a8e [Marcelo Vanzin] Review feedback.
      c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
      e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
      43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
      b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
      28b1434 [Marcelo Vanzin] Add a comment.
      304333a [Marcelo Vanzin] Fix propagation of properties file arg.
      bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
      8ec0243 [Marcelo Vanzin] Add missing newline.
      95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
      72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
      62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
      9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
      e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
      e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      de81da2 [Marcelo Vanzin] Fix CommandUtils.
      86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
      b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
      0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
      7cff919 [Marcelo Vanzin] Javadoc updates.
      eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
      e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
      f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
      7ed8859 [Marcelo Vanzin] Some more feedback.
      54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      61919df [Marcelo Vanzin] Clean leftover debug statement.
      aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
      e584fc3 [Marcelo Vanzin] Rework command building a little bit.
      525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
      8ac4e92 [Marcelo Vanzin] Minor test cleanup.
      e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
      c617539 [Marcelo Vanzin] Review feedback round 1.
      fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
      2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
      799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      a7936ef [Marcelo Vanzin] Fix pyspark tests.
      656374e [Marcelo Vanzin] Mima fixes.
      4d511e7 [Marcelo Vanzin] Fix tools search code.
      7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
      1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
      25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
      27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
      6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.
      517975d8
    • Xusen Yin's avatar
      [SPARK-5986][MLLib] Add save/load for k-means · 2d4e00ef
      Xusen Yin authored
      This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #4951 from yinxusen/SPARK-5986 and squashes the following commits:
      
      6dd74a0 [Xusen Yin] rewrite some functions and classes
      cd390fd [Xusen Yin] add indexed point
      b144216 [Xusen Yin] remove invalid comments
      dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
      2d4e00ef
  3. Mar 10, 2015
    • Michael Armbrust's avatar
      [SPARK-5183][SQL] Update SQL Docs with JDBC and Migration Guide · 26723741
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4958 from marmbrus/sqlDocs and squashes the following commits:
      
      9351dbc [Michael Armbrust] fix parquet example
      6877e13 [Michael Armbrust] add sql examples
      d81b7e7 [Michael Armbrust] rxins comments
      e393528 [Michael Armbrust] fix order
      19c2735 [Michael Armbrust] more on data source load/store
      00d5914 [Michael Armbrust] Update SQL Docs with JDBC and Migration Guide
      26723741
    • Reynold Xin's avatar
      Minor doc: Remove the extra blank line in data types javadoc. · 74fb4337
      Reynold Xin authored
      The extra blank line is preventing the first lines from showing up in the package summary page.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4955 from rxin/datatype-docs and squashes the following commits:
      
      1621114 [Reynold Xin] Minor doc: Remove the extra blank line in data types javadoc.
      74fb4337
    • cheng chang's avatar
      [SPARK-6186] [EC2] Make Tachyon version configurable in EC2 deployment script · 7c7d2d5e
      cheng chang authored
      This PR comes from Tachyon community to solve the issue:
      https://tachyon.atlassian.net/browse/TACHYON-11
      
      An accompanying PR is in mesos/spark-ec2:
      https://github.com/mesos/spark-ec2/pull/101
      
      Author: cheng chang <myairia@gmail.com>
      
      Closes #4901 from uronce-cc/master and squashes the following commits:
      
      313aa36 [cheng chang] minor re-wording
      fd2a48e [cheng chang] Remove Tachyon when deploying through git hash
      1d53c5c [cheng chang] add default value to --tachyon-version
      6f8887e [cheng chang] make tachyon version configurable
      7c7d2d5e
    • Nicholas Chammas's avatar
      [SPARK-6191] [EC2] Generalize ability to download libs · d14df06c
      Nicholas Chammas authored
      Right now we have a method to specifically download boto. This PR generalizes it so it's easy to download additional libraries if we want.
      
      For example, adding new external libraries for spark-ec2 is now as simple as:
      
      ```python
      external_libs = [
          {
               "name": "boto",
               "version": "2.34.0",
               "md5": "5556223d2d0cc4d06dd4829e671dcecd"
          },
          {
              "name": "PyYAML",
              "version": "3.11",
              "md5": "f50e08ef0fe55178479d3a618efe21db"
          },
          {
              "name": "argparse",
              "version": "1.3.0",
              "md5": "9bcf7f612190885c8c85e30ba41db3c7"
          }
      ]
      ```
      Likely use cases:
      * Downloading PyYAML to allow spark-ec2 configs to be persisted as a YAML file. ([SPARK-925](https://issues.apache.org/jira/browse/SPARK-925))
      * Downloading argparse to clean up / modernize our option parsing.
      
      First run output, with PyYAML and argparse added just for demonstration purposes:
      
      ```shell
      $ ./spark-ec2 --version
      Downloading external libraries that spark-ec2 needs from PyPI to /path/to/spark/ec2/lib...
      This should be a one-time operation.
       - Downloading boto...
       - Finished downloading boto.
       - Downloading PyYAML...
       - Finished downloading PyYAML.
       - Downloading argparse...
       - Finished downloading argparse.
      spark-ec2 1.2.1
      ```
      
      Output thereafter:
      
      ```shell
      $ ./spark-ec2 --version
      spark-ec2 1.2.1
      ```
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4919 from nchammas/setup-ec2-libs and squashes the following commits:
      
      a077955 [Nicholas Chammas] print default region
      c95fb7d [Nicholas Chammas] to docstring
      5448845 [Nicholas Chammas] remove libs added for demo purposes
      60d8c23 [Nicholas Chammas] generalize ability to download libs
      d14df06c
    • Lev Khomich's avatar
      [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough · c4c4b07b
      Lev Khomich authored
      A simple try-catch wrapping KryoException to be more informative.
      
      Author: Lev Khomich <levkhomich@gmail.com>
      
      Closes #4947 from levkhomich/master and squashes the following commits:
      
      0f7a947 [Lev Khomich] [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough
      c4c4b07b
    • Yuhao Yang's avatar
      [SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce · 9a0272fb
      Yuhao Yang authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6177
      Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`.
      
      sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #4899 from hhbyyh/adjustPartition and squashes the following commits:
      
      a499630 [Yuhao Yang] update comment
      9a2d7b6 [Yuhao Yang] move to comment
      f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition
      26a564a [Yuhao Yang] add coalesce to LDAExample
      9a0272fb
  4. Mar 09, 2015
    • Davies Liu's avatar
      [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect() · 8767565c
      Davies Liu authored
      Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.
      
      This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4923 from davies/fix_collect and squashes the following commits:
      
      d730286 [Davies Liu] address comments
      24c92a4 [Davies Liu] fix style
      ba54614 [Davies Liu] use socket to transfer data from JVM
      9517c8f [Davies Liu] fix memory leak in collect()
      8767565c
    • Reynold Xin's avatar
      [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames. · 3cac1991
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4954 from rxin/df-docs and squashes the following commits:
      
      c592c70 [Reynold Xin] [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
      3cac1991
    • Reynold Xin's avatar
      [Docs] Replace references to SchemaRDD with DataFrame · 70f88148
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4952 from rxin/schemardd-df-reference and squashes the following commits:
      
      b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame
      70f88148
    • Theodore Vasiloudis's avatar
      [EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments · f7c79920
      Theodore Vasiloudis authored
      As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.
      
      When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.
      
      This solves this by getting the instance types from the existing instances, and overwriting the default options.
      
      EDIT: Further clarification of the issue:
      
      In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.
      
      This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue [5838](https://issues.apache.org/jira/browse/SPARK-5838), where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.
      
      Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.
      
      Author: Theodore Vasiloudis <thvasilo@users.noreply.github.com>
      Author: Theodore Vasiloudis <tvas@sics.se>
      
      Closes #4916 from thvasilo/SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments and squashes the following commits:
      
      6705b98 [Theodore Vasiloudis] Added comment to clarify setting master instance type to the empty string.
      a3d29fe [Theodore Vasiloudis] More trailing whitespace
      7b32429 [Theodore Vasiloudis] Removed trailing whitespace
      3ebd52a [Theodore Vasiloudis] Make sure that the instance type is correct when relaunching a cluster.
      f7c79920
  5. Mar 08, 2015
    • Jacky Li's avatar
      [GraphX] Improve LiveJournalPageRank example · 55b1b32d
      Jacky Li authored
      1. Removed unnecessary import
      2. Modified usage print since user must specify the --numEPart parameter as it is required in Analytics.main
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #4917 from jackylk/import and squashes the following commits:
      
      6c07682 [Jacky Li] fix comment
      c0df8f2 [Jacky Li] fix scalastyle
      b6235e6 [Jacky Li] fix for comment
      87be83b [Jacky Li] remove default value description
      5caae76 [Jacky Li] remove import and modify usage
      55b1b32d
    • Sean Owen's avatar
      SPARK-6205 [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError · f16b7b03
      Sean Owen authored
      Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4933 from srowen/SPARK-6205 and squashes the following commits:
      
      ddd4d32 [Sean Owen] Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue
      f16b7b03
    • Nicholas Chammas's avatar
      [SPARK-6193] [EC2] Push group filter up to EC2 · 52ed7da1
      Nicholas Chammas authored
      When looking for a cluster, spark-ec2 currently pulls down [info for all instances](https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds.
      
      This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2.
      
      Basically, the problem (and solution) look like this:
      
      ```python
      >>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10)
      116.96390509605408
      >>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10)
      4.629754066467285
      ```
      
      Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances):
      
      ```shell
      # master
      $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
      ...
      3 loops, best of 3: 9.83 sec per loop
      
      # this PR
      $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
      ...
      3 loops, best of 3: 1.47 sec per loop
      ```
      
      This PR also refactors `get_existing_cluster()` to make it, I hope, simpler.
      
      Finally, this PR fixes some minor grammar issues related to printing status to the user. :tophat: :clap:
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits:
      
      18802f1 [Nicholas Chammas] ignore shutting-down
      f2a5b9f [Nicholas Chammas] fix grammar
      d96a489 [Nicholas Chammas] push group filter up to EC2
      52ed7da1
  6. Mar 07, 2015
    • Florian Verhein's avatar
      [SPARK-5641] [EC2] Allow spark_ec2.py to copy arbitrary files to cluster · 334c5bd1
      Florian Verhein authored
      Give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master).
      
      This is an alternative approach to meeting requirements discussed in https://github.com/apache/spark/pull/4487
      
      Author: Florian Verhein <florian.verhein@gmail.com>
      
      Closes #4583 from florianverhein/master and squashes the following commits:
      
      49dee88 [Florian Verhein] removed addition of trailing / in rsync to give user this option, added documentation in help
      7b8e3d8 [Florian Verhein] remove unused args
      87d922c [Florian Verhein] [SPARK-5641] [EC2] implement --deploy-root-dir
      334c5bd1
    • WangTaoTheTonic's avatar
      [Minor]fix the wrong description · 729c05bd
      WangTaoTheTonic authored
      Found it by accident. I'm not gonna file jira for this as it is a very tiny fix.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #4936 from WangTaoTheTonic/wrongdesc and squashes the following commits:
      
      fb8a8ec [WangTaoTheTonic] fix the wrong description
      aca5596 [WangTaoTheTonic] fix the wrong description
      729c05bd
    • Nicholas Chammas's avatar
      [EC2] Reorder print statements on termination · 2646794f
      Nicholas Chammas authored
      The PR reorders some print statements slightly on cluster termination so that they read better.
      
      For example, from this:
      
      ```
      Are you sure you want to destroy the cluster spark-cluster-test?
      The following instances will be terminated:
      Searching for existing cluster spark-cluster-test in region us-west-2...
      Found 1 master(s), 2 slaves
      > ...
      ALL DATA ON ALL NODES WILL BE LOST!!
      Destroy cluster spark-cluster-test (y/N):
      ```
      
      To this:
      
      ```
      Searching for existing cluster spark-cluster-test in region us-west-2...
      Found 1 master(s), 2 slaves
      The following instances will be terminated:
      > ...
      ALL DATA ON ALL NODES WILL BE LOST!!
      Are you sure you want to destroy the cluster spark-cluster-test? (y/N)
      ```
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4932 from nchammas/termination-print-order and squashes the following commits:
      
      c23711d [Nicholas Chammas] reorder prints on termination
      2646794f
  7. Mar 06, 2015
    • RobertZK's avatar
      Fix python typo (+ Scala, Java typos) · 48a723c9
      RobertZK authored
      Author: RobertZK <technoguyrob@gmail.com>
      Author: Robert Krzyzanowski <technoguyrob@gmail.com>
      
      Closes #4840 from robertzk/patch-1 and squashes the following commits:
      
      d286215 [RobertZK] lambda fix per @laserson
      5937989 [Robert Krzyzanowski] Fix python typo
      48a723c9
    • Vinod K C's avatar
      [SPARK-6178][Shuffle] Removed unused imports · dba0b2ea
      Vinod K C authored
      Author: Vinod K C <vinod.kchuawei.com>
      
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #4900 from vinodkc/unused_imports and squashes the following commits:
      
      5373456 [Vinod K C] Removed empty lines
      9da7438 [Vinod K C] Changed order of import
      594d471 [Vinod K C] Removed unused imports
      dba0b2ea
    • GuoQiang Li's avatar
      [Minor] Resolve sbt warnings: postfix operator second should be enabled · 05cb6b34
      GuoQiang Li authored
      Resolve sbt warnings:
      
      ```
      [warn] spark/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogManager.scala:155: postfix operator second should be enabled
      [warn] by making the implicit value scala.language.postfixOps visible.
      [warn] This can be achieved by adding the import clause 'import scala.language.postfixOps'
      [warn] or by setting the compiler option -language:postfixOps.
      [warn] See the Scala docs for value scala.language.postfixOps for a discussion
      [warn] why the feature should be explicitly enabled.
      [warn]         Await.ready(f, 1 second)
      [warn]                          ^
      ```
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #4908 from witgo/sbt_warnings and squashes the following commits:
      
      0629af4 [GuoQiang Li] Resolve sbt warnings: postfix operator second should be enabled
      05cb6b34
    • Marcelo Vanzin's avatar
      [core] [minor] Don't pollute source directory when running UtilsSuite. · cd7594ca
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4921 from vanzin/utils-suite and squashes the following commits:
      
      7795dd4 [Marcelo Vanzin] [core] [minor] Don't pollute source directory when running UtilsSuite.
      cd7594ca
    • Zhang, Liye's avatar
      [CORE, DEPLOY][minor] align arguments order with docs of worker · d8b3da9d
      Zhang, Liye authored
      The help message for starting `worker` is `Usage: Worker [options] <master>`. While in `start-slaves.sh`, the format is not align with that, it is confusing for the fist glance.
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #4924 from liyezhang556520/startSlaves and squashes the following commits:
      
      7fd5deb [Zhang, Liye] align arguments order with docs of worker
      d8b3da9d
Loading