Skip to content
Snippets Groups Projects
  1. Mar 11, 2015
    • Marcelo Vanzin's avatar
      [SPARK-4924] Add a library for launching Spark jobs programmatically. · 517975d8
      Marcelo Vanzin authored
      This change encapsulates all the logic involved in launching a Spark job
      into a small Java library that can be easily embedded into other applications.
      
      The overall goal of this change is twofold, as described in the bug:
      
      - Provide a public API for launching Spark processes. This is a common request
        from users and currently there's no good answer for it.
      
      - Remove a lot of the duplicated code and other coupling that exists in the
        different parts of Spark that deal with launching processes.
      
      A lot of the duplication was due to different code needed to build an
      application's classpath (and the bootstrapper needed to run the driver in
      certain situations), and also different code needed to parse spark-submit
      command line options in different contexts. The change centralizes those
      as much as possible so that all code paths can rely on the library for
      handling those appropriately.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3916 from vanzin/SPARK-4924 and squashes the following commits:
      
      18c7e4d [Marcelo Vanzin] Fix make-distribution.sh.
      2ce741f [Marcelo Vanzin] Add lots of quotes.
      3b28a75 [Marcelo Vanzin] Update new pom.
      a1b8af1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      897141f [Marcelo Vanzin] Review feedback.
      e2367d2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      28cd35e [Marcelo Vanzin] Remove stale comment.
      b1d86b0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      00505f9 [Marcelo Vanzin] Add blurb about new API in the programming guide.
      5f4ddcc [Marcelo Vanzin] Better usage messages.
      92a9cfb [Marcelo Vanzin] Fix Win32 launcher, usage.
      6184c07 [Marcelo Vanzin] Rename field.
      4c19196 [Marcelo Vanzin] Update comment.
      7e66c18 [Marcelo Vanzin] Fix pyspark tests.
      0031a8e [Marcelo Vanzin] Review feedback.
      c12d84b [Marcelo Vanzin] Review feedback. And fix spark-submit on Windows.
      e2d4d71 [Marcelo Vanzin] Simplify some code used to launch pyspark.
      43008a7 [Marcelo Vanzin] Don't make builder extend SparkLauncher.
      b4d6912 [Marcelo Vanzin] Use spark-submit script in SparkLauncher.
      28b1434 [Marcelo Vanzin] Add a comment.
      304333a [Marcelo Vanzin] Fix propagation of properties file arg.
      bb67b93 [Marcelo Vanzin] Remove unrelated Yarn change (that is also wrong).
      8ec0243 [Marcelo Vanzin] Add missing newline.
      95ddfa8 [Marcelo Vanzin] Fix handling of --help for spark-class command builder.
      72da7ec [Marcelo Vanzin] Rename SparkClassLauncher.
      62978e4 [Marcelo Vanzin] Minor cleanup of Windows code path.
      9cd5b44 [Marcelo Vanzin] Make all non-public APIs package-private.
      e4c80b6 [Marcelo Vanzin] Reorganize the code so that only SparkLauncher is public.
      e50dc5e [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      de81da2 [Marcelo Vanzin] Fix CommandUtils.
      86a87bf [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      2061967 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      46d46da [Marcelo Vanzin] Clean up a test and make it more future-proof.
      b93692a [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      ad03c48 [Marcelo Vanzin] Revert "Fix a thread-safety issue in "local" mode."
      0b509d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      23aa2a9 [Marcelo Vanzin] Read java-opts from conf dir, not spark home.
      7cff919 [Marcelo Vanzin] Javadoc updates.
      eae4d8e [Marcelo Vanzin] Fix new unit tests on Windows.
      e570fb5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      44cd5f7 [Marcelo Vanzin] Add package-info.java, clean up javadocs.
      f7cacff [Marcelo Vanzin] Remove "launch Spark in new thread" feature.
      7ed8859 [Marcelo Vanzin] Some more feedback.
      54cd4fd [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      61919df [Marcelo Vanzin] Clean leftover debug statement.
      aae5897 [Marcelo Vanzin] Use launcher classes instead of jars in non-release mode.
      e584fc3 [Marcelo Vanzin] Rework command building a little bit.
      525ef5b [Marcelo Vanzin] Rework Unix spark-class to handle argument with newlines.
      8ac4e92 [Marcelo Vanzin] Minor test cleanup.
      e946a99 [Marcelo Vanzin] Merge PySparkLauncher into SparkSubmitCliLauncher.
      c617539 [Marcelo Vanzin] Review feedback round 1.
      fc6a3e2 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      f26556b [Marcelo Vanzin] Fix a thread-safety issue in "local" mode.
      2f4e8b4 [Marcelo Vanzin] Changes needed to make this work with SPARK-4048.
      799fc20 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      bb5d324 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      53faef1 [Marcelo Vanzin] Merge branch 'master' into SPARK-4924
      a7936ef [Marcelo Vanzin] Fix pyspark tests.
      656374e [Marcelo Vanzin] Mima fixes.
      4d511e7 [Marcelo Vanzin] Fix tools search code.
      7a01e4a [Marcelo Vanzin] Fix pyspark on Yarn.
      1b3f6e9 [Marcelo Vanzin] Call SparkSubmit from spark-class launcher for unknown classes.
      25c5ae6 [Marcelo Vanzin] Centralize SparkSubmit command line parsing.
      27be98a [Marcelo Vanzin] Modify Spark to use launcher lib.
      6f70eea [Marcelo Vanzin] [SPARK-4924] Add a library for launching Spark jobs programatically.
      517975d8
    • Xusen Yin's avatar
      [SPARK-5986][MLLib] Add save/load for k-means · 2d4e00ef
      Xusen Yin authored
      This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #4951 from yinxusen/SPARK-5986 and squashes the following commits:
      
      6dd74a0 [Xusen Yin] rewrite some functions and classes
      cd390fd [Xusen Yin] add indexed point
      b144216 [Xusen Yin] remove invalid comments
      dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
      2d4e00ef
  2. Mar 10, 2015
    • Michael Armbrust's avatar
      [SPARK-5183][SQL] Update SQL Docs with JDBC and Migration Guide · 26723741
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4958 from marmbrus/sqlDocs and squashes the following commits:
      
      9351dbc [Michael Armbrust] fix parquet example
      6877e13 [Michael Armbrust] add sql examples
      d81b7e7 [Michael Armbrust] rxins comments
      e393528 [Michael Armbrust] fix order
      19c2735 [Michael Armbrust] more on data source load/store
      00d5914 [Michael Armbrust] Update SQL Docs with JDBC and Migration Guide
      26723741
    • Reynold Xin's avatar
      Minor doc: Remove the extra blank line in data types javadoc. · 74fb4337
      Reynold Xin authored
      The extra blank line is preventing the first lines from showing up in the package summary page.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4955 from rxin/datatype-docs and squashes the following commits:
      
      1621114 [Reynold Xin] Minor doc: Remove the extra blank line in data types javadoc.
      74fb4337
    • cheng chang's avatar
      [SPARK-6186] [EC2] Make Tachyon version configurable in EC2 deployment script · 7c7d2d5e
      cheng chang authored
      This PR comes from Tachyon community to solve the issue:
      https://tachyon.atlassian.net/browse/TACHYON-11
      
      An accompanying PR is in mesos/spark-ec2:
      https://github.com/mesos/spark-ec2/pull/101
      
      Author: cheng chang <myairia@gmail.com>
      
      Closes #4901 from uronce-cc/master and squashes the following commits:
      
      313aa36 [cheng chang] minor re-wording
      fd2a48e [cheng chang] Remove Tachyon when deploying through git hash
      1d53c5c [cheng chang] add default value to --tachyon-version
      6f8887e [cheng chang] make tachyon version configurable
      7c7d2d5e
    • Nicholas Chammas's avatar
      [SPARK-6191] [EC2] Generalize ability to download libs · d14df06c
      Nicholas Chammas authored
      Right now we have a method to specifically download boto. This PR generalizes it so it's easy to download additional libraries if we want.
      
      For example, adding new external libraries for spark-ec2 is now as simple as:
      
      ```python
      external_libs = [
          {
               "name": "boto",
               "version": "2.34.0",
               "md5": "5556223d2d0cc4d06dd4829e671dcecd"
          },
          {
              "name": "PyYAML",
              "version": "3.11",
              "md5": "f50e08ef0fe55178479d3a618efe21db"
          },
          {
              "name": "argparse",
              "version": "1.3.0",
              "md5": "9bcf7f612190885c8c85e30ba41db3c7"
          }
      ]
      ```
      Likely use cases:
      * Downloading PyYAML to allow spark-ec2 configs to be persisted as a YAML file. ([SPARK-925](https://issues.apache.org/jira/browse/SPARK-925))
      * Downloading argparse to clean up / modernize our option parsing.
      
      First run output, with PyYAML and argparse added just for demonstration purposes:
      
      ```shell
      $ ./spark-ec2 --version
      Downloading external libraries that spark-ec2 needs from PyPI to /path/to/spark/ec2/lib...
      This should be a one-time operation.
       - Downloading boto...
       - Finished downloading boto.
       - Downloading PyYAML...
       - Finished downloading PyYAML.
       - Downloading argparse...
       - Finished downloading argparse.
      spark-ec2 1.2.1
      ```
      
      Output thereafter:
      
      ```shell
      $ ./spark-ec2 --version
      spark-ec2 1.2.1
      ```
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4919 from nchammas/setup-ec2-libs and squashes the following commits:
      
      a077955 [Nicholas Chammas] print default region
      c95fb7d [Nicholas Chammas] to docstring
      5448845 [Nicholas Chammas] remove libs added for demo purposes
      60d8c23 [Nicholas Chammas] generalize ability to download libs
      d14df06c
    • Lev Khomich's avatar
      [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough · c4c4b07b
      Lev Khomich authored
      A simple try-catch wrapping KryoException to be more informative.
      
      Author: Lev Khomich <levkhomich@gmail.com>
      
      Closes #4947 from levkhomich/master and squashes the following commits:
      
      0f7a947 [Lev Khomich] [SPARK-6087][CORE] Provide actionable exception if Kryo buffer is not large enough
      c4c4b07b
    • Yuhao Yang's avatar
      [SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce · 9a0272fb
      Yuhao Yang authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6177
      Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`.
      
      sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #4899 from hhbyyh/adjustPartition and squashes the following commits:
      
      a499630 [Yuhao Yang] update comment
      9a2d7b6 [Yuhao Yang] move to comment
      f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition
      26a564a [Yuhao Yang] add coalesce to LDAExample
      9a0272fb
  3. Mar 09, 2015
    • Davies Liu's avatar
      [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect() · 8767565c
      Davies Liu authored
      Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.
      
      This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4923 from davies/fix_collect and squashes the following commits:
      
      d730286 [Davies Liu] address comments
      24c92a4 [Davies Liu] fix style
      ba54614 [Davies Liu] use socket to transfer data from JVM
      9517c8f [Davies Liu] fix memory leak in collect()
      8767565c
    • Reynold Xin's avatar
      [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames. · 3cac1991
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4954 from rxin/df-docs and squashes the following commits:
      
      c592c70 [Reynold Xin] [SPARK-5310][Doc] Update SQL Programming Guide to include DataFrames.
      3cac1991
    • Reynold Xin's avatar
      [Docs] Replace references to SchemaRDD with DataFrame · 70f88148
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4952 from rxin/schemardd-df-reference and squashes the following commits:
      
      b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame
      70f88148
    • Theodore Vasiloudis's avatar
      [EC2] [SPARK-6188] Instance types can be mislabeled when re-starting cluster with default arguments · f7c79920
      Theodore Vasiloudis authored
      As described in https://issues.apache.org/jira/browse/SPARK-6188 and discovered in https://issues.apache.org/jira/browse/SPARK-5838.
      
      When re-starting a cluster, if the user does not provide the instance types, which is the recommended behavior in the docs currently, the instance will be assigned the default type m1.large. This then affects the setup of the machines.
      
      This solves this by getting the instance types from the existing instances, and overwriting the default options.
      
      EDIT: Further clarification of the issue:
      
      In short, while the instances themselves are the same as launched, their setup is done assuming the default instance type, m1.large.
      
      This means that the machines are assumed to have 2 disks, and that leads to problems that are described in in issue [5838](https://issues.apache.org/jira/browse/SPARK-5838), where machines that have one disk end up having shuffle spills in the in the small (8GB) snapshot partitions that quickly fills up and results in failing jobs due to "No space left on device" errors.
      
      Other instance specific settings that are set in the spark_ec2.py script are likely to be wrong as well.
      
      Author: Theodore Vasiloudis <thvasilo@users.noreply.github.com>
      Author: Theodore Vasiloudis <tvas@sics.se>
      
      Closes #4916 from thvasilo/SPARK-6188]-Instance-types-can-be-mislabeled-when-re-starting-cluster-with-default-arguments and squashes the following commits:
      
      6705b98 [Theodore Vasiloudis] Added comment to clarify setting master instance type to the empty string.
      a3d29fe [Theodore Vasiloudis] More trailing whitespace
      7b32429 [Theodore Vasiloudis] Removed trailing whitespace
      3ebd52a [Theodore Vasiloudis] Make sure that the instance type is correct when relaunching a cluster.
      f7c79920
  4. Mar 08, 2015
    • Jacky Li's avatar
      [GraphX] Improve LiveJournalPageRank example · 55b1b32d
      Jacky Li authored
      1. Removed unnecessary import
      2. Modified usage print since user must specify the --numEPart parameter as it is required in Analytics.main
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #4917 from jackylk/import and squashes the following commits:
      
      6c07682 [Jacky Li] fix comment
      c0df8f2 [Jacky Li] fix scalastyle
      b6235e6 [Jacky Li] fix for comment
      87be83b [Jacky Li] remove default value description
      5caae76 [Jacky Li] remove import and modify usage
      55b1b32d
    • Sean Owen's avatar
      SPARK-6205 [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError · f16b7b03
      Sean Owen authored
      Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4933 from srowen/SPARK-6205 and squashes the following commits:
      
      ddd4d32 [Sean Owen] Add xml-apis to core test deps to work aroudn UISeleniumSuite classpath issue
      f16b7b03
    • Nicholas Chammas's avatar
      [SPARK-6193] [EC2] Push group filter up to EC2 · 52ed7da1
      Nicholas Chammas authored
      When looking for a cluster, spark-ec2 currently pulls down [info for all instances](https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620) and filters locally. When working on an AWS account with hundreds of active instances, this step alone can take over 10 seconds.
      
      This PR improves how spark-ec2 searches for clusters by pushing the filter up to EC2.
      
      Basically, the problem (and solution) look like this:
      
      ```python
      >>> timeit.timeit('blah = conn.get_all_reservations()', setup='from __main__ import conn', number=10)
      116.96390509605408
      >>> timeit.timeit('blah = conn.get_all_reservations(filters={"instance.group-name": ["my-cluster-master"]})', setup='from __main__ import conn', number=10)
      4.629754066467285
      ```
      
      Translated to a user-visible action, this looks like (against an AWS account with ~200 active instances):
      
      ```shell
      # master
      $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
      ...
      3 loops, best of 3: 9.83 sec per loop
      
      # this PR
      $ python -m timeit -n 3 --setup 'import subprocess' 'subprocess.call("./spark-ec2 get-master my-cluster --region us-west-2", shell=True)'
      ...
      3 loops, best of 3: 1.47 sec per loop
      ```
      
      This PR also refactors `get_existing_cluster()` to make it, I hope, simpler.
      
      Finally, this PR fixes some minor grammar issues related to printing status to the user. :tophat: :clap:
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4922 from nchammas/get-existing-cluster-faster and squashes the following commits:
      
      18802f1 [Nicholas Chammas] ignore shutting-down
      f2a5b9f [Nicholas Chammas] fix grammar
      d96a489 [Nicholas Chammas] push group filter up to EC2
      52ed7da1
  5. Mar 07, 2015
    • Florian Verhein's avatar
      [SPARK-5641] [EC2] Allow spark_ec2.py to copy arbitrary files to cluster · 334c5bd1
      Florian Verhein authored
      Give users an easy way to rcp a directory structure to the master's / as part of the cluster launch, at a useful point in the workflow (before setup.sh is called on the master).
      
      This is an alternative approach to meeting requirements discussed in https://github.com/apache/spark/pull/4487
      
      Author: Florian Verhein <florian.verhein@gmail.com>
      
      Closes #4583 from florianverhein/master and squashes the following commits:
      
      49dee88 [Florian Verhein] removed addition of trailing / in rsync to give user this option, added documentation in help
      7b8e3d8 [Florian Verhein] remove unused args
      87d922c [Florian Verhein] [SPARK-5641] [EC2] implement --deploy-root-dir
      334c5bd1
    • WangTaoTheTonic's avatar
      [Minor]fix the wrong description · 729c05bd
      WangTaoTheTonic authored
      Found it by accident. I'm not gonna file jira for this as it is a very tiny fix.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #4936 from WangTaoTheTonic/wrongdesc and squashes the following commits:
      
      fb8a8ec [WangTaoTheTonic] fix the wrong description
      aca5596 [WangTaoTheTonic] fix the wrong description
      729c05bd
    • Nicholas Chammas's avatar
      [EC2] Reorder print statements on termination · 2646794f
      Nicholas Chammas authored
      The PR reorders some print statements slightly on cluster termination so that they read better.
      
      For example, from this:
      
      ```
      Are you sure you want to destroy the cluster spark-cluster-test?
      The following instances will be terminated:
      Searching for existing cluster spark-cluster-test in region us-west-2...
      Found 1 master(s), 2 slaves
      > ...
      ALL DATA ON ALL NODES WILL BE LOST!!
      Destroy cluster spark-cluster-test (y/N):
      ```
      
      To this:
      
      ```
      Searching for existing cluster spark-cluster-test in region us-west-2...
      Found 1 master(s), 2 slaves
      The following instances will be terminated:
      > ...
      ALL DATA ON ALL NODES WILL BE LOST!!
      Are you sure you want to destroy the cluster spark-cluster-test? (y/N)
      ```
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4932 from nchammas/termination-print-order and squashes the following commits:
      
      c23711d [Nicholas Chammas] reorder prints on termination
      2646794f
  6. Mar 06, 2015
    • RobertZK's avatar
      Fix python typo (+ Scala, Java typos) · 48a723c9
      RobertZK authored
      Author: RobertZK <technoguyrob@gmail.com>
      Author: Robert Krzyzanowski <technoguyrob@gmail.com>
      
      Closes #4840 from robertzk/patch-1 and squashes the following commits:
      
      d286215 [RobertZK] lambda fix per @laserson
      5937989 [Robert Krzyzanowski] Fix python typo
      48a723c9
    • Vinod K C's avatar
      [SPARK-6178][Shuffle] Removed unused imports · dba0b2ea
      Vinod K C authored
      Author: Vinod K C <vinod.kchuawei.com>
      
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #4900 from vinodkc/unused_imports and squashes the following commits:
      
      5373456 [Vinod K C] Removed empty lines
      9da7438 [Vinod K C] Changed order of import
      594d471 [Vinod K C] Removed unused imports
      dba0b2ea
    • GuoQiang Li's avatar
      [Minor] Resolve sbt warnings: postfix operator second should be enabled · 05cb6b34
      GuoQiang Li authored
      Resolve sbt warnings:
      
      ```
      [warn] spark/streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogManager.scala:155: postfix operator second should be enabled
      [warn] by making the implicit value scala.language.postfixOps visible.
      [warn] This can be achieved by adding the import clause 'import scala.language.postfixOps'
      [warn] or by setting the compiler option -language:postfixOps.
      [warn] See the Scala docs for value scala.language.postfixOps for a discussion
      [warn] why the feature should be explicitly enabled.
      [warn]         Await.ready(f, 1 second)
      [warn]                          ^
      ```
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #4908 from witgo/sbt_warnings and squashes the following commits:
      
      0629af4 [GuoQiang Li] Resolve sbt warnings: postfix operator second should be enabled
      05cb6b34
    • Marcelo Vanzin's avatar
      [core] [minor] Don't pollute source directory when running UtilsSuite. · cd7594ca
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4921 from vanzin/utils-suite and squashes the following commits:
      
      7795dd4 [Marcelo Vanzin] [core] [minor] Don't pollute source directory when running UtilsSuite.
      cd7594ca
    • Zhang, Liye's avatar
      [CORE, DEPLOY][minor] align arguments order with docs of worker · d8b3da9d
      Zhang, Liye authored
      The help message for starting `worker` is `Usage: Worker [options] <master>`. While in `start-slaves.sh`, the format is not align with that, it is confusing for the fist glance.
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #4924 from liyezhang556520/startSlaves and squashes the following commits:
      
      7fd5deb [Zhang, Liye] align arguments order with docs of worker
      d8b3da9d
  7. Mar 05, 2015
    • Michael Armbrust's avatar
      [SQL] Make Strategies a public developer API · eb48fd6e
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4920 from marmbrus/openStrategies and squashes the following commits:
      
      cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API
      eb48fd6e
    • Yin Huai's avatar
      [SPARK-6163][SQL] jsonFile should be backed by the data source API · 1b4bb25c
      Yin Huai authored
      jira: https://issues.apache.org/jira/browse/SPARK-6163
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4896 from yhuai/SPARK-6163 and squashes the following commits:
      
      45e023e [Yin Huai] Address @chenghao-intel's comment.
      2e8734e [Yin Huai] Use JSON data source for jsonFile.
      92a4a33 [Yin Huai] Test.
      1b4bb25c
    • Wenchen Fan's avatar
      [SPARK-6145][SQL] fix ORDER BY on nested fields · 5873c713
      Wenchen Fan authored
      Based on #4904 with style errors fixed.
      
      `LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
      So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4918 from marmbrus/pr/4904 and squashes the following commits:
      
      997f84e [Michael Armbrust] fix style
      3eedbfc [Wenchen Fan] fix 6145
      5873c713
    • Josh Rosen's avatar
      [SPARK-6175] Fix standalone executor log links when ephemeral ports or SPARK_PUBLIC_DNS are used · 424a86a1
      Josh Rosen authored
      This patch fixes two issues with the executor log viewing links added in Spark 1.3.  In standalone mode, the log URLs might include a port value of 0 rather than the actual bound port of the UI, which broke the ability to view logs from workers whose web UIs had been configured to bind to ephemeral ports.  In addition, the URLs used workers' local hostnames instead of respecting SPARK_PUBLIC_DNS, which prevented this feature from working properly on Spark EC2 clusters because the links would point to internal DNS names instead of external ones.
      
      I included tests for both of these bugs:
      
      - We now browse to the URLs and verify that they point to the expected pages.
      - To test SPARK_PUBLIC_DNS, I changed the code that reads the environment variable to do so via `SparkConf.getenv`, then used a custom SparkConf subclass to mock the environment variable (this pattern is used elsewhere in Spark's tests).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4903 from JoshRosen/SPARK-6175 and squashes the following commits:
      
      5577f41 [Josh Rosen] Remove println
      cfec135 [Josh Rosen] Use webUi.boundPort and publicAddress in log links
      27918c7 [Josh Rosen] Add failing unit tests for standalone log URL viewing
      c250fbe [Josh Rosen] Respect SparkConf in local-cluster Workers.
      422a2ef [Josh Rosen] Use conf.getenv to read SPARK_PUBLIC_DNS
      424a86a1
    • Xiangrui Meng's avatar
      [SPARK-6090][MLLIB] add a basic BinaryClassificationMetrics to PySpark/MLlib · 0bfacd5c
      Xiangrui Meng authored
      A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.
      
      davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4863 from mengxr/SPARK-6090 and squashes the following commits:
      
      009a3a3 [Xiangrui Meng] provide schema
      dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
      0bfacd5c
    • Sean Owen's avatar
      SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11 · c9cfba0c
      Sean Owen authored
      Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits:
      
      eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
      c9cfba0c
    • Daoyuan Wang's avatar
      [SPARK-6153] [SQL] promote guava dep for hive-thriftserver · e06c7dfb
      Daoyuan Wang authored
      For package thriftserver, guava is used at runtime.
      
      /cc pwendell
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4884 from adrian-wang/test and squashes the following commits:
      
      4600ae7 [Daoyuan Wang] only promote for thriftserver
      44dda18 [Daoyuan Wang] promote guava dep for hive
      e06c7dfb
  8. Mar 04, 2015
    • Sean Owen's avatar
      SPARK-5143 [BUILD] [WIP] spark-network-yarn 2.11 depends on spark-network-shuffle 2.10 · 7ac072f7
      Sean Owen authored
      Update `<scala.binary.version>` prop in POM when switching between Scala 2.10/2.11
      
      ScrapCodes for review. This `sed` command is supposed to just replace the first occurrence, but it replaces them all. Are you more of a `sed` wizard than I? It may be a GNU/BSD thing that is throwing me off. Really, just the first instance should be replaced, hence the `[WIP]`.
      
      NB on OS X the original `sed` command here will create files like `pom.xml-e` through the source tree though it otherwise works. It's like `-e` is also the arg to `-i`. I couldn't get rid of that even with `-i""`. No biggie.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4876 from srowen/SPARK-5143 and squashes the following commits:
      
      b060c44 [Sean Owen] Oops, fixed reversed version numbers!
      e875d4a [Sean Owen] Add note about non-GNU sed; fix new pom.xml update to work as intended on GNU sed
      703e1eb [Sean Owen] Update scala.binary.version prop in POM when switching between Scala 2.10/2.11
      7ac072f7
    • Cheng Lian's avatar
      [SPARK-6149] [SQL] [Build] Excludes Guava 15 referenced by jackson-module-scala_2.10 · 1aa90e39
      Cheng Lian authored
      This PR excludes Guava 15.0 from the SBT build, to make Spark SQL CLI (`bin/spark-sql`) work when compiled against Hive 0.12.0.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4890)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4890 from liancheng/exclude-guava-15 and squashes the following commits:
      
      91ae9fa [Cheng Lian] Moves Guava 15 exclusion from SBT build to POM
      282bd2a [Cheng Lian] Excludes Guava 15 referenced by jackson-module-scala_2.10
      1aa90e39
    • Marcelo Vanzin's avatar
      [SPARK-6144] [core] Fix addFile when source files are on "hdfs:" · 3a35a0df
      Marcelo Vanzin authored
      The code failed in two modes: it complained when it tried to re-create a directory that already existed, and it was placing some files in the wrong parent directory. The patch fixes both issues.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      Author: trystanleftwich <trystan@atscale.com>
      
      Closes #4894 from vanzin/SPARK-6144 and squashes the following commits:
      
      100b3a1 [Marcelo Vanzin] Style fix.
      58266aa [Marcelo Vanzin] Fix fetchHcfs file for directories.
      91733b7 [trystanleftwich] [SPARK-6144]When in cluster mode using ADD JAR with a hdfs:// sourced jar will fail
      3a35a0df
    • Zhang, Liye's avatar
      [SPARK-6107][CORE] Display inprogress application information for event log... · f6773edc
      Zhang, Liye authored
      [SPARK-6107][CORE] Display inprogress application information for event log history for standalone mode
      
      when application is finished running abnormally (Ctrl + c for example), the history event log file is still ends with `.inprogress` suffix. And the application state can not be showed on webUI, User can only see "*Application history not foud xxxx, Application xxx is still in progress*".
      
      For application that not finished normally, the history will show:
      ![image](https://cloud.githubusercontent.com/assets/4716022/6437137/184f9fc0-c0f5-11e4-88cc-a2eb087e4561.png)
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #4848 from liyezhang556520/showLogInprogress and squashes the following commits:
      
      03589ac [Zhang, Liye] change inprogress to in progress
      b55f19f [Zhang, Liye] scala modify after rebase
      8aa66a2 [Zhang, Liye] use softer wording
      b030bd4 [Zhang, Liye] clean code
      79c8cb1 [Zhang, Liye] fix some mistakes
      11cdb68 [Zhang, Liye] add a missing space
      c29205b [Zhang, Liye] refine code according to sean owen's comments
      e9952a7 [Zhang, Liye] scala style fix again
      150502d [Zhang, Liye] scala style fix
      f11a5da [Zhang, Liye] small fix for file path
      22e878b [Zhang, Liye] enable in progress eventlog file
      f6773edc
    • Liang-Chi Hsieh's avatar
      [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default... · aef8a84e
      Liang-Chi Hsieh authored
      [SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default LongType value in defaultPrimitive
      
      In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`.
      
      Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4870 from viirya/codegen_type and squashes the following commits:
      
      76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive.
      aef8a84e
    • Cheng Lian's avatar
      [SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-client · 76b472f1
      Cheng Lian authored
      Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4.
      
      Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency.
      
      [1]: |https://github.com/databricks/spark-integration-tests
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4872 from liancheng/remove-docker-client and squashes the following commits:
      
      1f4169e [Cheng Lian] Removes DockerHacks
      159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client
      76b472f1
    • Brennon York's avatar
      [SPARK-3355][Core]: Allow running maven tests in run-tests · 418f38d9
      Brennon York authored
      Added an AMPLAB_JENKINS_BUILD_TOOL env. variable to allow differentiation between maven and sbt build / test suites. The only issue I found with this is that, when running maven builds I wasn't able to get individual package tests running without running a `mvn install` first. Not sure what Jenkins is doing wrt its env., but figured its much better to just test everything than install packages in the "~/.m2/" directory and only test individual items, esp. if this is predominantly for the Jenkins build. Thoughts / comments would be great!
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #4734 from brennonyork/SPARK-3355 and squashes the following commits:
      
      c813d32 [Brennon York] changed mvn call from 'clean compile
      616ce30 [Brennon York] fixed merge conflicts
      3540de9 [Brennon York] added an AMPLAB_JENKINS_BUILD_TOOL env. variable to allow differentiation between maven and sbt build / test suites
      418f38d9
    • tedyu's avatar
      SPARK-6085 Increase default value for memory overhead · 8d3e2414
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #4836 from tedyu/master and squashes the following commits:
      
      d65b495 [tedyu] SPARK-6085 Increase default value for memory overhead
      1fdd4df [tedyu] SPARK-6085 Increase default value for memory overhead
      8d3e2414
    • Xiangrui Meng's avatar
      [SPARK-6141][MLlib] Upgrade Breeze from 0.10 to 0.11 to fix convergence bug · 76e20a0a
      Xiangrui Meng authored
      LBFGS and OWLQN in Breeze 0.10 has convergence check bug.
      This is fixed in 0.11, see the description in Breeze project for detail:
      
      https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: DB Tsai <dbtsai@alpinenow.com>
      Author: DB Tsai <dbtsai@dbtsai.com>
      
      Closes #4879 from dbtsai/breeze and squashes the following commits:
      
      d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze
      c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1
      35c2f26 [Xiangrui Meng] fix LRSuite
      397a208 [DB Tsai] upgrade breeze
      76e20a0a
  9. Mar 03, 2015
    • Andrew Or's avatar
      [SPARK-6132][HOTFIX] ContextCleaner InterruptedException should be quiet · d334bfbc
      Andrew Or authored
      If the cleaner is stopped, we shouldn't print a huge stack trace when the cleaner thread is interrupted because we purposefully did this.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4882 from andrewor14/cleaner-interrupt and squashes the following commits:
      
      8652120 [Andrew Or] Just a hot fix
      d334bfbc
Loading