Skip to content
Snippets Groups Projects
  1. Apr 01, 2015
    • Kousuke Saruta's avatar
      [SPARK-6597][Minor] Replace `input:checkbox` with `input[type="checkbox"]` in additional-metrics.js · d824c11c
      Kousuke Saruta authored
      In additional-metrics.js, there are some selector notation like `input:checkbox` but JQuery's official document says `input[type="checkbox"]` is better.
      
      https://api.jquery.com/checkbox-selector/
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5254 from sarutak/SPARK-6597 and squashes the following commits:
      
      a253bc4 [Kousuke Saruta] Replaced input:checkbox with input[type="checkbox"]
      d824c11c
    • Florian Verhein's avatar
      [EC2] [SPARK-6600] Open ports in ec2/spark_ec2.py to allow HDFS NFS gateway · 41226234
      Florian Verhein authored
      Authorizes incoming access to master on the ports required to use the hadoop hdfs nfs gateway from outside the cluster.
      
      Author: Florian Verhein <florian.verhein@gmail.com>
      
      Closes #5257 from florianverhein/master and squashes the following commits:
      
      72a586a [Florian Verhein] [EC2] [SPARK-6600] initial impl
      41226234
    • Ilya Ganelin's avatar
      [SPARK-4655][Core] Split Stage into ShuffleMapStage and ResultStage subclasses · ff1915e1
      Ilya Ganelin authored
      Hi all - this patch changes the Stage class to an abstract class and introduces two new classes that extend it: ShuffleMapStage and ResultStage - with the goal of increasing readability of the DAGScheduler class. Their usage is updated within DAGScheduler.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      Author: Ilya Ganelin <ilganeli@gmail.com>
      
      Closes #4708 from ilganeli/SPARK-4655 and squashes the following commits:
      
      c248924 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      d930385 [Ilya Ganelin] Fixed merge conflict from
      a9a765f [Ilya Ganelin] Update DAGScheduler.scala
      c03563c [Ilya Ganelin] Minor fixeS
      c39e971 [Ilya Ganelin] Added return typing for public methods
      845bc87 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      e8031d8 [Ilya Ganelin] Minor string fixes
      4ec53ac [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      c004f62 [Ilya Ganelin] Update DAGScheduler.scala
      a2cb03f [Ilya Ganelin] [SPARK-4655] Replaced usages of Nil and eliminated some code reuse
      3d5cf20 [Ilya Ganelin] [SPARK-4655] Moved mima exclude to 1.4
      6912c55 [Ilya Ganelin] Resolved merge conflict
      4bff208 [Ilya Ganelin] Minor stylistic fixes
      c6fffbb [Ilya Ganelin] newline
      41402ad [Ilya Ganelin] Style fixes
      02c6981 [Ilya Ganelin] Merge branch 'SPARK-4655' of github.com:ilganeli/spark into SPARK-4655
      c755a09 [Ilya Ganelin] Some more stylistic updates and minor refactoring
      b6257a0 [Ilya Ganelin] Update MimaExcludes.scala
      0f0c624 [Ilya Ganelin] Fixed merge conflict
      2eba262 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      6b43d7b [Ilya Ganelin] Got rid of some spaces
      6f1a5db [Ilya Ganelin] Revert "More minor formatting and refactoring"
      1b3471b [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-4655
      c9288e2 [Ilya Ganelin] More minor formatting and refactoring
      d548caf [Ilya Ganelin] Formatting fix
      c3ae5c2 [Ilya Ganelin] Explicit typing
      0dacaf3 [Ilya Ganelin] Got rid of stale import
      6da3a71 [Ilya Ganelin] Trailing whitespace
      b85c5fe [Ilya Ganelin] Added minor fixes
      a57dfcd [Ilya Ganelin] Added MiMA exclusion to get around binary compatibility check
      83ed849 [Ilya Ganelin] moved braces for consistency
      96dd161 [Ilya Ganelin] Fixed minor style error
      cfd6f10 [Ilya Ganelin] Updated DAGScheduler to use new ResultStage and ShuffleMapStage classes
      83494e9 [Ilya Ganelin] Added new Stage classes
      ff1915e1
  2. Mar 31, 2015
    • Reynold Xin's avatar
      [Doc] Improve Python DataFrame documentation · 305abe1e
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5287 from rxin/pyspark-df-doc-cleanup-context and squashes the following commits:
      
      1841b60 [Reynold Xin] Lint.
      f2007f1 [Reynold Xin] functions and types.
      bc3b72b [Reynold Xin] More improvements to DataFrame Python doc.
      ac1d4c0 [Reynold Xin] Bug fix.
      b163365 [Reynold Xin] Python fix. Added Experimental flag to DataFrameNaFunctions.
      608422d [Reynold Xin] [Doc] Cleanup context.py Python docs.
      305abe1e
    • Josh Rosen's avatar
      [SPARK-6614] OutputCommitCoordinator should clear authorized committer only... · 37326079
      Josh Rosen authored
      [SPARK-6614] OutputCommitCoordinator should clear authorized committer only after authorized committer fails, not after any failure
      
      In OutputCommitCoordinator, there is some logic to clear the authorized committer's lock on committing in case that task fails.  However, it looks like the current code also clears this lock if other non-authorized tasks fail, which is an obvious bug.
      
      In theory, it's possible that this could allow a new committer to start, run to completion, and commit output before the authorized committer finished, but it's unlikely that this race occurs often in practice due to the complex combination of failure and timing conditions that would be required to expose it.
      
      This patch addresses this issue and adds a regression test.
      
      Thanks to aarondav for spotting this issue.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5276 from JoshRosen/SPARK-6614 and squashes the following commits:
      
      d532ba7 [Josh Rosen] Check whether failed task was authorized committer
      cbb3784 [Josh Rosen] Add regression test for SPARK-6614
      37326079
    • MechCoder's avatar
      [SPARK-5692] [MLlib] Word2Vec save/load · 0e00f12d
      MechCoder authored
      Word2Vec model now supports saving and loading.
      
      a] The Metadata stored in JSON format consists of "version", "classname", "vectorSize" and "numWords"
      b] The data stored in Parquet file format consists of an Array of rows with each row consisting of 2 columns, first being the word: String and the second, an Array of Floats.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5291 from MechCoder/spark-5692 and squashes the following commits:
      
      1142f3a [MechCoder] Add numWords to metaData
      bfe4c39 [MechCoder] [SPARK-5692] Word2Vec save/load
      0e00f12d
    • Liang-Chi Hsieh's avatar
      [SPARK-6633][SQL] Should be "Contains" instead of "EndsWith" when constructing... · 2036bc59
      Liang-Chi Hsieh authored
      [SPARK-6633][SQL] Should be "Contains" instead of "EndsWith" when constructing sources.StringContains
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5299 from viirya/stringcontains and squashes the following commits:
      
      c1ece4c [Liang-Chi Hsieh] Should be Contains instead of EndsWith.
      2036bc59
    • Michael Armbrust's avatar
      [SPARK-5371][SQL] Propagate types after function conversion, before futher resolution · beebb7ff
      Michael Armbrust authored
      Before it was possible for a query to flip back and forth from a resolved state, allowing resolution to propagate up before coercion had stabilized.  The issue was that `ResolvedReferences` would run after `FunctionArgumentConversion`, but before `PropagateTypes` had run.  This PR ensures we correctly `PropagateTypes` after any coercion has applied.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5278 from marmbrus/unionNull and squashes the following commits:
      
      dc3581a [Michael Armbrust] [SPARK-5371][SQL] Propogate types after function conversion / before futher resolution
      beebb7ff
    • Yanbo Liang's avatar
      [SPARK-6255] [MLLIB] Support multiclass classification in Python API · b5bd75d9
      Yanbo Liang authored
      Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
      ```scala
      LogisticRegressionWithLBFGS
          setNumClasses
          setValidateData
      LogisticRegressionModel
          getThreshold
          numClasses
          numFeatures
      SVMWithSGD
          setValidateData
      SVMModel
          getThreshold
      ```
      For users the greatest benefit in this PR is multiclass classification was supported by Python API.
      Users can train multiclass classification model and use it to predict in pyspark.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5137 from yanboliang/spark-6255 and squashes the following commits:
      
      0bd531e [Yanbo Liang] address comments
      444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
      fc7990b [Yanbo Liang] address comments
      b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
      ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
      b5bd75d9
    • lewuathe's avatar
      [SPARK-6598][MLLIB] Python API for IDFModel · 46de6c05
      lewuathe authored
      This is the sub-task of SPARK-6254.
      Wrapping IDFModel `idf` member function for pyspark.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5264 from Lewuathe/SPARK-6598 and squashes the following commits:
      
      1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
      46de6c05
    • Michael Armbrust's avatar
      [SPARK-6145][SQL] fix ORDER BY on nested fields · cd48ca50
      Michael Armbrust authored
      This PR is based on work by cloud-fan in #4904, but with two differences:
       - We isolate the logic for Sort's special handling into `ResolveSortReferences`
       - We avoid creating UnresolvedGetField expressions during resolution.  Instead we either resolve GetField or we return None.  This avoids us going down the wrong path early on.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5189 from marmbrus/nestedOrderBy and squashes the following commits:
      
      b8cae45 [Michael Armbrust] fix another test
      0f36a11 [Michael Armbrust] WIP
      91820cd [Michael Armbrust] Fix bug.
      cd48ca50
    • Cheng Lian's avatar
      [SPARK-6575] [SQL] Adds configuration to disable schema merging while... · 81020144
      Cheng Lian authored
      [SPARK-6575] [SQL] Adds configuration to disable schema merging while converting metastore Parquet tables
      
      Consider a metastore Parquet table that
      
      1. doesn't have schema evolution issue
      2. has lots of data files and/or partitions
      
      In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5231)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5231 from liancheng/spark-6575 and squashes the following commits:
      
      cd96159 [Cheng Lian] Adds configuration to disable schema merging while converting metastore Parquet tables
      81020144
    • Cheng Lian's avatar
      [SPARK-6555] [SQL] Overrides equals() and hashCode() for MetastoreRelation · a7992ffa
      Cheng Lian authored
      Also removes temporary workarounds made in #5183 and #5251.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5289)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5289 from liancheng/spark-6555 and squashes the following commits:
      
      d0095ac [Cheng Lian] Removes unused imports
      cfafeeb [Cheng Lian] Removes outdated comment
      75a2746 [Cheng Lian] Overrides equals() and hashCode() for MetastoreRelation
      a7992ffa
    • leahmcguire's avatar
      [SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib · d01a6d8c
      leahmcguire authored
      Added optional model type parameter for  NaiveBayes training. Can be either Multinomial or Bernoulli.
      
      When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.
      
       Default for model is original Multinomial fit and predict.
      
      Added additional testing for Bernoulli and Multinomial models.
      
      Author: leahmcguire <lmcguire@salesforce.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Leah McGuire <lmcguire@salesforce.com>
      
      Closes #4087 from leahmcguire/master and squashes the following commits:
      
      f3c8994 [leahmcguire] changed checks on model type to requires
      acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings
      2224b15 [Leah McGuire] Merge pull request #2 from jkbradley/leahmcguire-master
      9ad89ca [Joseph K. Bradley] removed old code
      6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added.  Updated tests.  Also updated ModelType enum-like type.
      852a727 [leahmcguire] merged with upstream master
      a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes
      18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only
      bea62af [leahmcguire] put back in constructor for NaiveBayes
      01baad7 [leahmcguire] made fixes from code review
      fb0a5c7 [leahmcguire] removed typo
      e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures
      2d0c1ba [leahmcguire] fixed typo in NaiveBayes
      c298e78 [leahmcguire] fixed scala style errors
      b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master'
      900b586 [leahmcguire] fixed model call so that uses type argument
      ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master'
      e016569 [leahmcguire] updated test suite with model type fix
      85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master'
      dc65374 [leahmcguire] integrated model type fix
      7622b0c [leahmcguire] added comments and fixed style as per rb
      b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type
      3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly
      b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure
      5a4a534 [leahmcguire] fixed scala style error in NaiveBayes
      3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict
      d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes
      76e5b0f [leahmcguire] removed unnecessary sort from test
      0313c0c [leahmcguire] fixed style error in NaiveBayes.scala
      4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated."
      ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
      d01a6d8c
    • Xiangrui Meng's avatar
      [SPARK-6542][SQL] add CreateStruct · a05835b8
      Xiangrui Meng authored
      Similar to `CreateArray`, we can add `CreateStruct` to create nested columns. marmbrus
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5195 from mengxr/SPARK-6542 and squashes the following commits:
      
      3795c57 [Xiangrui Meng] update error message
      ae7ac3e [Xiangrui Meng] move unit test to a separate suite
      85dd559 [Xiangrui Meng] use NamedExpr
      c78e31a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-6542
      85f3106 [Xiangrui Meng] add CreateStruct
      a05835b8
    • Yin Huai's avatar
      [SPARK-6618][SQL] HiveMetastoreCatalog.lookupRelation should use fine-grained lock · 314afd0e
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6618
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5281 from yhuai/lookupRelationLock and squashes the following commits:
      
      591b4be [Yin Huai] A test?
      b3a9625 [Yin Huai] Just protect client.
      314afd0e
    • Reynold Xin's avatar
      [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python. · b80a030e
      Reynold Xin authored
      To maintain consistency with the Scala API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5284 from rxin/df-na-alias and squashes the following commits:
      
      19f46b7 [Reynold Xin] Show DataFrameNaFunctions in docs.
      6618118 [Reynold Xin] [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python.
      b80a030e
    • Reynold Xin's avatar
      [SPARK-6625][SQL] Add common string filters to data sources. · f07e7140
      Reynold Xin authored
      Filters such as startsWith, endsWith, contains will be very useful for data sources that provide search functionality, e.g. Succinct, Elastic Search, Solr.
      
      I also took this chance to improve documentation for the data source filters.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5285 from rxin/ds-string-filters and squashes the following commits:
      
      f021727 [Reynold Xin] Fixed grammar.
      7695a52 [Reynold Xin] [SPARK-6625][SQL] Add common string filters to data sources.
      f07e7140
    • zsxwing's avatar
      [SPARK-5124][Core] Move StopCoordinator to the receive method since it does not require a reply · 56775571
      zsxwing authored
      Hotfix for #4588
      
      cc rxin
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5283 from zsxwing/hotfix and squashes the following commits:
      
      cf3e5a7 [zsxwing] Move StopCoordinator to the receive method since it does not require a reply
      56775571
  3. Mar 30, 2015
    • Reynold Xin's avatar
      [SPARK-6119][SQL] DataFrame support for missing data handling · b8ff2bc6
      Reynold Xin authored
      This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5274 from rxin/df-missing-value and squashes the following commits:
      
      4ee1b98 [Reynold Xin] Improve error reporting in Python.
      33a330c [Reynold Xin] Remove replace for now.
      bc4fdbb [Reynold Xin] Added documentation for replace.
      d56f5a5 [Reynold Xin] Added replace for Scala/Java.
      2385d00 [Reynold Xin] Feedback from Xiangrui on "how".
      914a374 [Reynold Xin] fill with map.
      185c67e [Reynold Xin] Allow specifying column subsets in fill.
      749eb47 [Reynold Xin] fillna
      249b94e [Reynold Xin] Removing undefined functions.
      6a73c68 [Reynold Xin] Missing file.
      67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
      b8ff2bc6
    • Cheng Lian's avatar
      [SPARK-6369] [SQL] Uses commit coordinator to help committing Hive and Parquet tables · fde69454
      Cheng Lian authored
      This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables.
      
      This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side.
      
      TODO
      
      - [ ] Add tests
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5139 from liancheng/spark-6369 and squashes the following commits:
      
      72eb628 [Cheng Lian] Fixes typo in javadoc
      9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments
      dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables
      fde69454
    • Davies Liu's avatar
      [SPARK-6603] [PySpark] [SQL] add SQLContext.udf and deprecate inferSchema() and applySchema · f76d2e55
      Davies Liu authored
      This PR create an alias for `registerFunction` as `udf.register`, to be consistent with Scala API.
      
      It also deprecated inferSchema() and applySchema(), show an warning for them.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5273 from davies/udf and squashes the following commits:
      
      476e947 [Davies Liu] address comments
      c096fdb [Davies Liu] add SQLContext.udf and deprecate inferSchema() and applySchema
      f76d2e55
    • Brennon York's avatar
      [HOTFIX][SPARK-4123]: Updated to fix bug where multiple dependencies added breaks Github output · df355008
      Brennon York authored
      Currently there is a bug whereby if a new patch introduces more than one new dependency (or removes more than one) it breaks the Github post output (see [this build](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29399/consoleFull)). This hotfix will remove `awk` print statements in place of `printf` so as not to automatically add the newline character which is then escaped and added directly at the end of the `awk` statement. This should take a failed build output such as:
      
      ```json
      data: {"body": "  [Test build #29400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29400/consoleFull) for   PR 5266 at commit [`2aa4be0`](https://github.com/apache/spark/commit/2aa4be0e1d7ce052f8c901c6d9462c611c3a920a).\n * This patch **passes all tests**.\n * This patch merges cleanly.\n * This patch adds the following public classes _(experimental)_:\n  * `class IDF extends Estimator[IDFModel] with IDFParams `\n  * `class Normalizer extends UnaryTransformer[Vector, Vector, Normalizer] `\n\n * This patch **adds the following new dependencies:**\n   * `avro-1.7.7.jar`
         * `breeze-macros_2.10-0.11.2.jar`
         * `breeze_2.10-0.11.2.jar`\n * This patch **removes the following dependencies:**\n   * `avro-1.7.6.jar`
         * `breeze-macros_2.10-0.11.1.jar`
         * `breeze_2.10-0.11.1.jar`"}
      ```
      
      and turn it into:
      
      ```json
      data: {"body": "  [Test build #29400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29400/consoleFull) for   PR 5266 at commit [`2aa4be0`](https://github.com/apache/spark/commit/2aa4be0e1d7ce052f8c901c6d9462c611c3a920a).\n * This patch **passes all tests**.\n * This patch merges cleanly.\n * This patch adds the following public classes _(experimental)_:\n  * `class IDF extends Estimator[IDFModel] with IDFParams `\n  * `class Normalizer extends UnaryTransformer[Vector, Vector, Normalizer] `\n\n * This patch **adds the following new dependencies:**\n   * `avro-1.7.7.jar`\n   * `breeze-macros_2.10-0.11.2.jar`\n   * `breeze_2.10-0.11.2.jar`\n * This patch **removes the following dependencies:**\n   * `avro-1.7.6.jar`\n   * `breeze-macros_2.10-0.11.1.jar`\n   * `breeze_2.10-0.11.1.jar`"}
      ```
      
      I've tested this locally and all worked.
      
      /cc srowen pwendell nchammas
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5269 from brennonyork/HOTFIX-SPARK-4123 and squashes the following commits:
      
      a441068 [Brennon York] Updated awk to use printf and to manually insert newlines so that the JSON github string when posted is corrected
      df355008
    • CodingCat's avatar
      [SPARK-6592][SQL] fix filter for scaladoc to generate API doc for Row class under catalyst dir · 32259c67
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-6592
      
      The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory
      
      we need to include Row into the scaladoc while still excluding other classes of catalyst project
      
      Thanks for the help on this patch from rxin and liancheng
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits:
      
      02098a4 [CodingCat] ignore collection, enable types (except those protected classes)
      f7af2cb [CodingCat] commit
      3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
      32259c67
    • Michael Armbrust's avatar
      [SPARK-6595][SQL] MetastoreRelation should be a MultiInstanceRelation · fe81f6c7
      Michael Armbrust authored
      Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan.  As such, it needs to inherit from `MultiInstanceRelation` or self joins will break.  I also add better debugging errors when our self join handling fails in case there are future bugs.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5251 from marmbrus/multiMetaStore and squashes the following commits:
      
      4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation
      fe81f6c7
    • Jose Manuel Gomez's avatar
      [HOTFIX] Update start-slave.sh · 19d4c392
      Jose Manuel Gomez authored
      wihtout this change the below error happens when I execute sbin/start-all.sh
      
      localhost: /spark-1.3/sbin/start-slave.sh: line 32: unexpected EOF while looking for matching `"'
      localhost: /spark-1.3/sbin/start-slave.sh: line 33: syntax error: unexpected end of file
      
      my operating system is Linux Mint 17.1 Rebecca
      
      Author: Jose Manuel Gomez <jmgomez@stratio.com>
      
      Closes #5262 from josegom/patch-2 and squashes the following commits:
      
      453af8b [Jose Manuel Gomez] Update start-slave.sh
      2c456bd [Jose Manuel Gomez] Update start-slave.sh
      19d4c392
    • Ilya Ganelin's avatar
      [SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle · 4bdfb7ba
      Ilya Ganelin authored
      I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts.
      
      I've included documentation to address the following issues:
      https://issues.apache.org/jira/browse/SPARK-5836
      https://issues.apache.org/jira/browse/SPARK-3441
      https://issues.apache.org/jira/browse/SPARK-5750
      
      https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      Author: Ilya Ganelin <ilganeli@gmail.com>
      
      Closes #5074 from ilganeli/SPARK-5750 and squashes the following commits:
      
      6178e24 [Ilya Ganelin] Update programming-guide.md
      7a0b96f [Ilya Ganelin] Update programming-guide.md
      2c5df08 [Ilya Ganelin] Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750
      dffbd2d [Ilya Ganelin] [SPARK-5750] Slight wording update
      1ff4eb4 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
      85f9c6e [Ilya Ganelin] Update programming-guide.md
      349d1fa [Ilya Ganelin] Added cross linkf or configuration page
      eeb5a7a [Ilya Ganelin] [SPARK-5750] Added some minor fixes
      dd5cc9d [Ilya Ganelin] [SPARK-5750] Fixed some factual inaccuracies with regards to shuffle internals.
      a8adb57 [Ilya Ganelin] [SPARK-5750] Incoporated feedback from Sean Owen
      9954bbe [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
      159dd1c [Ilya Ganelin] [SPARK-5750] Style fixes from rxin.
      75ef67b [Ilya Ganelin] [SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining the shuffle operation and included errata from a number of other JIRAs
      4bdfb7ba
    • CodingCat's avatar
      [SPARK-6596] fix the instruction on building scaladoc · de673303
      CodingCat authored
      In README.md under docs/ directory, it says that
      
      > You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory.
      
      I guess the right approach is build/sbt unidoc
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #5253 from CodingCat/SPARK-6596 and squashes the following commits:
      
      af379ed [CodingCat] fix the instruction on building scaladoc
      de673303
    • Eran Medan's avatar
      [spark-sql] a better exception message than "scala.MatchError" for unsupported... · 17b13c53
      Eran Medan authored
      [spark-sql] a better exception message than "scala.MatchError" for unsupported types in Schema creation
      
      Currently if trying to register an RDD (or DataFrame in 1.3) as a table that has types that have no supported Schema representation (e.g. type "Any") - it would throw a match error. e.g. scala.MatchError: Any (of class scala.reflect.internal.Types$ClassNoArgsTypeRef)
      
      This fix is just to have a nicer error message than a MatchError
      
      Author: Eran Medan <ehrann.mehdan@gmail.com>
      
      Closes #5235 from eranation/patch-2 and squashes the following commits:
      
      af4b1a2 [Eran Medan] Line should be under 100 chars
      0c69e9d [Eran Medan] Change from sys.error UnsupportedOperationException
      524be86 [Eran Medan] better exception than scala.MatchError: Any
      17b13c53
  4. Mar 29, 2015
    • Li Zhihui's avatar
      Fix string interpolator error in HeartbeatReceiver · 01dc9f50
      Li Zhihui authored
      Error log before fixed
      <code>15/03/29 10:07:25 ERROR YarnScheduler: Lost an executor 24 (already removed): Executor heartbeat timed out after ${now - lastSeenMs} ms</code>
      
      Author: Li Zhihui <zhihui.li@intel.com>
      
      Closes #5255 from li-zhihui/fixstringinterpolator and squashes the following commits:
      
      c93f2b7 [Li Zhihui] Fix string interpolator error in HeartbeatReceiver
      01dc9f50
    • zsxwing's avatar
      [SPARK-5124][Core] A standard RPC interface and an Akka implementation · a8d53afb
      zsxwing authored
      This PR added a standard internal RPC interface for Spark and an Akka implementation. See [the design document](https://issues.apache.org/jira/secure/attachment/12698710/Pluggable%20RPC%20-%20draft%202.pdf) for more details.
      
      I will split the whole work into multiple PRs to make it easier for code review. This is the first PR and avoid to touch too many files.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4588 from zsxwing/rpc-part1 and squashes the following commits:
      
      fe3df4c [zsxwing] Move registerEndpoint and use actorSystem.dispatcher in asyncSetupEndpointRefByURI
      f6f3287 [zsxwing] Remove RpcEndpointRef.toURI
      8bd1097 [zsxwing] Fix docs and the code style
      f459380 [zsxwing] Add RpcAddress.fromURI and rename urls to uris
      b221398 [zsxwing] Move send methods above ask methods
      15cfd7b [zsxwing] Merge branch 'master' into rpc-part1
      9ffa997 [zsxwing] Fix MiMa tests
      78a1733 [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-part1
      385b9c3 [zsxwing] Fix the code style and add docs
      2cc3f78 [zsxwing] Add an asynchronous version of setupEndpointRefByUrl
      e8dfec3 [zsxwing] Remove 'sendWithReply(message: Any, sender: RpcEndpointRef): Unit'
      08564ae [zsxwing] Add RpcEnvFactory to create RpcEnv
      e5df4ca [zsxwing] Handle AkkaFailure(e) in Actor
      ec7c5b0 [zsxwing] Fix docs
      7fc95e1 [zsxwing] Implement askWithReply in RpcEndpointRef
      9288406 [zsxwing] Document thread-safety for setupThreadSafeEndpoint
      3007c09 [zsxwing] Move setupDriverEndpointRef to RpcUtils and rename to makeDriverRef
      c425022 [zsxwing] Fix the code style
      5f87700 [zsxwing] Move the logical of processing message to a private function
      3e56123 [zsxwing] Use lazy to eliminate CountDownLatch
      07f128f [zsxwing] Remove ActionScheduler.scala
      4d34191 [zsxwing] Remove scheduler from RpcEnv
      7cdd95e [zsxwing] Add docs for RpcEnv
      51e6667 [zsxwing] Add 'sender' to RpcCallContext and rename the parameter of receiveAndReply to 'context'
      ffc1280 [zsxwing] Rename 'fail' to 'sendFailure' and other minor code style changes
      28e6d0f [zsxwing] Add onXXX for network events and remove the companion objects of network events
      3751c97 [zsxwing] Rename RpcResponse to RpcCallContext
      fe7d1ff [zsxwing] Add explicit reply in rpc
      7b9e0c9 [zsxwing] Fix the indentation
      04a106e [zsxwing] Remove NopCancellable and add a const NOP in object SettableCancellable
      2a579f4 [zsxwing] Remove RpcEnv.systemName
      155b987 [zsxwing] Change newURI to uriOf and add some comments
      45b2317 [zsxwing] A standard RPC interface and An Akka implementation
      a8d53afb
    • June.He's avatar
      [SPARK-6585][Tests]Fix FileServerSuite testcase in some Env. · 0e2753ff
      June.He authored
        Change FileServerSuite.test("HttpFileServer should not work with SSL when the server is untrusted") catch SSLException
      
      Author: June.He <jun.hejun@huawei.com>
      
      Closes #5239 from sisihj/SPARK-6585 and squashes the following commits:
      
      cb19ae3 [June.He] Change FileServerSuite.test("HttpFileServer should not work with SSL when the server is untrusted") catch SSLException
      0e2753ff
    • Thomas Graves's avatar
      [SPARK-6558] Utils.getCurrentUserName returns the full principal name instead of login name · 52ece26b
      Thomas Graves authored
      Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName()
      getUserName() returns the users full principal name (ie user1CORP.COM). getShortUserName() returns just the users login name (user1).
      
      This just happens to work on YARN because the Client code sets:
      env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName()
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #5229 from tgravescs/SPARK-6558 and squashes the following commits:
      
      24830bf [Thomas Graves] Utils.getCurrentUserName returns the full principal name instead of login name
      52ece26b
    • Nishkam Ravi's avatar
      [SPARK-6406] Launch Spark using assembly jar instead of a separate launcher jar · e3eb3939
      Nishkam Ravi authored
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #5085 from nishkamravi2/master_nravi and squashes the following commits:
      
      bad4349 [nishkamravi2] Update Main.java
      36a6f87 [Nishkam Ravi] Minor changes and bug fixes
      b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
      d9658d6 [Nishkam Ravi] Changes for SPARK-6406
      ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
      345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ac58975 [Nishkam Ravi] spark-class changes
      06bfeb0 [nishkamravi2] Update spark-class
      35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
      4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
      746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
      bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      d453197 [nishkamravi2] Update NewHadoopRDD.scala
      6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
      0ce2c32 [nishkamravi2] Update HadoopRDD.scala
      f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
      71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      494d8c0 [nishkamravi2] Update DiskBlockManager.scala
      3c5ddba [nishkamravi2] Update DiskBlockManager.scala
      f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
      79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
      535295a [nishkamravi2] Update TaskSetManager.scala
      3e1b616 [Nishkam Ravi] Modify test for maxResultSize
      9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
      5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      e3eb3939
    • Brennon York's avatar
      [SPARK-4123][Project Infra]: Show new dependencies added in pull requests · 55153f5c
      Brennon York authored
      Starting work on this, but need to find a way to ensure that, after doing a checkout from `apache/master`, we can successfully return to the current checkout. I believe that `git rev-parse HEAD` will get me what I want, but pushing this PR up to test what the Jenkins boxes are seeing.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5093 from brennonyork/SPARK-4123 and squashes the following commits:
      
      42e243e [Brennon York] moved starting test output to before pr tests, fixed indentation, changed mvn call to build/mvn
      dadd941 [Brennon York] reverted assembly pom, put the regular test suite back in play
      7aa1dee [Brennon York] set new dendencies into a <code> block, removed the bash debugging flag
      0074566 [Brennon York] fixed minor echo issue with quotes
      e229802 [Brennon York] updated to print the new dependency found
      27bb9b5 [Brennon York] changed the assembly pom to test whether the pr test will pick up new deps
      5375ad8 [Brennon York] git output to dev null
      9bce980 [Brennon York] ensure both gate files exist
      8f3c4b4 [Brennon York] updated to reflect the correct pushed in HEAD variable
      2bc7b27 [Brennon York] added a pom gate check
      a18db71 [Brennon York] full test of new deps script
      ea170de [Brennon York] dont let mvn execute tests
      f70d8cd [Brennon York] testing mvn with package
      62ffd65 [Brennon York] updated dependency output message and changed compile to package given the jenkins failure output
      04747e4 [Brennon York] adding simple mvn statement to see if command executes and prints compile output
      87f9bea [Brennon York] added -x flag with bash to get insight into what is executing and what isnt
      9e87208 [Brennon York] added set blocks to catch any non-zero exit codes and updated output
      6b3042b [Brennon York] removed excess git checkout print statements
      4077d46 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
      2bb5527 [Brennon York] added echo statement so jenkins logs which pr tests are running
      d027f8f [Brennon York] proper piping of unnecessary stderr and stdout
      6e2890d [Brennon York] updated test output newlines
      d9f6f7f [Brennon York] removed echo
      bad9a3a [Brennon York] added back the new deps test
      e9e3ad1 [Brennon York] removed escapes for quotes
      97e5cfb [Brennon York] commenting out new deps script
      17379a5 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
      56f74a8 [Brennon York] updated the unop for ensuring a test is available
      f2abc8c [Brennon York] removed the git checkout
      6912584 [Brennon York] added this_mssg echo output
      c610d42 [Brennon York] removed the error to dev/null
      b98f78c [Brennon York] added the removed deps and echo output for jenkins testing
      291a8fe [Brennon York] updated location of maven binary
      126ce61 [Brennon York] removing new deps test to isolate why jenkins isn't posting messages
      f8011d8 [Brennon York] minor updates and style changes
      63a35c9 [Brennon York] updated new dependencies test
      dae7ba8 [Brennon York] Capturing output directly from dependency builds
      94d3547 [Brennon York] adding the new dependencies script into the test mix
      2bca3c3 [Brennon York] added a git checkout 'git rev-parse HEAD' to the end of each pr test
      ae83b90 [Brennon York] removed jenkins tests to grab some values from the jenkins box
      4110993 [Brennon York] beginning work on pr test to add new dependencies
      55153f5c
    • Reynold Xin's avatar
      [DOC] Improvements to Python docs. · 5eef00d0
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5238 from rxin/pyspark-docs and squashes the following commits:
      
      c285951 [Reynold Xin] Reset deprecation warning.
      8c1031e [Reynold Xin] inferSchema
      dd91b1a [Reynold Xin] [DOC] Improvements to Python docs.
      5eef00d0
  5. Mar 28, 2015
  6. Mar 27, 2015
    • Adam Budde's avatar
      [SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema · 5909f097
      Adam Budde authored
      Opening to replace #5188.
      
      When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore.
      
      In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema.
      
      In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore.
      
      This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there.
      
      Author: Adam Budde <budde@amazon.com>
      
      Closes #5214 from budde/nullable-fields and squashes the following commits:
      
      a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538
      9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema
      5909f097
    • Reynold Xin's avatar
      [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 row, not 1 row · 3af73343
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5226 from rxin/empty-df and squashes the following commits:
      
      1306d88 [Reynold Xin] Proper fix.
      e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.
      3af73343
Loading