Skip to content
Snippets Groups Projects
  1. Mar 30, 2015
    • Brennon York's avatar
      [HOTFIX][SPARK-4123]: Updated to fix bug where multiple dependencies added breaks Github output · df355008
      Brennon York authored
      Currently there is a bug whereby if a new patch introduces more than one new dependency (or removes more than one) it breaks the Github post output (see [this build](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29399/consoleFull)). This hotfix will remove `awk` print statements in place of `printf` so as not to automatically add the newline character which is then escaped and added directly at the end of the `awk` statement. This should take a failed build output such as:
      
      ```json
      data: {"body": "  [Test build #29400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29400/consoleFull) for   PR 5266 at commit [`2aa4be0`](https://github.com/apache/spark/commit/2aa4be0e1d7ce052f8c901c6d9462c611c3a920a).\n * This patch **passes all tests**.\n * This patch merges cleanly.\n * This patch adds the following public classes _(experimental)_:\n  * `class IDF extends Estimator[IDFModel] with IDFParams `\n  * `class Normalizer extends UnaryTransformer[Vector, Vector, Normalizer] `\n\n * This patch **adds the following new dependencies:**\n   * `avro-1.7.7.jar`
         * `breeze-macros_2.10-0.11.2.jar`
         * `breeze_2.10-0.11.2.jar`\n * This patch **removes the following dependencies:**\n   * `avro-1.7.6.jar`
         * `breeze-macros_2.10-0.11.1.jar`
         * `breeze_2.10-0.11.1.jar`"}
      ```
      
      and turn it into:
      
      ```json
      data: {"body": "  [Test build #29400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29400/consoleFull) for   PR 5266 at commit [`2aa4be0`](https://github.com/apache/spark/commit/2aa4be0e1d7ce052f8c901c6d9462c611c3a920a).\n * This patch **passes all tests**.\n * This patch merges cleanly.\n * This patch adds the following public classes _(experimental)_:\n  * `class IDF extends Estimator[IDFModel] with IDFParams `\n  * `class Normalizer extends UnaryTransformer[Vector, Vector, Normalizer] `\n\n * This patch **adds the following new dependencies:**\n   * `avro-1.7.7.jar`\n   * `breeze-macros_2.10-0.11.2.jar`\n   * `breeze_2.10-0.11.2.jar`\n * This patch **removes the following dependencies:**\n   * `avro-1.7.6.jar`\n   * `breeze-macros_2.10-0.11.1.jar`\n   * `breeze_2.10-0.11.1.jar`"}
      ```
      
      I've tested this locally and all worked.
      
      /cc srowen pwendell nchammas
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5269 from brennonyork/HOTFIX-SPARK-4123 and squashes the following commits:
      
      a441068 [Brennon York] Updated awk to use printf and to manually insert newlines so that the JSON github string when posted is corrected
      df355008
    • CodingCat's avatar
      [SPARK-6592][SQL] fix filter for scaladoc to generate API doc for Row class under catalyst dir · 32259c67
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-6592
      
      The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory
      
      we need to include Row into the scaladoc while still excluding other classes of catalyst project
      
      Thanks for the help on this patch from rxin and liancheng
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits:
      
      02098a4 [CodingCat] ignore collection, enable types (except those protected classes)
      f7af2cb [CodingCat] commit
      3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
      32259c67
    • Michael Armbrust's avatar
      [SPARK-6595][SQL] MetastoreRelation should be a MultiInstanceRelation · fe81f6c7
      Michael Armbrust authored
      Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan.  As such, it needs to inherit from `MultiInstanceRelation` or self joins will break.  I also add better debugging errors when our self join handling fails in case there are future bugs.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5251 from marmbrus/multiMetaStore and squashes the following commits:
      
      4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation
      fe81f6c7
    • Jose Manuel Gomez's avatar
      [HOTFIX] Update start-slave.sh · 19d4c392
      Jose Manuel Gomez authored
      wihtout this change the below error happens when I execute sbin/start-all.sh
      
      localhost: /spark-1.3/sbin/start-slave.sh: line 32: unexpected EOF while looking for matching `"'
      localhost: /spark-1.3/sbin/start-slave.sh: line 33: syntax error: unexpected end of file
      
      my operating system is Linux Mint 17.1 Rebecca
      
      Author: Jose Manuel Gomez <jmgomez@stratio.com>
      
      Closes #5262 from josegom/patch-2 and squashes the following commits:
      
      453af8b [Jose Manuel Gomez] Update start-slave.sh
      2c456bd [Jose Manuel Gomez] Update start-slave.sh
      19d4c392
    • Ilya Ganelin's avatar
      [SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle · 4bdfb7ba
      Ilya Ganelin authored
      I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts.
      
      I've included documentation to address the following issues:
      https://issues.apache.org/jira/browse/SPARK-5836
      https://issues.apache.org/jira/browse/SPARK-3441
      https://issues.apache.org/jira/browse/SPARK-5750
      
      https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      Author: Ilya Ganelin <ilganeli@gmail.com>
      
      Closes #5074 from ilganeli/SPARK-5750 and squashes the following commits:
      
      6178e24 [Ilya Ganelin] Update programming-guide.md
      7a0b96f [Ilya Ganelin] Update programming-guide.md
      2c5df08 [Ilya Ganelin] Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750
      dffbd2d [Ilya Ganelin] [SPARK-5750] Slight wording update
      1ff4eb4 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
      85f9c6e [Ilya Ganelin] Update programming-guide.md
      349d1fa [Ilya Ganelin] Added cross linkf or configuration page
      eeb5a7a [Ilya Ganelin] [SPARK-5750] Added some minor fixes
      dd5cc9d [Ilya Ganelin] [SPARK-5750] Fixed some factual inaccuracies with regards to shuffle internals.
      a8adb57 [Ilya Ganelin] [SPARK-5750] Incoporated feedback from Sean Owen
      9954bbe [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
      159dd1c [Ilya Ganelin] [SPARK-5750] Style fixes from rxin.
      75ef67b [Ilya Ganelin] [SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining the shuffle operation and included errata from a number of other JIRAs
      4bdfb7ba
    • CodingCat's avatar
      [SPARK-6596] fix the instruction on building scaladoc · de673303
      CodingCat authored
      In README.md under docs/ directory, it says that
      
      > You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory.
      
      I guess the right approach is build/sbt unidoc
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #5253 from CodingCat/SPARK-6596 and squashes the following commits:
      
      af379ed [CodingCat] fix the instruction on building scaladoc
      de673303
    • Eran Medan's avatar
      [spark-sql] a better exception message than "scala.MatchError" for unsupported... · 17b13c53
      Eran Medan authored
      [spark-sql] a better exception message than "scala.MatchError" for unsupported types in Schema creation
      
      Currently if trying to register an RDD (or DataFrame in 1.3) as a table that has types that have no supported Schema representation (e.g. type "Any") - it would throw a match error. e.g. scala.MatchError: Any (of class scala.reflect.internal.Types$ClassNoArgsTypeRef)
      
      This fix is just to have a nicer error message than a MatchError
      
      Author: Eran Medan <ehrann.mehdan@gmail.com>
      
      Closes #5235 from eranation/patch-2 and squashes the following commits:
      
      af4b1a2 [Eran Medan] Line should be under 100 chars
      0c69e9d [Eran Medan] Change from sys.error UnsupportedOperationException
      524be86 [Eran Medan] better exception than scala.MatchError: Any
      17b13c53
  2. Mar 29, 2015
    • Li Zhihui's avatar
      Fix string interpolator error in HeartbeatReceiver · 01dc9f50
      Li Zhihui authored
      Error log before fixed
      <code>15/03/29 10:07:25 ERROR YarnScheduler: Lost an executor 24 (already removed): Executor heartbeat timed out after ${now - lastSeenMs} ms</code>
      
      Author: Li Zhihui <zhihui.li@intel.com>
      
      Closes #5255 from li-zhihui/fixstringinterpolator and squashes the following commits:
      
      c93f2b7 [Li Zhihui] Fix string interpolator error in HeartbeatReceiver
      01dc9f50
    • zsxwing's avatar
      [SPARK-5124][Core] A standard RPC interface and an Akka implementation · a8d53afb
      zsxwing authored
      This PR added a standard internal RPC interface for Spark and an Akka implementation. See [the design document](https://issues.apache.org/jira/secure/attachment/12698710/Pluggable%20RPC%20-%20draft%202.pdf) for more details.
      
      I will split the whole work into multiple PRs to make it easier for code review. This is the first PR and avoid to touch too many files.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4588 from zsxwing/rpc-part1 and squashes the following commits:
      
      fe3df4c [zsxwing] Move registerEndpoint and use actorSystem.dispatcher in asyncSetupEndpointRefByURI
      f6f3287 [zsxwing] Remove RpcEndpointRef.toURI
      8bd1097 [zsxwing] Fix docs and the code style
      f459380 [zsxwing] Add RpcAddress.fromURI and rename urls to uris
      b221398 [zsxwing] Move send methods above ask methods
      15cfd7b [zsxwing] Merge branch 'master' into rpc-part1
      9ffa997 [zsxwing] Fix MiMa tests
      78a1733 [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-part1
      385b9c3 [zsxwing] Fix the code style and add docs
      2cc3f78 [zsxwing] Add an asynchronous version of setupEndpointRefByUrl
      e8dfec3 [zsxwing] Remove 'sendWithReply(message: Any, sender: RpcEndpointRef): Unit'
      08564ae [zsxwing] Add RpcEnvFactory to create RpcEnv
      e5df4ca [zsxwing] Handle AkkaFailure(e) in Actor
      ec7c5b0 [zsxwing] Fix docs
      7fc95e1 [zsxwing] Implement askWithReply in RpcEndpointRef
      9288406 [zsxwing] Document thread-safety for setupThreadSafeEndpoint
      3007c09 [zsxwing] Move setupDriverEndpointRef to RpcUtils and rename to makeDriverRef
      c425022 [zsxwing] Fix the code style
      5f87700 [zsxwing] Move the logical of processing message to a private function
      3e56123 [zsxwing] Use lazy to eliminate CountDownLatch
      07f128f [zsxwing] Remove ActionScheduler.scala
      4d34191 [zsxwing] Remove scheduler from RpcEnv
      7cdd95e [zsxwing] Add docs for RpcEnv
      51e6667 [zsxwing] Add 'sender' to RpcCallContext and rename the parameter of receiveAndReply to 'context'
      ffc1280 [zsxwing] Rename 'fail' to 'sendFailure' and other minor code style changes
      28e6d0f [zsxwing] Add onXXX for network events and remove the companion objects of network events
      3751c97 [zsxwing] Rename RpcResponse to RpcCallContext
      fe7d1ff [zsxwing] Add explicit reply in rpc
      7b9e0c9 [zsxwing] Fix the indentation
      04a106e [zsxwing] Remove NopCancellable and add a const NOP in object SettableCancellable
      2a579f4 [zsxwing] Remove RpcEnv.systemName
      155b987 [zsxwing] Change newURI to uriOf and add some comments
      45b2317 [zsxwing] A standard RPC interface and An Akka implementation
      a8d53afb
    • June.He's avatar
      [SPARK-6585][Tests]Fix FileServerSuite testcase in some Env. · 0e2753ff
      June.He authored
        Change FileServerSuite.test("HttpFileServer should not work with SSL when the server is untrusted") catch SSLException
      
      Author: June.He <jun.hejun@huawei.com>
      
      Closes #5239 from sisihj/SPARK-6585 and squashes the following commits:
      
      cb19ae3 [June.He] Change FileServerSuite.test("HttpFileServer should not work with SSL when the server is untrusted") catch SSLException
      0e2753ff
    • Thomas Graves's avatar
      [SPARK-6558] Utils.getCurrentUserName returns the full principal name instead of login name · 52ece26b
      Thomas Graves authored
      Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName()
      getUserName() returns the users full principal name (ie user1CORP.COM). getShortUserName() returns just the users login name (user1).
      
      This just happens to work on YARN because the Client code sets:
      env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName()
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #5229 from tgravescs/SPARK-6558 and squashes the following commits:
      
      24830bf [Thomas Graves] Utils.getCurrentUserName returns the full principal name instead of login name
      52ece26b
    • Nishkam Ravi's avatar
      [SPARK-6406] Launch Spark using assembly jar instead of a separate launcher jar · e3eb3939
      Nishkam Ravi authored
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #5085 from nishkamravi2/master_nravi and squashes the following commits:
      
      bad4349 [nishkamravi2] Update Main.java
      36a6f87 [Nishkam Ravi] Minor changes and bug fixes
      b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
      d9658d6 [Nishkam Ravi] Changes for SPARK-6406
      ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
      345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ac58975 [Nishkam Ravi] spark-class changes
      06bfeb0 [nishkamravi2] Update spark-class
      35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
      4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
      746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
      bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      d453197 [nishkamravi2] Update NewHadoopRDD.scala
      6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
      0ce2c32 [nishkamravi2] Update HadoopRDD.scala
      f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
      71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      494d8c0 [nishkamravi2] Update DiskBlockManager.scala
      3c5ddba [nishkamravi2] Update DiskBlockManager.scala
      f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
      79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
      535295a [nishkamravi2] Update TaskSetManager.scala
      3e1b616 [Nishkam Ravi] Modify test for maxResultSize
      9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
      5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      e3eb3939
    • Brennon York's avatar
      [SPARK-4123][Project Infra]: Show new dependencies added in pull requests · 55153f5c
      Brennon York authored
      Starting work on this, but need to find a way to ensure that, after doing a checkout from `apache/master`, we can successfully return to the current checkout. I believe that `git rev-parse HEAD` will get me what I want, but pushing this PR up to test what the Jenkins boxes are seeing.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5093 from brennonyork/SPARK-4123 and squashes the following commits:
      
      42e243e [Brennon York] moved starting test output to before pr tests, fixed indentation, changed mvn call to build/mvn
      dadd941 [Brennon York] reverted assembly pom, put the regular test suite back in play
      7aa1dee [Brennon York] set new dendencies into a <code> block, removed the bash debugging flag
      0074566 [Brennon York] fixed minor echo issue with quotes
      e229802 [Brennon York] updated to print the new dependency found
      27bb9b5 [Brennon York] changed the assembly pom to test whether the pr test will pick up new deps
      5375ad8 [Brennon York] git output to dev null
      9bce980 [Brennon York] ensure both gate files exist
      8f3c4b4 [Brennon York] updated to reflect the correct pushed in HEAD variable
      2bc7b27 [Brennon York] added a pom gate check
      a18db71 [Brennon York] full test of new deps script
      ea170de [Brennon York] dont let mvn execute tests
      f70d8cd [Brennon York] testing mvn with package
      62ffd65 [Brennon York] updated dependency output message and changed compile to package given the jenkins failure output
      04747e4 [Brennon York] adding simple mvn statement to see if command executes and prints compile output
      87f9bea [Brennon York] added -x flag with bash to get insight into what is executing and what isnt
      9e87208 [Brennon York] added set blocks to catch any non-zero exit codes and updated output
      6b3042b [Brennon York] removed excess git checkout print statements
      4077d46 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
      2bb5527 [Brennon York] added echo statement so jenkins logs which pr tests are running
      d027f8f [Brennon York] proper piping of unnecessary stderr and stdout
      6e2890d [Brennon York] updated test output newlines
      d9f6f7f [Brennon York] removed echo
      bad9a3a [Brennon York] added back the new deps test
      e9e3ad1 [Brennon York] removed escapes for quotes
      97e5cfb [Brennon York] commenting out new deps script
      17379a5 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
      56f74a8 [Brennon York] updated the unop for ensuring a test is available
      f2abc8c [Brennon York] removed the git checkout
      6912584 [Brennon York] added this_mssg echo output
      c610d42 [Brennon York] removed the error to dev/null
      b98f78c [Brennon York] added the removed deps and echo output for jenkins testing
      291a8fe [Brennon York] updated location of maven binary
      126ce61 [Brennon York] removing new deps test to isolate why jenkins isn't posting messages
      f8011d8 [Brennon York] minor updates and style changes
      63a35c9 [Brennon York] updated new dependencies test
      dae7ba8 [Brennon York] Capturing output directly from dependency builds
      94d3547 [Brennon York] adding the new dependencies script into the test mix
      2bca3c3 [Brennon York] added a git checkout 'git rev-parse HEAD' to the end of each pr test
      ae83b90 [Brennon York] removed jenkins tests to grab some values from the jenkins box
      4110993 [Brennon York] beginning work on pr test to add new dependencies
      55153f5c
    • Reynold Xin's avatar
      [DOC] Improvements to Python docs. · 5eef00d0
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5238 from rxin/pyspark-docs and squashes the following commits:
      
      c285951 [Reynold Xin] Reset deprecation warning.
      8c1031e [Reynold Xin] inferSchema
      dd91b1a [Reynold Xin] [DOC] Improvements to Python docs.
      5eef00d0
  3. Mar 28, 2015
  4. Mar 27, 2015
    • Adam Budde's avatar
      [SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema · 5909f097
      Adam Budde authored
      Opening to replace #5188.
      
      When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore.
      
      In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema.
      
      In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore.
      
      This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there.
      
      Author: Adam Budde <budde@amazon.com>
      
      Closes #5214 from budde/nullable-fields and squashes the following commits:
      
      a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538
      9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema
      5909f097
    • Reynold Xin's avatar
      [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 row, not 1 row · 3af73343
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5226 from rxin/empty-df and squashes the following commits:
      
      1306d88 [Reynold Xin] Proper fix.
      e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.
      3af73343
    • Xusen Yin's avatar
      [SPARK-6526][ML] Add Normalizer transformer in ML package · d5497ab1
      Xusen Yin authored
      See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).
      
      mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5181 from yinxusen/SPARK-6526 and squashes the following commits:
      
      6faa7bf [Xusen Yin] fix style
      8a462da [Xusen Yin] remove duplications
      ab35ab0 [Xusen Yin] add test suite
      bc8cd0f [Xusen Yin] fix comment
      79774c9 [Xusen Yin] add Normalizer transformer in ML package
      d5497ab1
    • Davies Liu's avatar
      [SPARK-6574] [PySpark] fix sql example · 887e1b72
      Davies Liu authored
      Fix the import in sql example.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5230 from davies/fix_sql_example and squashes the following commits:
      
      7ecc5f4 [Davies Liu] fix sql example
      887e1b72
    • Michael Armbrust's avatar
      [SPARK-6550][SQL] Use analyzed plan in DataFrame · 5d9c37c2
      Michael Armbrust authored
      This is based on bug and test case proposed by viirya.  See #5203 for a excellent description of the problem.
      
      TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`.  However, this `AttributeReference` is based on an analyzed plan which is thrown away.  At execution time, we once again analyze the plan.  However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid.
      
      As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5217 from marmbrus/preanalyzer and squashes the following commits:
      
      1f98e2d [Michael Armbrust] revert change
      dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame
      089c52e [Michael Armbrust] WIP
      5d9c37c2
    • Dean Chen's avatar
      [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7 · aa2b9917
      Dean Chen authored
      Fixes bug causing Kryo serialization to fail with Avro files in between stages.
      
      https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
      
      Author: Dean Chen <deanchen5@gmail.com>
      
      Closes #5193 from deanchen/SPARK-6544 and squashes the following commits:
      
      813d4c5 [Dean Chen] [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7
      aa2b9917
    • zsxwing's avatar
      [SPARK-6556][Core] Fix wrong parsing logic of executorTimeoutMs and... · da546b7b
      zsxwing authored
      [SPARK-6556][Core] Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
      
      The current reading logic of `executorTimeoutMs` is:
      ```Scala
      private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout",
          sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
      ```
      So if `spark.storage.blockManagerSlaveTimeoutMs` is 10000 and `spark.network.timeout` is not set, executorTimeoutMs will be 10000 * 1000. But the correct value should have been 10000.
      
      `checkTimeoutIntervalMs` has the same issue.
      
      This PR fixes them.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5209 from zsxwing/SPARK-6556 and squashes the following commits:
      
      6a0a411 [zsxwing] Fix docs
      c7d5422 [zsxwing] Add comments for executorTimeoutMs and checkTimeoutIntervalMs
      ccd5147 [zsxwing] Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
      da546b7b
    • Yu ISHIKAWA's avatar
      [SPARK-6341][mllib] Upgrade breeze from 0.11.1 to 0.11.2 · f43a6103
      Yu ISHIKAWA authored
      There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
      https://issues.apache.org/jira/browse/SPARK-6341
      
      And thanks you for your great cooperation, David Hall(dlwh)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #5222 from yu-iskw/upgrade-breeze and squashes the following commits:
      
      ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
      f43a6103
    • mcheah's avatar
      [SPARK-6405] Limiting the maximum Kryo buffer size to be 2GB. · 49d2ec63
      mcheah authored
      Kryo buffers are backed by byte arrays, but primitive arrays can only be
      up to 2GB in size. It is misleading to allow users to set buffers past
      this size.
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #5218 from mccheah/feature/limit-kryo-buffer and squashes the following commits:
      
      1d6d1be [mcheah] Fixing numeric typo
      e2e30ce [mcheah] Removing explicit int and double type to match style
      09fd80b [mcheah] Should be >= not >. Slightly more consistent error message.
      60634f9 [mcheah] [SPARK-6405] Limiting the maximum Kryo buffer size to be 2GB.
      49d2ec63
  5. Mar 26, 2015
    • Brennon York's avatar
      [SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference · 39fb5796
      Brennon York authored
      Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.
      
      To demonstrate a basic example with pseudocode:
      
      ```
      Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
      > Set((0L,0))
      ```
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:
      
      248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
      3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
      6575d92 [Brennon York] updated mima exclude
      aaa030b [Brennon York] completed graph#minus functionality
      7227c0f [Brennon York] beginning work on minus functionality
      39fb5796
    • Michael Armbrust's avatar
      [DOCS][SQL] Fix JDBC example · aad00322
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5192 from marmbrus/fixJDBCDocs and squashes the following commits:
      
      b48a33d [Michael Armbrust] [DOCS][SQL] Fix JDBC example
      aad00322
    • Cheng Lian's avatar
      [SPARK-6554] [SQL] Don't push down predicates which reference partition column(s) · 71a0d40e
      Cheng Lian authored
      There are two cases for the new Parquet data source:
      
      1. Partition columns exist in the Parquet data files
      
         We don't need to push-down these predicates since partition pruning already handles them.
      
      1. Partition columns don't exist in the Parquet data files
      
         We can't push-down these predicates since they are considered as invalid columns by Parquet.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5210)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5210 from liancheng/spark-6554 and squashes the following commits:
      
      4f7ec03 [Cheng Lian] Adds comments
      e134ced [Cheng Lian] Don't push down predicates which reference partition column(s)
      71a0d40e
    • Reynold Xin's avatar
      [SPARK-6117] [SQL] Improvements to DataFrame.describe() · 784fcd53
      Reynold Xin authored
      1. Slightly modifications to the code to make it more readable.
      2. Added Python implementation.
      3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5201 from rxin/df-describe and squashes the following commits:
      
      25a7834 [Reynold Xin] Reset run-tests.
      6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe()
      784fcd53
    • Sean Owen's avatar
      SPARK-6532 [BUILD] LDAModel.scala fails scalastyle on Windows · c3a52a08
      Sean Owen authored
      Use standard UTF-8 source / report encoding for scalastyle
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5211 from srowen/SPARK-6532 and squashes the following commits:
      
      16a33e5 [Sean Owen] Use standard UTF-8 source / report encoding for scalastyle
      c3a52a08
    • Sean Owen's avatar
      SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases · fe15ea97
      Sean Owen authored
      Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5148 from srowen/SPARK-6480 and squashes the following commits:
      
      974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
      23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
      fe15ea97
    • Yuhao Yang's avatar
      [MLlib]remove unused import · 3ddb975f
      Yuhao Yang authored
      minor thing. Let me know if jira is required.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5207 from hhbyyh/adjustImport and squashes the following commits:
      
      2240121 [Yuhao Yang] remove unused import
      3ddb975f
    • Yash Datta's avatar
      [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema... · 1c05027a
      Yash Datta authored
      [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
      
      Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
      But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #5141 from saucam/replace_col and squashes the following commits:
      
      e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema
      5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
      1c05027a
    • zsxwing's avatar
      [SPARK-6468][Block Manager] Fix the race condition of subDirs in DiskBlockManager · 0c88ce54
      zsxwing authored
      There are two race conditions of `subDirs` in `DiskBlockManager`:
      
      1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. Although it's designed for testing, it's still worth to add correct locks to eliminate the race condition.
      2. The double-check has a race condition in `getFile(filename: String)`. If a thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` block, it may not be able to see the correct content of the File instance pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there is no volatile variable here).
      
      This PR fixed the above race conditions.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5136 from zsxwing/SPARK-6468 and squashes the following commits:
      
      cbb872b [zsxwing] Fix the race condition of subDirs in DiskBlockManager
      0c88ce54
    • Michael Armbrust's avatar
      [SPARK-6465][SQL] Fix serialization of GenericRowWithSchema using kryo · f88f51bb
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits:
      
      bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo
      f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
      f88f51bb
    • DoingDone9's avatar
      [SPARK-6546][Build] Using the wrong code that will make spark compile failed!! · 855cba8f
      DoingDone9 authored
      wrong code : val tmpDir = Files.createTempDir()
      not Files should Utils
      
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5198 from DoingDone9/FilesBug and squashes the following commits:
      
      6e0140d [DoingDone9] Update InsertIntoHiveTableSuite.scala
      e57d23f [DoingDone9] Update InsertIntoHiveTableSuite.scala
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      855cba8f
    • azagrebin's avatar
      [SPARK-6117] [SQL] add describe function to DataFrame for summary statis... · 5bbcd130
      azagrebin authored
      Please review my solution for SPARK-6117
      
      Author: azagrebin <azagrebin@gmail.com>
      
      Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits:
      
      f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case
      ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns
      9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics
      5bbcd130
    • Davies Liu's avatar
      [SPARK-6536] [PySpark] Column.inSet() in Python · f5358029
      Davies Liu authored
      ```
      >>> df[df.name.inSet("Bob", "Mike")].collect()
      [Row(age=5, name=u'Bob')]
      >>> df[df.age.inSet([1, 2, 3])].collect()
      [Row(age=2, name=u'Alice')]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5190 from davies/in and squashes the following commits:
      
      6b73a47 [Davies Liu] Column.inSet() in Python
      f5358029
  6. Mar 25, 2015
    • Michael Armbrust's avatar
      [SPARK-6463][SQL] AttributeSet.equal should compare size · 276ef1c3
      Michael Armbrust authored
      Previously this could result in sets compare equals when in fact the right was a subset of the left.
      
      Based on #5133 by sisihj
      
      Author: sisihj <jun.hejun@huawei.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5194 from marmbrus/pr/5133 and squashes the following commits:
      
      5ed4615 [Michael Armbrust] fix imports
      d4cbbc0 [Michael Armbrust] Add test cases
      0a0834f [sisihj]  AttributeSet.equal should compare size
      276ef1c3
    • KaiXinXiaoLei's avatar
      The UT test of spark is failed. Because there is a test in SQLQuerySuite about... · e87bf371
      KaiXinXiaoLei authored
      The UT test of spark is failed. Because there is a test in SQLQuerySuite about creating table “test”
      
      If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are  running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala  ,and this table not droped. So when running "drop cached table", table test already exists.
      
      There is error info:
      01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists)
      at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616)
      at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
      at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
      at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
      at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test”
      
      And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is:
      
        test("SPARK-4825 save join to table") {
          val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF()
          sql("CREATE TABLE test1 (key INT, value STRING)")
          testData.insertInto("test1")
          sql("CREATE TABLE test2 (key INT, value STRING)")
          testData.insertInto("test2")
          testData.insertInto("test2")
          sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key =   b.key")
          checkAnswer(
            table("test"),
            sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
        }
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits:
      
      7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
      e87bf371
Loading