Skip to content
Snippets Groups Projects
  1. Aug 04, 2015
    • Mike Dusenberry's avatar
      [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. · 571d5b53
      Mike Dusenberry authored
      This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark.  Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object.  New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class.  This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code.  Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.  Associated documentation and unit-tests have also been added.  To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
      
      Author: Mike Dusenberry <mwdusenb@us.ibm.com>
      
      Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
      
      bb039cb [Mike Dusenberry] Minor documentation update.
      b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner.  Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that.  If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly.  This is only for internal usage, and publicly, we still require 'rows' to be an RDD.  We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed.  The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
      7f0dcb6 [Mike Dusenberry] Updating module docstring.
      cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
      687e345 [Mike Dusenberry] Improving conversion performance.  This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
      3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
      308f197 [Mike Dusenberry] Using properties for better documentation.
      1633f86 [Mike Dusenberry] Minor documentation cleanup.
      f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
      ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
      3fd4016 [Mike Dusenberry] Updating docstrings.
      27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
      a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
      d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
      4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
      c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
      329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
      0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
      c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
      4ad6819 [Mike Dusenberry] Documenting the  and  parameters.
      3b854b9 [Mike Dusenberry] Minor updates to documentation.
      10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
      119018d [Mike Dusenberry] Adding static  methods to each of the distributed matrix classes to consolidate conversion logic.
      4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
      93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
      f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
      6a3ecb7 [Mike Dusenberry] Updating pattern matching.
      08f287b [Mike Dusenberry] Slight reformatting of the documentation.
      a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output.  This is fine since the values are all small, and thus can be easily represented as ints.
      4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
      7e3ca16 [Mike Dusenberry] Fixing long lines.
      f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
      ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
      dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices.  Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
      0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
      3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier.  The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction.  This way, we can call  for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object.  This is analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
      4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
      23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
      b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix.  Updating DistributedMatrices factory methods to accept numRows and numCols with default values.  Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
      bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
      d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices.  Added a factory method for creating a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
      571d5b53
    • Joseph K. Bradley's avatar
      [SPARK-9582] [ML] LDA cleanups · 1833d9c0
      Joseph K. Bradley authored
      Small cleanups to recent LDA additions and docs.
      
      CC: feynmanliang
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7916 from jkbradley/lda-cleanups and squashes the following commits:
      
      f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning
      97947aa [Joseph K. Bradley] a few more cleanups
      5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood
      c566915 [Joseph K. Bradley] small edit to make review easier
      63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models
      1833d9c0
    • Joseph K. Bradley's avatar
      [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier · e3754560
      Joseph K. Bradley authored
      Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns.
      
      CC: holdenk yanboliang
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7903 from jkbradley/rf-prob-python and squashes the following commits:
      
      c62a83f [Joseph K. Bradley] made unit test more robust
      14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark
      e3754560
    • CodingCat's avatar
      [SPARK-9602] remove "Akka/Actor" words from comments · 9d668b73
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-9602
      
      Although we have hidden Akka behind RPC interface, I found that the Akka/Actor-related comments are still spreading everywhere. To make it consistent, we shall remove "actor"/"akka" words from the comments...
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits:
      
      e8296a3 [CodingCat] remove actor words from comments
      9d668b73
    • Josh Rosen's avatar
      [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter · ab8ee1a3
      Josh Rosen authored
      This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits:
      
      967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
      948c344 [Josh Rosen] Add large records tests for KV sorter.
      3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method
      380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
      27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite
      a49baef [Josh Rosen] Address initial round of review comments
      3edb931 [Josh Rosen] Remove accidentally-committed debug statements.
      2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
      ab8ee1a3
    • Wenchen Fan's avatar
      [SPARK-9553][SQL] remove the no-longer-necessary createCode and... · f4b1ac08
      Wenchen Fan authored
      [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7890 from cloud-fan/minor and squashes the following commits:
      
      c3b1be3 [Wenchen Fan] fix style
      b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and replace the usage of them by createStructCode
      f4b1ac08
    • Michael Armbrust's avatar
      [SPARK-9606] [SQL] Ignore flaky thrift server tests · a0cc0175
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7939 from marmbrus/turnOffThriftTests and squashes the following commits:
      
      80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server tests
      a0cc0175
    • Holden Karau's avatar
      [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier · 5a23213c
      Holden Karau authored
      This PR replaces the old "threshold" with a generalized "thresholds" Param.  We keep getThreshold,setThreshold for backwards compatibility for binary classification.
      
      Note that the primary author of this PR is holdenk
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:
      
      3952977 [Joseph K. Bradley] fixed pyspark doc test
      85febc8 [Joseph K. Bradley] made python unit tests a little more robust
      7eb1d86 [Joseph K. Bradley] small cleanups
      6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
      0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
      7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
      be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
      6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
      25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
      c02d6c0 [Holden Karau] No default for thresholds
      5e43628 [Holden Karau] CR feedback and fixed the renamed test
      f3fbbd1 [Holden Karau] revert the changes to random forest :(
      51f581c [Holden Karau] Add explicit types to public methods, fix long line
      f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
      adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
      398078a [Holden Karau] move the thresholding around a bunch based on the design doc
      4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
      638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
      e09919c [Holden Karau] Fix return type, I need more coffee....
      8d92cac [Holden Karau] Use ClassifierParams as the head
      3456ed3 [Holden Karau] Add explicit return types even though just test
      a0f3b0c [Holden Karau] scala style fixes
      6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
      ffc8dab [Holden Karau] Update the sharedParams
      0420290 [Holden Karau] Allow us to override the get methods selectively
      978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
      1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
      1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
      efb9084 [Holden Karau] move setThresholds only to where its used
      6b34809 [Holden Karau] Add a test with thresholding for the RFCS
      74f54c3 [Holden Karau] Fix creation of vote array
      1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
      2f44b18 [Holden Karau] Add a global default of null for thresholds param
      f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
      634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
      85c9e01 [Holden Karau] Test passes again... little fnur
      099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
      0f46836 [Holden Karau] Start adding a classifiersuite
      f70eb5e [Holden Karau] Fix test compile issues
      a7d59c8 [Holden Karau] Move thresholding into Classifier trait
      5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
      1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
      31d6bf2 [Holden Karau] Start threading the threshold info through
      0ef228c [Holden Karau] Add hasthresholds
      5a23213c
    • Michael Armbrust's avatar
      [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting · 34a0eb2e
      Michael Armbrust authored
      The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7906 from marmbrus/revertSortProjection and squashes the following commits:
      
      2da6972 [Michael Armbrust] unrevert unrelated changes
      4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation"
      34a0eb2e
    • Shivaram Venkataraman's avatar
      [SPARK-9562] Change reference to amplab/spark-ec2 from mesos/ · 6a0f8b99
      Shivaram Venkataraman authored
      cc srowen pwendell nchammas
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7899 from shivaram/spark-ec2-move and squashes the following commits:
      
      7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from mesos/
      6a0f8b99
    • Yijie Shen's avatar
      [SPARK-9541] [SQL] DataTimeUtils cleanup · b5034c9c
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9541
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7870 from yjshen/datetime_cleanup and squashes the following commits:
      
      9203e33 [Yijie Shen] revert getMonth & getDayOfMonth
      5cad119 [Yijie Shen] rebase code
      7d62a74 [Yijie Shen] remove tmp tuple inside split date
      e98aaac [Yijie Shen] DataTimeUtils cleanup
      b5034c9c
    • Davies Liu's avatar
      [SPARK-8246] [SQL] Implement get_json_object · 73dedb58
      Davies Liu authored
      This is based on #7485 , thanks to NathanHowell
      
      Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #7901 from davies/get_json_object and squashes the following commits:
      
      3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object
      98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
      a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala
      c818519 [Yin Huai] new results.
      18ce26b [Davies Liu] fix tests
      6ac29fb [Yin Huai] Golden files.
      25eebef [Davies Liu] use HiveQuerySuite
      e0ac6ec [Yin Huai] Golden answer files.
      940c060 [Davies Liu] tweat code style
      44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
      9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element
      8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
      73dedb58
    • Tarek Auel's avatar
      [SPARK-8244] [SQL] string function: find in set · b1f88a38
      Tarek Auel authored
      This PR is based on #7186 (just fix the conflict), thanks to tarekauel .
      
      find_in_set(string str, string strList): int
      
      Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3.
      
      Only add this to SQL, not DataFrame.
      
      Closes #7186
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7900 from davies/find_in_set and squashes the following commits:
      
      4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
      8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
      243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility
      1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix
      e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8
      0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244
      208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix
      71b2e69 [Tarek Auel] [SPARK-8244] find_in_set
      66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244
      61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison
      4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244
      e3b20c8 [Tarek Auel] [SPARK-8244] added type check
      1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
      b1f88a38
    • Marcelo Vanzin's avatar
      [SPARK-9583] [BUILD] Do not print mvn debug messages to stdout. · d702d537
      Marcelo Vanzin authored
      This allows build/mvn to be used by make-distribution.sh.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7915 from vanzin/SPARK-9583 and squashes the following commits:
      
      6469e60 [Marcelo Vanzin] [SPARK-9583] [build] Do not print mvn debug messages to stdout.
      d702d537
    • Carson Wang's avatar
      [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page · cb7fa0aa
      Carson Wang authored
      Add pagination for the RDD page to avoid unresponsive UI when the number of the RDD partitions is large.
      Before:
      ![rddpagebefore](https://cloud.githubusercontent.com/assets/9278199/8951533/3d9add54-3601-11e5-99d0-5653b473c49b.png)
      After:
      ![rddpageafter](https://cloud.githubusercontent.com/assets/9278199/8951536/439d66e0-3601-11e5-9cee-1b380fe6620d.png)
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits:
      
      03c7168 [Carson Wang] Fix style issues
      612c18c [Carson Wang] RDD partition table pagination for the RDD Page
      cb7fa0aa
    • tedyu's avatar
      [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted · b211cbc7
      tedyu authored
      This PR removes the dependency reduced POM hack brought back by #7191
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #7919 from tedyu/master and squashes the following commits:
      
      1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
      b211cbc7
    • Sean Owen's avatar
      [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build... · 76d74090
      Sean Owen authored
      [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
      
      Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
      
      I'll explain several of the changes inline in comments.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7862 from srowen/SPARK-9534 and squashes the following commits:
      
      ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
      76d74090
    • Ankur Dave's avatar
      [SPARK-3190] [GRAPHX] Fix VertexRDD.count() overflow regression · 9e952ecb
      Ankur Dave authored
      SPARK-3190 was originally fixed by 96df9290, but a5ef5811 introduced a regression during refactoring. This commit fixes the regression.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #7923 from ankurdave/SPARK-3190-reopening and squashes the following commits:
      
      a3e1b23 [Ankur Dave] Fix VertexRDD.count() overflow regression
      9e952ecb
  2. Aug 03, 2015
    • Sean Owen's avatar
      [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build · 0afa6fbf
      Sean Owen authored
      Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits:
      
      73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
      0afa6fbf
    • Reynold Xin's avatar
      [SPARK-9577][SQL] Surface concrete iterator types in various sort classes. · 5eb89f67
      Reynold Xin authored
      We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7911 from rxin/surface-concrete-type and squashes the following commits:
      
      0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
      5eb89f67
    • CodingCat's avatar
      [SPARK-8416] highlight and topping the executor threads in thread dumping page · 3b0e4449
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-8416
      
      To facilitate debugging, I made this patch with three changes:
      
      * render the executor-thread and non executor-thread entries with different background colors
      
      * put the executor threads on the top of the list
      
      * sort the threads alphabetically
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits:
      
      34fc708 [CodingCat] fix className
      d7b79dd [CodingCat] lowercase threadName
      d032882 [CodingCat] sort alphabetically and change the css class name
      f0513b1 [CodingCat] change the color & group threads by name
      2da6e06 [CodingCat] small fix
      3fc9f36 [CodingCat] define classes in webui.css
      8ee125e [CodingCat] highlight and put on top the executor threads in thread dumping page
      3b0e4449
    • Burak Yavuz's avatar
      [SPARK-9263] Added flags to exclude dependencies when using --packages · 1633d0a2
      Burak Yavuz authored
      While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing).
      
      The flag I added was --packages-exclude, but I'm open on renaming it. I also added property flags in case people would like to use a conf file to provide dependencies, which is possible if there is a long list of dependencies or exclusions.
      
      cc andrewor14 vanzin pwendell
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #7599 from brkyvz/packages-exclusions and squashes the following commits:
      
      636f410 [Burak Yavuz] addressed nits
      6e54ede [Burak Yavuz] is this the culprit
      b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into packages-exclusions
      154f5db [Burak Yavuz] addressed initial comments
      1536d7a [Burak Yavuz] Added flags to exclude packages using --packages-exclude
      1633d0a2
    • Matthew Brandyberry's avatar
      [SPARK-9483] Fix UTF8String.getPrefix for big-endian. · b79b4f5f
      Matthew Brandyberry authored
      Previous code assumed little-endian.
      
      Author: Matthew Brandyberry <mbrandy@us.ibm.com>
      
      Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits:
      
      ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments.
      17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for big-endian.
      b79b4f5f
    • Shivaram Venkataraman's avatar
      Add a prerequisites section for building docs · 7abaaad5
      Shivaram Venkataraman authored
      This puts all the install commands that need to be run in one section instead of being spread over many paragraphs
      
      cc rxin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7912 from shivaram/docs-setup-readme and squashes the following commits:
      
      cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs
      7abaaad5
    • MechCoder's avatar
      [SPARK-8874] [ML] Add missing methods in Word2Vec · 13675c74
      MechCoder authored
      Add missing methods
      
      1. getVectors
      2. findSynonyms
      
      to W2Vec scala and python API
      
      mengxr
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7263 from MechCoder/missing_methods_w2vec and squashes the following commits:
      
      149d5ca [MechCoder] minor doc
      69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec
      13675c74
    • Steve Loughran's avatar
      [SPARK-8064] [SQL] Build against Hive 1.2.1 · a2409d1c
      Steve Loughran authored
      Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork.
      
      Tests not run yet: that's what the machines are for
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Cheng Lian <lian@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits:
      
      7556d85 [Cheng Lian] Updates .q files and corresponding golden files
      ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002
      6a92bb0 [Cheng Lian] Overrides HiveConf time vars
      dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe
      0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header...
      fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark
      7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar
      376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration
      2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down
      cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically.
      6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import
      da310dc [Michael Armbrust] Fixes for Hive tests.
      a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete
      7404f34 [Patrick Wendell] Add spark-hive staging repo
      832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code
      312c0d4 [Steve Loughran] SPARK-8064  maven/ivy dependency purge; calcite declaration needed
      fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand"
      c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first
      4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests
      314eb3c [Steve Loughran] SPARK-8064 deprecation warning  noise in one of the tests
      17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly.
      d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options
      23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens
      54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase
      0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize
      fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides
      fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1
      dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy
      d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType
      051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark
      6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call
      e6121e5 [Steve Loughran] SPARK-8064 address review comments
      aa43dc6 [Steve Loughran] SPARK-8064  more robust teardown on JavaMetastoreDatasourcesSuite
      f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text
      8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output.
      5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue*
      642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing
      97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised.
      335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log.
      3ed872f [Steve Loughran] SPARK-8064 rename field double to  dbl
      bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes
      41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions
      2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
      1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex
      bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded
      c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6
      0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread
      13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1
      d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops
      26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT
      3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure
      d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1
      1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text
      8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions
      dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause.
      463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output
      2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec
      1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec
      75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port"
      3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants
      c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression?
      27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings
      00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now)
      cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite
      f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package
      6c310b4 [Steve Loughran] SPARK-8064 subclass  Hive ServerOptionsProcessor to make it public again
      f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere
      4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
      a2409d1c
    • Reynold Xin's avatar
      Revert "[SPARK-9372] [SQL] Filter nulls in join keys" · b2e4b85d
      Reynold Xin authored
      This reverts commit 687c8c37.
      b2e4b85d
    • Andrew Or's avatar
      [SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations · 702aa9d7
      Andrew Or authored
      This patch exposes the memory used by internal data structures on the SparkUI. This tracks memory used by all spilling operations and SQL operators backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, `GeneratedAggregate` etc. The metric exposed is "peak execution memory", which broadly refers to the peak in-memory sizes of each of these data structure.
      
      A separate patch will extend this by linking the new information to the SQL operators themselves.
      
      <img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974776/b90fc980-362a-11e5-9e2b-842da75b1641.png">
      <img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974777/baa76492-362a-11e5-9b77-e364a6a6b64e.png">
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7770)
      <!-- Reviewable:end -->
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7770 from andrewor14/expose-memory-metrics and squashes the following commits:
      
      9abecb9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      f5b0d68 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      d7df332 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      8eefbc5 [Andrew Or] Fix non-failing tests
      9de2a12 [Andrew Or] Fix tests due to another logical merge conflict
      876bfa4 [Andrew Or] Fix failing test after logical merge conflict
      361a359 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      40b4802 [Andrew Or] Fix style?
      d0fef87 [Andrew Or] Fix tests?
      b3b92f6 [Andrew Or] Address comments
      0625d73 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      c00a197 [Andrew Or] Fix potential NPEs
      10da1cd [Andrew Or] Fix compile
      17f4c2d [Andrew Or] Fix compile?
      a87b4d0 [Andrew Or] Fix compile?
      d70874d [Andrew Or] Fix test compile + address comments
      2840b7d [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      6aa2f7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      b889a68 [Andrew Or] Minor changes: comments, spacing, style
      663a303 [Andrew Or] UnsafeShuffleWriter: update peak memory before close
      d090a94 [Andrew Or] Fix style
      2480d84 [Andrew Or] Expand test coverage
      5f1235b [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      1ecf678 [Andrew Or] Minor changes: comments, style, unused imports
      0b6926c [Andrew Or] Oops
      111a05e [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      a7a39a5 [Andrew Or] Strengthen presence check for accumulator
      a919eb7 [Andrew Or] Add tests for unsafe shuffle writer
      23c845d [Andrew Or] Add tests for SQL operators
      a757550 [Andrew Or] Address comments
      b5c51c1 [Andrew Or] Re-enable test in JavaAPISuite
      5107691 [Andrew Or] Add tests for internal accumulators
      59231e4 [Andrew Or] Fix tests
      9528d09 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      5b5e6f3 [Andrew Or] Add peak execution memory to summary table + tooltip
      92b4b6b [Andrew Or] Display peak execution memory on the UI
      eee5437 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      d9b9015 [Andrew Or] Track execution memory in unsafe shuffles
      770ee54 [Andrew Or] Track execution memory in broadcast joins
      9c605a4 [Andrew Or] Track execution memory in GeneratedAggregate
      9e824f2 [Andrew Or] Add back execution memory tracking for *ExternalSort
      4ef4cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics
      e6c3e2f [Andrew Or] Move internal accumulators creation to Stage
      a417592 [Andrew Or] Expose memory metrics in UnsafeExternalSorter
      3c4f042 [Andrew Or] Track memory usage in ExternalAppendOnlyMap / ExternalSorter
      bd7ab3f [Andrew Or] Add internal accumulators to TaskContext
      702aa9d7
    • Xiangrui Meng's avatar
      [SPARK-9544] [MLLIB] add Python API for RFormula · e4765a46
      Xiangrui Meng authored
      Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:
      
      3d5ff03 [Xiangrui Meng] add an doctest for . and -
      5e969a5 [Xiangrui Meng] fix pydoc
      1cd41f8 [Xiangrui Meng] organize imports
      3c18b10 [Xiangrui Meng] add Python API for RFormula
      e4765a46
    • Yanbo Liang's avatar
      [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples · 8ca287eb
      Yanbo Liang authored
      Add ml.PCA user guide document and code examples for Scala/Java/Python.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7522 from yanboliang/ml-pca-md and squashes the following commits:
      
      60dec05 [Yanbo Liang] address comments
      f992abe [Yanbo Liang] Add ml.PCA doc and examples
      8ca287eb
    • Kousuke Saruta's avatar
      [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults. · ba1c4e13
      Kousuke Saruta authored
      Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits:
      
      a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults
      ba1c4e13
    • Joseph K. Bradley's avatar
      [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor · ff9169a0
      Joseph K. Bradley authored
      Added featureImportance to RandomForestClassifier and Regressor.
      
      This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341]
      
      CC: yanboliang  Would you mind taking a look?  Thanks!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits:
      
      72a167a [Joseph K. Bradley] fixed unit test
      86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map
      5aa74f0 [Joseph K. Bradley] finally fixed unit test for real
      33df5db [Joseph K. Bradley] fix unit test
      42a2d3b [Joseph K. Bradley] fix unit test
      fe94e72 [Joseph K. Bradley] modified feature importance unit tests
      cc693ee [Feynman Liang] Add classifier tests
      79a6f87 [Feynman Liang] Compare dense vectors in test
      21d01fc [Feynman Liang] Added failing SKLearn test
      ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor.  Need to add unit tests
      ff9169a0
    • Cheng Lian's avatar
      [SPARK-9554] [SQL] Enables in-memory partition pruning by default · 703e44bf
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7895 from liancheng/spark-9554/enable-in-memory-partition-pruning and squashes the following commits:
      
      67c403e [Cheng Lian] Enables in-memory partition pruning by default
      703e44bf
    • Reynold Xin's avatar
      [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes. · 7a9d09f0
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7897 from rxin/calculateBitSetWidthInBytes and squashes the following commits:
      
      2e73b3a [Reynold Xin] [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.
      7a9d09f0
    • Joseph Batchik's avatar
      [SPARK-9511] [SQL] Fixed Table Name Parsing · dfe7bd16
      Joseph Batchik authored
      The issue was that the tokenizer was parsing "1one" into the numeric 1 using the code on line 110. I added another case to accept strings that start with a number and then have a letter somewhere else in it as well.
      
      Author: Joseph Batchik <joseph.batchik@cloudera.com>
      
      Closes #7844 from JDrit/parse_error and squashes the following commits:
      
      b8ca12f [Joseph Batchik] fixed parsing issue by adding another case
      dfe7bd16
    • Andrew Or's avatar
      [SPARK-1855] Local checkpointing · b41a3271
      Andrew Or authored
      Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*.
      
      **Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently.
      
      The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7279 from andrewor14/local-checkpoint and squashes the following commits:
      
      729600f [Andrew Or] Oops, fix tests
      34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint
      e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      3be5aea [Andrew Or] Address comments
      bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      ab003a3 [Andrew Or] Fix compile
      c2e111b [Andrew Or] Address comments
      33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      e908a42 [Andrew Or] Fix tests
      f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level
      a92657d [Andrew Or] Update a few comments
      e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      1bbe154 [Andrew Or] Simplify LocalCheckpointRDD
      48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests
      62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD
      87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint
      c449b38 [Andrew Or] Fix style
      4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing
      53b363b [Andrew Or] Rename a few more awkwardly named methods (minor)
      e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups
      4880deb [Andrew Or] Fix style
      d096c67 [Andrew Or] Fix mima
      172cb66 [Andrew Or] Fix mima?
      e53d964 [Andrew Or] Fix style
      56831c5 [Andrew Or] Add a few warnings and clear exception messages
      2e59646 [Andrew Or] Add local checkpoint clean up tests
      4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing
      4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups
      0477eec [Andrew Or] Rename a few methods with awkward names (minor)
      2e902e5 [Andrew Or] First implementation of local checkpointing
      8447454 [Andrew Or] Fix tests
      4ac1896 [Andrew Or] Refactor checkpoint interface for modularity
      b41a3271
    • Joseph K. Bradley's avatar
      [SPARK-9528] [ML] Changed RandomForestClassifier to extend ProbabilisticClassifier · 69f5a7c9
      Joseph K. Bradley authored
      RandomForestClassifier now outputs rawPrediction based on tree probabilities, plus probability column computed from normalized rawPrediction.
      
      CC: holdenk
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7859 from jkbradley/rf-prob and squashes the following commits:
      
      6c28f51 [Joseph K. Bradley] Changed RandomForestClassifier to extend ProbabilisticClassifier
      69f5a7c9
    • Reynold Xin's avatar
      8be198c8
    • Davies Liu's avatar
      [SPARK-9518] [SQL] cleanup generated UnsafeRowJoiner and fix bug · 191bf268
      Davies Liu authored
      Currently, when copy the bitsets, we didn't consider that the row1 may not sit in the beginning of byte array.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7892 from davies/clean_join and squashes the following commits:
      
      14cce9e [Davies Liu] cleanup generated UnsafeRowJoiner and fix bug
      191bf268
    • Wenchen Fan's avatar
      [SPARK-9551][SQL] add a cheap version of copy for UnsafeRow to reuse a copy buffer · 137f4786
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7885 from cloud-fan/cheap-copy and squashes the following commits:
      
      0900ca1 [Wenchen Fan] replace == with ===
      73f4ada [Wenchen Fan] add tests
      07b865a [Wenchen Fan] add a cheap version of copy
      137f4786
Loading