- Aug 04, 2015
-
-
Mike Dusenberry authored
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark. Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object. New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class. This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code. Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity. Associated documentation and unit-tests have also been added. To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits: bb039cb [Mike Dusenberry] Minor documentation update. b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner. Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that. If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly. This is only for internal usage, and publicly, we still require 'rows' to be an RDD. We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed. The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included. 7f0dcb6 [Mike Dusenberry] Updating module docstring. cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data. 687e345 [Mike Dusenberry] Improving conversion performance. This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side. 3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed. 308f197 [Mike Dusenberry] Using properties for better documentation. 1633f86 [Mike Dusenberry] Minor documentation cleanup. f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix. ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner. 3fd4016 [Mike Dusenberry] Updating docstrings. 27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix. a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly. d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction. 4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry. c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions. 329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring. 0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests. c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted. 4ad6819 [Mike Dusenberry] Documenting the and parameters. 3b854b9 [Mike Dusenberry] Minor updates to documentation. 10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods. 119018d [Mike Dusenberry] Adding static methods to each of the distributed matrix classes to consolidate conversion logic. 4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace. 93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request. f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request. 6a3ecb7 [Mike Dusenberry] Updating pattern matching. 08f287b [Mike Dusenberry] Slight reformatting of the documentation. a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints. 4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines. 7e3ca16 [Mike Dusenberry] Fixing long lines. f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices. ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful. dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests. 0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization. 3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on . 4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix. 23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs. b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters. bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods. d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
-
Joseph K. Bradley authored
Small cleanups to recent LDA additions and docs. CC: feynmanliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #7916 from jkbradley/lda-cleanups and squashes the following commits: f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning 97947aa [Joseph K. Bradley] a few more cleanups 5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood c566915 [Joseph K. Bradley] small edit to make review easier 63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models
-
Joseph K. Bradley authored
Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns. CC: holdenk yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #7903 from jkbradley/rf-prob-python and squashes the following commits: c62a83f [Joseph K. Bradley] made unit test more robust 14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark
-
CodingCat authored
https://issues.apache.org/jira/browse/SPARK-9602 Although we have hidden Akka behind RPC interface, I found that the Akka/Actor-related comments are still spreading everywhere. To make it consistent, we shall remove "actor"/"akka" words from the comments... Author: CodingCat <zhunansjtu@gmail.com> Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits: e8296a3 [CodingCat] remove actor words from comments
-
Josh Rosen authored
This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages. Author: Josh Rosen <joshrosen@databricks.com> Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits: 967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 948c344 [Josh Rosen] Add large records tests for KV sorter. 3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method 380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter 27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite a49baef [Josh Rosen] Address initial round of review comments 3edb931 [Josh Rosen] Remove accidentally-committed debug statements. 2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
-
Wenchen Fan authored
[SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7890 from cloud-fan/minor and squashes the following commits: c3b1be3 [Wenchen Fan] fix style b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and replace the usage of them by createStructCode
-
Michael Armbrust authored
Author: Michael Armbrust <michael@databricks.com> Closes #7939 from marmbrus/turnOffThriftTests and squashes the following commits: 80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server tests
-
Holden Karau authored
This PR replaces the old "threshold" with a generalized "thresholds" Param. We keep getThreshold,setThreshold for backwards compatibility for binary classification. Note that the primary author of this PR is holdenk Author: Holden Karau <holden@pigscanfly.ca> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits: 3952977 [Joseph K. Bradley] fixed pyspark doc test 85febc8 [Joseph K. Bradley] made python unit tests a little more robust 7eb1d86 [Joseph K. Bradley] small cleanups 6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues. 0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests 7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc. 6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests 25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression c02d6c0 [Holden Karau] No default for thresholds 5e43628 [Holden Karau] CR feedback and fixed the renamed test f3fbbd1 [Holden Karau] revert the changes to random forest :( 51f581c [Holden Karau] Add explicit types to public methods, fix long line f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic 398078a [Holden Karau] move the thresholding around a bunch based on the design doc 4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok) 638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test e09919c [Holden Karau] Fix return type, I need more coffee.... 8d92cac [Holden Karau] Use ClassifierParams as the head 3456ed3 [Holden Karau] Add explicit return types even though just test a0f3b0c [Holden Karau] scala style fixes 6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now ffc8dab [Holden Karau] Update the sharedParams 0420290 [Holden Karau] Allow us to override the get methods selectively 978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions 1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there" 1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there efb9084 [Holden Karau] move setThresholds only to where its used 6b34809 [Holden Karau] Add a test with thresholding for the RFCS 74f54c3 [Holden Karau] Fix creation of vote array 1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down. 2f44b18 [Holden Karau] Add a global default of null for thresholds param f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds" 634b06f [Holden Karau] Some progress towards unifying threshold and thresholds 85c9e01 [Holden Karau] Test passes again... little fnur 099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer) 0f46836 [Holden Karau] Start adding a classifiersuite f70eb5e [Holden Karau] Fix test compile issues a7d59c8 [Holden Karau] Move thresholding into Classifier trait 5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test) 1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation 31d6bf2 [Holden Karau] Start threading the threshold info through 0ef228c [Holden Karau] Add hasthresholds
-
Michael Armbrust authored
The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now. Author: Michael Armbrust <michael@databricks.com> Closes #7906 from marmbrus/revertSortProjection and squashes the following commits: 2da6972 [Michael Armbrust] unrevert unrelated changes 4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation"
-
Shivaram Venkataraman authored
cc srowen pwendell nchammas Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #7899 from shivaram/spark-ec2-move and squashes the following commits: 7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from mesos/
-
Yijie Shen authored
JIRA: https://issues.apache.org/jira/browse/SPARK-9541 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7870 from yjshen/datetime_cleanup and squashes the following commits: 9203e33 [Yijie Shen] revert getMonth & getDayOfMonth 5cad119 [Yijie Shen] rebase code 7d62a74 [Yijie Shen] remove tmp tuple inside split date e98aaac [Yijie Shen] DataTimeUtils cleanup
-
Davies Liu authored
This is based on #7485 , thanks to NathanHowell Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead. Author: Davies Liu <davies@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Nathan Howell <nhowell@godaddy.com> Closes #7901 from davies/get_json_object and squashes the following commits: 3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object 98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala c818519 [Yin Huai] new results. 18ce26b [Davies Liu] fix tests 6ac29fb [Yin Huai] Golden files. 25eebef [Davies Liu] use HiveQuerySuite e0ac6ec [Yin Huai] Golden answer files. 940c060 [Davies Liu] tweat code style 44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object 9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element 8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
-
Tarek Auel authored
This PR is based on #7186 (just fix the conflict), thanks to tarekauel . find_in_set(string str, string strList): int Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3. Only add this to SQL, not DataFrame. Closes #7186 Author: Tarek Auel <tarek.auel@googlemail.com> Author: Davies Liu <davies@databricks.com> Closes #7900 from davies/find_in_set and squashes the following commits: 4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set 243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility 1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8 0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244 208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix 71b2e69 [Tarek Auel] [SPARK-8244] find_in_set 66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244 61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison 4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244 e3b20c8 [Tarek Auel] [SPARK-8244] added type check 1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
-
Marcelo Vanzin authored
This allows build/mvn to be used by make-distribution.sh. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7915 from vanzin/SPARK-9583 and squashes the following commits: 6469e60 [Marcelo Vanzin] [SPARK-9583] [build] Do not print mvn debug messages to stdout.
-
Carson Wang authored
Add pagination for the RDD page to avoid unresponsive UI when the number of the RDD partitions is large. Before:  After:  Author: Carson Wang <carson.wang@intel.com> Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits: 03c7168 [Carson Wang] Fix style issues 612c18c [Carson Wang] RDD partition table pagination for the RDD Page
-
tedyu authored
This PR removes the dependency reduced POM hack brought back by #7191 Author: tedyu <yuzhihong@gmail.com> Closes #7919 from tedyu/master and squashes the following commits: 1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
-
Sean Owen authored
[SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. I'll explain several of the changes inline in comments. Author: Sean Owen <sowen@cloudera.com> Closes #7862 from srowen/SPARK-9534 and squashes the following commits: ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
-
Ankur Dave authored
SPARK-3190 was originally fixed by 96df9290, but a5ef5811 introduced a regression during refactoring. This commit fixes the regression. Author: Ankur Dave <ankurdave@gmail.com> Closes #7923 from ankurdave/SPARK-3190-reopening and squashes the following commits: a3e1b23 [Ankur Dave] Fix VertexRDD.count() overflow regression
-
- Aug 03, 2015
-
-
Sean Owen authored
Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too Author: Sean Owen <sowen@cloudera.com> Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits: 73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
-
Reynold Xin authored
We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls. Author: Reynold Xin <rxin@databricks.com> Closes #7911 from rxin/surface-concrete-type and squashes the following commits: 0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
-
CodingCat authored
https://issues.apache.org/jira/browse/SPARK-8416 To facilitate debugging, I made this patch with three changes: * render the executor-thread and non executor-thread entries with different background colors * put the executor threads on the top of the list * sort the threads alphabetically Author: CodingCat <zhunansjtu@gmail.com> Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits: 34fc708 [CodingCat] fix className d7b79dd [CodingCat] lowercase threadName d032882 [CodingCat] sort alphabetically and change the css class name f0513b1 [CodingCat] change the color & group threads by name 2da6e06 [CodingCat] small fix 3fc9f36 [CodingCat] define classes in webui.css 8ee125e [CodingCat] highlight and put on top the executor threads in thread dumping page
-
Burak Yavuz authored
While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing). The flag I added was --packages-exclude, but I'm open on renaming it. I also added property flags in case people would like to use a conf file to provide dependencies, which is possible if there is a long list of dependencies or exclusions. cc andrewor14 vanzin pwendell Author: Burak Yavuz <brkyvz@gmail.com> Closes #7599 from brkyvz/packages-exclusions and squashes the following commits: 636f410 [Burak Yavuz] addressed nits 6e54ede [Burak Yavuz] is this the culprit b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into packages-exclusions 154f5db [Burak Yavuz] addressed initial comments 1536d7a [Burak Yavuz] Added flags to exclude packages using --packages-exclude
-
Matthew Brandyberry authored
Previous code assumed little-endian. Author: Matthew Brandyberry <mbrandy@us.ibm.com> Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits: ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments. 17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for big-endian.
-
Shivaram Venkataraman authored
This puts all the install commands that need to be run in one section instead of being spread over many paragraphs cc rxin Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #7912 from shivaram/docs-setup-readme and squashes the following commits: cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs
-
MechCoder authored
Add missing methods 1. getVectors 2. findSynonyms to W2Vec scala and python API mengxr Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7263 from MechCoder/missing_methods_w2vec and squashes the following commits: 149d5ca [MechCoder] minor doc 69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec
-
Steve Loughran authored
Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork. Tests not run yet: that's what the machines are for Author: Steve Loughran <stevel@hortonworks.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits: 7556d85 [Cheng Lian] Updates .q files and corresponding golden files ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002 6a92bb0 [Cheng Lian] Overrides HiveConf time vars dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe 0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header... fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark 7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar 376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration 2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically. 6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import da310dc [Michael Armbrust] Fixes for Hive tests. a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete 7404f34 [Patrick Wendell] Add spark-hive staging repo 832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code 312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand" c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first 4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests 314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests 17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly. d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options 23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens 54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase 0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1 dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType 051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark 6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call e6121e5 [Steve Loughran] SPARK-8064 address review comments aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text 8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output. 5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue* 642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing 97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised. 335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log. 3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes 41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions 2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name 1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6 0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread 13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1 d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops 26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT 3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1 1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text 8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause. 463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output 2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec 1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec 75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port" 3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression? 27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings 00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now) cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package 6c310b4 [Steve Loughran] SPARK-8064 subclass Hive ServerOptionsProcessor to make it public again f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere 4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
-
Reynold Xin authored
This reverts commit 687c8c37.
-
Andrew Or authored
This patch exposes the memory used by internal data structures on the SparkUI. This tracks memory used by all spilling operations and SQL operators backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, `GeneratedAggregate` etc. The metric exposed is "peak execution memory", which broadly refers to the peak in-memory sizes of each of these data structure. A separate patch will extend this by linking the new information to the SQL operators themselves. <img width="950" alt="screen shot 2015-07-29 at 7 43 17 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974776/b90fc980-362a-11e5-9e2b-842da75b1641.png"> <img width="802" alt="screen shot 2015-07-29 at 7 43 05 pm" src="https://cloud.githubusercontent.com/assets/2133137/8974777/baa76492-362a-11e5-9b77-e364a6a6b64e.png"> <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7770) <!-- Reviewable:end --> Author: Andrew Or <andrew@databricks.com> Closes #7770 from andrewor14/expose-memory-metrics and squashes the following commits: 9abecb9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics f5b0d68 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics d7df332 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 8eefbc5 [Andrew Or] Fix non-failing tests 9de2a12 [Andrew Or] Fix tests due to another logical merge conflict 876bfa4 [Andrew Or] Fix failing test after logical merge conflict 361a359 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 40b4802 [Andrew Or] Fix style? d0fef87 [Andrew Or] Fix tests? b3b92f6 [Andrew Or] Address comments 0625d73 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics c00a197 [Andrew Or] Fix potential NPEs 10da1cd [Andrew Or] Fix compile 17f4c2d [Andrew Or] Fix compile? a87b4d0 [Andrew Or] Fix compile? d70874d [Andrew Or] Fix test compile + address comments 2840b7d [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 6aa2f7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics b889a68 [Andrew Or] Minor changes: comments, spacing, style 663a303 [Andrew Or] UnsafeShuffleWriter: update peak memory before close d090a94 [Andrew Or] Fix style 2480d84 [Andrew Or] Expand test coverage 5f1235b [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 1ecf678 [Andrew Or] Minor changes: comments, style, unused imports 0b6926c [Andrew Or] Oops 111a05e [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics a7a39a5 [Andrew Or] Strengthen presence check for accumulator a919eb7 [Andrew Or] Add tests for unsafe shuffle writer 23c845d [Andrew Or] Add tests for SQL operators a757550 [Andrew Or] Address comments b5c51c1 [Andrew Or] Re-enable test in JavaAPISuite 5107691 [Andrew Or] Add tests for internal accumulators 59231e4 [Andrew Or] Fix tests 9528d09 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 5b5e6f3 [Andrew Or] Add peak execution memory to summary table + tooltip 92b4b6b [Andrew Or] Display peak execution memory on the UI eee5437 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics d9b9015 [Andrew Or] Track execution memory in unsafe shuffles 770ee54 [Andrew Or] Track execution memory in broadcast joins 9c605a4 [Andrew Or] Track execution memory in GeneratedAggregate 9e824f2 [Andrew Or] Add back execution memory tracking for *ExternalSort 4ef4cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics e6c3e2f [Andrew Or] Move internal accumulators creation to Stage a417592 [Andrew Or] Expose memory metrics in UnsafeExternalSorter 3c4f042 [Andrew Or] Track memory usage in ExternalAppendOnlyMap / ExternalSorter bd7ab3f [Andrew Or] Add internal accumulators to TaskContext
-
Xiangrui Meng authored
Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #7879 from mengxr/SPARK-9544 and squashes the following commits: 3d5ff03 [Xiangrui Meng] add an doctest for . and - 5e969a5 [Xiangrui Meng] fix pydoc 1cd41f8 [Xiangrui Meng] organize imports 3c18b10 [Xiangrui Meng] add Python API for RFormula
-
Yanbo Liang authored
Add ml.PCA user guide document and code examples for Scala/Java/Python. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7522 from yanboliang/ml-pca-md and squashes the following commits: 60dec05 [Yanbo Liang] address comments f992abe [Yanbo Liang] Add ml.PCA doc and examples
-
Kousuke Saruta authored
Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits: a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults
-
Joseph K. Bradley authored
Added featureImportance to RandomForestClassifier and Regressor. This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341] CC: yanboliang Would you mind taking a look? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Author: Feynman Liang <fliang@databricks.com> Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits: 72a167a [Joseph K. Bradley] fixed unit test 86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map 5aa74f0 [Joseph K. Bradley] finally fixed unit test for real 33df5db [Joseph K. Bradley] fix unit test 42a2d3b [Joseph K. Bradley] fix unit test fe94e72 [Joseph K. Bradley] modified feature importance unit tests cc693ee [Feynman Liang] Add classifier tests 79a6f87 [Feynman Liang] Compare dense vectors in test 21d01fc [Feynman Liang] Added failing SKLearn test ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor. Need to add unit tests
-
Cheng Lian authored
Author: Cheng Lian <lian@databricks.com> Closes #7895 from liancheng/spark-9554/enable-in-memory-partition-pruning and squashes the following commits: 67c403e [Cheng Lian] Enables in-memory partition pruning by default
-
Reynold Xin authored
Author: Reynold Xin <rxin@databricks.com> Closes #7897 from rxin/calculateBitSetWidthInBytes and squashes the following commits: 2e73b3a [Reynold Xin] [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.
-
Joseph Batchik authored
The issue was that the tokenizer was parsing "1one" into the numeric 1 using the code on line 110. I added another case to accept strings that start with a number and then have a letter somewhere else in it as well. Author: Joseph Batchik <joseph.batchik@cloudera.com> Closes #7844 from JDrit/parse_error and squashes the following commits: b8ca12f [Joseph Batchik] fixed parsing issue by adding another case
-
Andrew Or authored
Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*. **Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently. The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf). Author: Andrew Or <andrew@databricks.com> Closes #7279 from andrewor14/local-checkpoint and squashes the following commits: 729600f [Andrew Or] Oops, fix tests 34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 3be5aea [Andrew Or] Address comments bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint ab003a3 [Andrew Or] Fix compile c2e111b [Andrew Or] Address comments 33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint e908a42 [Andrew Or] Fix tests f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level a92657d [Andrew Or] Update a few comments e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 1bbe154 [Andrew Or] Simplify LocalCheckpointRDD 48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests 62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD 87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint c449b38 [Andrew Or] Fix style 4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing 53b363b [Andrew Or] Rename a few more awkwardly named methods (minor) e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups 4880deb [Andrew Or] Fix style d096c67 [Andrew Or] Fix mima 172cb66 [Andrew Or] Fix mima? e53d964 [Andrew Or] Fix style 56831c5 [Andrew Or] Add a few warnings and clear exception messages 2e59646 [Andrew Or] Add local checkpoint clean up tests 4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing 4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups 0477eec [Andrew Or] Rename a few methods with awkward names (minor) 2e902e5 [Andrew Or] First implementation of local checkpointing 8447454 [Andrew Or] Fix tests 4ac1896 [Andrew Or] Refactor checkpoint interface for modularity
-
Joseph K. Bradley authored
RandomForestClassifier now outputs rawPrediction based on tree probabilities, plus probability column computed from normalized rawPrediction. CC: holdenk Author: Joseph K. Bradley <joseph@databricks.com> Closes #7859 from jkbradley/rf-prob and squashes the following commits: 6c28f51 [Joseph K. Bradley] Changed RandomForestClassifier to extend ProbabilisticClassifier
-
Reynold Xin authored
-
Davies Liu authored
Currently, when copy the bitsets, we didn't consider that the row1 may not sit in the beginning of byte array. cc rxin Author: Davies Liu <davies@databricks.com> Closes #7892 from davies/clean_join and squashes the following commits: 14cce9e [Davies Liu] cleanup generated UnsafeRowJoiner and fix bug
-
Wenchen Fan authored
Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7885 from cloud-fan/cheap-copy and squashes the following commits: 0900ca1 [Wenchen Fan] replace == with === 73f4ada [Wenchen Fan] add tests 07b865a [Wenchen Fan] add a cheap version of copy
-