Skip to content
Snippets Groups Projects
  1. Aug 05, 2015
    • Tathagata Das's avatar
      [SPARK-9217] [STREAMING] Make the kinesis receiver reliable by recording sequence numbers · c2a71f07
      Tathagata Das authored
      This PR is the second one in the larger issue of making the Kinesis integration reliable and provide WAL-free at-least once guarantee. It is based on the design doc - https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit
      
      In this PR, I have updated the Kinesis Receiver to do the following.
      - Control the block generation, by creating its own BlockGenerator with own callback methods and using it to keep track of the ranges of sequence numbers that go into each block.
      - More specifically, as the KinesisRecordProcessor provides small batches of records, the records are atomically inserted into the block (that is, either the whole batch is in the block, or not). Accordingly the sequence number range of the batch is recorded. Since there may be many batches added to a block, the receiver tracks all the range of sequence numbers that is added to a block.
      - When the block is ready to be pushed, the block is pushed and the ranges are reported as metadata of the block. In addition, the ranges are used to find out the latest sequence number for each shard that can be checkpointed through the DynamoDB.
      - Periodically, each KinesisRecordProcessor checkpoints the latest successfully stored sequence number for it own shard.
      - The array of ranges in the block metadata is used to create KinesisBackedBlockRDDs. The ReceiverInputDStream has been slightly refactored to allow the creation of KinesisBackedBlockRDDs instead of the WALBackedBlockRDDs.
      
      Things to be done
      - [x] Add new test to verify that the sequence numbers are recovered.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7825 from tdas/kinesis-receiver and squashes the following commits:
      
      2159be9 [Tathagata Das] Fixed bug
      569be83 [Tathagata Das] Fix scala style issue
      bf31e22 [Tathagata Das] Added more documentation to make the kinesis test endpoint more configurable
      3ad8361 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into kinesis-receiver
      c693a63 [Tathagata Das] Removed unnecessary constructor params from KinesisTestUtils
      e1f1d0a [Tathagata Das] Addressed PR comments
      b9fa6bf [Tathagata Das] Fix serialization issues
      f8b7680 [Tathagata Das] Updated doc
      33fe43a [Tathagata Das] Added more tests
      7997138 [Tathagata Das] Fix style errors
      a806710 [Tathagata Das] Fixed unit test and use KinesisInputDStream
      40a1709 [Tathagata Das] Fixed KinesisReceiverSuite tests
      7e44df6 [Tathagata Das] Added documentation and fixed checkpointing
      096383f [Tathagata Das] Added test, and addressed some of the comments.
      84a7892 [Tathagata Das] fixed scala style issue
      e19e37d [Tathagata Das] Added license
      1cd7b66 [Tathagata Das] Updated kinesis receiver
      c2a71f07
    • Davies Liu's avatar
      [SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType · 781c8d71
      Davies Liu authored
      Let Decimal carry the correct precision and scale with DecimalType.
      
      cc rxin yhuai
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7925 from davies/decimal_scale and squashes the following commits:
      
      e19701a [Davies Liu] some tweaks
      57d78d2 [Davies Liu] fix tests
      5d5bc69 [Davies Liu] match precision and scale with DecimalType
      781c8d71
    • Pedro Rodriguez's avatar
      [SPARK-8231] [SQL] Add array_contains · d3454858
      Pedro Rodriguez authored
      This PR is based on #7580 , thanks to EntilZha
      
      PR for work on https://issues.apache.org/jira/browse/SPARK-8231
      
      Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128
      
      Main points are:
      1. If the array is empty, null, or the value is null, return false
      2. If there is a type mismatch, throw error
      3. If comparison is not supported, throw error
      
      Closes #7580
      
      Author: Pedro Rodriguez <prodriguez@trulia.com>
      Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7949 from davies/array_contains and squashes the following commits:
      
      d3c08bc [Davies Liu] use foreach() to avoid copy
      bc3d1fe [Davies Liu] fix array_contains
      719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains
      e352cf9 [Pedro Rodriguez] fixed diff from master
      4d5b0ff [Pedro Rodriguez] added docs and another type check
      ffc0591 [Pedro Rodriguez] fixed unit test
      7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3
      b5ffae8 [Pedro Rodriguez] fixed pyspark test
      4e7dce3 [Pedro Rodriguez] added more docs
      3082399 [Pedro Rodriguez] fixed unit test
      46f9789 [Pedro Rodriguez] reverted change
      d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this
      8528027 [Pedro Rodriguez] added more tests
      686e029 [Pedro Rodriguez] fix scala style
      d262e9d [Pedro Rodriguez] reworked type checking code and added more tests
      2517a58 [Pedro Rodriguez] removed unused import
      28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests
      12f8795 [Pedro Rodriguez] fix scala style checks
      e8a20a9 [Pedro Rodriguez] added python df (broken atm)
      65b562c [Pedro Rodriguez] made array_contains nullable false
      33b45aa [Pedro Rodriguez] reordered test
      9623c64 [Pedro Rodriguez] fixed test
      4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs
      72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs
      69c46fb [Pedro Rodriguez] added tests and codegen
      9e0bfc4 [Pedro Rodriguez] initial attempt at implementation
      d3454858
    • Xiangrui Meng's avatar
      [SPARK-9540] [MLLIB] optimize PrefixSpan implementation · a02bcf20
      Xiangrui Meng authored
      This is a major refactoring of the PrefixSpan implementation. It contains the following changes:
      
      1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large.
      2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`.
      3. Remember the start indices of all partial projections in the projected postfix to help next projection.
      4. Reuse the original sequence array for projected postfixes.
      5. Use `Prefix` IDs in aggregation rather than its content.
      6. Use `ArrayBuilder` for building primitive arrays.
      7. Expose `maxLocalProjDBSize`.
      8. Tests are not changed except using `0` instead of `-1` as the delimiter.
      
      `Postfix`'s API doc should be a good place to start.
      
      Closes #7594
      
      feynmanliang zhangjiajin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7937 from mengxr/SPARK-9540 and squashes the following commits:
      
      2d0ec31 [Xiangrui Meng] address more comments
      48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test
      65f90e8 [Xiangrui Meng] naming and documentation
      8afc86a [Xiangrui Meng] refactor impl
      a02bcf20
    • Reynold Xin's avatar
      Update docs/README.md to put all prereqs together. · f7abd6be
      Reynold Xin authored
      This pull request groups all the prereq requirements into a single section.
      
      cc srowen shivaram
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7951 from rxin/readme-docs and squashes the following commits:
      
      ab7ded0 [Reynold Xin] Updated docs/README.md to put all prereqs together.
      f7abd6be
  2. Aug 04, 2015
    • zsxwing's avatar
      [SPARK-9504] [STREAMING] [TESTS] Fix o.a.s.streaming.StreamingContextSuite.stop gracefully again · d34bac0e
      zsxwing authored
      The test failure is here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3150/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/
      
      There is a race condition in TestReceiver that it may add 1 record and increase `TestReceiver.counter` after stopping `BlockGenerator`. This PR just adds `join` to wait the pushing thread.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7934 from zsxwing/SPARK-9504-2 and squashes the following commits:
      
      cfd7973 [zsxwing] Wait for the thread to make sure we won't change TestReceiver.counter after stopping BlockGenerator
      d34bac0e
    • Davies Liu's avatar
      [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions · 2b67fdb6
      Davies Liu authored
      This adds Python API for those DataFrame functions that is introduced in 1.5.
      
      There is issue with serialize byte_array in Python 3, so some of functions (for BinaryType) does not have tests.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7922 from davies/python_functions and squashes the following commits:
      
      8ad942f [Davies Liu] fix test
      5fb6ec3 [Davies Liu] fix bugs
      3495ed3 [Davies Liu] fix issues
      ea5f7bb [Davies Liu] Add python API for DataFrame functions
      2b67fdb6
    • zhichao.li's avatar
      [SPARK-7119] [SQL] Give script a default serde with the user specific types · 6f8f0e26
      zhichao.li authored
      This is to address this issue that there would be not compatible type exception when running this:
      `from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2;`
      
      15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer
      	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
      	at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57)
      	at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127)
      	at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
      	at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
      	at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
      	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
      	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
      	at scala.collection.AbstractIterator.to(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
      	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
      	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
      	at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
      	at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
      	at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
      	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
      	at org.apache.spark.scheduler.Task.run(Task.scala:64)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      	at java.lang.Thread.run(Thread.java:722)
      
      chenghao-intel marmbrus
      
      Author: zhichao.li <zhichao.li@intel.com>
      
      Closes #6638 from zhichao-li/transDataType2 and squashes the following commits:
      
      a36cc7c [zhichao.li] style
      b9252a8 [zhichao.li] delete cacheRow
      f6968a4 [zhichao.li] give script a default serde
      6f8f0e26
    • Burak Yavuz's avatar
      [SPARK-8313] R Spark packages support · c9a4c36d
      Burak Yavuz authored
      shivaram cafreeman Could you please help me in testing this out? Exposing and running `rPackageBuilder` from inside the shell works, but for some reason, I can't get it to work during Spark Submit. It just starts relaunching Spark Submit.
      
      For testing, you may use the R branch with [sbt-spark-package](https://github.com/databricks/sbt-spark-package). You can call spPackage, and then pass the jar using `--jars`.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #7139 from brkyvz/r-submit and squashes the following commits:
      
      0de384f [Burak Yavuz] remove unused imports 2
      d253708 [Burak Yavuz] removed unused imports
      6603d0d [Burak Yavuz] addressed comments
      4258ffe [Burak Yavuz] merged master
      ddfcc06 [Burak Yavuz] added zipping test
      3a1be7d [Burak Yavuz] don't zip
      77995df [Burak Yavuz] fix URI
      ac45527 [Burak Yavuz] added zipping of all libs
      e6bf7b0 [Burak Yavuz] add println ignores
      1bc5554 [Burak Yavuz] add assumes for tests
      9778e03 [Burak Yavuz] addressed comments
      b42b300 [Burak Yavuz] merged master
      ffd134e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit
      d867756 [Burak Yavuz] add apache header
      eff5ba1 [Burak Yavuz] ready for review
      8838edb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into r-submit
      e5b5a06 [Burak Yavuz] added doc
      bb751ce [Burak Yavuz] fix null bug
      0226768 [Burak Yavuz] fixed issues
      8810beb [Burak Yavuz] R packages support
      c9a4c36d
    • Yijie Shen's avatar
      [SPARK-9432][SQL] Audit expression unit tests to make sure we pass the proper numeric ranges · a7fe48f6
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9432
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7933 from yjshen/numeric_ranges and squashes the following commits:
      
      e719f78 [Yijie Shen] proper integral range check
      a7fe48f6
    • Holden Karau's avatar
      [SPARK-8601] [ML] Add an option to disable standardization for linear regression · d92fa141
      Holden Karau authored
      All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
      
      In R, there is an option for this.
      standardize
      
      Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
      
      Note that the primary author for this PR is holdenk
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits:
      
      e856036 [DB Tsai] scala doc
      596e96c [DB Tsai] minor
      bbff347 [DB Tsai] naming
      baa0805 [DB Tsai] touch up
      d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
      6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression
      332f140 [Holden Karau] Merge in master
      eebe10a [Holden Karau] Use same comparision operator throughout the test
      3f92935 [Holden Karau] merge
      b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR).
      0c334a2 [Holden Karau] Remove extra line
      99ce053 [Holden Karau] merge in master
      e54a8a9 [Holden Karau] Fix long line
      e47c574 [Holden Karau] Add support for L2 without standardization.
      55d3a66 [Holden Karau] Add standardization param for linear regression
      00a1dc5 [Holden Karau] Add the param to the linearregression impl
      d92fa141
    • Feynman Liang's avatar
      [SPARK-9609] [MLLIB] Fix spelling of Strategy.defaultStrategy · 629e26f7
      Feynman Liang authored
      jkbradley
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7941 from feynmanliang/SPARK-9609-stategy-spelling and squashes the following commits:
      
      d2aafb1 [Feynman Liang] Add deprecated backwards compatibility
      aa090a8 [Feynman Liang] Fix spelling
      629e26f7
    • Wenchen Fan's avatar
      [SPARK-9598][SQL] do not expose generic getter in internal row · 7c8fc1f7
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7932 from cloud-fan/generic-getter and squashes the following commits:
      
      c60de4c [Wenchen Fan] do not expose generic getter in internal row
      7c8fc1f7
    • Joseph K. Bradley's avatar
      [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol · b77d3b96
      Joseph K. Bradley authored
      Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol.
      
      I don't think setScoreCol was actually used anywhere (based on search).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following commits:
      
      e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol
      b77d3b96
    • Mike Dusenberry's avatar
      [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark. · 571d5b53
      Mike Dusenberry authored
      This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark.  Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object.  New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class.  This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code.  Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.  Associated documentation and unit-tests have also been added.  To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.
      
      Author: Mike Dusenberry <mwdusenb@us.ibm.com>
      
      Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:
      
      bb039cb [Mike Dusenberry] Minor documentation update.
      b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner.  Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that.  If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly.  This is only for internal usage, and publicly, we still require 'rows' to be an RDD.  We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed.  The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
      7f0dcb6 [Mike Dusenberry] Updating module docstring.
      cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
      687e345 [Mike Dusenberry] Improving conversion performance.  This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
      3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
      308f197 [Mike Dusenberry] Using properties for better documentation.
      1633f86 [Mike Dusenberry] Minor documentation cleanup.
      f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
      ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
      3fd4016 [Mike Dusenberry] Updating docstrings.
      27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
      a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
      d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
      4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
      c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
      329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
      0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
      c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
      4ad6819 [Mike Dusenberry] Documenting the  and  parameters.
      3b854b9 [Mike Dusenberry] Minor updates to documentation.
      10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
      119018d [Mike Dusenberry] Adding static  methods to each of the distributed matrix classes to consolidate conversion logic.
      4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
      93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
      f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
      6a3ecb7 [Mike Dusenberry] Updating pattern matching.
      08f287b [Mike Dusenberry] Slight reformatting of the documentation.
      a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output.  This is fine since the values are all small, and thus can be easily represented as ints.
      4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
      7e3ca16 [Mike Dusenberry] Fixing long lines.
      f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
      ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
      dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices.  Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
      0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
      3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier.  The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction.  This way, we can call  for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object.  This is analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
      4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
      23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
      b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix.  Updating DistributedMatrices factory methods to accept numRows and numCols with default values.  Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
      bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
      d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices.  Added a factory method for creating a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
      571d5b53
    • Joseph K. Bradley's avatar
      [SPARK-9582] [ML] LDA cleanups · 1833d9c0
      Joseph K. Bradley authored
      Small cleanups to recent LDA additions and docs.
      
      CC: feynmanliang
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7916 from jkbradley/lda-cleanups and squashes the following commits:
      
      f7021d9 [Joseph K. Bradley] broadcasting large matrices for LDA in local model and online learning
      97947aa [Joseph K. Bradley] a few more cleanups
      5b03f88 [Joseph K. Bradley] reverted split of lda log likelihood
      c566915 [Joseph K. Bradley] small edit to make review easier
      63f6c7d [Joseph K. Bradley] clarified log likelihood for lda models
      1833d9c0
    • Joseph K. Bradley's avatar
      [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier · e3754560
      Joseph K. Bradley authored
      Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns.
      
      CC: holdenk yanboliang
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7903 from jkbradley/rf-prob-python and squashes the following commits:
      
      c62a83f [Joseph K. Bradley] made unit test more robust
      14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark
      e3754560
    • CodingCat's avatar
      [SPARK-9602] remove "Akka/Actor" words from comments · 9d668b73
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-9602
      
      Although we have hidden Akka behind RPC interface, I found that the Akka/Actor-related comments are still spreading everywhere. To make it consistent, we shall remove "actor"/"akka" words from the comments...
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #7936 from CodingCat/SPARK-9602 and squashes the following commits:
      
      e8296a3 [CodingCat] remove actor words from comments
      9d668b73
    • Josh Rosen's avatar
      [SPARK-9452] [SQL] Support records larger than page size in UnsafeExternalSorter · ab8ee1a3
      Josh Rosen authored
      This patch extends UnsafeExternalSorter to support records larger than the page size. The basic strategy is the same as in #7762: store large records in their own overflow pages.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7891 from JoshRosen/large-records-in-sql-sorter and squashes the following commits:
      
      967580b [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
      948c344 [Josh Rosen] Add large records tests for KV sorter.
      3c17288 [Josh Rosen] Combine memory and disk cleanup into general cleanupResources() method
      380f217 [Josh Rosen] Merge remote-tracking branch 'origin/master' into large-records-in-sql-sorter
      27eafa0 [Josh Rosen] Fix page size in PackedRecordPointerSuite
      a49baef [Josh Rosen] Address initial round of review comments
      3edb931 [Josh Rosen] Remove accidentally-committed debug statements.
      2b164e2 [Josh Rosen] Support large records in UnsafeExternalSorter.
      ab8ee1a3
    • Wenchen Fan's avatar
      [SPARK-9553][SQL] remove the no-longer-necessary createCode and... · f4b1ac08
      Wenchen Fan authored
      [SPARK-9553][SQL] remove the no-longer-necessary createCode and createStructCode, and replace the usage
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7890 from cloud-fan/minor and squashes the following commits:
      
      c3b1be3 [Wenchen Fan] fix style
      b0cbe2e [Wenchen Fan] remove the createCode and createStructCode, and replace the usage of them by createStructCode
      f4b1ac08
    • Michael Armbrust's avatar
      [SPARK-9606] [SQL] Ignore flaky thrift server tests · a0cc0175
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7939 from marmbrus/turnOffThriftTests and squashes the following commits:
      
      80d618e [Michael Armbrust] [SPARK-9606][SQL] Ignore flaky thrift server tests
      a0cc0175
    • Holden Karau's avatar
      [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier · 5a23213c
      Holden Karau authored
      This PR replaces the old "threshold" with a generalized "thresholds" Param.  We keep getThreshold,setThreshold for backwards compatibility for binary classification.
      
      Note that the primary author of this PR is holdenk
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:
      
      3952977 [Joseph K. Bradley] fixed pyspark doc test
      85febc8 [Joseph K. Bradley] made python unit tests a little more robust
      7eb1d86 [Joseph K. Bradley] small cleanups
      6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
      0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
      7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
      be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
      6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
      25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
      c02d6c0 [Holden Karau] No default for thresholds
      5e43628 [Holden Karau] CR feedback and fixed the renamed test
      f3fbbd1 [Holden Karau] revert the changes to random forest :(
      51f581c [Holden Karau] Add explicit types to public methods, fix long line
      f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
      adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
      398078a [Holden Karau] move the thresholding around a bunch based on the design doc
      4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
      638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
      e09919c [Holden Karau] Fix return type, I need more coffee....
      8d92cac [Holden Karau] Use ClassifierParams as the head
      3456ed3 [Holden Karau] Add explicit return types even though just test
      a0f3b0c [Holden Karau] scala style fixes
      6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
      ffc8dab [Holden Karau] Update the sharedParams
      0420290 [Holden Karau] Allow us to override the get methods selectively
      978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
      1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
      1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
      efb9084 [Holden Karau] move setThresholds only to where its used
      6b34809 [Holden Karau] Add a test with thresholding for the RFCS
      74f54c3 [Holden Karau] Fix creation of vote array
      1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
      2f44b18 [Holden Karau] Add a global default of null for thresholds param
      f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
      634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
      85c9e01 [Holden Karau] Test passes again... little fnur
      099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
      0f46836 [Holden Karau] Start adding a classifiersuite
      f70eb5e [Holden Karau] Fix test compile issues
      a7d59c8 [Holden Karau] Move thresholding into Classifier trait
      5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
      1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
      31d6bf2 [Holden Karau] Start threading the threshold info through
      0ef228c [Holden Karau] Add hasthresholds
      5a23213c
    • Michael Armbrust's avatar
      [SPARK-9512][SQL] Revert SPARK-9251, Allow evaluation while sorting · 34a0eb2e
      Michael Armbrust authored
      The analysis rule has a bug and we ended up making the sorter still capable of doing evaluation, so lets revert this for now.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7906 from marmbrus/revertSortProjection and squashes the following commits:
      
      2da6972 [Michael Armbrust] unrevert unrelated changes
      4f2b00c [Michael Armbrust] Revert "[SPARK-9251][SQL] do not order by expressions which still need evaluation"
      34a0eb2e
    • Shivaram Venkataraman's avatar
      [SPARK-9562] Change reference to amplab/spark-ec2 from mesos/ · 6a0f8b99
      Shivaram Venkataraman authored
      cc srowen pwendell nchammas
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7899 from shivaram/spark-ec2-move and squashes the following commits:
      
      7cc22c9 [Shivaram Venkataraman] Change reference to amplab/spark-ec2 from mesos/
      6a0f8b99
    • Yijie Shen's avatar
      [SPARK-9541] [SQL] DataTimeUtils cleanup · b5034c9c
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9541
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7870 from yjshen/datetime_cleanup and squashes the following commits:
      
      9203e33 [Yijie Shen] revert getMonth & getDayOfMonth
      5cad119 [Yijie Shen] rebase code
      7d62a74 [Yijie Shen] remove tmp tuple inside split date
      e98aaac [Yijie Shen] DataTimeUtils cleanup
      b5034c9c
    • Davies Liu's avatar
      [SPARK-8246] [SQL] Implement get_json_object · 73dedb58
      Davies Liu authored
      This is based on #7485 , thanks to NathanHowell
      
      Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      Author: Nathan Howell <nhowell@godaddy.com>
      
      Closes #7901 from davies/get_json_object and squashes the following commits:
      
      3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object
      98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
      a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala
      c818519 [Yin Huai] new results.
      18ce26b [Davies Liu] fix tests
      6ac29fb [Yin Huai] Golden files.
      25eebef [Davies Liu] use HiveQuerySuite
      e0ac6ec [Yin Huai] Golden answer files.
      940c060 [Davies Liu] tweat code style
      44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object
      9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element
      8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object
      73dedb58
    • Tarek Auel's avatar
      [SPARK-8244] [SQL] string function: find in set · b1f88a38
      Tarek Auel authored
      This PR is based on #7186 (just fix the conflict), thanks to tarekauel .
      
      find_in_set(string str, string strList): int
      
      Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. For example, find_in_set('ab', 'abc,b,ab,c,def') returns 3.
      
      Only add this to SQL, not DataFrame.
      
      Closes #7186
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7900 from davies/find_in_set and squashes the following commits:
      
      4334209 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
      8f00572 [Davies Liu] Merge branch 'master' of github.com:apache/spark into find_in_set
      243ede4 [Tarek Auel] [SPARK-8244][SQL] hive compatibility
      1aaf64e [Tarek Auel] [SPARK-8244][SQL] unit test fix
      e4093a4 [Tarek Auel] [SPARK-8244][SQL] final modifier for COMMA_UTF8
      0d05df5 [Tarek Auel] Merge branch 'master' into SPARK-8244
      208d710 [Tarek Auel] [SPARK-8244] address comments & bug fix
      71b2e69 [Tarek Auel] [SPARK-8244] find_in_set
      66c7fda [Tarek Auel] Merge branch 'master' into SPARK-8244
      61b8ca2 [Tarek Auel] [SPARK-8224] removed loop and split; use unsafe String comparison
      4f75a65 [Tarek Auel] Merge branch 'master' into SPARK-8244
      e3b20c8 [Tarek Auel] [SPARK-8244] added type check
      1c2bbb7 [Tarek Auel] [SPARK-8244] findInSet
      b1f88a38
    • Marcelo Vanzin's avatar
      [SPARK-9583] [BUILD] Do not print mvn debug messages to stdout. · d702d537
      Marcelo Vanzin authored
      This allows build/mvn to be used by make-distribution.sh.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7915 from vanzin/SPARK-9583 and squashes the following commits:
      
      6469e60 [Marcelo Vanzin] [SPARK-9583] [build] Do not print mvn debug messages to stdout.
      d702d537
    • Carson Wang's avatar
      [SPARK-2016] [WEBUI] RDD partition table pagination for the RDD Page · cb7fa0aa
      Carson Wang authored
      Add pagination for the RDD page to avoid unresponsive UI when the number of the RDD partitions is large.
      Before:
      ![rddpagebefore](https://cloud.githubusercontent.com/assets/9278199/8951533/3d9add54-3601-11e5-99d0-5653b473c49b.png)
      After:
      ![rddpageafter](https://cloud.githubusercontent.com/assets/9278199/8951536/439d66e0-3601-11e5-9cee-1b380fe6620d.png)
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7692 from carsonwang/SPARK-2016 and squashes the following commits:
      
      03c7168 [Carson Wang] Fix style issues
      612c18c [Carson Wang] RDD partition table pagination for the RDD Page
      cb7fa0aa
    • tedyu's avatar
      [SPARK-8064] [BUILD] Follow-up. Undo change from SPARK-9507 that was accidentally reverted · b211cbc7
      tedyu authored
      This PR removes the dependency reduced POM hack brought back by #7191
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #7919 from tedyu/master and squashes the following commits:
      
      1bfbd7b [tedyu] [BUILD] Remove dependency reduced POM hack
      b211cbc7
    • Sean Owen's avatar
      [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build... · 76d74090
      Sean Owen authored
      [SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition
      
      Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
      
      I'll explain several of the changes inline in comments.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7862 from srowen/SPARK-9534 and squashes the following commits:
      
      ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.
      76d74090
    • Ankur Dave's avatar
      [SPARK-3190] [GRAPHX] Fix VertexRDD.count() overflow regression · 9e952ecb
      Ankur Dave authored
      SPARK-3190 was originally fixed by 96df9290, but a5ef5811 introduced a regression during refactoring. This commit fixes the regression.
      
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #7923 from ankurdave/SPARK-3190-reopening and squashes the following commits:
      
      a3e1b23 [Ankur Dave] Fix VertexRDD.count() overflow regression
      9e952ecb
  3. Aug 03, 2015
    • Sean Owen's avatar
      [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build · 0afa6fbf
      Sean Owen authored
      Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits:
      
      73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too
      0afa6fbf
    • Reynold Xin's avatar
      [SPARK-9577][SQL] Surface concrete iterator types in various sort classes. · 5eb89f67
      Reynold Xin authored
      We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7911 from rxin/surface-concrete-type and squashes the following commits:
      
      0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
      5eb89f67
    • CodingCat's avatar
      [SPARK-8416] highlight and topping the executor threads in thread dumping page · 3b0e4449
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-8416
      
      To facilitate debugging, I made this patch with three changes:
      
      * render the executor-thread and non executor-thread entries with different background colors
      
      * put the executor threads on the top of the list
      
      * sort the threads alphabetically
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits:
      
      34fc708 [CodingCat] fix className
      d7b79dd [CodingCat] lowercase threadName
      d032882 [CodingCat] sort alphabetically and change the css class name
      f0513b1 [CodingCat] change the color & group threads by name
      2da6e06 [CodingCat] small fix
      3fc9f36 [CodingCat] define classes in webui.css
      8ee125e [CodingCat] highlight and put on top the executor threads in thread dumping page
      3b0e4449
    • Burak Yavuz's avatar
      [SPARK-9263] Added flags to exclude dependencies when using --packages · 1633d0a2
      Burak Yavuz authored
      While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing).
      
      The flag I added was --packages-exclude, but I'm open on renaming it. I also added property flags in case people would like to use a conf file to provide dependencies, which is possible if there is a long list of dependencies or exclusions.
      
      cc andrewor14 vanzin pwendell
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #7599 from brkyvz/packages-exclusions and squashes the following commits:
      
      636f410 [Burak Yavuz] addressed nits
      6e54ede [Burak Yavuz] is this the culprit
      b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into packages-exclusions
      154f5db [Burak Yavuz] addressed initial comments
      1536d7a [Burak Yavuz] Added flags to exclude packages using --packages-exclude
      1633d0a2
    • Matthew Brandyberry's avatar
      [SPARK-9483] Fix UTF8String.getPrefix for big-endian. · b79b4f5f
      Matthew Brandyberry authored
      Previous code assumed little-endian.
      
      Author: Matthew Brandyberry <mbrandy@us.ibm.com>
      
      Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits:
      
      ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments.
      17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for big-endian.
      b79b4f5f
    • Shivaram Venkataraman's avatar
      Add a prerequisites section for building docs · 7abaaad5
      Shivaram Venkataraman authored
      This puts all the install commands that need to be run in one section instead of being spread over many paragraphs
      
      cc rxin
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7912 from shivaram/docs-setup-readme and squashes the following commits:
      
      cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs
      7abaaad5
    • MechCoder's avatar
      [SPARK-8874] [ML] Add missing methods in Word2Vec · 13675c74
      MechCoder authored
      Add missing methods
      
      1. getVectors
      2. findSynonyms
      
      to W2Vec scala and python API
      
      mengxr
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7263 from MechCoder/missing_methods_w2vec and squashes the following commits:
      
      149d5ca [MechCoder] minor doc
      69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec
      13675c74
    • Steve Loughran's avatar
      [SPARK-8064] [SQL] Build against Hive 1.2.1 · a2409d1c
      Steve Loughran authored
      Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork.
      
      Tests not run yet: that's what the machines are for
      
      Author: Steve Loughran <stevel@hortonworks.com>
      Author: Cheng Lian <lian@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits:
      
      7556d85 [Cheng Lian] Updates .q files and corresponding golden files
      ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002
      6a92bb0 [Cheng Lian] Overrides HiveConf time vars
      dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe
      0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header...
      fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark
      7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar
      376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration
      2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down
      cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically.
      6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import
      da310dc [Michael Armbrust] Fixes for Hive tests.
      a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete
      7404f34 [Patrick Wendell] Add spark-hive staging repo
      832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code
      312c0d4 [Steve Loughran] SPARK-8064  maven/ivy dependency purge; calcite declaration needed
      fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand"
      c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first
      4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests
      314eb3c [Steve Loughran] SPARK-8064 deprecation warning  noise in one of the tests
      17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly.
      d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options
      23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens
      54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase
      0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize
      fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides
      fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1
      dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy
      d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType
      051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark
      6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call
      e6121e5 [Steve Loughran] SPARK-8064 address review comments
      aa43dc6 [Steve Loughran] SPARK-8064  more robust teardown on JavaMetastoreDatasourcesSuite
      f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text
      8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output.
      5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue*
      642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing
      97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised.
      335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log.
      3ed872f [Steve Loughran] SPARK-8064 rename field double to  dbl
      bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes
      41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions
      2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
      1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex
      bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded
      c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6
      0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread
      13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1
      d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops
      26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT
      3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure
      d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1
      1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text
      8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions
      dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause.
      463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output
      2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec
      1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec
      75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port"
      3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants
      c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression?
      27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings
      00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now)
      cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite
      f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package
      6c310b4 [Steve Loughran] SPARK-8064 subclass  Hive ServerOptionsProcessor to make it public again
      f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere
      4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1
      a2409d1c
Loading