Skip to content
Snippets Groups Projects
  1. Jun 04, 2015
    • Thomas Omans's avatar
      [SPARK-7743] [SQL] Parquet 1.7 · cd3176bd
      Thomas Omans authored
      Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).
      
      Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`
      
      ```diff
      -    val readContext = getReadSupport(configuration).init(
      +    val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
      ```
      
      Since ParquetInputFormat.getReadSupport was made package private in the latest release.
      
      Thanks
      -- Thomas Omans
      
      Author: Thomas Omans <tomans@cj.com>
      
      Closes #6597 from eggsby/SPARK-7743 and squashes the following commits:
      
      2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
      cd3176bd
  2. Jun 03, 2015
    • Patrick Wendell's avatar
      [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 · 2c4d550e
      Patrick Wendell authored
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:
      
      2f42d02 [Patrick Wendell] A few more excludes
      4bebcf0 [Patrick Wendell] Update to RC4
      61aaf46 [Patrick Wendell] Using new release candidate
      55f1610 [Patrick Wendell] Another exclude
      04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
      36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
      2c4d550e
  3. Jun 02, 2015
    • DB Tsai's avatar
      [SPARK-7547] [ML] Scala Example code for ElasticNet · a86b3e9b
      DB Tsai authored
      This is scala example code for both linear and logistic regression. Python and Java versions are to be added.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6576 from dbtsai/elasticNetExample and squashes the following commits:
      
      e7ca406 [DB Tsai] fix test
      6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
      136e0dd [DB Tsai] address feedback
      1ec29d4 [DB Tsai] fix style
      9462f5f [DB Tsai] add example
      a86b3e9b
    • Ram Sriharsha's avatar
      [SPARK-7387] [ML] [DOC] CrossValidator example code in Python · c3f4c325
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6358 from harsha2010/SPARK-7387 and squashes the following commits:
      
      63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
      aeb6bb6 [Ram Sriharsha] Python Style Fix
      54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      615e91c [Ram Sriharsha] cleanup
      204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python
      c3f4c325
  4. May 31, 2015
    • Reynold Xin's avatar
      [SPARK-7979] Enforce structural type checker. · 4b5f12ba
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6536 from rxin/structural-type-checker and squashes the following commits:
      
      f833151 [Reynold Xin] Fixed compilation.
      633f9a1 [Reynold Xin] Fixed typo.
      d1fa804 [Reynold Xin] [SPARK-7979] Enforce structural type checker.
      4b5f12ba
    • Reynold Xin's avatar
      [SPARK-3850] Trim trailing spaces for examples/streaming/yarn. · 564bc11e
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6530 from rxin/trim-whitespace-1 and squashes the following commits:
      
      7b7b3a0 [Reynold Xin] Reset again.
      dc14597 [Reynold Xin] Reset scalastyle.
      cd556c4 [Reynold Xin] YARN, Kinesis, Flume.
      4223fe1 [Reynold Xin] [SPARK-3850] Trim trailing spaces for examples/streaming.
      564bc11e
  5. May 29, 2015
    • Ram Sriharsha's avatar
      [SPARK-6013] [ML] Add more Python ML examples for spark.ml · dbf8ff38
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6443 from harsha2010/SPARK-6013 and squashes the following commits:
      
      732506e [Ram Sriharsha] Code Review Feedback
      121c211 [Ram Sriharsha] python style fix
      5f9b8c3 [Ram Sriharsha] python style fixes
      925ca86 [Ram Sriharsha] Simple Params Example
      8b372b1 [Ram Sriharsha] GBT Example
      965ec14 [Ram Sriharsha] Random Forest Example
      dbf8ff38
  6. May 28, 2015
    • Reynold Xin's avatar
      [SPARK-7929] Remove Bagel examples & whitespace fix for examples. · 2881d14c
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6480 from rxin/whitespace-example and squashes the following commits:
      
      8a4a3d4 [Reynold Xin] [SPARK-7929] Remove Bagel examples & whitespace fix for examples.
      2881d14c
    • Li Yao's avatar
      [MINOR] Fix the a minor bug in PageRank Example. · c771589c
      Li Yao authored
      Fix the bug that entering only 1 arg will cause array out of bounds exception in PageRank example.
      
      Author: Li Yao <hnkfliyao@gmail.com>
      
      Closes #6455 from lastland/patch-1 and squashes the following commits:
      
      de06128 [Li Yao] Fix the bug that entering only 1 arg will cause array out of bounds exception.
      c771589c
    • zsxwing's avatar
      [SPARK-7895] [STREAMING] [EXAMPLES] Move Kafka examples from scala-2.10/src to src · 000df2f0
      zsxwing authored
      Since `spark-streaming-kafka` now is published for both Scala 2.10 and 2.11, we can move `KafkaWordCount` and `DirectKafkaWordCount` from `examples/scala-2.10/src/` to `examples/src/` so that they will appear in `spark-examples-***-jar` for Scala 2.11.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6436 from zsxwing/SPARK-7895 and squashes the following commits:
      
      c6052f1 [zsxwing] Update examples/pom.xml
      0bcfa87 [zsxwing] Fix the sleep time
      b9d1256 [zsxwing] Move Kafka examples from scala-2.10/src to src
      000df2f0
  7. May 25, 2015
    • tedyu's avatar
      Close HBaseAdmin at the end of HBaseTest · 23bea97d
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6381 from ted-yu/master and squashes the following commits:
      
      e2f0ea1 [tedyu] Close HBaseAdmin at the end of HBaseTest
      23bea97d
  8. May 23, 2015
    • GenTang's avatar
      [SPARK-5090] [EXAMPLES] The improvement of python converter for hbase · 4583cf4b
      GenTang authored
      Hi,
      
      Following the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-td10001.html. I made some modification in three files in package examples:
      1. HBaseConverters.scala: the new converter will converts all the records in an hbase results into a single string
      2. hbase_input.py: as the value string may contain several records, we can use ast package to convert the string into dict
      3. HBaseTest.scala: as the package examples use hbase 0.98.7 the original constructor HTableDescriptor is deprecated. The updation to new constructor is made
      
      Author: GenTang <gen.tang86@gmail.com>
      
      Closes #3920 from GenTang/master and squashes the following commits:
      
      d2153df [GenTang] import JSONObject precisely
      4802481 [GenTang] dump the result into a singl String
      62df7f0 [GenTang] remove the comment
      21de653 [GenTang] return the string in json format
      15b1fe3 [GenTang] the modification of comments
      5cbbcfc [GenTang] the improvement of pythonconverter
      ceb31c5 [GenTang] the modification for adapting updation of hbase
      3253b61 [GenTang] the modification accompanying the improvement of pythonconverter
      4583cf4b
  9. May 17, 2015
    • Reynold Xin's avatar
      [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface. · 517eb37a
      Reynold Xin authored
      Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits:
      
      7465c2c [Reynold Xin] Fixed unit test.
      118e609 [Reynold Xin] Updated tests.
      3441b57 [Reynold Xin] Updated javadoc.
      13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
      517eb37a
  10. May 16, 2015
    • Reynold Xin's avatar
      [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API. · 161d0b4a
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6211 from rxin/mllib-reader and squashes the following commits:
      
      79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.
      161d0b4a
    • Reynold Xin's avatar
      [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API · 578bfeef
      Reynold Xin authored
      This patch introduces DataFrameWriter and DataFrameReader.
      
      DataFrameReader interface, accessible through SQLContext.read, contains methods that create DataFrames. These methods used to reside in SQLContext. Example usage:
      ```scala
      sqlContext.read.json("...")
      sqlContext.read.parquet("...")
      ```
      
      DataFrameWriter interface, accessible through DataFrame.write, implements a builder pattern to avoid the proliferation of options in writing DataFrame out. It currently implements:
      - mode
      - format (e.g. "parquet", "json")
      - options (generic options passed down into data sources)
      - partitionBy (partitioning columns)
      Example usage:
      ```scala
      df.write.mode("append").format("json").partitionBy("date").saveAsTable("myJsonTable")
      ```
      
      TODO:
      
      - [ ] Documentation update
      - [ ] Move JDBC into reader / writer?
      - [ ] Deprecate the old interfaces
      - [ ] Move the generic load interface into reader.
      - [ ] Update example code and documentation
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6175 from rxin/reader-writer and squashes the following commits:
      
      b146c95 [Reynold Xin] Deprecation of old APIs.
      bd8abdf [Reynold Xin] Fixed merge conflict.
      26abea2 [Reynold Xin] Added general load methods.
      244fbec [Reynold Xin] Added equivalent to example.
      4f15d92 [Reynold Xin] Added documentation for partitionBy.
      7e91611 [Reynold Xin] [SPARK-7654][SQL] DataFrameReader and DataFrameWriter for input/output API.
      578bfeef
  11. May 15, 2015
    • Ram Sriharsha's avatar
      [SPARK-7575] [ML] [DOC] Example code for OneVsRest · cc12a86f
      Ram Sriharsha authored
      Java and Scala examples for OneVsRest. Fixes the base classifier to be Logistic Regression and accepts the configuration parameters of the base classifier.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6115 from harsha2010/SPARK-7575 and squashes the following commits:
      
      87ad3c7 [Ram Sriharsha] extra line
      f5d9891 [Ram Sriharsha] Merge branch 'master' into SPARK-7575
      7076084 [Ram Sriharsha] cleanup
      dfd660c [Ram Sriharsha] cleanup
      8703e4f [Ram Sriharsha] update doc
      cb23995 [Ram Sriharsha] fix commandline options for JavaOneVsRestExample
      69e91f8 [Ram Sriharsha] cleanup
      7f4e127 [Ram Sriharsha] cleanup
      d4c40d0 [Ram Sriharsha] Code Review fixes
      461eb38 [Ram Sriharsha] cleanup
      e0106d9 [Ram Sriharsha] Fix typo
      935cf56 [Ram Sriharsha] Try to match Java and Scala Example Commandline options
      5323ff9 [Ram Sriharsha] cleanup
      196a59a [Ram Sriharsha] cleanup
      6adfa0c [Ram Sriharsha] Style Fix
      8cfc5d5 [Ram Sriharsha] [SPARK-7575] Example code for OneVsRest
      cc12a86f
  12. May 14, 2015
    • DB Tsai's avatar
      [SPARK-7568] [ML] ml.LogisticRegression doesn't output the right prediction · c1080b6f
      DB Tsai authored
      The difference is because we previously don't fit the intercept in Spark 1.3. Here, we change the input `String` so that the probability of instance 6 can be classified as `1.0` without any ambiguity.
      
      with lambda = 0.001 in current LOR implementation, the prediction is
      ```
      (4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0
      (5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
      (6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], prediction=1.0
      (7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
      ```
      and the training accuracy is
      ```
      (0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], prediction=1.0
      (1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0
      (2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], prediction=1.0
      (3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], prediction=0.0
      ```
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6109 from dbtsai/lor-example and squashes the following commits:
      
      ac63ce4 [DB Tsai] first commit
      c1080b6f
    • Xiangrui Meng's avatar
      [SPARK-7407] [MLLIB] use uid + name to identify parameters · 1b8625f4
      Xiangrui Meng authored
      A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
      
      This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
      
      c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      520f0a2 [Xiangrui Meng] address comments
      2569168 [Xiangrui Meng] fix tests
      873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
      409ea08 [Xiangrui Meng] minor updates
      83a163c [Xiangrui Meng] update JavaDeveloperApiExample
      5db5325 [Xiangrui Meng] update OneVsRest
      7bde7ae [Xiangrui Meng] merge master
      697fdf9 [Xiangrui Meng] update Bucketizer
      7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      629d402 [Xiangrui Meng] fix LRSuite
      154516f [Xiangrui Meng] merge master
      aa4a611 [Xiangrui Meng] fix examples/compile
      a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
      fdbc415 [Xiangrui Meng] all tests passed
      c255f17 [Xiangrui Meng] fix tests in ParamsSuite
      818e1db [Xiangrui Meng] merge master
      e1160cf [Xiangrui Meng] fix tests
      fbc39f0 [Xiangrui Meng] pass test:compile
      108937e [Xiangrui Meng] pass compile
      8726d39 [Xiangrui Meng] use parent uid in Param
      eaeed35 [Xiangrui Meng] update Identifiable
      1b8625f4
  13. May 11, 2015
    • Bryan Cutler's avatar
      [SPARK-7522] [EXAMPLES] Removed angle brackets from dataFormat option · 4f8a1551
      Bryan Cutler authored
      As is, to specify this option on command line, you have to escape the angle brackets.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6049 from BryanCutler/dataFormat-option-7522 and squashes the following commits:
      
      b34afb4 [Bryan Cutler] [SPARK-7522] Removed angle brackets from dataFormat option
      4f8a1551
  14. May 09, 2015
  15. May 06, 2015
    • jerryshao's avatar
      [SPARK-7396] [STREAMING] [EXAMPLE] Update KafkaWordCountProducer to use new Producer API · 316a5c04
      jerryshao authored
      Otherwise it will throw exception:
      
      ```
      Exception in thread "main" kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.
      	at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
      	at kafka.producer.Producer.send(Producer.scala:77)
      	at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96)
      	at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      ```
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5936 from jerryshao/SPARK-7396 and squashes the following commits:
      
      270bbe2 [jerryshao] Fix Kafka Produce throw Exception issue
      316a5c04
    • Shivaram Venkataraman's avatar
      [SPARK-6799] [SPARKR] Remove SparkR RDD examples, add dataframe examples · 4e930420
      Shivaram Venkataraman authored
      This PR also makes some of the DataFrame to RDD methods private as the RDD class is private in 1.4
      
      cc rxin pwendell
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5949 from shivaram/sparkr-examples and squashes the following commits:
      
      6c42fdc [Shivaram Venkataraman] Remove SparkR RDD examples, add dataframe examples
      4e930420
  16. May 05, 2015
    • Hrishikesh Subramonian's avatar
      [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity · 5995ada9
      Hrishikesh Subramonian authored
      The following items are added to Python kmeans:
      
      kmeans - setEpsilon, setInitializationSteps
      KMeansModel - computeCost, k
      
      Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>
      
      Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:
      
      b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
      5fd3ced [Hrishikesh Subramonian] doc test corrections
      20b3c68 [Hrishikesh Subramonian] python 3 fixes
      4d4e695 [Hrishikesh Subramonian] added arguments in python tests
      21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
      5995ada9
    • Jihong MA's avatar
      [SPARK-7357] Improving HBaseTest example · 51f46200
      Jihong MA authored
      Author: Jihong MA <linlin200605@gmail.com>
      
      Closes #5904 from JihongMA/SPARK-7357 and squashes the following commits:
      
      7d6153a [Jihong MA] SPARK-7357 Improving HBaseTest example
      51f46200
    • Niccolo Becchi's avatar
      [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and... · da738cff
      Niccolo Becchi authored
      [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and kmeans.py to simplify readability
      
      With the previous syntax it could look like that the reduceByKey sums separately abscissas and ordinates of some 2D points. Perhaps in this way should be easier to understand the example, especially for who is starting the functional programming like me now.
      
      Author: Niccolo Becchi <niccolo.becchi@gmail.com>
      Author: pippobaudos <niccolo.becchi@gmail.com>
      
      Closes #5875 from pippobaudos/patch-1 and squashes the following commits:
      
      3bb3a47 [pippobaudos] renamed variables in LocalKMeans.scala and kmeans.py to simplify readability
      2c2a7a2 [Niccolo Becchi] Update SparkKMeans.scala
      da738cff
  17. May 04, 2015
    • Xiangrui Meng's avatar
      [SPARK-5956] [MLLIB] Pipeline components should be copyable. · e0833c59
      Xiangrui Meng authored
      This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a copy of the current instance with a randomly generated uid and some extra param values. With this change, we only need to implement `fit` and `transform` without extra param values given the default implementation of `fit(dataset, extra)`:
      
      ~~~scala
      def fit(dataset: DataFrame, extra: ParamMap): Model = {
        copy(extra).fit(dataset)
      }
      ~~~
      
      Inside `fit` and `transform`, since only the embedded values are used, I added `$` as an alias for `getOrDefault` to make the code easier to read. For example, in `LinearRegression.fit` we have:
      
      ~~~scala
      val effectiveRegParam = $(regParam) / yStd
      val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
      val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
      ~~~
      
      Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the fitted pipeline model stored all copied stages (no matter whether it is a transformer or a model).
      
      Other changes:
      * `Params$.inheritValues` is moved to `Params!.copyValues` and returns the target instance.
      * `fittingParamMap` was removed because the `parent` carries this information.
      * `validate` was renamed to `validateParams` to be more precise.
      
      TODOs:
      * [x] add tests for newly added methods
      * [ ] update documentation
      
      jkbradley dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:
      
      7bef88d [Xiangrui Meng] address comments
      05229c3 [Xiangrui Meng] assert -> assertEquals
      b2927b1 [Xiangrui Meng] organize imports
      f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
      93e7924 [Xiangrui Meng] add tests for hasParam & copy
      463ecae [Xiangrui Meng] merge master
      2b954c3 [Xiangrui Meng] update Binarizer
      465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5956
      282a1a8 [Xiangrui Meng] fix test
      819dd2d [Xiangrui Meng] merge master
      b642872 [Xiangrui Meng] example code runs
      5a67779 [Xiangrui Meng] examples compile
      c76b4d1 [Xiangrui Meng] fix all unit tests
      0f4fd64 [Xiangrui Meng] fix some tests
      9286a22 [Xiangrui Meng] copyValues to trained models
      53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to copyValues
      9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to validateParams
      d882afc [Xiangrui Meng] test compile
      f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra params in all spark.ml components
      e0833c59
  18. Apr 29, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7176] [ML] Add validation functionality to Param · 114bad60
      Joseph K. Bradley authored
      Main change: Added isValid field to Param.  Modified all usages to use isValid when relevant.  Added helper methods in ParamValidate.
      
      Also overrode Params.validate() in:
      * CrossValidator + model
      * Pipeline + model
      
      I made a few updates for the elastic net patch:
      * I changed "tol" to "convergenceTol"
      * I added some documentation
      
      This PR is Scala + Java only.  Python will be in a follow-up PR.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5740 from jkbradley/enforce-validate and squashes the following commits:
      
      ad9c6c1 [Joseph K. Bradley] re-generated sharedParams after merging with current master
      76415e8 [Joseph K. Bradley] reverted convergenceTol to tol
      af62f4b [Joseph K. Bradley] Removed changes to SparkBuild, python linalg.  Fixed test failures.  Renamed ParamValidate to ParamValidators.  Removed explicit type from ParamValidators calls where possible.
      bb2665a [Joseph K. Bradley] merged with elastic net pr
      ecda302 [Joseph K. Bradley] fix rat tests, plus add a little doc
      6895dfc [Joseph K. Bradley] small cleanups
      069ac6d [Joseph K. Bradley] many cleanups
      928fb84 [Joseph K. Bradley] Maybe done
      a910ac7 [Joseph K. Bradley] still workin
      6d60e2e [Joseph K. Bradley] Still workin
      b987319 [Joseph K. Bradley] Partly done with adding checks, but blocking on adding checking functionality to Param
      dbc9fb2 [Joseph K. Bradley] merged with master.  enforcing Params.validate
      114bad60
  19. Apr 28, 2015
    • Ilya Ganelin's avatar
      [SPARK-5932] [CORE] Use consistent naming for size properties · 2d222fb3
      Ilya Ganelin authored
      I've added an interface to JavaUtils to do byte conversion and added hooks within Utils.scala to handle conversion within Spark code (like for time strings). I've added matching tests for size conversion, and then updated all deprecated configs and documentation as per SPARK-5933.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5574 from ilganeli/SPARK-5932 and squashes the following commits:
      
      11f6999 [Ilya Ganelin] Nit fixes
      49a8720 [Ilya Ganelin] Whitespace fix
      2ab886b [Ilya Ganelin] Scala style
      fc85733 [Ilya Ganelin] Got rid of floating point math
      852a407 [Ilya Ganelin] [SPARK-5932] Added much improved overflow handling. Can now handle sizes up to Long.MAX_VALUE Petabytes instead of being capped at Long.MAX_VALUE Bytes
      9ee779c [Ilya Ganelin] Simplified fraction matches
      22413b1 [Ilya Ganelin] Made MAX private
      3dfae96 [Ilya Ganelin] Fixed some nits. Added automatic conversion of old paramter for kryoserializer.mb to new values.
      e428049 [Ilya Ganelin] resolving merge conflict
      8b43748 [Ilya Ganelin] Fixed error in pattern matching for doubles
      84a2581 [Ilya Ganelin] Added smoother handling of fractional values for size parameters. This now throws an exception and added a warning for old spark.kryoserializer.buffer
      d3d09b6 [Ilya Ganelin] [SPARK-5932] Fixing error in KryoSerializer
      fe286b4 [Ilya Ganelin] Resolved merge conflict
      c7803cd [Ilya Ganelin] Empty lines
      54b78b4 [Ilya Ganelin] Simplified byteUnit class
      69e2f20 [Ilya Ganelin] Updates to code
      f32bc01 [Ilya Ganelin] [SPARK-5932] Fixed error in API in SparkConf.scala where Kb conversion wasn't being done properly (was Mb). Added test cases for both timeUnit and ByteUnit conversion
      f15f209 [Ilya Ganelin] Fixed conversion of kryo buffer size
      0f4443e [Ilya Ganelin]     Merge remote-tracking branch 'upstream/master' into SPARK-5932
      35a7fa7 [Ilya Ganelin] Minor formatting
      928469e [Ilya Ganelin] [SPARK-5932] Converted some longs to ints
      5d29f90 [Ilya Ganelin] [SPARK-5932] Finished documentation updates
      7a6c847 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer
      afc9a38 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize and spark.storage.memoryMapThreshold
      ae7e9f6 [Ilya Ganelin] [SPARK-5932] Updated spark.io.compression.snappy.block.size
      2d15681 [Ilya Ganelin] [SPARK-5932] Updated spark.executor.logs.rolling.size.maxBytes
      1fbd435 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize
      eba4de6 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer.kb
      b809a78 [Ilya Ganelin] [SPARK-5932] Updated spark.kryoserializer.buffer.max
      0cdff35 [Ilya Ganelin] [SPARK-5932] Updated to use bibibytes in method names. Updated spark.kryoserializer.buffer.mb and spark.reducer.maxMbInFlight
      475370a [Ilya Ganelin] [SPARK-5932] Simplified ByteUnit code, switched to using longs. Updated docs to clarify that we use kibi, mebi etc instead of kilo, mega
      851d691 [Ilya Ganelin] [SPARK-5932] Updated memoryStringToMb to use new interfaces
      a9f4fcf [Ilya Ganelin] [SPARK-5932] Added unit tests for unit conversion
      747393a [Ilya Ganelin] [SPARK-5932] Added unit tests for ByteString conversion
      09ea450 [Ilya Ganelin] [SPARK-5932] Added byte string conversion to Jav utils
      5390fd9 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5932
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      2d222fb3
    • jerryshao's avatar
      [SPARK-5946] [STREAMING] Add Python API for direct Kafka stream · 9e4e82b7
      jerryshao authored
      Currently only added `createDirectStream` API, I'm not sure if `createRDD` is also needed, since some Java object needs to be wrapped in Python. Please help to review, thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      Author: Saisai Shao <saisai.shao@intel.com>
      
      Closes #4723 from jerryshao/direct-kafka-python-api and squashes the following commits:
      
      a1fe97c [jerryshao] Fix rebase issue
      eebf333 [jerryshao] Address the comments
      da40f4e [jerryshao] Fix Python 2.6 Syntax error issue
      5c0ee85 [jerryshao] Style fix
      4aeac18 [jerryshao] Fix bug in example code
      7146d86 [jerryshao] Add unit test
      bf3bdd6 [jerryshao] Add more APIs and address the comments
      f5b3801 [jerryshao] Small style fix
      8641835 [Saisai Shao] Rebase and update the code
      589c05b [Saisai Shao] Fix the style
      d6fcb6a [Saisai Shao] Address the comments
      dfda902 [Saisai Shao] Style fix
      0f7d168 [Saisai Shao] Add the doc and fix some style issues
      67e6880 [Saisai Shao] Fix test bug
      917b0db [Saisai Shao] Add Python createRDD API for Kakfa direct stream
      c3fc11d [jerryshao] Modify the docs
      2c00936 [Saisai Shao] address the comments
      3360f44 [jerryshao] Fix code style
      e0e0f0d [jerryshao] Code clean and bug fix
      338c41f [Saisai Shao] Add python API and example for direct kafka stream
      9e4e82b7
  20. Apr 27, 2015
    • Yuhao Yang's avatar
      [SPARK-7090] [MLLIB] Introduce LDAOptimizer to LDA to further improve extensibility · 4d9e560b
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7090
      
      LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
      As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
      Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
      
      Concrete changes:
      
      1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
      
      2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
              -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
              -move the code from LDA.initalState to initalState of EMLDAOptimizer
      
      3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
      
      4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
      
      Further work:
      add OnlineLDAOptimizer and other possible Optimizers once ready.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits:
      
      0e2e006 [Yuhao Yang] respond to review comments
      08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      e756ce4 [Yuhao Yang] solve mima exception
      d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
      0bb8400 [Yuhao Yang] refactor LDA with Optimizer
      ec2f857 [Yuhao Yang] protoptype for discussion
      4d9e560b
    • tedyu's avatar
      SPARK-7107 Add parameter for zookeeper.znode.parent to hbase_inputformat... · ef82bddc
      tedyu authored
      ....py
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #5673 from tedyu/master and squashes the following commits:
      
      ab7c72b [tedyu] SPARK-7107 Adjust indentation to pass Python style tests
      6e25939 [tedyu] Adjust line length to be shorter than 100 characters
      18d172a [tedyu] SPARK-7107 Add parameter for zookeeper.znode.parent to hbase_inputformat.py
      ef82bddc
  21. Apr 25, 2015
    • Joseph K. Bradley's avatar
      [SPARK-6113] [ML] Tree ensembles for Pipelines API · a7160c4e
      Joseph K. Bradley authored
      This is a continuation of [https://github.com/apache/spark/pull/5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees.  Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.
      
      This PR follows the example set by the previous PR for Decision Trees.  It includes a few cleanups to Decision Trees.
      
      Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap.  I plan to submit a separate PR which makes those values in Model be Options.  It does not matter much which PR gets merged first.
      
      CC: mengxr manishamde codedeft chouqin
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5626 from jkbradley/dt-api-ensembles and squashes the following commits:
      
      729167a [Joseph K. Bradley] small cleanups based on code review
      bbae2a2 [Joseph K. Bradley] Updated per all comments in code review
      855aa9a [Joseph K. Bradley] scala style fix
      ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples
      c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml.  Not tested yet.  Need to add example as well
      d045ebd [Joseph K. Bradley] some more updates, but far from done
      ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.
      a7160c4e
    • KeheCAI's avatar
      update the deprecated CountMinSketchMonoid function to TopPctCMS function · cca9905b
      KeheCAI authored
      http://twitter.github.io/algebird/index.html#com.twitter.algebird.legacy.CountMinSketchMonoid$
      The CountMinSketchMonoid has been deprecated since 0.8.1. Newer code should use TopPctCMS.monoid().
      
      ![image](https://cloud.githubusercontent.com/assets/1327396/7269619/d8b48b92-e8d5-11e4-8902-087f630e6308.png)
      
      Author: KeheCAI <caikehe@gmail.com>
      
      Closes #5629 from caikehe/master and squashes the following commits:
      
      e8aa06f [KeheCAI] update algebird-core to version 0.9.0 from 0.8.1
      5653351 [KeheCAI] change scala code style
      4c0dfd1 [KeheCAI] update the deprecated CountMinSketchMonoid function to TopPctCMS function
      cca9905b
  22. Apr 24, 2015
    • linweizhong's avatar
      [PySpark][Minor] Update sql example, so that can read file correctly · d874f8b5
      linweizhong authored
      To run Spark, default will read file from HDFS if we don't set the schema.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #5684 from Sephiroth-Lin/pyspark_example_minor and squashes the following commits:
      
      19fe145 [linweizhong] Update example sql.py, so that can read file correctly
      d874f8b5
    • Calvin Jia's avatar
      [SPARK-6122] [CORE] Upgrade tachyon-client version to 0.6.3 · 438859eb
      Calvin Jia authored
      This is a reopening of #4867.
      A short summary of the issues resolved from the previous PR:
      
      1. HTTPClient version mismatch: Selenium (used for UI tests) requires version 4.3.x, and Tachyon included 4.2.5 through a transitive dependency of its shaded thrift jar. To address this, Tachyon 0.6.3 will promote the transitive dependencies of the shaded jar so they can be excluded in spark.
      
      2. Jackson-Mapper-ASL version mismatch: In lower versions of hadoop-client (ie. 1.0.4), version 1.0.1 is included. The parquet library used in spark sql requires version 1.8+. Its unclear to me why upgrading tachyon-client would cause this dependency to break. The solution was to exclude jackson-mapper-asl from hadoop-client.
      
      It seems that the dependency management in spark-parent will not work on transitive dependencies, one way to make sure jackson-mapper-asl is included with the correct version is to add it as a top level dependency. The best solution would be to exclude the dependency in the modules which require a higher version, but that did not fix the unit tests. Any suggestions on the best way to solve this would be appreciated!
      
      Author: Calvin Jia <jia.calvin@gmail.com>
      
      Closes #5354 from calvinjia/upgrade_tachyon_0.6.3 and squashes the following commits:
      
      0eefe4d [Calvin Jia] Handle httpclient version in maven dependency management. Remove httpclient version setting from profiles.
      7c00dfa [Calvin Jia] Set httpclient version to 4.3.2 for selenium. Specify version of httpclient for sql/hive (previously 4.2.5 transitive dependency of libthrift).
      9263097 [Calvin Jia] Merge master to test latest changes
      dbfc1bd [Calvin Jia] Use Tachyon 0.6.4 for cleaner dependencies.
      e2ff80a [Calvin Jia] Exclude the jetty and curator promoted dependencies from tachyon-client.
      a3a29da [Calvin Jia] Update tachyon-client exclusions.
      0ae6c97 [Calvin Jia] Change tachyon version to 0.6.3
      a204df9 [Calvin Jia] Update make distribution tachyon version.
      a93c94f [Calvin Jia] Exclude jackson-mapper-asl from hadoop client since it has a lower version than spark's expected version.
      a8a923c [Calvin Jia] Exclude httpcomponents from Tachyon
      910fabd [Calvin Jia] Update to master
      eed9230 [Calvin Jia] Update tachyon version to 0.6.1.
      11907b3 [Calvin Jia] Use TachyonURI for tachyon paths instead of strings.
      71bf441 [Calvin Jia] Upgrade Tachyon client version to 0.6.0.
      438859eb
  23. Apr 23, 2015
  24. Apr 21, 2015
    • Joseph K. Bradley's avatar
      [SPARK-6113] [ML] Small cleanups after original tree API PR · 607eff0e
      Joseph K. Bradley authored
      This does a few clean-ups.  With this PR, all spark.ml tree components have ```private[ml]``` constructors.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5567 from jkbradley/dt-api-dt2 and squashes the following commits:
      
      2263b5b [Joseph K. Bradley] Added note about tree example issue.
      bb9f610 [Joseph K. Bradley] Small cleanups after original tree API PR
      607eff0e
  25. Apr 20, 2015
  26. Apr 17, 2015
    • Joseph K. Bradley's avatar
      [SPARK-6113] [ml] Stabilize DecisionTree API · a83571ac
      Joseph K. Bradley authored
      This is a PR for cleaning up and finalizing the DecisionTree API.  PRs for ensembles will follow once this is merged.
      
      ### Goal
      
      Here is the description copied from the JIRA (for both trees and ensembles):
      
      > **Issue**: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design.
      > **Proposal**: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details.
      > **[Design doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)** : This outlines current issues and the proposed API.
      
      Overall code layout:
      * The old API in mllib.tree.* will remain the same.
      * The new API will reside in ml.classification.* and ml.regression.*
      
      ### Summary of changes
      
      Old API
      * Exactly the same, except I made 1 method in Loss private (but that is not a breaking change since that method was introduced after the Spark 1.3 release).
      
      New APIs
      * Under Pipeline API
      * The new API preserves functionality, except:
        * New API does NOT store prob (probability of label in classification).  I want to have it store the full vector of probabilities but feel that should be in a later PR.
      * Use abstractions for parameters, estimators, and models to avoid code duplication
      * Limit parameters to relevant algorithms
      * For enum-like types, only expose Strings
        * We can make these pluggable later on by adding new parameters.  That is a far-future item.
      
      Test suites
      * I organized DecisionTreeSuite, but I made absolutely no changes to the tests themselves.
      * The test suites for the new API only test (a) similarity with the results of the old API and (b) elements of the new API.
        * After code is moved to this new API, we should move the tests from the old suites which test the internals.
      
      ### Details
      
      #### Changed names
      
      Parameters
      * useNodeIdCache -> cacheNodeIds
      
      #### Other changes
      
      * Split: Changed categories to set instead of list
      
      #### Non-decision tree changes
      * AttributeGroup
        * Added parentheses to toMetadata, toStructField methods (These were removed in a previous PR, but I ran into 1 issue with the Scala compiler not being able to disambiguate between a toMetadata method with no parentheses and a toMetadata method which takes 1 argument.)
      * Attributes
        * Renamed: toMetadata -> toMetadataImpl
        * Added toMetadata methods which return ML metadata (keyed with “ML_ATTR”)
        * NominalAttribute: Added getNumValues method which examines both numValues and values.
      * Params.inheritValues: Checks whether the parent param really belongs to the child (to allow Estimator-Model pairs with different sets of parameters)
      
      ### Questions for reviewers
      
      * Is "DecisionTreeClassificationModel" too long a name?
      * Is this OK in the docs?
      ```
      class DecisionTreeRegressor extends TreeRegressor[DecisionTreeRegressionModel] with DecisionTreeParams[DecisionTreeRegressor] with TreeRegressorParams[DecisionTreeRegressor]
      ```
      
      ### Future
      
      We should open up the abstractions at some point.  E.g., it would be useful to be able to set tree-related parameters in 1 place and then pass those to multiple tree-based algorithms.
      
      Follow-up JIRAs will be (in this order):
      * Tree ensembles
      * Deprecate old tree code
      * Move DecisionTree implementation code to new API.
      * Move tests from the old suites which test the internals.
      * Update programming guide
      * Python API
      * Change RandomForest* to always use bootstrapping, even when numTrees = 1
      * Provide the probability of the predicted label for classification.  After we move code to the new API and update it to maintain probabilities for all labels, then we can add the probabilities to the new API.
      
      CC: mengxr  manishamde  codedeft  chouqin  MechCoder
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5530 from jkbradley/dt-api-dt and squashes the following commits:
      
      6aae255 [Joseph K. Bradley] Changed tree abstractions not to take type parameters, and for setters to return this.type instead
      ec17947 [Joseph K. Bradley] Updates based on code review.  Main changes were: moving public types from ml.impl.tree to ml.tree, modifying CategoricalSplit to take an Array of categories but store a Set internally, making more types sealed or final
      5626c81 [Joseph K. Bradley] style fixes
      f8fbd24 [Joseph K. Bradley] imported reorg of DecisionTreeSuite from old PR.  small cleanups
      7ef63ed [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example (for real this time)
      e11673f [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example
      119f407 [Joseph K. Bradley] added DecisionTreeClassifier example
      0bdc486 [Joseph K. Bradley] fixed issues after param PR was merged
      f9fbb60 [Joseph K. Bradley] Done with DecisionTreeClassifier, but no save/load yet.  Need to add example as well
      2532c9a [Joseph K. Bradley] partial move to spark.ml API, not done yet
      c72c1a0 [Joseph K. Bradley] Copied changes for common items, plus DecisionTreeClassifier from original PR
      a83571ac
  27. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
Loading