Skip to content
Snippets Groups Projects
  1. Aug 01, 2015
    • HuJiayin's avatar
      [SPARK-8269] [SQL] string function: initcap · 00cd92f3
      HuJiayin authored
      This PR is based on #7208 , thanks to HuJiayin
      
      Closes #7208
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7850 from davies/initcap and squashes the following commits:
      
      54472e9 [Davies Liu] fix python test
      17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
      3a906e4 [Davies Liu] implement title case in UTF8String
      8b2506a [HuJiayin] Update functions.py
      2cd43e5 [HuJiayin] fix python style check
      b616c0e [HuJiayin] add python api
      1f5a0ef [HuJiayin] add codegen
      7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
      6a0b958 [HuJiayin] add column
      c79482d [HuJiayin] support soundex
      7ce416b [HuJiayin] support initcap rebase code
      00cd92f3
    • zhichao.li's avatar
      [SPARK-8263] [SQL] substr/substring should also support binary type · c5166f7a
      zhichao.li authored
      This is based on #7641, thanks to zhichao-li
      
      Closes #7641
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7848 from davies/substr and squashes the following commits:
      
      461b709 [Davies Liu] remove bytearry from tests
      b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
      01d795e [zhichao.li] scala style
      99aa130 [zhichao.li] add substring to dataframe
      4f68bfe [zhichao.li] add binary type support for substring
      c5166f7a
    • Cheng Hao's avatar
      [SPARK-8232] [SQL] Add sort_array support · cf6c9ca3
      Cheng Hao authored
      This PR is based on #7581 , just fix the conflict.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7851 from davies/sort_array and squashes the following commits:
      
      a80ef66 [Davies Liu] fix conflict
      7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
      664c960 [Cheng Hao] update the sort_array by using the ArrayData
      276d2d5 [Cheng Hao] add empty line
      0edab9c [Cheng Hao] Add asending/descending support for sort_array
      80fc0f8 [Cheng Hao] Add type checking
      a42b678 [Cheng Hao] Add sort_array support
      cf6c9ca3
    • Davies Liu's avatar
      Revert "[SPARK-8232] [SQL] Add sort_array support" · 60ea7ab4
      Davies Liu authored
      This reverts commit 67ad4e21.
      60ea7ab4
    • Cheng Hao's avatar
      [SPARK-8232] [SQL] Add sort_array support · 67ad4e21
      Cheng Hao authored
      Add expression `sort_array` support.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Davies Liu <davies.liu@gmail.com>
      
      Closes #7581 from chenghao-intel/sort_array and squashes the following commits:
      
      664c960 [Cheng Hao] update the sort_array by using the ArrayData
      276d2d5 [Cheng Hao] add empty line
      0edab9c [Cheng Hao] Add asending/descending support for sort_array
      80fc0f8 [Cheng Hao] Add type checking
      a42b678 [Cheng Hao] Add sort_array support
      67ad4e21
  2. Jul 31, 2015
    • zhichao.li's avatar
      [SPARK-8264][SQL]add substring_index function · 6996bd2e
      zhichao.li authored
      This PR is based on #7533 , thanks to zhichao-li
      
      Closes #7533
      
      Author: zhichao.li <zhichao.li@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7843 from davies/str_index and squashes the following commits:
      
      391347b [Davies Liu] add python api
      3ce7802 [Davies Liu] fix substringIndex
      f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
      515519b [zhichao.li] add foldable and remove null checking
      9546991 [zhichao.li] scala style
      67c253a [zhichao.li] hide some apis and clean code
      b19b013 [zhichao.li] add codegen and clean code
      ac863e9 [zhichao.li] reduce the calling of numChars
      12e108f [zhichao.li] refine unittest
      d92951b [zhichao.li] add lastIndexOf
      52d7b03 [zhichao.li] add substring_index function
      6996bd2e
    • HuJiayin's avatar
      [SPARK-8271][SQL]string function: soundex · 4d5a6e7b
      HuJiayin authored
      This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738
      
      It's based on #7115 , thanks to HuJiayin
      
      Author: HuJiayin <jiayin.hu@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7812 from davies/soundex and squashes the following commits:
      
      fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
      a4bd6d8 [Davies Liu] fix soundex
      2538908 [HuJiayin] add codegen soundex
      d15d329 [HuJiayin] add back ut
      ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
      e2dec2c [HuJiayin] support soundex rebase code
      4d5a6e7b
    • zsxwing's avatar
      [SPARK-8564] [STREAMING] Add the Python API for Kinesis · 3afc1de8
      zsxwing authored
      This PR adds the Python API for Kinesis, including a Python example and a simple unit test.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6955 from zsxwing/kinesis-python and squashes the following commits:
      
      e42e471 [zsxwing] Merge branch 'master' into kinesis-python
      455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
      32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      5082d28 [zsxwing] Fix the syntax error for Python 2.6
      fca416b [zsxwing] Fix wrong comparison
      96670ff [zsxwing] Fix the compilation error after merging master
      756a128 [zsxwing] Merge branch 'master' into kinesis-python
      6c37395 [zsxwing] Print stack trace for debug
      7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
      cc9d071 [zsxwing] Fix the python test errors
      466b425 [zsxwing] Add python tests for Kinesis
      e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
      3da2601 [zsxwing] Fix the kinesis folder
      687446b [zsxwing] Fix the error message and the maven output path
      add2beb [zsxwing] Merge branch 'master' into kinesis-python
      4957c0b [zsxwing] Add the Python API for Kinesis
      3afc1de8
    • Yanbo Liang's avatar
      [SPARK-9214] [ML] [PySpark] support ml.NaiveBayes for Python · 69b62f76
      Yanbo Liang authored
      support ml.NaiveBayes for Python
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7568 from yanboliang/spark-9214 and squashes the following commits:
      
      5ee3fd6 [Yanbo Liang] fix typos
      3ecd046 [Yanbo Liang] fix typos
      f9c94d1 [Yanbo Liang] change lambda_ to smoothing and fix other issues
      180452a [Yanbo Liang] fix typos
      7dda1f4 [Yanbo Liang] support ml.NaiveBayes for Python
      69b62f76
    • Ram Sriharsha's avatar
      [SPARK-7690] [ML] Multiclass classification Evaluator · 4e5919bf
      Ram Sriharsha authored
      Multiclass Classification Evaluator for ML Pipelines. F1 score, precision, recall, weighted precision and weighted recall are supported as available metrics.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #7475 from harsha2010/SPARK-7690 and squashes the following commits:
      
      9bf4ec7 [Ram Sriharsha] fix indentation
      3f09a85 [Ram Sriharsha] cleanup doc
      16115ae [Ram Sriharsha] code review fixes
      032d2a3 [Ram Sriharsha] fix test
      eec9865 [Ram Sriharsha] Fix Python Indentation
      1dbeffd [Ram Sriharsha] Merge branch 'master' into SPARK-7690
      68cea85 [Ram Sriharsha] Merge branch 'master' into SPARK-7690
      54c03de [Ram Sriharsha] [SPARK-7690][ml][WIP] Multiclass Evaluator for ML Pipeline
      4e5919bf
  3. Jul 30, 2015
    • Daoyuan Wang's avatar
      [SPARK-8176] [SPARK-8197] [SQL] function to_date/ trunc · 83670fc9
      Daoyuan Wang authored
      This PR is based on #6988 , thanks to adrian-wang .
      
      This brings two SQL functions: to_date() and trunc().
      
      Closes #6988
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7805 from davies/to_date and squashes the following commits:
      
      2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date
      310dd55 [Daoyuan Wang] remove dup test in rebase
      980b092 [Daoyuan Wang] resolve rebase conflict
      a476c5a [Daoyuan Wang] address comments from davies
      d44ea5f [Daoyuan Wang] function to_date, trunc
      83670fc9
    • Xiangrui Meng's avatar
      [SPARK-7157][SQL] add sampleBy to DataFrame · df326695
      Xiangrui Meng authored
      This was previously committed but then reverted due to test failures (see #6769).
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7755 from rxin/SPARK-7157 and squashes the following commits:
      
      fbf9044 [Xiangrui Meng] fix python test
      542bd37 [Xiangrui Meng] update test
      604fe6d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
      f051afd [Xiangrui Meng] use udf instead of building expression
      f4e9425 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
      8fb990b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
      103beb3 [Xiangrui Meng] add Java-friendly sampleBy
      991f26f [Xiangrui Meng] fix seed
      4a14834 [Xiangrui Meng] move sampleBy to stat
      832f7cc [Xiangrui Meng] add sampleBy to DataFrame
      df326695
    • Xiangrui Meng's avatar
      [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg · ca71cc8c
      Xiangrui Meng authored
      This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR.
      
      Closes #7731
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7746 from mengxr/SPARK-9408 and squashes the following commits:
      
      0e05a3b [Xiangrui Meng] merge master
      1135551 [Xiangrui Meng] add a comment for str(...)
      c48cae0 [Xiangrui Meng] update tests
      173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py
      ca71cc8c
    • Daoyuan Wang's avatar
      [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290]... · 1abf7dc1
      Daoyuan Wang authored
      [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation
      
      This PR is based on #7589 , thanks to adrian-wang
      
      Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
      add/subtract of date/timestamp and interval.
      
      Closes #7589
      
      cc rxin
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7754 from davies/date_add and squashes the following commits:
      
      e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      6224ce4 [Davies Liu] fix conclict
      bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      e47ff2c [Davies Liu] add python api, fix date functions
      01943d0 [Davies Liu] Merge branch 'master' into date_add
      522e91a [Daoyuan Wang] fix
      e8a639a [Daoyuan Wang] fix
      42df486 [Daoyuan Wang] fix style
      87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
      1a68e03 [Daoyuan Wang] poc of time interval calculation
      c506661 [Daoyuan Wang] function date_add , date_sub
      1abf7dc1
    • Josh Rosen's avatar
      [SPARK-8850] [SQL] Enable Unsafe mode by default · 520ec0ff
      Josh Rosen authored
      This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:
      
      **List of fixed blockers**:
      
      - [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
      - [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
      - [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
      - [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
      - [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
      - [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
      - [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
      - [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
      - [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
      - [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
        - [x] Wrong answer in `join_1to1` (fixed by #7680)
        - [x] Wrong answer in `join_nulls` (fixed by #7680)
        - [x] Managed memory OOM / leak in `lateral_view`
        - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
        - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
      - [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
      - [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:
      
      83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
      f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
      963f567 [Josh Rosen] Reduce buffer size for R tests
      d6986de [Josh Rosen] Lower page size in PySpark tests
      013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
      5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
      ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
      715517b [Josh Rosen] Enable Unsafe by default
      520ec0ff
    • Xiangrui Meng's avatar
      [MINOR] [MLLIB] fix doc for RegexTokenizer · 81464f2a
      Xiangrui Meng authored
      This is #7791 for Python. hhbyyh
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7798 from mengxr/regex-tok-py and squashes the following commits:
      
      baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
      81464f2a
    • Davies Liu's avatar
      [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__ · e044705b
      Davies Liu authored
      Also we could create a Python UDT without having a Scala one, it's important for Python users.
      
      cc mengxr JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7453 from davies/class_in_main and squashes the following commits:
      
      4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
      793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      dc65f19 [Davies Liu] address comment
      a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      a86e1fc [Davies Liu] fix serialization
      ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      63f52ef [Davies Liu] fix pylint check
      655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      316a394 [Davies Liu] support Python UDT with UTF
      0bcb3ef [Davies Liu] fix bug in mllib
      de986d6 [Davies Liu] fix test
      83d65ac [Davies Liu] fix bug in StructType
      55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
      e044705b
    • Alex Angelini's avatar
      Fix reference to self.names in StructType · f5dd1133
      Alex Angelini authored
      `names` is not defined in this context, I think you meant `self.names`.
      
      davies
      
      Author: Alex Angelini <alex.louis.angelini@gmail.com>
      
      Closes #7766 from angelini/fix_struct_type_names and squashes the following commits:
      
      01543a1 [Alex Angelini] Fix reference to self.names in StructType
      f5dd1133
  4. Jul 29, 2015
    • Holden Karau's avatar
      [SPARK-9016] [ML] make random forest classifiers implement classification trait · 37c2d192
      Holden Karau authored
      Implement the classification trait for RandomForestClassifiers. The plan is to use this in the future to providing thresholding for RandomForestClassifiers (as well as other classifiers that implement that trait).
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7432 from holdenk/SPARK-9016-make-random-forest-classifiers-implement-classification-trait and squashes the following commits:
      
      bf22fa6 [Holden Karau] Add missing imports for testing suite
      e948f0d [Holden Karau] Check the prediction generation from rawprediciton
      25320c3 [Holden Karau] Don't supply numClasses when not needed, assert model classes are as expected
      1a67e04 [Holden Karau] Use old decission tree stuff instead
      673e0c3 [Holden Karau] Merge branch 'master' into SPARK-9016-make-random-forest-classifiers-implement-classification-trait
      0d15b96 [Holden Karau] FIx typo
      5eafad4 [Holden Karau] add a constructor for rootnode + num classes
      fc6156f [Holden Karau] scala style fix
      2597915 [Holden Karau] take num classes in constructor
      3ccfe4a [Holden Karau] Merge in master, make pass numClasses through randomforest for training
      222a10b [Holden Karau] Increase numtrees to 3 in the python test since before the two were equal and the argmax was selecting the last one
      16aea1c [Holden Karau] Make tests match the new models
      b454a02 [Holden Karau] Make the Tree classifiers extends the Classifier base class
      77b4114 [Holden Karau] Import vectors lib
      37c2d192
  5. Jul 28, 2015
    • MechCoder's avatar
      [SPARK-7105] [PYSPARK] [MLLIB] Support model save/load in GMM · 198d181d
      MechCoder authored
      This PR introduces save / load for GMM's in python API.
      
      Also I refactored `GaussianMixtureModel` and inherited it from `JavaModelWrapper` with model being `GaussianMixtureModelWrapper`, a wrapper which provides convenience methods to `GaussianMixtureModel` (due to serialization and deserialization issues) and I moved the creation of gaussians to the scala backend.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7617 from MechCoder/python_gmm_save_load and squashes the following commits:
      
      9c305aa [MechCoder] [SPARK-7105] [PySpark] [MLlib] Support model save/load in GMM
      198d181d
  6. Jul 25, 2015
    • JD's avatar
      [Spark-8668][SQL] Adding expr to functions · 723db13e
      JD authored
      Author: JD <jd@csh.rit.edu>
      Author: Joseph Batchik <josephbatchik@gmail.com>
      
      Closes #7606 from JDrit/expr and squashes the following commits:
      
      ad7f607 [Joseph Batchik] fixing python linter error
      9d6daea [Joseph Batchik] removed order by per @rxin's comment
      707d5c6 [Joseph Batchik] Added expr to fuctions.py
      79df83c [JD] added example to the docs
      b89eec8 [JD] moved function up as per @rxin's comment
      4960909 [JD] updated per @JoshRosen's comment
      2cb329c [JD] updated per @rxin's comment
      9a9ad0c [JD] removing unused import
      6dc26d0 [JD] removed split
      7f2222c [JD] Adding expr function as per SPARK-8668
      723db13e
  7. Jul 24, 2015
    • Cheolsoo Park's avatar
      [SPARK-9270] [PYSPARK] allow --name option in pyspark · 9a113961
      Cheolsoo Park authored
      This is continuation of #7512 which added `--name` option to spark-shell. This PR adds the same option to pyspark.
      
      Note that `--conf spark.app.name` in command-line has no effect in spark-shell and pyspark. Instead, `--name` must be used. This is in fact inconsistency with spark-sql which doesn't accept `--name` option while it accepts `--conf spark.app.name`. I am not fixing this inconsistency in this PR. IMO, one of `--name` and `--conf spark.app.name` is needed not both. But since I cannot decide which to choose, I am not making any change here.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7610 from piaozhexiu/SPARK-9270 and squashes the following commits:
      
      763e86d [Cheolsoo Park] Update windows script
      400b7f9 [Cheolsoo Park] Allow --name option to pyspark
      9a113961
  8. Jul 23, 2015
    • Yanbo Liang's avatar
      [SPARK-9122] [MLLIB] [PySpark] spark.mllib regression support batch predict · 52de3acc
      Yanbo Liang authored
      spark.mllib support batch predict for LinearRegressionModel, RidgeRegressionModel and LassoModel.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7614 from yanboliang/spark-9122 and squashes the following commits:
      
      4e610c0 [Yanbo Liang] spark.mllib regression support batch predict
      52de3acc
    • Davies Liu's avatar
      [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType · 8a94eb23
      Davies Liu authored
      Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).
      
      In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).
      
      If no precision and scale is provide, it's Decimal(10, 0) as before.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7605 from davies/decimal_unlimited and squashes the following commits:
      
      aa3f115 [Davies Liu] fix tests and style
      fb0d20d [Davies Liu] address comments
      bfaae35 [Davies Liu] fix style
      df93657 [Davies Liu] address comments and clean up
      06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
      4c28969 [Davies Liu] fix tests
      8d783cc [Davies Liu] fix tests
      788631c [Davies Liu] fix double with decimal in Union/except
      1779bde [Davies Liu] fix scala style
      c9c7c78 [Davies Liu] remove Decimal.Unlimited
      8a94eb23
    • Xiangrui Meng's avatar
      [SPARK-9243] [Documentation] null -> zero in crosstab doc · ecfb3127
      Xiangrui Meng authored
      We forgot to update doc. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7608 from mengxr/SPARK-9243 and squashes the following commits:
      
      0ea3236 [Xiangrui Meng] null -> zero in crosstab doc
      ecfb3127
  9. Jul 22, 2015
    • Josh Rosen's avatar
      [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled · b217230f
      Josh Rosen authored
      Spark has an option called spark.localExecution.enabled; according to the docs:
      
      > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
      
      This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
      
      This pull request simply brings #7484 up to date.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7585 from rxin/remove-local-exec and squashes the following commits:
      
      84bd10e [Reynold Xin] Python fix.
      1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
      eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
      b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
      8975d96 [Josh Rosen] Remove local execution tests.
      ffa8c9b [Josh Rosen] Remove documentation for configuration
      b217230f
    • MechCoder's avatar
      [SPARK-9223] [PYSPARK] [MLLIB] Support model save/load in LDA · 5307c9d3
      MechCoder authored
      Since save / load has been merged in LDA, it takes no time to write the wrappers in Python as well.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7587 from MechCoder/python_lda_save_load and squashes the following commits:
      
      c8e4ea7 [MechCoder] [SPARK-9223] [PySpark] Support model save/load in LDA
      5307c9d3
    • Matei Zaharia's avatar
      [SPARK-9244] Increase some memory defaults · fe26584a
      Matei Zaharia authored
      There are a few memory limits that people hit often and that we could
      make higher, especially now that memory sizes have grown.
      
      - spark.akka.frameSize: This defaults at 10 but is often hit for map
        output statuses in large shuffles. This memory is not fully allocated
        up-front, so we can just make this larger and still not affect jobs
        that never sent a status that large. We increase it to 128.
      
      - spark.executor.memory: Defaults at 512m, which is really small. We
        increase it to 1g.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #7586 from mateiz/configs and squashes the following commits:
      
      ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
      fe26584a
  10. Jul 21, 2015
    • Pedro Rodriguez's avatar
      [SPARK-8230][SQL] Add array/map size method · 560c658a
      Pedro Rodriguez authored
      Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230
      
      Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.
      
      Things to review:
      1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
      2. In Python code, should it be in a `1.5.0` function array or in a collections array?
      3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
      4. Something else?
      
      Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
      Author: Pedro Rodriguez <prodriguez@trulia.com>
      
      Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:
      
      9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
      9aea3bb [Pedro Rodriguez] removed imports from python docs
      15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
      d88247c [Pedro Rodriguez] removed python code
      bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
      59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
      c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
      130839f [Pedro Rodriguez] fixed failing test
      aa9bade [Pedro Rodriguez] fix style
      e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
      0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
      9a1a2ff [Pedro Rodriguez] added unit tests for map size
      2bfbcb6 [Pedro Rodriguez] added unit test for size
      20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
      b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
      99a6a5c [Pedro Rodriguez] fixed failing test
      cac75ac [Pedro Rodriguez] fix style
      933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
      42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
      f9c3b8a [Pedro Rodriguez] added unit tests for map size
      2515d9f [Pedro Rodriguez] added documentation
      0e60541 [Pedro Rodriguez] added unit test for size
      acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
      84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
      560c658a
    • Cheng Hao's avatar
      [SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace · 8c8f0ef5
      Cheng Hao authored
      Add expressions `regex_extract` & `regex_replace`
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #7468 from chenghao-intel/regexp and squashes the following commits:
      
      e5ea476 [Cheng Hao] minor update for documentation
      ef96fd6 [Cheng Hao] update the code gen
      72cf28f [Cheng Hao] Add more log for compilation error
      4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
      8c8f0ef5
    • Cheng Lian's avatar
      [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC · d38c5029
      Cheng Lian authored
      This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7444 from liancheng/spark-9100 and squashes the following commits:
      
      284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
      e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
      d38c5029
  11. Jul 20, 2015
    • Joseph K. Bradley's avatar
      [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests · a5d05819
      Joseph K. Bradley authored
      Several places in the PySpark SparseVector docs have one defined as:
      ```
      SparseVector(4, [2, 4], [1.0, 2.0])
      ```
      The index 4 goes out of bounds (but this is not checked).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits:
      
      c806a65 [Joseph K. Bradley] fixed doc test
      e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests
      a5d05819
    • Davies Liu's avatar
      [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type · 9f913c4f
      Davies Liu authored
      This PR also remove the duplicated code between registerFunction and UserDefinedFunction.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7450 from davies/fix_return_type and squashes the following commits:
      
      e80bf9f [Davies Liu] remove debugging code
      f94b1f6 [Davies Liu] fix mima
      8f9c58b [Davies Liu] convert returned object from UDF into internal type
      9f913c4f
    • Mateusz Buśkiewicz's avatar
      [SPARK-9101] [PySpark] Add missing NullType · 02181fb6
      Mateusz Buśkiewicz authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9101
      
      Author: Mateusz Buśkiewicz <mateusz.buskiewicz@getbase.com>
      
      Closes #7499 from sixers/spark-9101 and squashes the following commits:
      
      dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
      97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types
      02181fb6
    • MechCoder's avatar
      [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test · d0b4e93f
      MechCoder authored
      Python API for the KS-test
      
      Statistics.kolmogorovSmirnovTest(data, distName, *params)
      I'm not quite sure how to support the callable function since it is not serializable.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7430 from MechCoder/spark-8996 and squashes the following commits:
      
      2dd009d [MechCoder] minor
      021d233 [MechCoder] Remove one wrapper and other minor stuff
      49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
      d0b4e93f
  12. Jul 19, 2015
    • Nicholas Hwang's avatar
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions())... · a803ac3e
      Nicholas Hwang authored
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
      
      I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.
      
      Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.
      
      This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.
      
      As an illustrative example, submit the following to `spark-submit`:
      ```
      from pyspark import SparkConf, SparkContext
      import collections
      
      def updateCounter(acc, val):
          print 'update acc:', acc
          print 'update val:', val
          acc[val] += 1
          return acc
      
      def comboCounter(acc1, acc2):
          print 'combo acc1:', acc1
          print 'combo acc2:', acc2
          acc1.update(acc2)
          return acc1
      
      def main():
          conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
          sc = SparkContext(conf = conf)
      
          print '======= AGGREGATING with ONE PARTITION ======='
          print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)
      
          print '======= AGGREGATING with TWO PARTITIONS ======='
          print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)
      
      if __name__ == "__main__":
          main()
      ```
      
      One probably expects this to output the following:
      ```
      Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
      ```
      
      But it instead outputs this (regardless of the number of partitions):
      ```
      Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
      ```
      
      This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.
      
      I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.
      
      Author: Nicholas Hwang <moogling@gmail.com>
      
      Closes #7378 from njhwang/master and squashes the following commits:
      
      659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
      8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
      56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
      391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
      2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
      ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
      7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3.
      90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
      a803ac3e
    • Reynold Xin's avatar
      [SQL] Make date/time functions more consistent with other database systems. · 3427937e
      Reynold Xin authored
      This pull request fixes some of the problems in #6981.
      
      - Added date functions to `__all__` so they get exposed
      - Rename day_of_month -> dayofmonth
      - Rename day_in_year -> dayofyear
      - Rename week_of_year -> weekofyear
      - Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@databricks.com>
      
      Closes #7506 from rxin/datetime and squashes the following commits:
      
      0cb24d9 [Reynold Xin] Export all functions in Python.
      e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
      9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
      3427937e
    • Liang-Chi Hsieh's avatar
      [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API · 9b644c41
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9166
      
      Simply capture and hide `IllegalArgumentException` in Python API.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7497 from viirya/hide_illegalargument and squashes the following commits:
      
      8324dce [Liang-Chi Hsieh] Fix python style.
      9ace67d [Liang-Chi Hsieh] Also check exception message.
      8b2ce5c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into hide_illegalargument
      7be016a [Liang-Chi Hsieh] Capture and hide IllegalArgumentException in Python.
      9b644c41
    • Tarek Auel's avatar
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK... · 83b682be
      Tarek Auel authored
      [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
      
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8199
      https://issues.apache.org/jira/browse/SPARK-8184
      https://issues.apache.org/jira/browse/SPARK-8183
      https://issues.apache.org/jira/browse/SPARK-8182
      https://issues.apache.org/jira/browse/SPARK-8181
      https://issues.apache.org/jira/browse/SPARK-8180
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-8177
      https://issues.apache.org/jira/browse/SPARK-8179
      https://issues.apache.org/jira/browse/SPARK-9115
      
      Regarding `day`and `dayofmonth` are both necessary?
      
      ~~I am going to add `Quarter` to this PR as well.~~ Done.
      
      ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Tarek Auel <tarek.auel@gmail.com>
      
      Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:
      
      f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
      bb567b6 [Tarek Auel] [SPARK-8199] fixed test
      3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
      256c357 [Tarek Auel] [SPARK-8199] code cleanup
      5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
      6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
      4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
      ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
      70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
      3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
      fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
      cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
      746b80a [Tarek Auel] [SPARK-8199] build fix
      0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
      523542d [Tarek Auel] [SPARK-8199] address comments
      2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
      d01b977 [Tarek Auel] [SPARK-8199] python underscore
      56c4a92 [Tarek Auel] [SPARK-8199] update python docu
      e223bc0 [Tarek Auel] [SPARK-8199] refactoring
      d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
      b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
      1b2e540 [Tarek Auel] [SPARK-8119] style fix
      0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
      ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
      1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
      740af0e [Tarek Auel] implement date function using a calculation based on days
      4fb66da [Tarek Auel] WIP: date functions on calculation only
      1a436c9 [Tarek Auel] wip
      f775f39 [Tarek Auel] fixed return type
      ad17e96 [Tarek Auel] improved implementation
      c42b444 [Tarek Auel] Removed merge conflict file
      ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
      10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
      7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
      f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
      6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
      d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
      7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
      5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
      eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
      f120415 [Tarek Auel] improved runtime
      a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
      5fe74e1 [Tarek Auel] fixed python style
      3bfac90 [Tarek Auel] fixed style
      356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
      02efc5d [Tarek Auel] removed doubled code
      a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
      b680db6 [Tarek Auel] added codegeneration to all functions
      c739788 [Tarek Auel] added support for quarter SPARK-8178
      849fb41 [Tarek Auel] fixed stupid test
      638596f [Tarek Auel] improved codegen
      4d8049b [Tarek Auel] fixed tests and added type check
      5ebb235 [Tarek Auel] resolved naming conflict
      d0e2f99 [Tarek Auel] date functions
      83b682be
  13. Jul 17, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines · 34a889db
      Yu ISHIKAWA authored
      I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
      
      [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:
      
      be752de [Yu ISHIKAWA] Add assertions
      a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
      4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
      fb2417c [Yu ISHIKAWA] Use getInt, instead of get
      f397be4 [Yu ISHIKAWA] Switch the comparisons.
      ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
      effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
      c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
      19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
      1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
      f8338bc [Yu ISHIKAWA] Add the placeholders in Python
      4a03003 [Yu ISHIKAWA] Test for contains in Python
      6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
      288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
      5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
      97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
      e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
      978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
      2ec80bc [Yu ISHIKAWA] Fit on 1 line
      e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
      b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
      f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
      3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
      4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
      2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
      19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
      4d2ad1e [Yu ISHIKAWA] Modify the indentations
      0ae422f [Yu ISHIKAWA] Add a test for `setParams`
      4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
      11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
      220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
      92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
      c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
      6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
      687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
      a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
      5bedc51 [Yu ISHIKAWA] Remve an extra new line
      444c289 [Yu ISHIKAWA] Add the validation for `runs`
      e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
      7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
      7991e15 [Yu ISHIKAWA] Add a validation for `k`
      c2df35d [Yu ISHIKAWA] Make `predict` private
      93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
      d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
      e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
      8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
      6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
      99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
      79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
      6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      11c2a12 [Yu ISHIKAWA] Limit the imports
      badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
      f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
      85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
      aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
      c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
      598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
      63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
      34a889db
Loading