Skip to content
Snippets Groups Projects
  1. Nov 28, 2016
  2. Sep 19, 2016
    • sureshthalamati's avatar
      [SPARK-17473][SQL] fixing docker integration tests error due to different versions of jars. · cdea1d13
      sureshthalamati authored
      ## What changes were proposed in this pull request?
      Docker tests are using older version  of jersey jars (1.19),  which was used in older releases of spark.  In 2.0 releases Spark was upgraded to use 2.x verison of Jersey. After  upgrade to new versions, docker tests  are  failing with AbstractMethodError.  Now that spark is upgraded  to 2.x jersey version, using of  shaded docker jars  may not be required any more.  Removed the exclusions/overrides of jersey related classes from pom file, and changed the docker-client to use regular jar instead of shaded one.
      
      ## How was this patch tested?
      
      Tested  using existing  docker-integration-tests
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #15114 from sureshthalamati/docker_testfix-spark-17473.
      cdea1d13
  3. Jul 11, 2016
    • Reynold Xin's avatar
      [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT · ffcb6e05
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14130 from rxin/SPARK-16477.
      ffcb6e05
  4. May 17, 2016
  5. May 15, 2016
    • Sean Owen's avatar
      [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient · f5576a05
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      (Retry of https://github.com/apache/spark/pull/13049)
      
      - update to httpclient 4.5 / httpcore 4.4
      - remove some defunct exclusions
      - manage httpmime version to match
      - update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)
      
      ## How was this patch tested?
      
      Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13117 from srowen/SPARK-12972.2.
      f5576a05
  6. May 13, 2016
  7. Apr 28, 2016
  8. Apr 18, 2016
    • Luciano Resende's avatar
      [SPARK-14504][SQL] Enable Oracle docker tests · 68450c8c
      Luciano Resende authored
      ## What changes were proposed in this pull request?
      
      Enable Oracle docker tests
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #12270 from lresende/oracle.
      68450c8c
  9. Apr 11, 2016
  10. Mar 09, 2016
    • Sean Owen's avatar
      [SPARK-13595][BUILD] Move docker, extras modules into external · 256704c7
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`
      
      ## How was this patch tested?
      
      This is tested with Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11523 from srowen/SPARK-13595.
      256704c7
  11. Feb 26, 2016
    • thomastechs's avatar
      [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map string... · 8afe4914
      thomastechs authored
      [SPARK-12941][SQL][MASTER] Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
      
      ## What changes were proposed in this pull request?
      
      This Pull request is used for the fix SPARK-12941, creating a data type mapping to Oracle for the corresponding data type"Stringtype" from dataframe. This PR is for the master branch fix, where as another PR is already tested with the branch 1.4
      
      ## How was the this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      This patch was tested using the Oracle docker .Created a new integration suite for the same.The oracle.jdbc jar was to be downloaded from the maven repository.Since there was no jdbc jar available in the maven repository, the jar was downloaded from oracle site manually and installed in the local; thus tested. So, for SparkQA test case run, the ojdbc jar might be manually placed in the local maven repository(com/oracle/ojdbc6/11.2.0.2.0) while Spark QA test run.
      
      Author: thomastechs <thomas.sebastian@tcs.com>
      
      Closes #11306 from thomastechs/master.
      8afe4914
  12. Jan 30, 2016
    • Josh Rosen's avatar
      [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version · 289373b2
      Josh Rosen authored
      This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
      
      The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
      
      After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10608 from JoshRosen/SPARK-6363.
      289373b2
  13. Dec 19, 2015
  14. Dec 09, 2015
  15. Nov 10, 2015
    • Josh Rosen's avatar
      [SPARK-9818] Re-enable Docker tests for JDBC data source · 1dde39d7
      Josh Rosen authored
      This patch re-enables tests for the Docker JDBC data source. These tests were reverted in #4872 due to transitive dependency conflicts introduced by the `docker-client` library. This patch should avoid those problems by using a version of `docker-client` which shades its transitive dependencies and by performing some build-magic to work around problems with that shaded JAR.
      
      In addition, I significantly refactored the tests to simplify the setup and teardown code and to fix several Docker networking issues which caused problems when running in `boot2docker`.
      
      Closes #8101.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #9503 from JoshRosen/docker-jdbc-tests.
      1dde39d7
  16. Oct 07, 2015
  17. Sep 15, 2015
  18. Jun 28, 2015
    • Josh Rosen's avatar
      [SPARK-8683] [BUILD] Depend on mockito-core instead of mockito-all · f5100451
      Josh Rosen authored
      Spark's tests currently depend on `mockito-all`, which bundles Hamcrest and Objenesis classes. Instead, it should depend on `mockito-core`, which declares those libraries as Maven dependencies. This is necessary in order to fix a dependency conflict that leads to a NoSuchMethodError when using certain Hamcrest matchers.
      
      See https://github.com/mockito/mockito/wiki/Declaring-mockito-dependency for more details.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7061 from JoshRosen/mockito-core-instead-of-all and squashes the following commits:
      
      70eccbe [Josh Rosen] Depend on mockito-core instead of mockito-all.
      f5100451
  19. Jun 03, 2015
    • Patrick Wendell's avatar
      [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 · 2c4d550e
      Patrick Wendell authored
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:
      
      2f42d02 [Patrick Wendell] A few more excludes
      4bebcf0 [Patrick Wendell] Update to RC4
      61aaf46 [Patrick Wendell] Using new release candidate
      55f1610 [Patrick Wendell] Another exclude
      04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
      36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
      2c4d550e
  20. May 29, 2015
    • Andrew Or's avatar
      [HOT FIX] [BUILD] Fix maven build failures · a4f24123
      Andrew Or authored
      This patch fixes a build break in maven caused by #6441.
      
      Note that this patch reverts the changes in flume-sink because
      this module does not currently depend on Spark core, but the
      tests require it. There is not an easy way to make this work
      because mvn test dependencies are not transitive (MNG-1378).
      
      For now, we will leave the one test suite in flume-sink out
      until we figure out a better solution. This patch is mainly
      intended to unbreak the maven build.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6511 from andrewor14/fix-build-mvn and squashes the following commits:
      
      3d53643 [Andrew Or] [HOT FIX #6441] Fix maven build failures
      a4f24123
  21. May 12, 2015
    • Marcelo Vanzin's avatar
      [SPARK-7485] [BUILD] Remove pyspark files from assembly. · 82e890fb
      Marcelo Vanzin authored
      The sbt part of the build is hacky; it basically tricks sbt
      into generating the zip by using a generator, but returns
      an empty list for the generated files so that nothing is
      actually added to the assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:
      
      22c1e04 [Marcelo Vanzin] Remove unneeded code.
      4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
      82e890fb
  22. Apr 30, 2015
    • Vincenzo Selvaggio's avatar
      [SPARK-1406] Mllib pmml model export · 254e0509
      Vincenzo Selvaggio authored
      See PDF attached to the JIRA issue 1406.
      
      The contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: selvinsource <vselvaggio@hotmail.it>
      
      Closes #3062 from selvinsource/mllib_pmml_model_export_SPARK-1406 and squashes the following commits:
      
      852aac6 [Vincenzo Selvaggio] [SPARK-1406] Update JPMML version to 1.1.15 in LICENSE file
      085cf42 [Vincenzo Selvaggio] [SPARK-1406] Added Double Min and Max Fixed scala style
      30165c4 [Vincenzo Selvaggio] [SPARK-1406] Fixed extreme cases for logit
      7a5e0ec [Vincenzo Selvaggio] [SPARK-1406] Binary classification for SVM and Logistic Regression
      cfcb596 [Vincenzo Selvaggio] [SPARK-1406] Throw IllegalArgumentException when exporting a multinomial logistic regression
      25dce33 [Vincenzo Selvaggio] [SPARK-1406] Update code to latest pmml model
      dea98ca [Vincenzo Selvaggio] [SPARK-1406] Exclude transitive dependency for pmml model
      66b7c12 [Vincenzo Selvaggio] [SPARK-1406] Updated pmml model lib to 1.1.15, latest Java 6 compatible
      a0a55f7 [Vincenzo Selvaggio] Merge pull request #2 from mengxr/SPARK-1406
      3c22f79 [Xiangrui Meng] more code style
      e2313df [Vincenzo Selvaggio] Merge pull request #1 from mengxr/SPARK-1406
      472d757 [Xiangrui Meng] fix code style
      1676e15 [Vincenzo Selvaggio] fixed scala issue
      e2ffae8 [Vincenzo Selvaggio] fixed scala style
      b8823b0 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      b25bbf7 [Vincenzo Selvaggio] [SPARK-1406] Added export of pmml to distributed file system using the spark context
      7a949d0 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
      f46c75c [Vincenzo Selvaggio] [SPARK-1406] Added PMMLExportable to supported models
      7b33b4e [Vincenzo Selvaggio] [SPARK-1406] Added a PMMLExportable interface Restructured code in a new package mllib.pmml Supported models implements the new PMMLExportable interface: LogisticRegression, SVM, KMeansModel, LinearRegression, RidgeRegression, Lasso
      d559ec5 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      8fe12bb [Vincenzo Selvaggio] [SPARK-1406] Adjusted logistic regression export description and target categories
      03bc3a5 [Vincenzo Selvaggio] added logistic regression
      da2ec11 [Vincenzo Selvaggio] [SPARK-1406] added linear SVM PMML export
      82f2131 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      19adf29 [Vincenzo Selvaggio] [SPARK-1406] Fixed scala style
      1faf985 [Vincenzo Selvaggio] [SPARK-1406] Added target field to the regression model for completeness Adjusted unit test to deal with this change
      3ae8ae5 [Vincenzo Selvaggio] [SPARK-1406] Adjusted imported order according to the guidelines
      c67ce81 [Vincenzo Selvaggio] Merge remote-tracking branch 'upstream/master' into mllib_pmml_model_export_SPARK-1406
      78515ec [Vincenzo Selvaggio] [SPARK-1406] added pmml export for LinearRegressionModel, RidgeRegressionModel and LassoModel
      e29dfb9 [Vincenzo Selvaggio] removed version, by default is set to 4.2 (latest from jpmml) removed copyright
      ae8b993 [Vincenzo Selvaggio] updated some commented tests to use the new ModelExporter object reordered the imports
      df8a89e [Vincenzo Selvaggio] added pmml version to pmml model changed the copyright to spark
      a1b4dc3 [Vincenzo Selvaggio] updated imports
      834ca44 [Vincenzo Selvaggio] reordered the import accordingly to the guidelines
      349a76b [Vincenzo Selvaggio] new helper object to serialize the models to pmml format
      c3ef9b8 [Vincenzo Selvaggio] set it to private
      6357b98 [Vincenzo Selvaggio] set it to private
      e1eb251 [Vincenzo Selvaggio] removed serialization part, this will be part of the ModelExporter helper object
      aba5ee1 [Vincenzo Selvaggio] fixed cluster export
      cd6c07c [Vincenzo Selvaggio] fixed scala style to run tests
      f75b988 [Vincenzo Selvaggio] Merge remote-tracking branch 'origin/master' into mllib_pmml_model_export_SPARK-1406
      07a29bf [selvinsource] Update LICENSE
      8841439 [Vincenzo Selvaggio] adjust scala style in order to compile
      1433b11 [Vincenzo Selvaggio] complete suite tests
      8e71b8d [Vincenzo Selvaggio] kmeans pmml export implementation
      9bc494f [Vincenzo Selvaggio] added scala suite tests added saveLocalFile to ModelExport trait
      226e184 [Vincenzo Selvaggio] added javadoc and export model type in case there is a need to support other types of export (not just PMML)
      a0e3679 [Vincenzo Selvaggio] export and pmml export traits kmeans test implementation
      254e0509
  23. Mar 27, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-6341][mllib] Upgrade breeze from 0.11.1 to 0.11.2 · f43a6103
      Yu ISHIKAWA authored
      There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
      https://issues.apache.org/jira/browse/SPARK-6341
      
      And thanks you for your great cooperation, David Hall(dlwh)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #5222 from yu-iskw/upgrade-breeze and squashes the following commits:
      
      ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
      f43a6103
  24. Mar 20, 2015
    • Marcelo Vanzin's avatar
      [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT. · a7456459
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5056 from vanzin/SPARK-6371 and squashes the following commits:
      
      63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371
      6506f75 [Marcelo Vanzin] Use more fine-grained exclusion.
      178ba71 [Marcelo Vanzin] Oops.
      75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA.
      a45a62c [Marcelo Vanzin] Work around MIMA warning.
      1d8a670 [Marcelo Vanzin] Re-group jetty exclusion.
      0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx.
      cef4603 [Marcelo Vanzin] Indentation.
      296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
      a7456459
  25. Mar 12, 2015
    • Xiangrui Meng's avatar
      [SPARK-5814][MLLIB][GRAPHX] Remove JBLAS from runtime · 0cba802a
      Xiangrui Meng authored
      The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4699 from mengxr/SPARK-5814 and squashes the following commits:
      
      48635c6 [Xiangrui Meng] move netlib-java version to parent pom
      ca21c74 [Xiangrui Meng] remove jblas from ml-guide
      5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814
      c5c4183 [Xiangrui Meng] merge master
      0f20cad [Xiangrui Meng] add mima excludes
      e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime
      ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx
      fa7c2ca [Xiangrui Meng] move jblas to test scope
      0cba802a
  26. Mar 05, 2015
  27. Mar 04, 2015
  28. Jan 30, 2015
    • sboeschhuawei's avatar
      [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function · f377431a
      sboeschhuawei authored
      Add single pseudo-eigenvector PIC
      Including documentations and updated pom.xml with the following codes:
      mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
      mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala
      
      Author: sboeschhuawei <stephen.boesch@huawei.com>
      Author: Fan Jiang <fanjiang.sc@huawei.com>
      Author: Jiang Fan <fjiang6@gmail.com>
      Author: Stephen Boesch <stephen.boesch@huawei.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4254 from fjiang6/PIC and squashes the following commits:
      
      4550850 [sboeschhuawei] Removed pic test data
      f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
      4b78aaf [Xiangrui Meng] refactor PIC
      24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
      c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
      92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
      7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
      121e4d5 [sboeschhuawei] Remove unused testing data files
      1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
      218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
      43ab10b [sboeschhuawei] Change last two println's to log4j logger
      88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
      24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
      060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
      be659e3 [sboeschhuawei] Added mllib specific log4j
      90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
      bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
      b29c0db [Fan Jiang] Update PIClustering.scala
      ace9749 [Fan Jiang] Update PIClustering.scala
      a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
      f656c34 [sboeschhuawei] Added iris dataset
      b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
      a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
      9294263 [sboeschhuawei] Added visualization/plotting of input/output data
      e5df2b8 [sboeschhuawei] First end to end working PIC
      0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
      32a90dc [sboeschhuawei] Update circles test data values
      0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
      3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
      d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
      a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
      f377431a
  29. Jan 29, 2015
    • Xiangrui Meng's avatar
      [SPARK-5477] refactor stat.py · a3dc6184
      Xiangrui Meng authored
      There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change.
      
      davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4266 from mengxr/py-stat-refactor and squashes the following commits:
      
      1a5e1db [Xiangrui Meng] refactor stat.py
      a3dc6184
  30. Jan 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-4586][MLLIB] Python API for ML pipeline and parameters · e80dc1c5
      Xiangrui Meng authored
      This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.
      
      TODO:
      - [x] handle parameters in LRModel
      - [x] unit tests
      - [x] missing some docs
      
      CC: davies jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4151 from mengxr/SPARK-4586 and squashes the following commits:
      
      415268e [Xiangrui Meng] remove inherit_doc from __init__
      edbd6fe [Xiangrui Meng] move Identifiable to ml.util
      44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
      dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      14ae7e2 [Davies Liu] fix docs
      54ca7df [Davies Liu] fix tests
      78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
      fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      1dca16a [Davies Liu] refactor
      090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
      0882513 [Xiangrui Meng] update doc style
      a4f4dbf [Xiangrui Meng] add unit test for LR
      7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
      ba0ba1e [Xiangrui Meng] add unit tests for pipeline
      0586c7b [Xiangrui Meng] add more comments to the example
      5153cff [Xiangrui Meng] simplify java models
      036ca04 [Xiangrui Meng] gen numFeatures
      46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
      1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
      f66ba0c [Xiangrui Meng] make params a property
      d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
      f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
      05e3e40 [Xiangrui Meng] update example
      d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
      56de571 [Xiangrui Meng] fix style
      d0c5bb8 [Xiangrui Meng] a working copy
      bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      17ecfb9 [Xiangrui Meng] code gen for shared params
      d9ea77c [Xiangrui Meng] update doc
      c18dca1 [Xiangrui Meng] make the example working
      dadd84e [Xiangrui Meng] add base classes and docs
      a3015cf [Xiangrui Meng] add Estimator and Transformer
      46eea43 [Xiangrui Meng] a pipeline in python
      33b68e0 [Xiangrui Meng] a working LR
      e80dc1c5
  31. Jan 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-4048] Enhance and extend hadoop-provided profile. · 48cecf67
      Marcelo Vanzin authored
      This change does a few things to make the hadoop-provided profile more useful:
      
      - Create new profiles for other libraries / services that might be provided by the infrastructure
      - Simplify and fix the poms so that the profiles are only activated while building assemblies.
      - Fix tests so that they're able to run when the profiles are activated
      - Add a new env variable to be used by distributions that use these profiles to provide the runtime
        classpath for Spark jobs and daemons.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:
      
      82eb688 [Marcelo Vanzin] Add a comment.
      eb228c0 [Marcelo Vanzin] Fix borked merge.
      4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
      371ebee [Marcelo Vanzin] Review feedback.
      52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      322f882 [Marcelo Vanzin] Fix merge fail.
      f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      9640503 [Marcelo Vanzin] Cleanup child process log message.
      115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
      e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
      7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
      1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
      d1399ed [Marcelo Vanzin] Restore jetty dependency.
      82a54b9 [Marcelo Vanzin] Remove unused profile.
      5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
      1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
      f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
      9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
      d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
      4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
      417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
      2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
      1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
      284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
      48cecf67
  32. Jan 06, 2015
    • Sean Owen's avatar
      SPARK-4159 [CORE] Maven build doesn't run JUnit test suites · 4cba6eb4
      Sean Owen authored
      This PR:
      
      - Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
      - Tells `surefire` to test only Java tests
      - Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.
      
      For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3651 from srowen/SPARK-4159 and squashes the following commits:
      
      2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
      12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
      e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
      4cba6eb4
  33. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
  34. Nov 12, 2014
    • Xiangrui Meng's avatar
      [SPARK-3530][MLLIB] pipeline and parameters with examples · 4b736dba
      Xiangrui Meng authored
      This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of jkbradley etrain shivaram and many others who helped on the design, also with help from  marmbrus and liancheng on the Spark SQL side. The design doc can be found at:
      
      https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
      
      **org.apache.spark.ml**
      
      This is a new package with new set of ML APIs that address practical machine learning pipelines. (Sorry for taking so long!) It will be an alpha component, so this is definitely not something set in stone. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline:
      
      1. Transformer, which transforms a dataset into another
      2. Estimator, which fits models to data, where models are transformers
      3. Evaluator, which evaluates model output and returns a scalar metric
      4. Pipeline, a simple pipeline that consists of transformers and estimators
      
      Parameters could be supplied at fit/transform or embedded with components.
      
      1. Param: a strong-typed parameter key with self-contained doc
      2. ParamMap: a param -> value map
      3. Params: trait for components with parameters
      
      For any component that implements `Params`, user can easily check the doc by calling `explainParams`:
      
      ~~~
      > val lr = new LogisticRegression
      > lr.explainParams
      maxIter: max number of iterations (default: 100)
      regParam: regularization constant (default: 0.1)
      labelCol: label column name (default: label)
      featuresCol: features column name (default: features)
      ~~~
      
      or user can check individual param:
      
      ~~~
      > lr.maxIter
      maxIter: max number of iterations (default: 100)
      ~~~
      
      **Please start with the example code in test suites and under `org.apache.spark.examples.ml`, where I put several examples:**
      
      1. run a simple logistic regression job
      
      ~~~
          val lr = new LogisticRegression()
            .setMaxIter(10)
            .setRegParam(1.0)
          val model = lr.fit(dataset)
          model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
            .select('label, 'score, 'prediction).collect()
            .foreach(println)
      ~~~
      
      2. run logistic regression with cross-validation and grid search using areaUnderROC (default) as the metric
      
      ~~~
          val lr = new LogisticRegression
          val lrParamMaps = new ParamGridBuilder()
            .addGrid(lr.regParam, Array(0.1, 100.0))
            .addGrid(lr.maxIter, Array(0, 5))
            .build()
          val eval = new BinaryClassificationEvaluator
          val cv = new CrossValidator()
            .setEstimator(lr)
            .setEstimatorParamMaps(lrParamMaps)
            .setEvaluator(eval)
            .setNumFolds(3)
          val bestModel = cv.fit(dataset)
      ~~~
      
      3. run a pipeline that consists of a standard scaler and a logistic regression component
      
      ~~~
          val scaler = new StandardScaler()
            .setInputCol("features")
            .setOutputCol("scaledFeatures")
          val lr = new LogisticRegression()
            .setFeaturesCol(scaler.getOutputCol)
          val pipeline = new Pipeline()
            .setStages(Array(scaler, lr))
          val model = pipeline.fit(dataset)
          val predictions = model.transform(dataset)
            .select('label, 'score, 'prediction)
            .collect()
            .foreach(println)
      ~~~
      
      4. a simple text classification pipeline, which recognizes "spark":
      
      ~~~
          val training = sparkContext.parallelize(Seq(
            LabeledDocument(0L, "a b c d e spark", 1.0),
            LabeledDocument(1L, "b d", 0.0),
            LabeledDocument(2L, "spark f g h", 1.0),
            LabeledDocument(3L, "hadoop mapreduce", 0.0)))
          val tokenizer = new Tokenizer()
            .setInputCol("text")
            .setOutputCol("words")
          val hashingTF = new HashingTF()
            .setInputCol(tokenizer.getOutputCol)
            .setOutputCol("features")
          val lr = new LogisticRegression()
            .setMaxIter(10)
          val pipeline = new Pipeline()
            .setStages(Array(tokenizer, hashingTF, lr))
          val model = pipeline.fit(training)
          val test = sparkContext.parallelize(Seq(
            Document(4L, "spark i j k"),
            Document(5L, "l m"),
            Document(6L, "mapreduce spark"),
            Document(7L, "apache hadoop")))
          model.transform(test)
            .select('id, 'text, 'prediction, 'score)
            .collect()
            .foreach(println)
      ~~~
      
      Java examples are very similar. I put example code that creates a simple text classification pipeline in Scala and Java, where a simple tokenizer is defined as a transformer outside `org.apache.spark.ml`.
      
      **What are missing now and will be added soon:**
      
      1. ~~Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters.~~
      2. ~~Java examples.~~
      3. ~~Store training parameters in trained models.~~
      4. (later) Serialization and Python API.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3099 from mengxr/SPARK-3530 and squashes the following commits:
      
      2cc93fd [Xiangrui Meng] hide APIs as much as I can
      34319ba [Xiangrui Meng] use local instead local[2] for unit tests
      2524251 [Xiangrui Meng] rename PipelineStage.transform to transformSchema
      c9daab4 [Xiangrui Meng] remove mockito version
      1397ab5 [Xiangrui Meng] use sqlContext from LocalSparkContext instead of TestSQLContext
      6ffc389 [Xiangrui Meng] try to fix unit test
      a59d8b7 [Xiangrui Meng] doc updates
      977fd9d [Xiangrui Meng] add scala ml package object
      6d97fe6 [Xiangrui Meng] add AlphaComponent annotation
      731f0e4 [Xiangrui Meng] update package doc
      0435076 [Xiangrui Meng] remove ;this from setters
      fa21d9b [Xiangrui Meng] update extends indentation
      f1091b3 [Xiangrui Meng] typo
      228a9f4 [Xiangrui Meng] do not persist before calling binary classification metrics
      f51cd27 [Xiangrui Meng] rename default to defaultValue
      b3be094 [Xiangrui Meng] refactor schema transform in lr
      8791e8e [Xiangrui Meng] rename copyValues to inheritValues and make it do the right thing
      51f1c06 [Xiangrui Meng] remove leftover code in Transformer
      494b632 [Xiangrui Meng] compure score once
      ad678e9 [Xiangrui Meng] more doc for Transformer
      4306ed4 [Xiangrui Meng] org imports in text pipeline
      6e7c1c7 [Xiangrui Meng] update pipeline
      4f9e34f [Xiangrui Meng] more doc for pipeline
      aa5dbd4 [Xiangrui Meng] fix typo
      11be383 [Xiangrui Meng] fix unit tests
      3df7952 [Xiangrui Meng] clean up
      986593e [Xiangrui Meng] re-org java test suites
      2b11211 [Xiangrui Meng] remove external data deps
      9fd4933 [Xiangrui Meng] add unit test for pipeline
      2a0df46 [Xiangrui Meng] update tests
      2d52e4d [Xiangrui Meng] add @AlphaComponent to package-info
      27582a4 [Xiangrui Meng] doc changes
      73a000b [Xiangrui Meng] add schema transformation layer
      6736e87 [Xiangrui Meng] more doc / remove HasMetricName trait
      80a8b5e [Xiangrui Meng] rename SimpleTransformer to UnaryTransformer
      62ca2bb [Xiangrui Meng] check param parent in set/get
      1622349 [Xiangrui Meng] add getModel to PipelineModel
      a0e0054 [Xiangrui Meng] update StandardScaler to use SimpleTransformer
      d0faa04 [Xiangrui Meng] remove implicit mapping from ParamMap
      c7f6921 [Xiangrui Meng] move ParamGridBuilder test to ParamGridBuilderSuite
      e246f29 [Xiangrui Meng] re-org:
      7772430 [Xiangrui Meng] remove modelParams add a simple text classification pipeline
      b95c408 [Xiangrui Meng] remove implicits add unit tests to params
      bab3e5b [Xiangrui Meng] update params
      fe0ee92 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
      6e86d98 [Xiangrui Meng] some code clean-up
      2d040b3 [Xiangrui Meng] implement setters inside each class, add Params.copyValues [ci skip]
      fd751fc [Xiangrui Meng] add java-friendly versions of fit and tranform
      3f810cd [Xiangrui Meng] use multi-model training api in cv
      5b8f413 [Xiangrui Meng] rename model to modelParams
      9d2d35d [Xiangrui Meng] test varargs and chain model params
      f46e927 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3530
      1ef26e0 [Xiangrui Meng] specialize methods/types for Java
      df293ed [Xiangrui Meng] switch to setter/getter
      376db0a [Xiangrui Meng] pipeline and parameters
      4b736dba
  35. Nov 04, 2014
    • Xiangrui Meng's avatar
      [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD · 1a9c6cdd
      Xiangrui Meng authored
      Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.
      
      ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~
      
      marmbrus jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
      
      3a0b6e5 [Xiangrui Meng] organize imports
      236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
      1a9c6cdd
  36. Nov 01, 2014
    • Xiangrui Meng's avatar
      [SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading · d8176b1c
      Xiangrui Meng authored
      In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes.
      
      This PR sets commons-math3 version based on hadoop profiles.
      
      pwendell JoshRosen srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits:
      
      580f6d9 [Xiangrui Meng] replace tab by spaces
      7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts
      d3353d9 [Xiangrui Meng] do not shade commons-math3
      b4180dc [Xiangrui Meng] temp work
      d8176b1c
Loading