Skip to content
Snippets Groups Projects
  1. Dec 09, 2014
    • Sandy Ryza's avatar
      SPARK-4338. [YARN] Ditch yarn-alpha. · 912563aa
      Sandy Ryza authored
      Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:
      
      1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
      9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
      912563aa
    • Cheng Hao's avatar
      [SPARK-4785][SQL] Initilize Hive UDFs on the driver and serialize them with a wrapper · 383c5555
      Cheng Hao authored
      Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3640 from chenghao-intel/udf_serde and squashes the following commits:
      
      8e13756 [Cheng Hao] Update the comment
      74466a3 [Cheng Hao] refactor as feedbacks
      396c0e1 [Cheng Hao] avoid Simple UDF to be serialized
      e9c3212 [Cheng Hao] update the comment
      19cbd46 [Cheng Hao] support udf instance ser/de after initialization
      383c5555
    • zsxwing's avatar
      [SPARK-3154][STREAMING] Replace ConcurrentHashMap with mutable.HashMap and... · bcb5cdad
      zsxwing authored
      [SPARK-3154][STREAMING] Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 'stopped'
      
      Since `sequenceNumberToProcessor` and `stopped` are both protected by the lock `sequenceNumberToProcessor`, `ConcurrentHashMap` and `volatile` is unnecessary. So this PR updated them accordingly.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3634 from zsxwing/SPARK-3154 and squashes the following commits:
      
      0d087ac [zsxwing] Replace ConcurrentHashMap with mutable.HashMap and remove @volatile from 'stopped'
      bcb5cdad
  2. Dec 08, 2014
    • Cheng Hao's avatar
      [SPARK-4769] [SQL] CTAS does not work when reading from temporary tables · 51b1fe14
      Cheng Hao authored
      This is the code refactor and follow ups for #2570
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3336 from chenghao-intel/createtbl and squashes the following commits:
      
      3563142 [Cheng Hao] remove the unused variable
      e215187 [Cheng Hao] eliminate the compiling warning
      4f97f14 [Cheng Hao] fix bug in unittest
      5d58812 [Cheng Hao] revert the API changes
      b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS
      51b1fe14
    • Jacky Li's avatar
      [SQL] remove unnecessary import in spark-sql · 94438436
      Jacky Li authored
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #3630 from jackylk/remove and squashes the following commits:
      
      150e7e0 [Jacky Li] remove unnecessary import
      94438436
    • Sandy Ryza's avatar
      SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio doc... · cda94d15
      Sandy Ryza authored
      ...umented default is incorrect for YARN
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3624 from sryza/sandy-spark-4770 and squashes the following commits:
      
      bd81a3a [Sandy Ryza] SPARK-4770. [DOC] [YARN] spark.scheduler.minRegisteredResourcesRatio documented default is incorrect for YARN
      cda94d15
    • Sean Owen's avatar
      SPARK-3926 [CORE] Reopened: result of JavaRDD collectAsMap() is not serializable · e829bfa1
      Sean Owen authored
      My original 'fix' didn't fix at all. Now, there's a unit test to check whether it works. Of the two options to really fix it -- copy the `Map` to a `java.util.HashMap`, or copy and modify Scala's implementation in `Wrappers.MapWrapper`, I went with the latter.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3587 from srowen/SPARK-3926 and squashes the following commits:
      
      8586bb9 [Sean Owen] Remove unneeded no-arg constructor, and add additional note about copied code in LICENSE
      7bb0e66 [Sean Owen] Make SerializableMapWrapper actually serialize, and add unit test
      e829bfa1
    • Andrew Or's avatar
      [SPARK-4750] Dynamic allocation - synchronize kills · 65f929d5
      Andrew Or authored
      Simple omission on my part.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3612 from andrewor14/dynamic-allocation-synchronization and squashes the following commits:
      
      1f03b60 [Andrew Or] Synchronize kills
      65f929d5
    • Kostas Sakellis's avatar
      [SPARK-4774] [SQL] Makes HiveFromSpark more portable · d6a972b3
      Kostas Sakellis authored
      HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed
      you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when
      the jvm shuts down. This allows us to run this example outside of a spark source tree.
      
      Author: Kostas Sakellis <kostas@cloudera.com>
      
      Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits:
      
      6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable
      d6a972b3
    • Christophe Préaud's avatar
      [SPARK-4764] Ensure that files are fetched atomically · ab2abcb5
      Christophe Préaud authored
      tempFile is created in the same directory than targetFile, so that the
      move from tempFile to targetFile is always atomic
      
      Author: Christophe Préaud <christophe.preaud@kelkoo.com>
      
      Closes #2855 from preaudc/master and squashes the following commits:
      
      9ba89ca [Christophe Préaud] Ensure that files are fetched atomically
      54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
      c6a5590 [Christophe Préaud] Revert commit 8ea871f8130b2490f1bad7374a819bf56f0ccbbd
      7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
      8ea871f [Christophe Préaud] Ensure that files are fetched atomically
      ab2abcb5
  3. Dec 07, 2014
    • Takeshi Yamamuro's avatar
      [SPARK-4620] Add unpersist in Graph and GraphImpl · 8817fc7f
      Takeshi Yamamuro authored
      Add an IF to uncache both vertices and edges of Graph/GraphImpl.
      This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations.
      
      Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Ankur Dave <ankurdave@gmail.com>
      
      Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits:
      
      77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl
      8817fc7f
    • Takeshi Yamamuro's avatar
      [SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark · 2e6b736b
      Takeshi Yamamuro authored
      This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
      It could get performance gains by ~8% in my quick experiments.
      
      Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
      
      Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits:
      
      8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import
      3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
      2e6b736b
  4. Dec 06, 2014
  5. Dec 05, 2014
    • CrazyJvm's avatar
      Streaming doc : do you mean inadvertently? · 6eb1b6f6
      CrazyJvm authored
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #3620 from CrazyJvm/streaming-foreachRDD and squashes the following commits:
      
      b72886b [CrazyJvm] do you mean inadvertently?
      6eb1b6f6
    • Zhang, Liye's avatar
      [SPARK-4005][CORE] handle message replies in receive instead of in the individual private methods · 98a7d099
      Zhang, Liye authored
      In BlockManagermasterActor, when handling message type UpdateBlockInfo, the message replies is in handled in individual private methods, should handle it in receive of Akka.
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #2853 from liyezhang556520/akkaRecv and squashes the following commits:
      
      9b06f0a [Zhang, Liye] remove the unreachable code
      bf518cd [Zhang, Liye] change the indent
      242166b [Zhang, Liye] modified accroding to the comments
      d4b929b [Zhang, Liye] [SPARK-4005][CORE] handle message replies in receive instead of in the individual private methods
      98a7d099
    • Cheng Lian's avatar
      [SPARK-4761][SQL] Enables Kryo by default in Spark SQL Thrift server · 6f61e1f9
      Cheng Lian authored
      Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly).
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3621 from liancheng/kryo-by-default and squashes the following commits:
      
      70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server
      6f61e1f9
    • Michael Armbrust's avatar
      [SPARK-4753][SQL] Use catalyst for partition pruning in newParquet. · f5801e81
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits:
      
      4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.
      f5801e81
  6. Dec 04, 2014
    • Andrew Or's avatar
      fd852533
    • Andrew Or's avatar
      87437df0
    • Masayoshi TSUZUKI's avatar
      [SPARK-4464] Description about configuration options need to be modified in docs. · ca379039
      Masayoshi TSUZUKI authored
      Added description about -h and -host.
      Modified description about -i and -ip which are now deprecated.
      Added description about --properties-file.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #3329 from tsudukim/feature/SPARK-4464 and squashes the following commits:
      
      6c07caf [Masayoshi TSUZUKI] [SPARK-4464] Description about configuration options need to be modified in docs.
      ca379039
    • Andy Konwinski's avatar
      Fix typo in Spark SQL docs. · 15cf3b01
      Andy Konwinski authored
      Author: Andy Konwinski <andykonwinski@gmail.com>
      
      Closes #3611 from andyk/patch-3 and squashes the following commits:
      
      7bab333 [Andy Konwinski] Fix typo in Spark SQL docs.
      15cf3b01
    • Masayoshi TSUZUKI's avatar
      [SPARK-4421] Wrong link in spark-standalone.html · ddfc09c3
      Masayoshi TSUZUKI authored
      Modified the link of building Spark.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #3279 from tsudukim/feature/SPARK-4421 and squashes the following commits:
      
      56e31c1 [Masayoshi TSUZUKI] Modified the link of building Spark.
      ddfc09c3
    • Reynold Xin's avatar
      [SPARK-4397] Move object RDD to the front of RDD.scala. · ed92b47e
      Reynold Xin authored
      I ran into multiple cases that SBT/Scala compiler was confused by the implicits in continuous compilation mode. Adding explicit return types fixes the problem.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3580 from rxin/rdd-implicit and squashes the following commits:
      
      ee32fcd [Reynold Xin] Move object RDD to the end of the file.
      b8562c9 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into rdd-implicit
      d4e9f85 [Reynold Xin] Code review.
      a836a37 [Reynold Xin] Move object RDD to the front of RDD.scala.
      ed92b47e
    • lewuathe's avatar
      [SPARK-4652][DOCS] Add docs about spark-git-repo option · ab8177da
      lewuathe authored
      There might be some cases when WIPS spark version need to be run
      on EC2 cluster. In order to setup this type of cluster more easily,
      add --spark-git-repo option description to ec2 documentation.
      
      Author: lewuathe <lewuathe@me.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3513 from Lewuathe/doc-for-development-spark-cluster and squashes the following commits:
      
      6dae8ee [lewuathe] Wrap consistent with other descriptions
      cfaf9be [lewuathe] Add docs about spark-git-repo option
      
      (Editing / cleanup by Josh Rosen)
      ab8177da
    • Saldanha's avatar
      [SPARK-4459] Change groupBy type parameter from K to U · 743a889d
      Saldanha authored
      Please see https://issues.apache.org/jira/browse/SPARK-4459
      
      Author: Saldanha <saldaal1@phusca-l24858.wlan.na.novartis.net>
      
      Closes #3327 from alokito/master and squashes the following commits:
      
      54b1095 [Saldanha] [SPARK-4459] changed type parameter for keyBy from K to U
      d5f73c3 [Saldanha] [SPARK-4459] added keyBy test
      316ad77 [Saldanha] SPARK-4459 changed type parameter for groupBy from K to U.
      62ddd4b [Saldanha] SPARK-4459 added failing unit test
      743a889d
    • alexdebrie's avatar
      [SPARK-4745] Fix get_existing_cluster() function with multiple security groups · 794f3aec
      alexdebrie authored
      The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance.
      
      Author: alexdebrie <alexdebrie1@gmail.com>
      
      Closes #3596 from alexdebrie/master and squashes the following commits:
      
      9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups
      794f3aec
    • Patrick Wendell's avatar
      [HOTFIX] Fixing two issues with the release script. · 8dae26f8
      Patrick Wendell authored
      1. The version replacement was still producing some false changes.
      2. Uploads to the staging repo specifically.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3608 from pwendell/release-script and squashes the following commits:
      
      3c63294 [Patrick Wendell] Fixing two issues with the release script:
      8dae26f8
    • WangTaoTheTonic's avatar
      [SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster modes · 8106b1e3
      WangTaoTheTonic authored
      In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched.  If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches.
      
      This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode.
      
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits:
      
      ed1a25c [WangTaoTheTonic] revert unrelated formatting issue
      02c4e49 [WangTao] add comment
      32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext
      667cf24 [WangTaoTheTonic] document fix
      ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode
      2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode
      8106b1e3
    • Cheng Lian's avatar
      [SPARK-4683][SQL] Add a beeline.cmd to run on Windows · 28c7acac
      Cheng Lian authored
      Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line:
      
      ```
      bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian
      ```
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3599)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3599 from liancheng/beeline.cmd and squashes the following commits:
      
      79092e7 [Cheng Lian] Windows script for BeeLine
      28c7acac
    • Xiangrui Meng's avatar
      [FIX][DOC] Fix broken links in ml-guide.md · 7e758d70
      Xiangrui Meng authored
      and some minor changes in ScalaDoc.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:
      
      c559768 [Xiangrui Meng] minor code update
      ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
      0b5c182 [Xiangrui Meng] fix links in ml-guide
      7e758d70
    • Joseph K. Bradley's avatar
      [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes · 469a6e5f
      Joseph K. Bradley authored
      Documentation:
      * Added ml-guide.md, linked from mllib-guide.md
      * Updated mllib-guide.md with small section pointing to ml-guide.md
      
      Examples:
      * CrossValidatorExample
      * SimpleParamsExample
      * (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)
      
      Bug fixes:
      * PipelineModel: did not use ParamMaps correctly
      * UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)
      
      CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:
      
      d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
      c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
      99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
      ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
      3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
      41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
      469a6e5f
    • Joseph K. Bradley's avatar
      [docs] Fix outdated comment in tuning guide · 529439bd
      Joseph K. Bradley authored
      When you use the SPARK_JAVA_OPTS env variable, Spark complains:
      
      ```
      SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
      This is deprecated in Spark 1.0+.
      
      Please instead use:
       - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
       - ./spark-submit with --driver-java-options to set -X options for a driver
       - spark.executor.extraJavaOptions to set -X options for executors
       - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
      ```
      
      This updates the docs to redirect the user to the relevant part of the configuration docs.
      
      CC: mengxr  but please CC someone else as needed
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3592 from jkbradley/tuning-doc and squashes the following commits:
      
      0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
      529439bd
    • Aaron Davidson's avatar
      [SQL] Minor: Avoid calling Seq#size in a loop · c6c7165e
      Aaron Davidson authored
      Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3593 from aarondav/seq-opt and squashes the following commits:
      
      962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop
      c6c7165e
    • lewuathe's avatar
      [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's MLlib group · 20bfea4a
      lewuathe authored
      This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`.
      
      Closes #3554
      
      jkbradley
      
      Author: lewuathe <lewuathe@me.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits:
      
      184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc
      f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
      20bfea4a
    • Reynold Xin's avatar
      [SPARK-4719][API] Consolidate various narrow dep RDD classes with MapPartitionsRDD · c3ad4860
      Reynold Xin authored
      MappedRDD, MappedValuesRDD, FlatMappedValuesRDD, FilteredRDD, GlommedRDD, FlatMappedRDD are not necessary. They can be implemented trivially using MapPartitionsRDD.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3578 from rxin/SPARK-4719 and squashes the following commits:
      
      eed9853 [Reynold Xin] Preserve partitioning for filter.
      eb1a89b [Reynold Xin] [SPARK-4719][API] Consolidate various narrow dep RDD classes with MapPartitionsRDD.
      c3ad4860
    • Jacky Li's avatar
      [SQL] remove unnecessary import · ed88db4c
      Jacky Li authored
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #3585 from jackylk/remove and squashes the following commits:
      
      045423d [Jacky Li] remove unnecessary import
      ed88db4c
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 3cdae038
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #1875 (close requested by 'marmbrus')
      Closes #3566 (close requested by 'andrewor14')
      Closes #3487 (close requested by 'pwendell')
      3cdae038
  7. Dec 03, 2014
    • Andrew Or's avatar
      [Release] Correctly translate contributors name in release notes · a4dfb4ef
      Andrew Or authored
      This commit involves three main changes:
      
      (1) It separates the translation of contributor names from the
      generation of the contributors list. This is largely motivated
      by the Github API limit; even if we exceed this limit, we should
      at least be able to proceed manually as before. This is why the
      translation logic is abstracted into its own script
      translate-contributors.py.
      
      (2) When we look for candidate replacements for invalid author
      names, we should look for the assignees of the associated JIRAs
      too. As a result, the intermediate file must keep track of these.
      
      (3) This provides an interactive mode with which the user can
      sit at the terminal and manually pick the candidate replacement
      that he/she thinks makes the most sense. As before, there is a
      non-interactive mode that picks the first candidate that the
      script considers "valid."
      
      TODO: We should have a known_contributors file that stores
      known mappings so we don't have to go through all of this
      translation every time. This is also valuable because some
      contributors simply cannot be automatically translated.
      a4dfb4ef
    • Joseph K. Bradley's avatar
      [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix · 657a8883
      Joseph K. Bradley authored
      Major changes:
      * Added programming guide sections for tree ensembles
      * Added examples for tree ensembles
      * Updated DecisionTree programming guide with more info on parameters
      * **API change**: Standardized the tree parameter for the number of classes (for classification)
      
      Minor changes:
      * Updated decision tree documentation
      * Updated existing tree and tree ensemble examples
       * Use train/test split, and compute test error instead of training error.
       * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
      
      Note: I know this is a lot of lines, but most is covered by:
      * Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
      * New examples (which were copied from the programming guide)
      * The "numClasses" renaming
      
      I have run all examples and relevant unit tests.
      
      CC: mengxr manishamde codedeft
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:
      
      70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
      d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
      8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
      6fab846 [Joseph K. Bradley] small fixes based on review
      b9f8576 [Joseph K. Bradley] updated decision tree doc
      375204c [Joseph K. Bradley] fixed python style
      2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
      706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
      c76c823 [Joseph K. Bradley] added migration guide for mllib
      abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
      07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
      cdfdfbc [Joseph K. Bradley] added examples for GBT
      6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
      ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
      657a8883
    • Joseph K. Bradley's avatar
      [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer · 27ab0b8a
      Joseph K. Bradley authored
      I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3569 from jkbradley/lr-doc and squashes the following commits:
      
      654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization
      5035ad0 [Joseph K. Bradley] updated based on review
      94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
      27ab0b8a
Loading