Skip to content
Snippets Groups Projects
  1. Apr 21, 2015
    • Prashant Sharma's avatar
      [SPARK-7011] Build(compilation) fails with scala 2.11 option, because a... · 04bf34e3
      Prashant Sharma authored
      [SPARK-7011] Build(compilation) fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
      
      [This](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L58) is where it is used and fails compilations at.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #5593 from ScrapCodes/SPARK-7011/build-fix and squashes the following commits:
      
      e6d57a3 [Prashant Sharma] [SPARK-7011] Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
      04bf34e3
    • MechCoder's avatar
      [SPARK-6845] [MLlib] [PySpark] Add isTranposed flag to DenseMatrix · 45c47fa4
      MechCoder authored
      Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5455 from MechCoder/spark-6845 and squashes the following commits:
      
      525c370 [MechCoder] minor
      004a37f [MechCoder] Cast boolean to int
      151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
      cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
      45c47fa4
    • emres's avatar
      SPARK-3276 Added a new configuration spark.streaming.minRememberDuration · c25ca7c5
      emres authored
      SPARK-3276 Added a new configuration parameter ``spark.streaming.minRememberDuration``, with a default value of 1 minute.
      
      So that when a Spark Streaming application is started, an arbitrary number of minutes can be taken as threshold for remembering.
      
      Author: emres <emre.sevinc@gmail.com>
      
      Closes #5438 from emres/SPARK-3276 and squashes the following commits:
      
      766f938 [emres] SPARK-3276 Switched to using newly added getTimeAsSeconds method.
      affee1d [emres] SPARK-3276 Changed the property name and variable name for minRememberDuration
      c9d58ca [emres] SPARK-3276 Minor code re-formatting.
      1c53ba9 [emres] SPARK-3276 Started to use ssc.conf rather than ssc.sparkContext.getConf,  and also getLong method directly.
      bfe0acb [emres] SPARK-3276 Moved the minRememberDurationMin to the class
      daccc82 [emres] SPARK-3276 Changed the property name to reflect the unit of value and reduced number of fields.
      43cc1ce [emres] SPARK-3276 Added a new configuration parameter spark.streaming.minRemember duration, with a default value of 1 minute.
      c25ca7c5
    • Kay Ousterhout's avatar
      [SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD · c035c0f2
      Kay Ousterhout authored
      CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object.
      
      This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here.
      
      I did some simple experiments to see how much this effects closure size. For this example:
      $ val a = sc.parallelize(1 to 10).map((_, 1))
      $ val b = sc.parallelize(1 to 2).map(x => (x, 2*x))
      $ a.cogroup(b).collect()
      the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle.
      
      For this example:
      $ val sortedA = a.sortByKey()
      $ val sortedB = b.sortByKey()
      $ sortedA.cogroup(sortedB).collect()
      the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies.
      
      The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). It would also get bigger for a big RDD -- although I can't think of any examples where the RDD object gets large.  The difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4145 from kayousterhout/SPARK-5360 and squashes the following commits:
      
      85156c3 [Kay Ousterhout] Better comment the narrowDeps parameter
      cff0209 [Kay Ousterhout] Fixed spelling issue
      658e1af [Kay Ousterhout] [SPARK-5360] Eliminate duplicate objects in serialized CoGroupedRDD
      c035c0f2
    • David McGuire's avatar
      [SPARK-6985][streaming] Receiver maxRate over 1000 causes a StackOverflowError · 5fea3e5c
      David McGuire authored
      A simple truncation in integer division (on rates over 1000 messages / second) causes the existing implementation to sleep for 0 milliseconds, then call itself recursively; this causes what is essentially an infinite recursion, since the base case of the calculated amount of time having elapsed can't be reached before available stack space is exhausted. A fix to this truncation error is included in this patch.
      
      However, even with the defect patched, the accuracy of the existing implementation is abysmal (the error bounds of the original test were effectively [-30%, +10%], although this fact was obscured by hard-coded error margins); as such, when the error bounds were tightened down to [-5%, +5%], the existing implementation failed to meet the new, tightened, requirements. Therefore, an industry-vetted solution (from Guava) was used to get the adapted tests to pass.
      
      Author: David McGuire <david.mcguire2@nike.com>
      
      Closes #5559 from dmcguire81/master and squashes the following commits:
      
      d29d2e0 [David McGuire] Back out to +/-5% error margins, for flexibility in timing
      8be6934 [David McGuire] Fix spacing per code review
      90e98b9 [David McGuire] Address scalastyle errors
      29011bd [David McGuire] Further ratchet down the error margins
      b33b796 [David McGuire] Eliminate dependency on even distribution by BlockGenerator
      8f2934b [David McGuire] Remove arbitrary thread timing / cooperation code
      70ee310 [David McGuire] Use Thread.yield(), since Thread.sleep(0) is system-dependent
      82ee46d [David McGuire] Replace guard clause with nested conditional
      2794717 [David McGuire] Replace the RateLimiter with the Guava implementation
      38f3ca8 [David McGuire] Ratchet down the error rate to +/- 5%; tests fail
      24b1bc0 [David McGuire] Fix truncation in integer division causing infinite recursion
      d6e1079 [David McGuire] Stack overflow error in RateLimiter on rates over 1000/s
      5fea3e5c
    • Yanbo Liang's avatar
      [SPARK-5990] [MLLIB] Model import/export for IsotonicRegression · 1f2f723b
      Yanbo Liang authored
      Model import/export for IsotonicRegression
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5270 from yanboliang/spark-5990 and squashes the following commits:
      
      872028d [Yanbo Liang] fix code style
      f80ec1b [Yanbo Liang] address comments
      49600cc [Yanbo Liang] address comments
      429ff7d [Yanbo Liang] store each interval as a record
      2b2f5a1 [Yanbo Liang] Model import/export for IsotonicRegression
      1f2f723b
    • Davies Liu's avatar
      [SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression · ab9128fb
      Davies Liu authored
      This PR enable auto_convert in JavaGateway, then we could register a converter for a given types, for example, date and datetime.
      
      There are two bugs related to auto_convert, see [1] and [2], we workaround it in this PR.
      
      [1]  https://github.com/bartdag/py4j/issues/160
      [2] https://github.com/bartdag/py4j/issues/161
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5570 from davies/py4j_date and squashes the following commits:
      
      eb4fa53 [Davies Liu] fix tests in python 3
      d17d634 [Davies Liu] rollback changes in mllib
      2e7566d [Davies Liu] convert tuple into ArrayList
      ceb3779 [Davies Liu] Update rdd.py
      3c373f3 [Davies Liu] support date and datetime by auto_convert
      cb094ff [Davies Liu] enable auto convert
      ab9128fb
    • zsxwing's avatar
      [SPARK-6490][Core] Add spark.rpc.* and deprecate spark.akka.* · 8136810d
      zsxwing authored
      Deprecated `spark.akka.num.retries`, `spark.akka.retry.wait`, `spark.akka.askTimeout`,  `spark.akka.lookupTimeout`, and added `spark.rpc.num.retries`, `spark.rpc.retry.wait`, `spark.rpc.askTimeout`, `spark.rpc.lookupTimeout`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5595 from zsxwing/SPARK-6490 and squashes the following commits:
      
      e0d80a9 [zsxwing] Use getTimeAsMs and getTimeAsSeconds and other minor fixes
      31dbe69 [zsxwing] Add spark.rpc.* and deprecate spark.akka.*
      8136810d
  2. Apr 20, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-6635][SQL] DataFrame.withColumn should replace columns with identical column names · c736220d
      Liang-Chi Hsieh authored
      JIRA https://issues.apache.org/jira/browse/SPARK-6635
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5541 from viirya/replace_with_column and squashes the following commits:
      
      b539c7b [Liang-Chi Hsieh] For comment.
      72f35b1 [Liang-Chi Hsieh] DataFrame.withColumn can replace original column with identical column name.
      c736220d
    • Yin Huai's avatar
      [SPARK-6368][SQL] Build a specialized serializer for Exchange operator. · ce7ddabb
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6368
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5497 from yhuai/serializer2 and squashes the following commits:
      
      da562c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2
      50e0c3d [Yin Huai] When no filed is emitted to shuffle, use SparkSqlSerializer for now.
      9f1ed92 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2
      6d07678 [Yin Huai] Address comments.
      4273b8c [Yin Huai] Enabled SparkSqlSerializer2.
      09e587a [Yin Huai] Remove TODO.
      791b96a [Yin Huai] Use UTF8String.
      60a1487 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2
      3e09655 [Yin Huai] Use getAs for Date column.
      43b9fb4 [Yin Huai] Test.
      8297732 [Yin Huai] Fix test.
      c9373c8 [Yin Huai] Support DecimalType.
      2379eeb [Yin Huai] ASF header.
      39704ab [Yin Huai] Specialized serializer for Exchange.
      ce7ddabb
    • BenFradet's avatar
      [doc][streaming] Fixed broken link in mllib section · 517bdf36
      BenFradet authored
      The commit message is pretty self-explanatory.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #5600 from BenFradet/master and squashes the following commits:
      
      108492d [BenFradet] [doc][streaming] Fixed broken link in mllib section
      517bdf36
    • Eric Chiang's avatar
      fixed doc · 97fda73d
      Eric Chiang authored
      The contribution is my original work. I license the work to the project under the project's open source license.
      
      Small typo in the programming guide.
      
      Author: Eric Chiang <eric.chiang.m@gmail.com>
      
      Closes #5599 from ericchiang/docs-typo and squashes the following commits:
      
      1177942 [Eric Chiang] fixed doc
      97fda73d
    • Liang-Chi Hsieh's avatar
      [Minor][MLlib] Incorrect path to test data is used in DecisionTreeExample · 1ebceaa5
      Liang-Chi Hsieh authored
      It should load from `testInput` instead of `input` for test data.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5594 from viirya/use_testinput and squashes the following commits:
      
      5e8b174 [Liang-Chi Hsieh] Fix style.
      b60b475 [Liang-Chi Hsieh] Use testInput.
      1ebceaa5
    • Elisey Zanko's avatar
      [SPARK-6661] Python type errors should print type, not object · 77176619
      Elisey Zanko authored
      Author: Elisey Zanko <elisey.zanko@gmail.com>
      
      Closes #5361 from 31z4/spark-6661 and squashes the following commits:
      
      73c5d79 [Elisey Zanko] Python type errors should print type, not object
      77176619
    • Aaron Davidson's avatar
      [SPARK-7003] Improve reliability of connection failure detection between Netty... · 968ad972
      Aaron Davidson authored
      [SPARK-7003] Improve reliability of connection failure detection between Netty block transfer service endpoints
      
      Currently we rely on the assumption that an exception will be raised and the channel closed if two endpoints cannot communicate over a Netty TCP channel. However, this guarantee does not hold in all network environments, and [SPARK-6962](https://issues.apache.org/jira/browse/SPARK-6962) seems to point to a case where only the server side of the connection detected a fault.
      
      This patch improves robustness of fetch/rpc requests by having an explicit timeout in the transport layer which closes the connection if there is a period of inactivity while there are outstanding requests.
      
      NB: This patch is actually only around 50 lines added if you exclude the testing-related code.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #5584 from aarondav/timeout and squashes the following commits:
      
      8699680 [Aaron Davidson] Address Reynold's comments
      37ce656 [Aaron Davidson] [SPARK-7003] Improve reliability of connection failure detection between Netty block transfer service endpoints
      968ad972
    • jrabary's avatar
      [SPARK-5924] Add the ability to specify withMean or withStd parameters with StandarScaler · 1be20707
      jrabary authored
      The current implementation call the default constructor of mllib.feature.StandarScaler without the possibility to specify withMean or withStd options.
      
      Author: jrabary <Jaonary@gmail.com>
      
      Closes #4704 from jrabary/master and squashes the following commits:
      
      fae8568 [jrabary] style fix
      8896b0e [jrabary] Comments fix
      ef96d73 [jrabary] style fix
      8e52607 [jrabary] style fix
      edd9d48 [jrabary] Fix default param initialization
      17e1a76 [jrabary] Fix default param initialization
      298f405 [jrabary] Typo fix
      45ed914 [jrabary] Add withMean and withStd params to StandarScaler
      1be20707
  3. Apr 19, 2015
  4. Apr 18, 2015
    • Olivier Girardot's avatar
      SPARK-6993 : Add default min, max methods for JavaDoubleRDD · 8fbd45c7
      Olivier Girardot authored
      The default method will use Guava's Ordering instead of
      java.util.Comparator.naturalOrder() because it's not available
      in Java 7, only in Java 8.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5571 from ogirardot/master and squashes the following commits:
      
      7fe2e9e [Olivier Girardot] SPARK-6993 : Add default min, max methods for JavaDoubleRDD
      8fbd45c7
    • Gaurav Nanda's avatar
      Fixed doc · 729885ec
      Gaurav Nanda authored
      Just fixed a doc.
      
      Author: Gaurav Nanda <gaurav324@gmail.com>
      
      Closes #5576 from gaurav324/master and squashes the following commits:
      
      8a7323f [Gaurav Nanda] Fixed doc
      729885ec
    • Nicholas Chammas's avatar
      [SPARK-6219] Reuse pep8.py · 28683b4d
      Nicholas Chammas authored
      Per the discussion in the comments on [this commit](https://github.com/apache/spark/commit/f17d43b033d928dbc46aef8e367aa08902e698ad#commitcomment-10780649), this PR allows the Python lint script to reuse `pep8.py` when possible.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #5561 from nchammas/save-dem-pep8-bytes and squashes the following commits:
      
      b7c91e6 [Nicholas Chammas] reuse pep8.py
      28683b4d
    • Marcelo Vanzin's avatar
      [core] [minor] Make sure ConnectionManager stops. · 327ebf0c
      Marcelo Vanzin authored
      My previous fix (force a selector wakeup) didn't seem to work since
      I ran into the hang again. So change the code a bit to be more
      explicit about the condition when the selector thread should exit.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5566 from vanzin/conn-mgr-hang and squashes the following commits:
      
      ddb2c03 [Marcelo Vanzin] [core] [minor] Make sure ConnectionManager stops.
      327ebf0c
    • Olivier Girardot's avatar
      SPARK-6992 : Fix documentation example for Spark SQL on StructType · 5f095d56
      Olivier Girardot authored
      
      This patch is fixing the Java examples for Spark SQL when defining
      programmatically a Schema and mapping Rows.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5569 from ogirardot/branch-1.3 and squashes the following commits:
      
      c29e58d [Olivier Girardot] SPARK-6992 : Fix documentation example for Spark SQL on StructType
      
      (cherry picked from commit c9b1ba4b16a7afe93d45bf75b128cc0dd287ded0)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      5f095d56
  5. Apr 17, 2015
    • jerryshao's avatar
      [SPARK-6975][Yarn] Fix argument validation error · d850b4bd
      jerryshao authored
      `numExecutors` checking is failed when dynamic allocation is enabled with default configuration. Details can be seen is [SPARK-6975](https://issues.apache.org/jira/browse/SPARK-6975). sryza, please help me to review this, not sure is this the correct way, I think previous you change this part :)
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5551 from jerryshao/SPARK-6975 and squashes the following commits:
      
      4335da1 [jerryshao] Change according to the comments
      77bdcbd [jerryshao] Fix argument validation error
      d850b4bd
    • Marcelo Vanzin's avatar
      [SPARK-5933] [core] Move config deprecation warnings to SparkConf. · 19913373
      Marcelo Vanzin authored
      I didn't find many deprecated configs after a grep-based search,
      but the ones I could find were moved to the centralized location
      in SparkConf.
      
      While there, I deprecated a couple more HS configs that mentioned
      time units.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5562 from vanzin/SPARK-5933 and squashes the following commits:
      
      dcb617e7 [Marcelo Vanzin] [SPARK-5933] [core] Move config deprecation warnings to SparkConf.
      19913373
    • Jongyoul Lee's avatar
      [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode · 6fbeb82e
      Jongyoul Lee authored
      - Defined executorCores from "spark.mesos.executor.cores"
      - Changed the amount of mesosExecutor's cores to executorCores.
      - Added new configuration option on running-on-mesos.md
      
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #5063 from jongyoul/SPARK-6350 and squashes the following commits:
      
      9238d6e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs - Changed configuration name - Made mesosExecutorCores private
      2d41241 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
      89edb4f [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
      8ba7694 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
      7549314 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed docs
      4ae7b0c [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Removed TODO
      c27efce [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed Mesos*Suite for supporting integer WorkerOffers - Fixed Documentation
      1fe4c03 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Change available resources of cpus to integer value beacuse WorkerOffer support the amount cpus as integer value
      5f3767e [Jongyoul Lee] Revert "[SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode"
      4b7c69e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Changed configruation name and description from "spark.mesos.executor.cores" to "spark.executor.frameworkCores"
      0556792 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Defined executorCores from "spark.mesos.executor.cores" - Changed the amount of mesosExecutor's cores to executorCores. - Added new configuration option on running-on-mesos.md
      6fbeb82e
    • Ilya Ganelin's avatar
      [SPARK-6703][Core] Provide a way to discover existing SparkContext's · c5ed5101
      Ilya Ganelin authored
      I've added a static getOrCreate method to the static SparkContext object that allows one to either retrieve a previously created SparkContext or to instantiate a new one with the provided config. The method accepts an optional SparkConf to make usage intuitive.
      
      Still working on a test for this, basically want to create a new context from scratch, then ensure that subsequent calls don't overwrite that.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5501 from ilganeli/SPARK-6703 and squashes the following commits:
      
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      c5ed5101
    • Reynold Xin's avatar
      Minor fix to SPARK-6958: Improve Python docstring for DataFrame.sort. · a452c592
      Reynold Xin authored
      As a follow up PR to #5544.
      
      cc davies
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5558 from rxin/sort-doc-improvement and squashes the following commits:
      
      f4c276f [Reynold Xin] Review feedback.
      d2dcf24 [Reynold Xin] Minor fix to SPARK-6958: Improve Python docstring for DataFrame.sort.
      a452c592
    • Olivier Girardot's avatar
      SPARK-6988 : Fix documentation regarding DataFrames using the Java API · d305e686
      Olivier Girardot authored
      
      This patch includes :
       * adding how to use map after an sql query using javaRDD
       * fixing the first few java examples that were written in Scala
      
      Thank you for your time,
      
      Olivier.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5564 from ogirardot/branch-1.3 and squashes the following commits:
      
      9f8d60e [Olivier Girardot] SPARK-6988 : Fix documentation regarding DataFrames using the Java API
      
      (cherry picked from commit 6b528dc139da594ef2e651d84bd91fe0f738a39d)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      d305e686
    • cafreeman's avatar
      [SPARK-6807] [SparkR] Merge recent SparkR-pkg changes · 59e206de
      cafreeman authored
      This PR pulls in recent changes in SparkR-pkg, including
      
      cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField.
      
      Author: cafreeman <cfreeman@alteryx.com>
      Author: Davies Liu <davies@databricks.com>
      Author: Zongheng Yang <zongheng.y@gmail.com>
      Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5436 from davies/R3 and squashes the following commits:
      
      c2b09be [Davies Liu] SQLTypes -> schema
      a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3
      168b7fe [Davies Liu] sort generics
      b1fe460 [Davies Liu] fix conflict in README.md
      e74c04e [Davies Liu] fix schema.R
      4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5
      41f8184 [Davies Liu] rm man
      ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3
      1bdcb63 [Zongheng Yang] Updates to README.md.
      5a553e7 [cafreeman] Use object attribute instead of argument
      71372d9 [cafreeman] Update docs and examples
      8526d2e71 [cafreeman] Remove `tojson` functions
      6ef5f2d [cafreeman] Fix spacing
      7741d66 [cafreeman] Rename the SQL DataType function
      141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream
      9387402 [Davies Liu] fix style
      40199eb [Shivaram Venkataraman] Move except into sorted position
      07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD.
      7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey
      ed66c81 [cafreeman] Update `subtract` to work with `generics.R`
      f3ba785 [cafreeman] Fixed duplicate export
      275deb4 [cafreeman] Update `NAMESPACE` and tests
      1a3b63d [cafreeman] new version of `CreateDF`
      836c4bf [cafreeman] Update `createDataFrame` and `toDF`
      be5d5c1 [cafreeman] refactor schema functions
      40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5
      20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist
      ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4
      c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master
      b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master
      136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats
      cd66603 [cafreeman] new line at EOF
      8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep
      7dd81b7 [cafreeman] Documentation
      0e2a94f [cafreeman] Define functions for schema and fields
      59e206de
    • Joseph K. Bradley's avatar
      [SPARK-6113] [ml] Stabilize DecisionTree API · a83571ac
      Joseph K. Bradley authored
      This is a PR for cleaning up and finalizing the DecisionTree API.  PRs for ensembles will follow once this is merged.
      
      ### Goal
      
      Here is the description copied from the JIRA (for both trees and ensembles):
      
      > **Issue**: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design.
      > **Proposal**: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details.
      > **[Design doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)** : This outlines current issues and the proposed API.
      
      Overall code layout:
      * The old API in mllib.tree.* will remain the same.
      * The new API will reside in ml.classification.* and ml.regression.*
      
      ### Summary of changes
      
      Old API
      * Exactly the same, except I made 1 method in Loss private (but that is not a breaking change since that method was introduced after the Spark 1.3 release).
      
      New APIs
      * Under Pipeline API
      * The new API preserves functionality, except:
        * New API does NOT store prob (probability of label in classification).  I want to have it store the full vector of probabilities but feel that should be in a later PR.
      * Use abstractions for parameters, estimators, and models to avoid code duplication
      * Limit parameters to relevant algorithms
      * For enum-like types, only expose Strings
        * We can make these pluggable later on by adding new parameters.  That is a far-future item.
      
      Test suites
      * I organized DecisionTreeSuite, but I made absolutely no changes to the tests themselves.
      * The test suites for the new API only test (a) similarity with the results of the old API and (b) elements of the new API.
        * After code is moved to this new API, we should move the tests from the old suites which test the internals.
      
      ### Details
      
      #### Changed names
      
      Parameters
      * useNodeIdCache -> cacheNodeIds
      
      #### Other changes
      
      * Split: Changed categories to set instead of list
      
      #### Non-decision tree changes
      * AttributeGroup
        * Added parentheses to toMetadata, toStructField methods (These were removed in a previous PR, but I ran into 1 issue with the Scala compiler not being able to disambiguate between a toMetadata method with no parentheses and a toMetadata method which takes 1 argument.)
      * Attributes
        * Renamed: toMetadata -> toMetadataImpl
        * Added toMetadata methods which return ML metadata (keyed with “ML_ATTR”)
        * NominalAttribute: Added getNumValues method which examines both numValues and values.
      * Params.inheritValues: Checks whether the parent param really belongs to the child (to allow Estimator-Model pairs with different sets of parameters)
      
      ### Questions for reviewers
      
      * Is "DecisionTreeClassificationModel" too long a name?
      * Is this OK in the docs?
      ```
      class DecisionTreeRegressor extends TreeRegressor[DecisionTreeRegressionModel] with DecisionTreeParams[DecisionTreeRegressor] with TreeRegressorParams[DecisionTreeRegressor]
      ```
      
      ### Future
      
      We should open up the abstractions at some point.  E.g., it would be useful to be able to set tree-related parameters in 1 place and then pass those to multiple tree-based algorithms.
      
      Follow-up JIRAs will be (in this order):
      * Tree ensembles
      * Deprecate old tree code
      * Move DecisionTree implementation code to new API.
      * Move tests from the old suites which test the internals.
      * Update programming guide
      * Python API
      * Change RandomForest* to always use bootstrapping, even when numTrees = 1
      * Provide the probability of the predicted label for classification.  After we move code to the new API and update it to maintain probabilities for all labels, then we can add the probabilities to the new API.
      
      CC: mengxr  manishamde  codedeft  chouqin  MechCoder
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5530 from jkbradley/dt-api-dt and squashes the following commits:
      
      6aae255 [Joseph K. Bradley] Changed tree abstractions not to take type parameters, and for setters to return this.type instead
      ec17947 [Joseph K. Bradley] Updates based on code review.  Main changes were: moving public types from ml.impl.tree to ml.tree, modifying CategoricalSplit to take an Array of categories but store a Set internally, making more types sealed or final
      5626c81 [Joseph K. Bradley] style fixes
      f8fbd24 [Joseph K. Bradley] imported reorg of DecisionTreeSuite from old PR.  small cleanups
      7ef63ed [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example (for real this time)
      e11673f [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example
      119f407 [Joseph K. Bradley] added DecisionTreeClassifier example
      0bdc486 [Joseph K. Bradley] fixed issues after param PR was merged
      f9fbb60 [Joseph K. Bradley] Done with DecisionTreeClassifier, but no save/load yet.  Need to add example as well
      2532c9a [Joseph K. Bradley] partial move to spark.ml API, not done yet
      c72c1a0 [Joseph K. Bradley] Copied changes for common items, plus DecisionTreeClassifier from original PR
      a83571ac
    • Marcelo Vanzin's avatar
      [SPARK-2669] [yarn] Distribute client configuration to AM. · 50ab8a65
      Marcelo Vanzin authored
      Currently, when Spark launches the Yarn AM, the process will use
      the local Hadoop configuration on the node where the AM launches,
      if one is present. A more correct approach is to use the same
      configuration used to launch the Spark job, since the user may
      have made modifications (such as adding app-specific configs).
      
      The approach taken here is to use the distributed cache to make
      all files in the Hadoop configuration directory available to the
      AM. This is a little overkill since only the AM needs them (the
      executors use the broadcast Hadoop configuration from the driver),
      but is the easier approach.
      
      Even though only a few files in that directory may end up being
      used, all of them are uploaded. This allows supporting use cases
      such as when auxiliary configuration files are used for SSL
      configuration, or when uploading a Hive configuration directory.
      Not all of these may be reflected in a o.a.h.conf.Configuration object,
      but may be needed when a driver in cluster mode instantiates, for
      example, a HiveConf object instead.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4142 from vanzin/SPARK-2669 and squashes the following commits:
      
      f5434b9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      013f0fb [Marcelo Vanzin] Review feedback.
      f693152 [Marcelo Vanzin] Le sigh.
      ed45b7d [Marcelo Vanzin] Zip all config files and upload them as an archive.
      5927b6b [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      cbb9fb3 [Marcelo Vanzin] Remove stale test.
      e3e58d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      e3d0613 [Marcelo Vanzin] Review feedback.
      34bdbd8 [Marcelo Vanzin] Fix test.
      022a688 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      a77ddd5 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      79221c7 [Marcelo Vanzin] [SPARK-2669] [yarn] Distribute client configuration to AM.
      50ab8a65
    • Davies Liu's avatar
      [SPARK-6957] [SPARK-6958] [SQL] improve API compatibility to pandas · c84d9169
      Davies Liu authored
      ```
      select(['cola', 'colb'])
      
      groupby(['colA', 'colB'])
      groupby([df.colA, df.colB])
      
      df.sort('A', ascending=True)
      df.sort(['A', 'B'], ascending=True)
      df.sort(['A', 'B'], ascending=[1, 0])
      ```
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5544 from davies/compatibility and squashes the following commits:
      
      4944058 [Davies Liu] add docstrings
      adb2816 [Davies Liu] Merge branch 'master' of github.com:apache/spark into compatibility
      bcbbcab [Davies Liu] support ascending as list
      8dabdf0 [Davies Liu] improve API compatibility to pandas
      c84d9169
    • linweizhong's avatar
      [SPARK-6604][PySpark]Specify ip of python server scoket · dc48ba9f
      linweizhong authored
      In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process.
      /cc davies
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #5256 from Sephiroth-Lin/SPARK-6604 and squashes the following commits:
      
      7b3c633 [linweizhong] rephrase
      dc48ba9f
    • Punya Biswal's avatar
      [SPARK-6952] Handle long args when detecting PID reuse · f6a9a57a
      Punya Biswal authored
      sbin/spark-daemon.sh used
      
          ps -p "$TARGET_PID" -o args=
      
      to figure out whether the process running with the expected PID is actually a Spark
      daemon. When running with a large classpath, the output of ps gets
      truncated and the check fails spuriously.
      
      This weakens the check to see if it's a java command (which is something
      we do in other parts of the script) rather than looking for the specific
      main class name. This means that SPARK-4832 might happen under a
      slightly broader range of circumstances (a java program happened to
      reuse the same PID), but it seems worthwhile compared to failing
      consistently with a large classpath.
      
      Author: Punya Biswal <pbiswal@palantir.com>
      
      Closes #5535 from punya/feature/SPARK-6952 and squashes the following commits:
      
      7ea12d1 [Punya Biswal] Handle long args when detecting PID reuse
      f6a9a57a
    • Marcelo Vanzin's avatar
      [SPARK-6046] [core] Reorganize deprecated config support in SparkConf. · 4527761b
      Marcelo Vanzin authored
      This change tries to follow the chosen way for handling deprecated
      configs in SparkConf: all values (old and new) are kept in the conf
      object, and newer names take precedence over older ones when
      retrieving the value.
      
      Warnings are logged when config options are set, which generally happens
      on the driver node (where the logs are most visible).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5514 from vanzin/SPARK-6046 and squashes the following commits:
      
      9371529 [Marcelo Vanzin] Avoid math.
      6cf3f11 [Marcelo Vanzin] Review feedback.
      2445d48 [Marcelo Vanzin] Fix (and cleanup) update interval initialization.
      b6824be [Marcelo Vanzin] Clean up the other deprecated config use also.
      ab20351 [Marcelo Vanzin] Update FsHistoryProvider to only retrieve new config key.
      2c93209 [Marcelo Vanzin] [SPARK-6046] [core] Reorganize deprecated config support in SparkConf.
      4527761b
    • Sean Owen's avatar
      SPARK-6846 [WEBUI] Stage kill URL easy to accidentally trigger and possibility for security issue · f7a25644
      Sean Owen authored
      kill endpoints now only accept a POST (kill stage, master kill app, master kill driver); kill link now POSTs
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5528 from srowen/SPARK-6846 and squashes the following commits:
      
      137ac9f [Sean Owen] Oops, fix scalastyle line length probelm
      7c5f961 [Sean Owen] Add Imran's test of kill link
      59f447d [Sean Owen] kill endpoints now only accept a POST (kill stage, master kill app, master kill driver); kill link now POSTs
      f7a25644
Loading