Skip to content
Snippets Groups Projects
  1. Feb 10, 2015
    • Andrew Or's avatar
      [HOTFIX][SPARK-4136] Fix compilation and tests · b640c841
      Andrew Or authored
      b640c841
    • Sandy Ryza's avatar
      SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed · 69bc3bb6
      Sandy Ryza authored
      This takes advantage of the changes made in SPARK-4337 to cancel pending requests to YARN when they are no longer needed.
      
      Each time the timer in `ExecutorAllocationManager` strikes, we compute `maxNumNeededExecutors`, the maximum number of executors we could fill with the current load.  This is calculated as the total number of running and pending tasks divided by the number of cores per executor.  If `maxNumNeededExecutors` is below the total number of running and pending executors, we call `requestTotalExecutors(maxNumNeededExecutors)` to let the cluster manager know that it should cancel any pending requests above this amount.  If not, `maxNumNeededExecutors` is just used as a bound in alongside the configured `maxExecutors` to limit the number of new requests.
      
      The patch modifies the API exposed by `ExecutorAllocationClient` for requesting additional executors by moving from `requestExecutors` to `requestTotalExecutors`.  This makes the communication between the `ExecutorAllocationManager` and the `YarnAllocator` easier to reason about and removes some state that needed to be kept in the `CoarseGrainedSchedulerBackend`.  I think an argument can be made that this makes for a less attractive user-facing API in `SparkContext`, but I'm having trouble envisioning situations where a user would want to use either of these APIs.
      
      This will likely break some tests, but I wanted to get feedback on the approach before adding tests and polishing.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4168 from sryza/sandy-spark-4136 and squashes the following commits:
      
      37ce77d [Sandy Ryza] Warn on negative number
      cd3b2ff [Sandy Ryza] SPARK-4136
      69bc3bb6
    • Daoyuan Wang's avatar
      [SPARK-5716] [SQL] Support TOK_CHARSETLITERAL in HiveQl · c7ad80ae
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4502 from adrian-wang/utf8 and squashes the following commits:
      
      4d7b0ee [Daoyuan Wang] remove useless import
      606f981 [Daoyuan Wang] support TOK_CHARSETLITERAL in HiveQl
      c7ad80ae
    • JqueryFan's avatar
      [Spark-5717] [MLlib] add stop and reorganize import · 6cc96cf0
      JqueryFan authored
      Trivial. add sc stop and reorganize import
      https://issues.apache.org/jira/browse/SPARK-5717
      
      Author: JqueryFan <firing@126.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #4503 from hhbyyh/scstop and squashes the following commits:
      
      7837a2c [JqueryFan] revert import change
      2e85cc1 [Yuhao Yang] add stop and reorganize import
      6cc96cf0
    • Nicholas Chammas's avatar
      [SPARK-1805] [EC2] Validate instance types · 50820f15
      Nicholas Chammas authored
      Addresses [SPARK-1805](https://issues.apache.org/jira/browse/SPARK-1805), though doesn't resolve it completely.
      
      Error out quickly if the user asks for the master and slaves to have different AMI virtualization types, since we don't currently support that.
      
      In addition to that, we print warnings if the inputted instance types are not recognized, though I would prefer if we errored out. Elsewhere in the script it seems [we allow unrecognized instance types](https://github.com/apache/spark/blob/5de14cc2763a8211f77eeb55940dec025822eb78/ec2/spark_ec2.py#L331), though I think we should remove that.
      
      It's messy, but it should serve us until we enhance spark-ec2 to support clusters with mixed virtualization types.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4455 from nchammas/ec2-master-slave-different-virtualization and squashes the following commits:
      
      ce28609 [Nicholas Chammas] fix style
      b0adba0 [Nicholas Chammas] validate input instance types
      50820f15
    • Cheng Lian's avatar
      [SPARK-5700] [SQL] [Build] Bumps jets3t to 0.9.3 for hadoop-2.3 and hadoop-2.4 profiles · ba667935
      Cheng Lian authored
      This is a follow-up PR for #4454 and #4484. JetS3t 0.9.2 contains a log4j.properties file inside the artifact and breaks our tests (see SPARK-5696). This is fixed in 0.9.3.
      
      This PR also reverts hotfix changes introduced in #4484. The reason is that asking users to configure HiveThriftServer2 logging configurations in hive-log4j.properties can be unintuitive.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4499)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4499 from liancheng/spark-5700 and squashes the following commits:
      
      4f020c7 [Cheng Lian] Bumps jets3t to 0.9.3 for hadoop-2.3 and hadoop-2.4 profiles
      ba667935
    • Sean Owen's avatar
      SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError:... · 2d1e9167
      Sean Owen authored
      SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z"
      
      This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4470 from srowen/SPARK-5239.2 and squashes the following commits:
      
      2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
      2d1e9167
    • Tathagata Das's avatar
      [SPARK-4964][Streaming][Kafka] More updates to Exactly-once Kafka stream · c1513463
      Tathagata Das authored
      Changes
      - Added example
      - Added a critical unit test that verifies that offset ranges can be recovered through checkpoints
      
      Might add more changes.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4384 from tdas/new-kafka-fixes and squashes the following commits:
      
      7c931c3 [Tathagata Das] Small update
      3ed9284 [Tathagata Das] updated scala doc
      83d0402 [Tathagata Das] Added JavaDirectKafkaWordCount example.
      26df23c [Tathagata Das] Updates based on PR comments from Cody
      e4abf69 [Tathagata Das] Scala doc improvements and stuff.
      bb65232 [Tathagata Das] Fixed test bug and refactored KafkaStreamSuite
      50f2b56 [Tathagata Das] Added Java API and added more Scala and Java unit tests. Also updated docs.
      e73589c [Tathagata Das] Minor changes.
      4986784 [Tathagata Das] Added unit test to kafka offset recovery
      6a91cab [Tathagata Das] Added example
      c1513463
    • Joseph K. Bradley's avatar
      [SPARK-5597][MLLIB] save/load for decision trees and emsembles · ef2f55b9
      Joseph K. Bradley authored
      This is based on #4444 from jkbradley with the following changes:
      
      1. Node schema updated to
         ~~~
      treeId: int
      nodeId: Int
      predict/
             |- predict: Double
             |- prob: Double
      impurity: Double
      isLeaf: Boolean
      split/
           |- feature: Int
           |- threshold: Double
           |- featureType: Int
           |- categories: Array[Double]
      leftNodeId: Integer
      rightNodeId: Integer
      infoGain: Double
      ~~~
      
      2. Some refactor of the implementation.
      
      Closes #4444.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4493 from mengxr/SPARK-5597 and squashes the following commits:
      
      75e3bb6 [Xiangrui Meng] fix style
      2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation
      45873a2 [Joseph K. Bradley] org imports
      1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles
      dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles
      ef2f55b9
  2. Feb 09, 2015
    • Cheng Hao's avatar
      [SQL] Remove the duplicated code · bd0b5ea7
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4494 from chenghao-intel/tiny_code_change and squashes the following commits:
      
      450dfe7 [Cheng Hao] remove the duplicated code
      bd0b5ea7
    • Kay Ousterhout's avatar
      [SPARK-5701] Only set ShuffleReadMetrics when task has shuffle deps · a2d33d0b
      Kay Ousterhout authored
      The updateShuffleReadMetrics method in TaskMetrics (called by the executor heartbeater) will currently always add a ShuffleReadMetrics to TaskMetrics (with values set to 0), even when the task didn't read any shuffle data. ShuffleReadMetrics should only be added if the task reads shuffle data.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4488 from kayousterhout/SPARK-5701 and squashes the following commits:
      
      673ed58 [Kay Ousterhout] SPARK-5701: Only set ShuffleReadMetrics when task has shuffle deps
      a2d33d0b
    • Andrew Or's avatar
      [SPARK-5703] AllJobsPage throws empty.max exception · a95ed521
      Andrew Or authored
      If you have a `SparkListenerJobEnd` event without the corresponding `SparkListenerJobStart` event, then `JobProgressListener` will create an empty `JobUIData` with an empty `stageIds` list. However, later in `AllJobsPage` we call `stageIds.max`. If this is empty, it will throw an exception.
      
      This crashed my history server.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4490 from andrewor14/jobs-page-max and squashes the following commits:
      
      21797d3 [Andrew Or] Check nonEmpty before calling max
      a95ed521
    • Marcelo Vanzin's avatar
      [SPARK-2996] Implement userClassPathFirst for driver, yarn. · 20a60131
      Marcelo Vanzin authored
      Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
      `spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
      modifies the system classpath, instead of restricting the changes to the user's class
      loader. So this change implements the behavior of the latter for Yarn, and deprecates
      the more dangerous choice.
      
      To be able to achieve feature-parity, I also implemented the option for drivers (the existing
      option only applies to executors). So now there are two options, each controlling whether
      to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
      aliased to the new one (`spark.executor.userClassPathFirst`).
      
      The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
      was also doing some things that ended up causing JVM errors depending on how things
      were being called.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:
      
      9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
      fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
      a8c69f1 [Marcelo Vanzin] Review feedback.
      cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
      0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
      fe970a7 [Marcelo Vanzin] Review feedback.
      25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
      fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
      2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
      b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a10f379 [Marcelo Vanzin] Some feedback.
      3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      7b57cba [Marcelo Vanzin] Remove now outdated message.
      5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
      fa1aafa [Marcelo Vanzin] Remove write check on user jars.
      89d8072 [Marcelo Vanzin] Cleanups.
      a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
      50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
      7d14397 [Marcelo Vanzin] Register user jars in executor up front.
      7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
      20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
      55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
      0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
      4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
      d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
      46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
      a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
      91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
      a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
      89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
      20a60131
    • Sean Owen's avatar
      SPARK-4900 [MLLIB] MLlib SingularValueDecomposition ARPACK IllegalStateException · 36c4e1d7
      Sean Owen authored
      Fix ARPACK error code mapping, at least. It's not yet clear whether the error is what we expect from ARPACK. If it isn't, not sure if that's to be treated as an MLlib or Breeze issue.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4485 from srowen/SPARK-4900 and squashes the following commits:
      
      7355aa1 [Sean Owen] Fix ARPACK error code mapping
      36c4e1d7
    • KaiXinXiaoLei's avatar
      Add a config option to print DAG. · 31d435ec
      KaiXinXiaoLei authored
      Add a config option "spark.rddDebug.enable" to check whether to print DAG info. When "spark.rddDebug.enable" is true, it will print information about DAG in the log.
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #4257 from KaiXinXiaoLei/DAGprint and squashes the following commits:
      
      d9fe42e [KaiXinXiaoLei] change  log info
      c27ee76 [KaiXinXiaoLei] change log info
      83c2b32 [KaiXinXiaoLei] change config option
      adcb14f [KaiXinXiaoLei] change the file.
      f4e7b9e [KaiXinXiaoLei] add a option to print DAG
      31d435ec
    • Davies Liu's avatar
      [SPARK-5469] restructure pyspark.sql into multiple files · 08488c17
      Davies Liu authored
      All the DataTypes moved into pyspark.sql.types
      
      The changes can be tracked by `--find-copies-harder -M25`
      ```
      davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
      2       5       python/docs/pyspark.ml.rst
      0       3       python/docs/pyspark.mllib.rst
      10      2       python/docs/pyspark.sql.rst
      1       1       python/pyspark/mllib/linalg.py
      21      14      python/pyspark/{mllib => sql}/__init__.py
      14      2108    python/pyspark/{sql.py => sql/context.py}
      10      1772    python/pyspark/{sql.py => sql/dataframe.py}
      7       6       python/pyspark/{sql_tests.py => sql/tests.py}
      8       1465    python/pyspark/{sql.py => sql/types.py}
      4       2       python/run-tests
      1       1       sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
      ```
      
      Also `git blame -C -C python/pyspark/sql/context.py` to track the history.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4479 from davies/sql and squashes the following commits:
      
      1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
      2b2b983 [Davies Liu] restructure pyspark.sql
      08488c17
    • Andrew Or's avatar
      [SPARK-5698] Do not let user request negative # of executors · d302c480
      Andrew Or authored
      Otherwise we might crash the ApplicationMaster. Why? Please see https://issues.apache.org/jira/browse/SPARK-5698.
      
      sryza I believe this is also relevant in your patch #4168.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4483 from andrewor14/da-negative and squashes the following commits:
      
      53ed955 [Andrew Or] Throw IllegalArgumentException instead
      0e89fd5 [Andrew Or] Check against negative requests
      d302c480
    • Cheng Lian's avatar
      [SPARK-5699] [SQL] [Tests] Runs hive-thriftserver tests whenever SQL code is modified · 3ec3ad29
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4486)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4486 from liancheng/spark-5699 and squashes the following commits:
      
      538001d [Cheng Lian] Runs hive-thriftserver tests whenever SQL code is modified
      3ec3ad29
    • DoingDone9's avatar
      [SPARK-5648][SQL] support "alter ... unset tblproperties("key")" · d08e7c2b
      DoingDone9 authored
      make hivecontext support "alter ... unset tblproperties("key")"
      like :
      alter view viewName unset tblproperties("k")
      alter table tableName unset tblproperties("k")
      
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #4424 from DoingDone9/unset and squashes the following commits:
      
      6dd8bee [DoingDone9] support "alter ... unset tblproperties("key")"
      d08e7c2b
    • Wenchen Fan's avatar
      [SPARK-2096][SQL] support dot notation on array of struct · 0ee53ebc
      Wenchen Fan authored
      ~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type.
      An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~
      marmbrus Could you take a look?
      
      to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits:
      
      08a228a [Wenchen Fan] support dot notation on array of struct
      0ee53ebc
    • Lu Yan's avatar
      [SPARK-5614][SQL] Predicate pushdown through Generate. · 2a362925
      Lu Yan authored
      Now in Catalyst's rules, predicates can not be pushed through "Generate" nodes. Further more, partition pruning in HiveTableScan can not be applied on those queries involves "Generate". This makes such queries very inefficient. In practice, it finds patterns like
      
      ```scala
      Filter(predicate, Generate(generator, _, _, _, grandChild))
      ```
      
      and splits the predicate into 2 parts by referencing the generated column from Generate node or not. And a new Filter will be created for those conjuncts can be pushed beneath Generate node. If nothing left for the original Filter, it will be removed.
      For example, physical plan for query
      ```sql
      select len, bk
      from s_server lateral view explode(len_arr) len_table as len
      where len > 5 and day = '20150102';
      ```
      where 'day' is a partition column in metastore is like this in current version of Spark SQL:
      
      > Project [len, bk]
      >
      > Filter ((len > "5") && "(day = "20150102")")
      >
      > Generate explode(len_arr), true, false
      >
      > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), None
      
      But theoretically the plan should be like this
      
      > Project [len, bk]
      >
      > Filter (len > "5")
      >
      > Generate explode(len_arr), true, false
      >
      > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), Some(day = "20150102")
      
      Where partition pruning predicates can be pushed to HiveTableScan nodes.
      
      Author: Lu Yan <luyan02@baidu.com>
      
      Closes #4394 from ianluyan/ppd and squashes the following commits:
      
      a67dce9 [Lu Yan] Fix English grammar.
      7cea911 [Lu Yan] Revised based on @marmbrus's opinions
      ffc59fc [Lu Yan] [SPARK-5614][SQL] Predicate pushdown through Generate.
      2a362925
    • Cheng Lian's avatar
      [SPARK-5696] [SQL] [HOTFIX] Asks HiveThriftServer2 to re-initialize log4j using Hive configurations · b8080aa8
      Cheng Lian authored
      In this way, log4j configurations overriden by jets3t-0.9.2.jar can be again overriden by Hive default log4j configurations.
      
      This might not be the best solution for this issue since it requires users to use `hive-log4j.properties` rather than `log4j.properties` to initialize `HiveThriftServer2` logging configurations, which can be confusing. The main purpose of this PR is to fix Jenkins PR build.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4484)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4484 from liancheng/spark-5696 and squashes the following commits:
      
      df83956 [Cheng Lian] Hot fix: asks HiveThriftServer2 to re-initialize log4j using Hive configurations
      b8080aa8
    • Yin Huai's avatar
      [SQL] Code cleanup. · 5f0b30e5
      Yin Huai authored
      I added an unnecessary line of code in https://github.com/apache/spark/commit/13531dd97c08563e53dacdaeaf1102bdd13ef825.
      
      My bad. Let's delete it.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4482 from yhuai/unnecessaryCode and squashes the following commits:
      
      3645af0 [Yin Huai] Code cleanup.
      5f0b30e5
    • Michael Armbrust's avatar
      [SQL] Add some missing DataFrame functions. · 68b25cf6
      Michael Armbrust authored
      - as with a `Symbol`
      - distinct
      - sqlContext.emptyDataFrame
      - move add/remove col out of RDDApi section
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4437 from marmbrus/dfMissingFuncs and squashes the following commits:
      
      2004023 [Michael Armbrust] Add missing functions
      68b25cf6
    • Florian Verhein's avatar
      [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of spark_ec2.py · b884daa5
      Florian Verhein authored
      and by extension, the ami-list
      
      Useful for using alternate spark-ec2 repos or branches.
      
      Author: Florian Verhein <florian.verhein@gmail.com>
      
      Closes #4385 from florianverhein/master and squashes the following commits:
      
      7e2b4be [Florian Verhein] [SPARK-5611] [EC2] typo
      8b653dc [Florian Verhein] [SPARK-5611] [EC2] Enforce only supporting spark-ec2 forks from github, log improvement
      bc4b0ed [Florian Verhein] [SPARK-5611] allow spark-ec2 repos with different names
      8b5c551 [Florian Verhein] improve option naming, fix logging, fix lint failing, add guard to enforce spark-ec2
      7724308 [Florian Verhein] [SPARK-5611] [EC2] fixes
      b42b68c [Florian Verhein] [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of spark_ec2.py
      b884daa5
    • Reynold Xin's avatar
      [SPARK-5675][SQL] XyzType companion object should subclass XyzType · f48199eb
      Reynold Xin authored
      Otherwise, the following will always return false in Java.
      
      ```scala
      dataType instanceof StringType
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4463 from rxin/type-companion-object and squashes the following commits:
      
      04d5d8d [Reynold Xin] Comment.
      976e11e [Reynold Xin] [SPARK-5675][SQL]StringType case object should be subclass of StringType class
      f48199eb
    • Hari Shreedharan's avatar
      [SPARK-4905][STREAMING] FlumeStreamSuite fix. · 0765af9b
      Hari Shreedharan authored
      Using String constructor instead of CharsetDecoder to see if it fixes the issue of empty strings in Flume test output.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #4371 from harishreedharan/Flume-stream-attempted-fix and squashes the following commits:
      
      550d363 [Hari Shreedharan] Fix imports.
      8695950 [Hari Shreedharan] Use Charsets.UTF_8 instead of "UTF-8" in String constructors.
      af3ba14 [Hari Shreedharan] [SPARK-4905][STREAMING] FlumeStreamSuite fix.
      0765af9b
    • mcheah's avatar
      [SPARK-5691] Fixing wrong data structure lookup for dupe app registratio... · 6fe70d84
      mcheah authored
      In Master's registerApplication method, it checks if the application had
      already registered by examining the addressToWorker hash map. In reality,
      it should refer to the addressToApp data structure, as this is what
      really tracks which apps have been registered.
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #4477 from mccheah/spark-5691 and squashes the following commits:
      
      efdc573 [mcheah] [SPARK-5691] Fixing wrong data structure lookup for dupe app registration
      6fe70d84
    • Liang-Chi Hsieh's avatar
      [SPARK-5664][BUILD] Restore stty settings when exiting from SBT's spark-shell · dae21614
      Liang-Chi Hsieh authored
      For launching spark-shell from SBT.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4451 from viirya/restore_stty and squashes the following commits:
      
      fdfc480 [Liang-Chi Hsieh] Restore stty settings when exit (for launching spark-shell from SBT).
      dae21614
    • Davies Liu's avatar
      [SPARK-5678] Convert DataFrame to pandas.DataFrame and Series · afb13163
      Davies Liu authored
      ```
      pyspark.sql.DataFrame.to_pandas = to_pandas(self) unbound pyspark.sql.DataFrame method
          Collect all the rows and return a `pandas.DataFrame`.
      
          >>> df.to_pandas()  # doctest: +SKIP
             age   name
          0    2  Alice
          1    5    Bob
      
      pyspark.sql.Column.to_pandas = to_pandas(self) unbound pyspark.sql.Column method
          Return a pandas.Series from the column
      
          >>> df.age.to_pandas()  # doctest: +SKIP
          0    2
          1    5
          dtype: int64
      ```
      
      Not tests by jenkins (they depends on pandas)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4476 from davies/to_pandas and squashes the following commits:
      
      6276fb6 [Davies Liu] Convert DataFrame to pandas.DataFrame and Series
      afb13163
    • Sean Owen's avatar
      SPARK-4267 [YARN] Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later · de780604
      Sean Owen authored
      Before passing to YARN, escape arguments in "extraJavaOptions" args, in order to correctly handle cases like -Dfoo="one two three". Also standardize how these args are handled and ensure that individual args are treated as stand-alone args, not one string.
      
      vanzin andrewor14
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4452 from srowen/SPARK-4267.2 and squashes the following commits:
      
      c8297d2 [Sean Owen] Before passing to YARN, escape arguments in "extraJavaOptions" args, in order to correctly handle cases like -Dfoo="one two three". Also standardize how these args are handled and ensure that individual args are treated as stand-alone args, not one string.
      de780604
    • Sandy Ryza's avatar
      SPARK-2149. [MLLIB] Univariate kernel density estimation · 0793ee1b
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1093 from sryza/sandy-spark-2149 and squashes the following commits:
      
      5f06b33 [Sandy Ryza] More review comments
      0f73060 [Sandy Ryza] Respond to Sean's review comments
      0dfa005 [Sandy Ryza] SPARK-2149. Univariate kernel density estimation
      0793ee1b
    • Nicholas Chammas's avatar
      [SPARK-5473] [EC2] Expose SSH failures after status checks pass · 4dfe180f
      Nicholas Chammas authored
      If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is.
      
      This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. It also removes the growing trail of dots while waiting in favor of a fixed 3 dots.
      
      For example:
      
      ```
      $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch "spark-test"
      Setting up security groups...
      Searching for existing cluster spark-test...
      Spark AMI: ami-35b1885c
      Launching instances...
      Launched 1 slaves in us-east-1c, regid = r-7dadd096
      Launched master in us-east-1c, regid = r-fcadd017
      Waiting for cluster to enter 'ssh-ready' state...
      Warning: SSH connection error. (This could be temporary.)
      Host: 127.0.0.1
      SSH return code: 255
      SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory.
      Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
      Permission denied (publickey).
      ```
      
      This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3C1422323829398-21381.postn3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`.
      
      This is a usability improvement that should be backported to 1.2.
      
      Resolves [SPARK-5473](https://issues.apache.org/jira/browse/SPARK-5473).
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4262 from nchammas/expose-ssh-failure and squashes the following commits:
      
      8bda6ed [Nicholas Chammas] default to print SSH output
      2b92534 [Nicholas Chammas] show SSH output after status check pass
      4dfe180f
    • Xiangrui Meng's avatar
      [SPARK-5539][MLLIB] LDA guide · 855d12ac
      Xiangrui Meng authored
      This is the LDA user guide from jkbradley with Java and Scala code example.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4465 from mengxr/lda-guide and squashes the following commits:
      
      6dcb7d1 [Xiangrui Meng] update java example in the user guide
      76169ff [Xiangrui Meng] update java example
      36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
      c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
      855d12ac
    • Hung Lin's avatar
      [SPARK-5472][SQL] Fix Scala code style · 4575c564
      Hung Lin authored
      Fix Scala code style.
      
      Author: Hung Lin <hung@zoomdata.com>
      
      Closes #4464 from hunglin/SPARK-5472 and squashes the following commits:
      
      ef7a3b3 [Hung Lin] SPARK-5472: fix scala style
      4575c564
  3. Feb 08, 2015
    • Sean Owen's avatar
      SPARK-4405 [MLLIB] Matrices.* construction methods should check for rows x cols overflow · 4396dfb3
      Sean Owen authored
      Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods. jkbradley this should be an easy one. Review and/or merge as you see fit.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4461 from srowen/SPARK-4405 and squashes the following commits:
      
      c67574e [Sean Owen] Check that size of dense matrix array is not beyond Int.MaxValue in Matrices.* methods
      4396dfb3
    • Joseph K. Bradley's avatar
      [SPARK-5660][MLLIB] Make Matrix apply public · c1716118
      Joseph K. Bradley authored
      This is #4447 with `override`.
      
      Closes #4447
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4462 from mengxr/SPARK-5660 and squashes the following commits:
      
      f82c8d6 [Xiangrui Meng] add override to matrix.apply
      91cedde [Joseph K. Bradley] made matrix apply public
      c1716118
    • Reynold Xin's avatar
      [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in tabular format. · a052ed42
      Reynold Xin authored
      An example:
      ```
      year  month AVG('Adj Close) MAX('Adj Close)
      1980  12    0.503218        0.595103
      1981  01    0.523289        0.570307
      1982  02    0.436504        0.475256
      1983  03    0.410516        0.442194
      1984  04    0.450090        0.483521
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4416 from rxin/SPARK-5643 and squashes the following commits:
      
      d0e0d6e [Reynold Xin] [SQL] Minor update to data source and statistics documentation.
      269da83 [Reynold Xin] Updated isLocal comment.
      2cf3c27 [Reynold Xin] Moved logic into optimizer.
      1a04d8b [Reynold Xin] [SPARK-5643][SQL] Add a show method to print the content of a DataFrame in columnar format.
      a052ed42
    • Sam Halliday's avatar
      SPARK-5665 [DOCS] Update netlib-java documentation · 56aff4bd
      Sam Halliday authored
      I am the author of netlib-java and I found this documentation to be out of date. Some main points:
      
      1. Breeze has not depended on jBLAS for some time
      2. netlib-java provides a pure JVM implementation as the fallback (the original docs did not appear to be aware of this, claiming that gfortran was necessary)
      3. The licensing issue is not just about LGPL: optimised natives have proprietary licenses. Building with the LGPL flag turned on really doesn't help you get past this.
      4. I really think it's best to direct people to my detailed setup guide instead of trying to compress it into one sentence. It is different for each architecture, each OS, and for each backend.
      
      I hope this helps to clear things up :smile:
      
      Author: Sam Halliday <sam.halliday@Gmail.com>
      Author: Sam Halliday <sam.halliday@gmail.com>
      
      Closes #4448 from fommil/patch-1 and squashes the following commits:
      
      18cda11 [Sam Halliday] remove link to skillsmatters at request of @mengxr
      a35e4a9 [Sam Halliday] reword netlib-java/breeze docs
      56aff4bd
    • Xiangrui Meng's avatar
      [SPARK-5598][MLLIB] model save/load for ALS · 5c299c58
      Xiangrui Meng authored
      following #4233. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4422 from mengxr/SPARK-5598 and squashes the following commits:
      
      a059394 [Xiangrui Meng] SaveLoad not extending Loader
      14b7ea6 [Xiangrui Meng] address comments
      f487cb2 [Xiangrui Meng] add unit tests
      62fc43c [Xiangrui Meng] implement save/load for MFM
      5c299c58
Loading