Skip to content
Snippets Groups Projects
  1. Apr 14, 2015
    • WangTaoTheTonic's avatar
      [SPARK-6894]spark.executor.extraLibraryOptions => spark.executor.extraLibraryPath · f63b44a5
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-6894
      
      cc vanzin
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #5506 from WangTaoTheTonic/SPARK-6894 and squashes the following commits:
      
      4b7ced7 [WangTaoTheTonic] spark.executor.extraLibraryOptions => spark.executor.extraLibraryPath
      f63b44a5
    • Timothy Chen's avatar
      [SPARK-6081] Support fetching http/https uris in driver runner. · 320bca45
      Timothy Chen authored
      Currently if passed uris such as http/https, it won't able to fetch them as it only calls HadoopFs get.
      This fix utilizes the existing util method to fetch remote uris as well.
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #4832 from tnachen/driver_remote and squashes the following commits:
      
      aa52cd6 [Timothy Chen] Support fetching remote uris in driver runner.
      320bca45
    • Erik van Oosten's avatar
      SPARK-6878 [CORE] Fix for sum on empty RDD fails with exception · 51b306b9
      Erik van Oosten authored
      Author: Erik van Oosten <evanoosten@ebay.com>
      
      Closes #5489 from erikvanoosten/master and squashes the following commits:
      
      1c91954 [Erik van Oosten] Rewrote double range matcher to an exact equality assert (SPARK-6878)
      f1708c9 [Erik van Oosten] Fix for sum on empty RDD fails with exception (SPARK-6878)
      51b306b9
    • Punyashloka Biswal's avatar
      [SPARK-6731] Bump version of apache commons-math3 · 628a72f7
      Punyashloka Biswal authored
      Version 3.1.1 is two years old and the newer version includes
      approximate percentile statistics (among other things).
      
      Author: Punyashloka Biswal <punya.biswal@gmail.com>
      
      Closes #5380 from punya/patch-1 and squashes the following commits:
      
      226622b [Punyashloka Biswal] Bump version of apache commons-math3
      628a72f7
    • Brennon York's avatar
      [WIP][HOTFIX][SPARK-4123]: Fix bug in PR dependency (all deps. removed issue) · 77eeb10f
      Brennon York authored
      We're seeing a bug sporadically in the new PR dependency comparison test whereby it notes that *all* dependencies are removed. This happens when the current PR is built, but the final, sorted, dependency file is left blank. I believe this is an error either in the way the `git checkout` calls have been or an error within the `mvn` build for that PR (again, likely related to the `git checkout`). As such I've set the checkouts to now force (with `-f` flag) which is more in line with what Jenkins currently does on the initial checkout.
      
      Setting this as a WIP for now to trigger the build process myriad times to see if the issue still arises.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5443 from brennonyork/HOTFIX2-SPARK-4123 and squashes the following commits:
      
      f2186be [Brennon York] added output for the various git commit refs
      3f073d6 [Brennon York] removed the git checkouts piping to dev null
      07765a6 [Brennon York] updated the diff logic to reference the filenames rather than hardlink
      e3f63c7 [Brennon York] added '-f' to the checkout flags for git
      710c8d1 [Brennon York] added 30 minutes to the test benchmark
      77eeb10f
  2. Apr 13, 2015
    • Xiangrui Meng's avatar
      [SPARK-5957][ML] better handling of parameters · 971b95b0
      Xiangrui Meng authored
      The design doc was posted on the JIRA page. Python changes will be in a follow-up PR. jkbradley
      
      1. Use codegen for shared params.
      1. Move shared params to package `ml.param.shared`.
      1. Set default values in `Params` instead of in `Param`.
      1. Add a few methods to `Params` and `ParamMap`.
      1. Move schema handling to `SchemaUtils` from `Params`.
      
      - [x] check visibility of the methods added
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5431 from mengxr/SPARK-5957 and squashes the following commits:
      
      d19236d [Xiangrui Meng] fix test
      26ae2d7 [Xiangrui Meng] re-gen code and mark clear protected
      38b78c7 [Xiangrui Meng] update Param.toString and remove Params.explain()
      409e2d5 [Xiangrui Meng] address comments
      2d637bd [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
      eec2264 [Xiangrui Meng] make get* public in Params
      4090d95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
      4fee9e7 [Xiangrui Meng] re-gen shared params
      2737c2d [Xiangrui Meng] rename SharedParamCodeGen to SharedParamsCodeGen
      e938f81 [Xiangrui Meng] update code to set default parameter values
      28ed322 [Xiangrui Meng] merge master
      55be1f3 [Xiangrui Meng] merge master
      d63b5cc [Xiangrui Meng] fix examples
      29b004c [Xiangrui Meng] update ParamsSuite
      94fd98e [Xiangrui Meng] fix explain params
      48d0e84 [Xiangrui Meng] add remove and update explainParams
      4ac6348 [Xiangrui Meng] move schema utils to SchemaUtils add a few methods to Params
      0d9594e [Xiangrui Meng] add getOrElse to ParamMap
      eeeffe8 [Xiangrui Meng] map ++ paramMap => extractValues
      0d3fc5b [Xiangrui Meng] setDefault after param
      a9dbf59 [Xiangrui Meng] minor updates
      d9302b8 [Xiangrui Meng] generate default values
      1c72579 [Xiangrui Meng] pass test compile
      abb7a3b [Xiangrui Meng] update default values handling
      dcab97a [Xiangrui Meng] add codegen for shared params
      971b95b0
    • hlin09's avatar
      [Minor][SparkR] Minor refactor and removes redundancy related to cleanClosure. · 0ba3fdd5
      hlin09 authored
      1. Only use `cleanClosure` in creation of RRDDs. Normally, user and developer do not need to call `cleanClosure` in their function definition.
      2. Removes redundant code (e.g. unnecessary wrapper functions) related to `cleanClosure`.
      
      Author: hlin09 <hlin09pu@gmail.com>
      
      Closes #5495 from hlin09/cleanClosureFix and squashes the following commits:
      
      74ec303 [hlin09] Minor refactor and removes redundancy.
      0ba3fdd5
    • Daoyuan Wang's avatar
      [SPARK-5794] [SQL] fix add jar · b45059d0
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4586 from adrian-wang/addjar and squashes the following commits:
      
      efdd602 [Daoyuan Wang] move jar to another place
      6c707e8 [Daoyuan Wang] restrict hive version for test
      32c4fb8 [Daoyuan Wang] fix style and add a test
      9957d87 [Daoyuan Wang] use sessionstate classloader in makeRDDforTable
      0810e71 [Daoyuan Wang] remove variable substitution
      1898309 [Daoyuan Wang] fix classnotfound
      95a40da [Daoyuan Wang] support env argus in add jar, and set add jar ret to 0
      b45059d0
    • Fei Wang's avatar
      [SQL] [Minor] Fix for SqlApp.scala · 3782e1f2
      Fei Wang authored
      SqlApp.scala is out of date.
      
      Author: Fei Wang <wangfei1@huawei.com>
      
      Closes #5485 from scwf/patch-1 and squashes the following commits:
      
      6f731c2 [Fei Wang] SqlApp.scala compile error
      3782e1f2
    • Nathan Kronenfeld's avatar
      [Spark-4848] Allow different Worker configurations in standalone cluster · 435b8779
      Nathan Kronenfeld authored
      This refixes #3699 with the latest code.
      This fixes SPARK-4848
      
      I've changed the stand-alone cluster scripts to allow different workers to have different numbers of instances, with both port and web-ui port following allong appropriately.
      
      I did this by moving the loop over instances from start-slaves and stop-slaves (on the master) to start-slave and stop-slave (on the worker).
      
      Wile I was at it, I changed SPARK_WORKER_PORT to work the same way as SPARK_WORKER_WEBUI_PORT, since the new methods work fine for both.
      
      Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>
      
      Closes #5140 from nkronenfeld/feature/spark-4848 and squashes the following commits:
      
      cf5f47e [Nathan Kronenfeld] Merge remote branch 'upstream/master' into feature/spark-4848
      044ca6f [Nathan Kronenfeld] Documentation and formatting as requested by by andrewor14
      d739640 [Nathan Kronenfeld] Move looping through instances from the master to the workers, so that each worker respects its own number of instances and web-ui port
      435b8779
    • Liang-Chi Hsieh's avatar
      [SPARK-6877][SQL] Add code generation support for Min · 4898dfa4
      Liang-Chi Hsieh authored
      Currently `min` is not supported in code generation. This pr adds the support for it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5487 from viirya/add_min_codegen and squashes the following commits:
      
      0ddec23 [Liang-Chi Hsieh] Add code generation support for Min.
      4898dfa4
    • Liang-Chi Hsieh's avatar
      [SPARK-6303][SQL] Remove unnecessary Average in GeneratedAggregate · 5b8b324f
      Liang-Chi Hsieh authored
      Because `Average` is a `PartialAggregate`, we never get a `Average` node when reaching `HashAggregation` to prepare `GeneratedAggregate`.
      
      That is why in SQLQuerySuite there is already a test for `avg` with codegen. And it works.
      
      But we can find a case in `GeneratedAggregate` to deal with `Average`. Based on the above, we actually never execute this case.
      
      So we can remove this case from `GeneratedAggregate`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4996 from viirya/add_average_codegened and squashes the following commits:
      
      621c12f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_average_codegened
      368cfbc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_average_codegened
      74926d1 [Liang-Chi Hsieh] Add Average in canBeCodeGened lists.
      5b8b324f
    • hlin09's avatar
      [SPARK-6881][SparkR] Changes the checkpoint directory name. · d7f2c198
      hlin09 authored
      Author: hlin09 <hlin09pu@gmail.com>
      
      Closes #5493 from hlin09/fixCheckpointDir and squashes the following commits:
      
      e67fc40 [hlin09] Change to temp dir.
      1f7ed9e [hlin09] Change the checkpoint dir name.
      d7f2c198
    • Ilya Ganelin's avatar
      [SPARK-5931][CORE] Use consistent naming for time properties · c4ab255e
      Ilya Ganelin authored
      I've added new utility methods to do the conversion from times specified as e.g. 120s, 240ms, 360us to convert to a consistent internal representation. I've updated usage of these constants throughout the code to be consistent.
      
      I believe I've captured all usages of time-based properties throughout the code. I've also updated variable names in a number of places to reflect their units for clarity and updated documentation where appropriate.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      Author: Ilya Ganelin <ilganeli@gmail.com>
      
      Closes #5236 from ilganeli/SPARK-5931 and squashes the following commits:
      
      4526c81 [Ilya Ganelin] Update configuration.md
      de3bff9 [Ilya Ganelin] Fixing style errors
      f5fafcd [Ilya Ganelin] Doc updates
      951ca2d [Ilya Ganelin] Made the most recent round of changes
      bc04e05 [Ilya Ganelin] Minor fixes and doc updates
      25d3f52 [Ilya Ganelin] Minor nit fixes
      642a06d [Ilya Ganelin] Fixed logic for invalid suffixes and addid matching test
      8927e66 [Ilya Ganelin] Fixed handling of -1
      69fedcc [Ilya Ganelin] Added test for zero
      dc7bd08 [Ilya Ganelin] Fixed error in exception handling
      7d19cdd [Ilya Ganelin] Added fix for possible NPE
      6f651a8 [Ilya Ganelin] Now using regexes to simplify code in parseTimeString. Introduces getTimeAsSec and getTimeAsMs methods in SparkConf. Updated documentation
      cbd2ca6 [Ilya Ganelin] Formatting error
      1a1122c [Ilya Ganelin] Formatting fixes and added m for use as minute formatter
      4e48679 [Ilya Ganelin] Fixed priority order and mixed up conversions in a couple spots
      d4efd26 [Ilya Ganelin] Added time conversion for yarn.scheduler.heartbeat.interval-ms
      cbf41db [Ilya Ganelin] Got rid of thrown exceptions
      1465390 [Ilya Ganelin] Nit
      28187bf [Ilya Ganelin] Convert straight to seconds
      ff40bfe [Ilya Ganelin] Updated tests to fix small bugs
      19c31af [Ilya Ganelin] Added cleaner computation of time conversions in tests
      6387772 [Ilya Ganelin] Updated suffix handling to handle overlap of units more gracefully
      5193d5f [Ilya Ganelin] Resolved merge conflicts
      76cfa27 [Ilya Ganelin] [SPARK-5931] Minor nit fixes'
      bf779b0 [Ilya Ganelin] Special handling of overlapping usffixes for java
      dd0a680 [Ilya Ganelin] Updated scala code to call into java
      b2fc965 [Ilya Ganelin] replaced get or default since it's not present in this version of java
      39164f9 [Ilya Ganelin] [SPARK-5931] Updated Java conversion to be similar to scala conversion. Updated conversions to clean up code a little using TimeUnit.convert. Added Unit tests
      3b126e1 [Ilya Ganelin] Fixed conversion to US from seconds
      1858197 [Ilya Ganelin] Fixed bug where all time was being converted to us instead of the appropriate units
      bac9edf [Ilya Ganelin] More whitespace
      8613631 [Ilya Ganelin] Whitespace
      1c0c07c [Ilya Ganelin] Updated Java code to add day, minutes, and hours
      647b5ac [Ilya Ganelin] Udpated time conversion to use map iterator instead of if fall through
      70ac213 [Ilya Ganelin] Fixed remaining usages to be consistent. Updated Java-side time conversion
      68f4e93 [Ilya Ganelin] Updated more files to clean up usage of default time strings
      3a12dd8 [Ilya Ganelin] Updated host revceiver
      5232a36 [Ilya Ganelin] [SPARK-5931] Changed default behavior of time string conversion.
      499bdf0 [Ilya Ganelin] Merge branch 'SPARK-5931' of github.com:ilganeli/spark into SPARK-5931
      9e2547c [Ilya Ganelin] Reverting doc changes
      8f741e1 [Ilya Ganelin] Update JavaUtils.java
      34f87c2 [Ilya Ganelin] Update Utils.scala
      9a29d8d [Ilya Ganelin] Fixed misuse of time in streaming context test
      42477aa [Ilya Ganelin] Updated configuration doc with note on specifying time properties
      cde9bff [Ilya Ganelin] Updated spark.streaming.blockInterval
      c6a0095 [Ilya Ganelin] Updated spark.core.connection.auth.wait.timeout
      5181597 [Ilya Ganelin] Updated spark.dynamicAllocation.schedulerBacklogTimeout
      2fcc91c [Ilya Ganelin] Updated spark.dynamicAllocation.executorIdleTimeout
      6d1518e [Ilya Ganelin] Upated spark.speculation.interval
      3f1cfc8 [Ilya Ganelin] Updated spark.scheduler.revive.interval
      3352d34 [Ilya Ganelin] Updated spark.scheduler.maxRegisteredResourcesWaitingTime
      272c215 [Ilya Ganelin] Updated spark.locality.wait
      7320c87 [Ilya Ganelin] updated spark.akka.heartbeat.interval
      064ebd6 [Ilya Ganelin] Updated usage of spark.cleaner.ttl
      21ef3dd [Ilya Ganelin] updated spark.shuffle.sasl.timeout
      c9f5cad [Ilya Ganelin] Updated spark.shuffle.io.retryWait
      4933fda [Ilya Ganelin] Updated usage of spark.storage.blockManagerSlaveTimeout
      7db6d2a [Ilya Ganelin] Updated usage of spark.akka.timeout
      404f8c3 [Ilya Ganelin] Updated usage of spark.core.connection.ack.wait.timeout
      59bf9e1 [Ilya Ganelin] [SPARK-5931] Updated Utils and JavaUtils classes to add helper methods to handle time strings. Updated time strings in a few places to properly parse time
      c4ab255e
    • Cheng Hao's avatar
      [SPARK-5941] [SQL] Unit Test loads the table `src` twice for leftsemijoin.q · c5602bdc
      Cheng Hao authored
      In `leftsemijoin.q`, there is a data loading command for table `sales` already, but in `TestHive`, it also created the table `sales`, which causes duplicated records inserted into the `sales`.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4506 from chenghao-intel/df_table and squashes the following commits:
      
      0be05f7 [Cheng Hao] Remove the table `sales` creating from TestHive
      c5602bdc
    • Daoyuan Wang's avatar
      [SPARK-6872] [SQL] add copy in external sort · e63a86ab
      Daoyuan Wang authored
      We need add copy before call externalsort.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5481 from adrian-wang/extsort and squashes the following commits:
      
      9611586 [Daoyuan Wang] fix bug in external sort
      e63a86ab
    • MechCoder's avatar
      [SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation · 2a55cb41
      MechCoder authored
      The previous PR https://github.com/apache/spark/pull/4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #5330 from MechCoder/spark-5972 and squashes the following commits:
      
      0b5d659 [MechCoder] minor
      32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths
      d542bb0 [MechCoder] Remove unused imports and docs
      58f4932 [MechCoder] Remove unpersist
      70d3b4c [MechCoder] Broadcast for each tree
      5869533 [MechCoder] Access broadcasted values locally and other minor changes
      923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation
      2a55cb41
    • Yash Datta's avatar
      [SQL][SPARK-6742]: Don't push down predicates which reference partition column(s) · 3a205bbd
      Yash Datta authored
      cc liancheng
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #5390 from saucam/fpush and squashes the following commits:
      
      3f026d6 [Yash Datta] SPARK-6742: Fix scalastyle
      ce3d702 [Yash Datta] SPARK-6742: Add test case, fix scalastyle
      8592acc [Yash Datta] SPARK-6742: Don't push down predicates which reference partition column(s)
      3a205bbd
    • Daoyuan Wang's avatar
      [SPARK-6130] [SQL] support if not exists for insert overwrite into partition in hiveQl · 85ee0cab
      Daoyuan Wang authored
      Standard syntax:
      INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
      INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
       
      Hive extension (multiple inserts):
      FROM from_statement
      INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
      [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
      [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
      FROM from_statement
      INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
      [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
      [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
       
      Hive extension (dynamic partition inserts):
      INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
      INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4865 from adrian-wang/insertoverwrite and squashes the following commits:
      
      2fce94f [Daoyuan Wang] add assert
      10ea6f3 [Daoyuan Wang] add name for boolean parameter
      0bbe9b9 [Daoyuan Wang] fix failure
      4391154 [Daoyuan Wang] support if not exists for insert overwrite into partition in hiveQl
      85ee0cab
    • Xusen Yin's avatar
      [SPARK-5988][MLlib] add save/load for PowerIterationClusteringModel · 1e340c3a
      Xusen Yin authored
      See JIRA issue [SPARK-5988](https://issues.apache.org/jira/browse/SPARK-5988).
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5450 from yinxusen/SPARK-5988 and squashes the following commits:
      
      cb1ecfa [Xusen Yin] change Assignment into case class
      b1dd24c [Xusen Yin] add test suite
      63c3923 [Xusen Yin] add save load for power iteration clustering
      1e340c3a
    • Cheolsoo Park's avatar
      [SPARK-6662][YARN] Allow variable substitution in spark.yarn.historyServer.address · 6cc5b3ed
      Cheolsoo Park authored
      In Spark on YARN, explicit hostname and port number need to be set for "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem.
      
      But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to ${yarn.resourcemanager.hostname}:18080 if variable substitution is allowed.
      
      In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #5321 from piaozhexiu/SPARK-6662 and squashes the following commits:
      
      e37de75 [Cheolsoo Park] Preserve the space between the Hadoop and Spark imports
      79757c6 [Cheolsoo Park] Incorporate review comments
      10e2917 [Cheolsoo Park] Add helper function that substitutes hadoop vars to SparkHadoopUtil
      589b52c [Cheolsoo Park] Revert "Allow variable substitution for spark.yarn. properties"
      ff9c35d [Cheolsoo Park] Allow variable substitution for spark.yarn. properties
      6cc5b3ed
    • Reynold Xin's avatar
      [SPARK-6765] Enable scalastyle on test code. · c5b0b296
      Reynold Xin authored
      Turn scalastyle on for all test code. Most of the violations have been resolved in my previous pull requests:
      
      Core: https://github.com/apache/spark/pull/5484
      SQL: https://github.com/apache/spark/pull/5412
      MLlib: https://github.com/apache/spark/pull/5411
      GraphX: https://github.com/apache/spark/pull/5410
      Streaming: https://github.com/apache/spark/pull/5409
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5486 from rxin/test-style-enable and squashes the following commits:
      
      01683de [Reynold Xin] Fixed new code.
      a4ab46e [Reynold Xin] Fixed tests.
      20adbc8 [Reynold Xin] Missed one violation.
      5e36521 [Reynold Xin] [SPARK-6765] Enable scalastyle on test code.
      c5b0b296
    • Doug Balog's avatar
      [SPARK-6207] [YARN] [SQL] Adds delegation tokens for metastore to conf. · 77620be7
      Doug Balog authored
      Adds hive2-metastore delegation token to conf when running in secure mode.
      Without this change, running on YARN in cluster mode fails with a
      GSS exception.
      
      This is a rough patch that adds a dependency to spark/yarn on hive-exec.
      I'm looking for suggestions on how to make this patch better.
      
      This contribution is my original work and that I licenses the work to the
      Apache Spark project under the project's open source licenses.
      
      Author: Doug Balog <doug.balogtarget.com>
      
      Author: Doug Balog <doug.balog@target.com>
      
      Closes #5031 from dougb/SPARK-6207 and squashes the following commits:
      
      3e9ac16 [Doug Balog] [SPARK-6207] Fixes minor code spacing issues.
      e260765 [Doug Balog] [SPARK-6207] Second pass at adding Hive delegation token to conf. - Use reflection instead of adding dependency on hive. - Tested on Hive 0.13 and Hadoop 2.4.1
      1ab1729 [Doug Balog] Merge branch 'master' of git://github.com/apache/spark into SPARK-6207
      bf356d2 [Doug Balog] [SPARK-6207] [YARN] [SQL] Adds delegation tokens for metastore to conf. Adds hive2-metastore delagations token to conf when running in securemode. Without this change, runing on YARN in cluster mode fails with a GSS exception.
      77620be7
    • Pei-Lun Lee's avatar
      [SPARK-6352] [SQL] Add DirectParquetOutputCommitter · b29663ee
      Pei-Lun Lee authored
      Add a DirectParquetOutputCommitter class that skips _temporary directory when saving to s3. Add new config value "spark.sql.parquet.useDirectParquetOutputCommitter" (default false) to choose between the default output committer.
      
      Author: Pei-Lun Lee <pllee@appier.com>
      
      Closes #5042 from ypcat/spark-6352 and squashes the following commits:
      
      e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352
      9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer.
      0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license
      c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case
      0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter
      769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter
      f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
      b29663ee
    • linweizhong's avatar
      [SPARK-6870][Yarn] Catch InterruptedException when yarn application state... · 202ebf06
      linweizhong authored
      [SPARK-6870][Yarn] Catch InterruptedException when yarn application state monitor thread been interrupted
      
      On PR #5305 we interrupt the monitor thread but forget to catch the InterruptedException, then in the log will print the stack info, so we need to catch it.
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #5479 from Sephiroth-Lin/SPARK-6870 and squashes the following commits:
      
      f775f93 [linweizhong] Update, don't need to call Thread.currentThread() on monitor thread
      0e2ef1f [linweizhong] Update
      0d8958a [linweizhong] Update
      3513fdb [linweizhong] Catch InterruptedException
      202ebf06
    • Pradeep Chanumolu's avatar
      [SPARK-6671] Add status command for spark daemons · 240ea03f
      Pradeep Chanumolu authored
      SPARK-6671
      Currently using the spark-daemon.sh script we can start and stop the spark demons. But we cannot get the status of the daemons. It will be nice to include the status command in the spark-daemon.sh script, through which we can know if the spark demon is alive or not.
      
      Author: Pradeep Chanumolu <pchanumolu@maprtech.com>
      
      Closes #5327 from pchanumolu/master and squashes the following commits:
      
      d3a1f05 [Pradeep Chanumolu] Make status command check consistent with Stop command
      5062926 [Pradeep Chanumolu] Fix indentation in spark-daemon.sh
      3e66bc8 [Pradeep Chanumolu] SPARK-6671 : Add status command to spark daemons
      1ac3918 [Pradeep Chanumolu] Add status command to spark-daemon
      240ea03f
    • nyaapa's avatar
      [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI · 9d117cee
      nyaapa authored
      Author: nyaapa <nyaapa@gmail.com>
      
      Closes #5424 from nyaapa/master and squashes the following commits:
      
      6b717aa [nyaapa] [SPARK-6440][CORE] Remove Utils.localIpAddressHostname, Utils.localIpAddressURI and Utils.getAddressHostName; make Utils.localIpAddress private; rename Utils.localHostURI into Utils.localHostNameForURI; use Utils.localHostName in org.apache.spark.streaming.kinesis.KinesisReceiver and org.apache.spark.sql.hive.thriftserver.SparkSQLEnv
      2098081 [nyaapa] [SPARK-6440][CORE] style fixes and use getHostAddress instead of getHostName
      84763d7 [nyaapa] [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI
      9d117cee
    • zsxwing's avatar
      [SPARK-6860][Streaming][WebUI] Fix the possible inconsistency of StreamingPage · 14ce3ea2
      zsxwing authored
      Because `StreamingPage.render` doesn't hold the `listener` lock when generating the content, the different parts of content may have some inconsistent values if `listener` updates its status at the same time. And it will confuse people.
      
      This PR added `listener.synchronized` to make sure we have a consistent view of StreamingJobProgressListener when creating the content.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5470 from zsxwing/SPARK-6860 and squashes the following commits:
      
      cec6f92 [zsxwing] Add missing 'synchronized' in StreamingJobProgressListener
      7182498 [zsxwing] Add synchronized to make sure we have a consistent view of StreamingJobProgressListener when creating the content
      14ce3ea2
    • lisurprise's avatar
      [SPARK-6762]Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader · cadd7d72
      lisurprise authored
      The close action should be placed within finally block to avoid the potential resource leaks
      
      Author: lisurprise <zhichao.li@intel.com>
      
      Closes #5407 from zhichao-li/master and squashes the following commits:
      
      065999f [lisurprise] add guard for null
      ef862d6 [lisurprise] remove fs.close
      a754adc [lisurprise] refactor with tryWithSafeFinally
      824adb3 [lisurprise] close before validation
      c877da7 [lisurprise] Fix potential resource leaks
      cadd7d72
    • Dean Chen's avatar
      [SPARK-6868][YARN] Fix broken container log link on executor page when HTTPS_ONLY. · 950645d5
      Dean Chen authored
      Correct http schema in YARN container log link in Spark UI when container logs when YARN is configured to be HTTPS_ONLY.
      
      Uses the same logic as the YARN jobtracker webapp. Entry point is [JobBlock](https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108) and logic is in [MRWebAppUtil](https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/util/MRWebAppUtil.java#L75).
      
      I chose to migrate the logic over instead of importing MRWebAppUtil(but can update the PR to do so) since the class is designated as private and the logic was straightforward.
      
      Author: Dean Chen <deanchen5@gmail.com>
      
      Closes #5477 from deanchen/container-url and squashes the following commits:
      
      91d3090 [Dean Chen] Correct http schema in YARN container log link in Spark UI when container logs when YARN is configured to be HTTPS_ONLY.
      950645d5
    • Reynold Xin's avatar
      [SPARK-6562][SQL] DataFrame.replace · 68d1faa3
      Reynold Xin authored
      Supports replacing values with other values in DataFrames.
      
      Python support should be in a separate pull request.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5282 from rxin/df-na-replace and squashes the following commits:
      
      4b72434 [Reynold Xin] Removed println.
      c8d9946 [Reynold Xin] col -> cols
      fbb3c21 [Reynold Xin] [SPARK-6562][SQL] DataFrame.replace
      68d1faa3
    • Xiangrui Meng's avatar
      [SPARK-5885][MLLIB] Add VectorAssembler as a feature transformer · 92940449
      Xiangrui Meng authored
      VectorAssembler merges multiple columns into a vector column. This PR contains content from #5195.
      
      ~~carry ML attributes~~ (moved to a follow-up PR)
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5196 from mengxr/SPARK-5885 and squashes the following commits:
      
      a52b101 [Xiangrui Meng] recognize more types
      35daac2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5885
      bb5e64b [Xiangrui Meng] add TODO for null
      976a3d6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5885
      0859311 [Xiangrui Meng] Revert "add CreateStruct"
      29fb6ac [Xiangrui Meng] use CreateStruct
      adb71c4 [Xiangrui Meng] Merge branch 'SPARK-6542' into SPARK-5885
      85f3106 [Xiangrui Meng] add CreateStruct
      4ff16ce [Xiangrui Meng] add VectorAssembler
      92940449
    • Xiangrui Meng's avatar
      [SPARK-5886][ML] Add StringIndexer as a feature transformer · 685ddcf5
      Xiangrui Meng authored
      This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.
      
      TODOs:
      - [x] store feature to index map in output metadata
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4735 from mengxr/SPARK-5886 and squashes the following commits:
      
      d82575f [Xiangrui Meng] fix test
      700e70f [Xiangrui Meng] rename LabelIndexer to StringIndexer
      16a6f8c [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
      457166e [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
      f8b30f4 [Xiangrui Meng] update label indexer to output metadata
      e81ec28 [Xiangrui Meng] Merge branch 'openhashmap-contains' into SPARK-5886-2
      d6e6f1f [Xiangrui Meng] add contains to primitivekeyopenhashmap
      748a69b [Xiangrui Meng] add contains to OpenHashMap
      def3c5c [Xiangrui Meng] add LabelIndexer
      685ddcf5
    • Joseph K. Bradley's avatar
      [SPARK-4081] [mllib] VectorIndexer · d3792f54
      Joseph K. Bradley authored
      **Ready for review!**
      
      Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.
      
      This introduces a VectorIndexer class which does the following:
      * VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
        * Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
      * VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices
      
      Design notes:
      * This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
      * This does not yet support transforming data with new (unknown) categorical feature values.  That can be added later.
      * This is necessary for DecisionTree and tree ensembles.
      
      Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.
      
      Other notes:
      * This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3000 from jkbradley/indexer and squashes the following commits:
      
      5956d91 [Joseph K. Bradley] minor cleanups
      f5c57a8 [Joseph K. Bradley] added Java test suite
      643b444 [Joseph K. Bradley] removed FeatureTests
      02236c3 [Joseph K. Bradley] Updated VectorIndexer, ready for PR
      286d221 [Joseph K. Bradley] Reworked DatasetIndexer for spark.ml API, and renamed it to VectorIndexer
      12e6cf2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into indexer
      6d8f3f1 [Joseph K. Bradley] Added partly done DatasetIndexer to spark.ml
      6a2f553 [Joseph K. Bradley] Updated TODO for allowUnknownCategories
      3f041f8 [Joseph K. Bradley] Final cleanups for DatasetIndexer
      038b9e3 [Joseph K. Bradley] DatasetIndexer now maintains sparsity in SparseVector
      3a4a0bd [Joseph K. Bradley] Added another test for DatasetIndexer
      2006923 [Joseph K. Bradley] DatasetIndexer now passes tests
      f409987 [Joseph K. Bradley] partly done with DatasetIndexerSuite
      5e7c874 [Joseph K. Bradley] working on DatasetIndexer
      d3792f54
    • lewuathe's avatar
      [SPARK-6643][MLLIB] Implement StandardScalerModel missing methods · fc176614
      lewuathe authored
      This is the sub-task of SPARK-6254.
      Wrap missing method for `StandardScalerModel`.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #5310 from Lewuathe/SPARK-6643 and squashes the following commits:
      
      fafd690 [lewuathe] Fix for lint-python
      bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
      578f5ee [lewuathe] Remove unnecessary class
      a38f155 [lewuathe] Merge master
      66bb2ab [lewuathe] Fix typos
      82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
      fc176614
  3. Apr 12, 2015
    • Reynold Xin's avatar
      [SPARK-6765] Fix test code style for core. · a1fe59da
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5484 from rxin/test-style-core and squashes the following commits:
      
      e0b0100 [Reynold Xin] [SPARK-6765] Fix test code style for core.
      a1fe59da
    • Daoyuan Wang's avatar
      [MINOR] a typo: coalesce · 04bcd67c
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #5482 from adrian-wang/typo and squashes the following commits:
      
      e65ef6f [Daoyuan Wang] typo
      04bcd67c
    • cody koeninger's avatar
      [SPARK-6431][Streaming][Kafka] Error message for partition metadata requ... · 6ac8eea2
      cody koeninger authored
      ...ests
      
      The original reported problem was misdiagnosed; the topic just didn't exist yet.  Agreed upon solution was to improve error handling / message
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #5454 from koeninger/spark-6431-master and squashes the following commits:
      
      44300f8 [cody koeninger] [SPARK-6431][Streaming][Kafka] Error message for partition metadata requests
      6ac8eea2
    • lisurprise's avatar
      [SPARK-6843][core]Add volatile for the "state" · ddc17431
      lisurprise authored
      Fix potential visibility problem for the "state" of Executor
      
      The field of "state" is shared and modified by multiple threads. i.e:
      
      ```scala
      Within ExecutorRunner.scala
      
      (1) workerThread = new Thread("ExecutorRunner for " + fullId) {
        override def run() { fetchAndRunExecutor() }
      }
       workerThread.start()
      // Shutdown hook that kills actors on shutdown.
      
      (2)shutdownHook = new Thread() {
        override def run() {
          killProcess(Some("Worker shutting down"))
        }
      }
      
      (3)and also the "Actor thread" for worker.
      
      ```
      I think we should at lease add volatile to ensure the visibility among threads otherwise the worker might send an out-of-date status to the master.
      
      https://issues.apache.org/jira/browse/SPARK-6843
      
      Author: lisurprise <zhichao.li@intel.com>
      
      Closes #5448 from zhichao-li/state and squashes the following commits:
      
      a2386e7 [lisurprise] add volatile for state field
      ddc17431
    • Guancheng (G.C.) Chen's avatar
      [SPARK-6866][Build] Remove duplicated dependency in launcher/pom.xml · e9445b18
      Guancheng (G.C.) Chen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6866
      
      Remove duplicated dependency of scalatest in launcher/pom.xml since it already inherited the dependency from the parent pom.xml.
      
      Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
      
      Closes #5476 from gchen/SPARK-6866 and squashes the following commits:
      
      1ab484b [Guancheng (G.C.) Chen] remove duplicated dependency in launcher/pom.xml
      e9445b18
Loading