Skip to content
Snippets Groups Projects
  1. Aug 06, 2014
    • Cheng Lian's avatar
      [SPARK-2678][Core][SQL] A workaround for SPARK-2678 · a6cd3110
      Cheng Lian authored
      JIRA issues:
      
      - Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      - Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)
      
      Related PR:
      
      - #1715
      
      This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1801 from liancheng/spark-2874 and squashes the following commits:
      
      8045d7a [Cheng Lian] Make sure test suites pass
      8493a9e [Cheng Lian] Using eval to retain quoted arguments
      aed523f [Cheng Lian] Fixed typo in bin/spark-sql
      f12a0b1 [Cheng Lian] Worked arount SPARK-2678
      daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
      a6cd3110
    • Davies Liu's avatar
      [SPARK-2875] [PySpark] [SQL] handle null in schemaRDD() · 48789117
      Davies Liu authored
      Handle null in schemaRDD during converting them into Python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1802 from davies/json and squashes the following commits:
      
      88e6b1f [Davies Liu] handle null in schemaRDD()
      48789117
    • Andrew Or's avatar
      [SPARK-2157] Enable tight firewall rules for Spark · 09f7e458
      Andrew Or authored
      The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.
      
      The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.
      
      My spark-env.sh looks like this:
      ```
      export SPARK_MASTER_PORT=6060
      export SPARK_WORKER_PORT=7070
      export SPARK_MASTER_WEBUI_PORT=9090
      export SPARK_WORKER_WEBUI_PORT=9091
      ```
      and my spark-defaults.conf looks like this:
      ```
      spark.master spark://andrews-mbp:6060
      spark.driver.port 5001
      spark.fileserver.port 5011
      spark.broadcast.port 5021
      spark.replClassServer.port 5031
      spark.blockManager.port 5041
      spark.executor.port 5051
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1777 from andrewor14/configure-ports and squashes the following commits:
      
      621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      8a6b820 [Andrew Or] Use a random UI port during tests
      7da0493 [Andrew Or] Fix tests
      523c30e [Andrew Or] Add test for isBindCollision
      b97b02a [Andrew Or] Minor fixes
      c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      93d359f [Andrew Or] Executors connect to wrong port when collision occurs
      d502e5f [Andrew Or] Handle port collisions when creating Akka systems
      a2dd05c [Andrew Or] Patrick's comment nit
      86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
      1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
      cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
      e837cde [Andrew Or] Remove outdated TODOs
      bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      de1b207 [Andrew Or] Update docs to reflect new ports
      b565079 [Andrew Or] Add spark.ports.maxRetries
      2551eb2 [Andrew Or] Remove spark.worker.watcher.port
      151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
      9868358 [Andrew Or] Add a few miscellaneous ports
      6016e77 [Andrew Or] Add spark.executor.port
      8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
      4d9e6f3 [Andrew Or] Fix super subtle bug
      3f8e51b [Andrew Or] Correct erroneous docs...
      e111d08 [Andrew Or] Add names for UI services
      470f38c [Andrew Or] Special case non-"Address already in use" exceptions
      1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
      ba32280 [Andrew Or] Minor fixes
      6b550b0 [Andrew Or] Assorted fixes
      73fbe89 [Andrew Or] Move start service logic to Utils
      ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
      038a579 [Andrew Ash] Trust the server start function to report the port the service started on
      7c5bdc4 [Andrew Ash] Fix style issue
      0347aef [Andrew Ash] Unify port fallback logic to a single place
      24a4c32 [Andrew Ash] Remove type on val to match surrounding style
      9e4ad96 [Andrew Ash] Reformat for style checker
      5d84e0e [Andrew Ash] Document new port configuration options
      066dc7a [Andrew Ash] Fix up HttpServer port increments
      cad16da [Andrew Ash] Add fallover increment logic for HttpServer
      c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
      b80d2fd [Andrew Ash] Make Spark's block manager port configurable
      17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
      f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
      49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
      1c0981a [Andrew Ash] Make port in HttpServer configurable
      09f7e458
    • Tathagata Das's avatar
      [SPARK-1022][Streaming][HOTFIX] Fixed zookeeper dependency of Kafka · ee7f3085
      Tathagata Das authored
      https://github.com/apache/spark/pull/1751 caused maven builds to fail.
      
      ```
      ~/Apache/spark(branch-1.1|:heavy_check_mark:) ➤ mvn -U -DskipTests clean install
      .
      .
      .
      [error] Apache/spark/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala:36: object NIOServerCnxnFactory is not a member of package org.apache.zookeeper.server
      [error] import org.apache.zookeeper.server.NIOServerCnxnFactory
      [error]        ^
      [error] Apache/spark/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala:199: not found: type NIOServerCnxnFactory
      [error]     val factory = new NIOServerCnxnFactory()
      [error]                       ^
      [error] two errors found
      [error] Compile failed at Aug 5, 2014 1:42:36 PM [0.503s]
      ```
      
      The problem is how SBT and Maven resolves multiple versions of the same library, which in this case, is Zookeeper. Observing and comparing the dependency trees from Maven and SBT showed this. Spark depends on ZK 3.4.5 whereas Apache Kafka transitively depends on upon ZK 3.3.4. SBT decides to evict 3.3.4 and use the higher version 3.4.5. But Maven decides to stick to the closest (in the tree) dependent version of 3.3.4. And 3.3.4 does not have NIOServerCnxnFactory.
      
      The solution in this patch excludes zookeeper from the apache-kafka dependency in streaming-kafka module so that it just inherits zookeeper from Spark core.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #1797 from tdas/kafka-zk-fix and squashes the following commits:
      
      94b3931 [Tathagata Das] Fixed zookeeper dependency of Kafka
      ee7f3085
    • DB Tsai's avatar
      [MLlib] Use this.type as return type in k-means' builder pattern · c7b52010
      DB Tsai authored
      to ensure that the return object is itself.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #1796 from dbtsai/dbtsai-kmeans and squashes the following commits:
      
      658989e [DB Tsai] Alpine Data Labs
      c7b52010
    • CodingCat's avatar
      SPARK-2294: fix locality inversion bug in TaskManager · 63bdb1f4
      CodingCat authored
      copied from original JIRA (https://issues.apache.org/jira/browse/SPARK-2294):
      
      If an executor E is free, a task may be speculatively assigned to E when there are other tasks in the job that have not been launched (at all) yet. Similarly, a task without any locality preferences may be assigned to E when there was another NODE_LOCAL task that could have been scheduled.
      This happens because TaskSchedulerImpl calls TaskSetManager.resourceOffer (which in turn calls TaskSetManager.findTask) with increasing locality levels, beginning with PROCESS_LOCAL, followed by NODE_LOCAL, and so on until the highest currently allowed level. Now, supposed NODE_LOCAL is the highest currently allowed locality level. The first time findTask is called, it will be called with max level PROCESS_LOCAL; if it cannot find any PROCESS_LOCAL tasks, it will try to schedule tasks with no locality preferences or speculative tasks. As a result, speculative tasks or tasks with no preferences may be scheduled instead of NODE_LOCAL tasks.
      
      ----
      
      I added an additional parameter in resourceOffer and findTask, maxLocality, indicating when we should consider the tasks without locality preference
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #1313 from CodingCat/SPARK-2294 and squashes the following commits:
      
      bf3f13b [CodingCat] rollback some forgotten changes
      89f9bc0 [CodingCat] address matei's comments
      18cae02 [CodingCat] add test case for node-local tasks
      2ba6195 [CodingCat] fix failed test cases
      87dd09e [CodingCat] fix style
      9b9432f [CodingCat] remove hasNodeLocalOnlyTasks
      fdd1573 [CodingCat] fix failed test cases
      941a4fd [CodingCat] see my shocked face..........
      f600085 [CodingCat] remove hasNodeLocalOnlyTasks checking
      0b8a46b [CodingCat] test whether hasNodeLocalOnlyTasks affect the results
      73ceda8 [CodingCat] style fix
      b3a430b [CodingCat] remove fine granularity tracking for node-local only tasks
      f9a2ad8 [CodingCat] simplify the logic in TaskSchedulerImpl
      c8c1de4 [CodingCat] simplify the patch
      be652ed [CodingCat] avoid unnecessary delay when we only have nopref tasks
      dee9e22 [CodingCat] fix locality inversion bug in TaskManager by moving nopref branch
      63bdb1f4
    • Michael Armbrust's avatar
      [SQL] Fix logging warn -> debug · 5a826c00
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1800 from marmbrus/warning and squashes the following commits:
      
      8ea9cf1 [Michael Armbrust] [SQL] Fix logging warn -> debug.
      5a826c00
    • Reynold Xin's avatar
      [SQL] Tighten the visibility of various SQLConf methods and renamed setter/getters · b70bae40
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1794 from rxin/sql-conf and squashes the following commits:
      
      3ac11ef [Reynold Xin] getAllConfs return an immutable Map instead of an Array.
      4b19d6c [Reynold Xin] Tighten the visibility of various SQLConf methods and renamed setter/getters.
      b70bae40
  2. Aug 05, 2014
    • Anand Avati's avatar
      [SPARK-2806] core - upgrade to json4s-jackson 3.2.10 · 82624e2c
      Anand Avati authored
      Scala 2.11 packages not available for the current version (3.2.6)
      
      Signed-off-by: Anand Avati <avatiredhat.com>
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1702 from avati/SPARK-1812-json4s-jackson-3.2.10 and squashes the following commits:
      
      7be8324 [Anand Avati] SPARK-1812: core - upgrade to json4s 3.2.10
      82624e2c
    • Michael Armbrust's avatar
      [SPARK-2866][SQL] Support attributes in ORDER BY that aren't in SELECT · 1d70c4f6
      Michael Armbrust authored
      Minor refactoring to allow resolution either using a nodes input or output.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1795 from marmbrus/ordering and squashes the following commits:
      
      237f580 [Michael Armbrust] style
      74d833b [Michael Armbrust] newline
      705d963 [Michael Armbrust] Add a rule for resolving ORDER BY expressions that reference attributes not present in the SELECT clause.
      82cabda [Michael Armbrust] Generalize attribute resolution.
      1d70c4f6
    • Yin Huai's avatar
      [SPARK-2854][SQL] Finalize _acceptable_types in pyspark.sql · 69ec678d
      Yin Huai authored
      This PR aims to finalize accepted data value types in Python RDDs provided to Python `applySchema`.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2854
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1793 from yhuai/SPARK-2854 and squashes the following commits:
      
      32f0708 [Yin Huai] LongType only accepts long values.
      c2b23dd [Yin Huai] Do data type conversions based on the specified Spark SQL data type.
      69ec678d
    • Cheng Lian's avatar
      [SPARK-2650][SQL] Try to partially fix SPARK-2650 by adjusting initial buffer... · d0ae3f39
      Cheng Lian authored
      [SPARK-2650][SQL] Try to partially fix SPARK-2650 by adjusting initial buffer size and reducing memory allocation
      
      JIRA issue: [SPARK-2650](https://issues.apache.org/jira/browse/SPARK-2650)
      
      Please refer to [comments](https://issues.apache.org/jira/browse/SPARK-2650?focusedCommentId=14084397&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14084397) of SPARK-2650 for some other details.
      
      This PR adjusts the initial in-memory columnar buffer size to 1MB, same as the default value of Shark's `shark.column.partitionSize.mb` property when running in local mode. Will add Shark style partition size estimation in another PR.
      
      Also, before this PR, `NullableColumnBuilder` copies the whole buffer to add the null positions section, and then `CompressibleColumnBuilder` copies and compresses the buffer again, even if compression is disabled (`PassThrough` compression scheme is used to disable compression). In this PR the first buffer copy is eliminated to reduce memory consumption.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1769 from liancheng/spark-2650 and squashes the following commits:
      
      88a042e [Cheng Lian] Fixed method visibility and removed dead code
      001f2e5 [Cheng Lian] Try fixing SPARK-2650 by adjusting initial buffer size and reducing memory allocation
      d0ae3f39
    • wangfei's avatar
      [sql] rename project name in pom.xml of hive-thriftserver module · d94f5990
      wangfei authored
      module spark-hive-thriftserver_2.10 and spark-hive_2.10 both named "Spark Project Hive" in pom.xml, so rename spark-hive-thriftserver_2.10 project name to "Spark Project Hive Thrift Server"
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #1789 from scwf/patch-1 and squashes the following commits:
      
      ca1f5e9 [wangfei] [sql] rename module name of hive-thriftserver
      d94f5990
    • Stephen Boesch's avatar
      SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection · 2643e660
      Stephen Boesch authored
      I inquired on  dev mailing list about the motivation for checking the jdbc statement instead of the connection in the close() logic of JdbcRDD. Ted Yu believes there essentially is none-  it is a simple cut and paste issue. So here is the tiny fix to patch it.
      
      Author: Stephen Boesch <javadba>
      Author: Stephen Boesch <javadba@gmail.com>
      
      Closes #1792 from javadba/closejdbc and squashes the following commits:
      
      363be4f [Stephen Boesch] SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection (reformat with braces)
      6518d36 [Stephen Boesch] SPARK-2689 Fix tiny bug in JdbcRdd for closing jdbc connection
      3fb23ed [Stephen Boesch] SPARK-2689 Fix potential leak of connection/PreparedStatement in case of error in JdbcRDD
      095b2c9 [Stephen Boesch] Fix tiny bug (likely copy and paste error) in closing jdbc connection
      2643e660
    • Michael Giannakopoulos's avatar
      [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods · 1aad9114
      Michael Giannakopoulos authored
      Related to Jira Issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC)
      
      Author: Michael Giannakopoulos <miccagiann@gmail.com>
      
      Closes #1775 from miccagiann/linearMethodsReg and squashes the following commits:
      
      cb774c3 [Michael Giannakopoulos] MiniBatchFraction added in related PythonMLLibAPI java stubs.
      81fcbc6 [Michael Giannakopoulos] Fixing a typo-error.
      8ad263e [Michael Giannakopoulos] Adding regularizer type and intercept parameters to LogisticRegressionWithSGD and SVMWithSGD.
      1aad9114
    • Reynold Xin's avatar
      [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB. · acff9a7f
      Reynold Xin authored
      This can substantially reduce memory usage during shuffle.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:
      
      104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
      acff9a7f
    • Xiangrui Meng's avatar
      [SPARK-2864][MLLIB] fix random seed in word2vec; move model to local · cc491f69
      Xiangrui Meng authored
      It also moves the model to local in order to map `RDD[String]` to `RDD[Vector]`.
      
      Ishiihara
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1790 from mengxr/word2vec-fix and squashes the following commits:
      
      a87146c [Xiangrui Meng] add setters and make a default constructor
      e5c923b [Xiangrui Meng] fix random seed in word2vec; move model to local
      cc491f69
    • Thomas Graves's avatar
      SPARK-1680: use configs for specifying environment variables on YARN · 41e0a21b
      Thomas Graves authored
      Note that this also documents spark.executorEnv.*  which to me means its public.  If we don't want that please speak up.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1512 from tgravescs/SPARK-1680 and squashes the following commits:
      
      11525df [Thomas Graves] more doc changes
      553bad0 [Thomas Graves] fix documentation
      152bf7c [Thomas Graves] fix docs
      5382326 [Thomas Graves] try fix docs
      32f86a4 [Thomas Graves] use configs for specifying environment variables on YARN
      41e0a21b
    • Patrick Wendell's avatar
      SPARK-2380: Support displaying accumulator values in the web UI · 74f82c71
      Patrick Wendell authored
      This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish.
      
      Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going.
      
      ![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png)
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1309 from pwendell/metrics and squashes the following commits:
      
      8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD
      93fbe0f [Patrick Wendell] Other minor fixes
      cc43f68 [Patrick Wendell] Updating unit tests
      c991b1b [Patrick Wendell] Moving some code into the Accumulators class
      9a9ba3c [Patrick Wendell] More merge fixes
      c5ace9e [Patrick Wendell] More merge conflicts
      1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics
      9860c55 [Patrick Wendell] Potential solution to posting listener events
      0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined
      0ec4ac7 [Patrick Wendell] Java API's
      e95bf69 [Patrick Wendell] Stash
      be97261 [Patrick Wendell] Style fix
      8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class
      64d405f [Patrick Wendell] Adding missing file
      5d8b156 [Patrick Wendell] Changes based on Kay's review.
      9f18bad [Patrick Wendell] Minor style changes and tests
      7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback
      ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics.
      0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
      74f82c71
    • Guancheng (G.C.) Chen's avatar
      [SPARK-2859] Update url of Kryo project in related docs · ac3440f4
      Guancheng (G.C.) Chen authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-2859
      
      Kryo project has been migrated from googlecode to github, hence we need to update its URL in related docs such as tuning.md.
      
      Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com>
      
      Closes #1782 from gchen/kryo-docs and squashes the following commits:
      
      b62543c [Guancheng (G.C.) Chen] update url of Kryo project
      ac3440f4
    • Michael Armbrust's avatar
      [SPARK-2860][SQL] Fix coercion of CASE WHEN. · 6e821e3d
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1785 from marmbrus/caseNull and squashes the following commits:
      
      126006d [Michael Armbrust] better error message
      2fe357f [Michael Armbrust] Fix coercion of CASE WHEN.
      6e821e3d
    • Thomas Graves's avatar
      SPARK-1890 and SPARK-1891- add admin and modify acls · 1c5555a2
      Thomas Graves authored
      It was easier to combine these 2 jira since they touch many of the same places.  This pr adds the following:
      
      - adds modify acls
      - adds admin acls (list of admins/users that get added to both view and modify acls)
      - modify Kill button on UI to take modify acls into account
      - changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
      - send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:
      
      8292eb1 [Thomas Graves] review comments
      b92ec89 [Thomas Graves] remove unneeded variable from applistener
      4c765f4 [Thomas Graves] Add in admin acls
      72eb0ac [Thomas Graves] Add modify acls
      1c5555a2
    • Thomas Graves's avatar
      SPARK-1528 - spark on yarn, add support for accessing remote HDFS · 2c0f705e
      Thomas Graves authored
      Add a config (spark.yarn.access.namenodes) to allow applications running on yarn to access other secure HDFS cluster.  User just specifies the namenodes of the other clusters and we get Tokens for those and ship them with the spark application.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #1159 from tgravescs/spark-1528 and squashes the following commits:
      
      ddbcd16 [Thomas Graves] review comments
      0ac8501 [Thomas Graves] SPARK-1528 - add support for accessing remote HDFS
      2c0f705e
    • jerryshao's avatar
      [SPARK-1022][Streaming] Add Kafka real unit test · e87075df
      jerryshao authored
      This PR is a updated version of (https://github.com/apache/spark/pull/557) to actually test sending and receiving data through Kafka, and fix previous flaky issues.
      
      @tdas, would you mind reviewing this PR? Thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #1751 from jerryshao/kafka-unit-test and squashes the following commits:
      
      b6a505f [jerryshao] code refactor according to comments
      5222330 [jerryshao] Change JavaKafkaStreamSuite to better test it
      5525f10 [jerryshao] Fix flaky issue of Kafka real unit test
      4559310 [jerryshao] Minor changes for Kafka unit test
      860f649 [jerryshao] Minor style changes, and tests ignored due to flakiness
      796d4ca [jerryshao] Add real Kafka streaming test
      e87075df
    • Reynold Xin's avatar
      [SPARK-2856] Decrease initial buffer size for Kryo to 64KB. · 184048f8
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1780 from rxin/kryo-init-size and squashes the following commits:
      
      551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
      184048f8
    • wangfei's avatar
      [SPARK-1779] Throw an exception if memory fractions are not between 0 and 1 · 9862c614
      wangfei authored
      Author: wangfei <scnbwf@yeah.net>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #714 from scwf/memoryFraction and squashes the following commits:
      
      6e385b9 [wangfei] Update SparkConf.scala
      da6ee59 [wangfei] add configs
      829a195 [wangfei] add indent
      717c0ca [wangfei] updated to make more concise
      fc45476 [wangfei] validate memoryfraction in sparkconf
      2e79b3d [wangfei] && => ||
      43621bd [wangfei] && => ||
      cf38bcf [wangfei] throw IllegalArgumentException
      14d18ac [wangfei] throw IllegalArgumentException
      dff1f0f [wangfei] Update BlockManager.scala
      764965f [wangfei] Update ExternalAppendOnlyMap.scala
      a59d76b [wangfei] Throw exception when memoryFracton is out of range
      7b899c2 [wangfei] 【SPARK-1779】
      9862c614
    • Andrew Or's avatar
      [SPARK-2857] Correct properties to set Master / Worker ports · a646a365
      Andrew Or authored
      `master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1779 from andrewor14/master-worker-port and squashes the following commits:
      
      8475e95 [Andrew Or] Update docs to reflect changes in configs
      4db3d5d [Andrew Or] Stop using configs that don't actually work
      a646a365
    • Matei Zaharia's avatar
      SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections · 4fde28c2
      Matei Zaharia authored
      This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1707 from mateiz/spark-2711 and squashes the following commits:
      
      debf75b [Matei Zaharia] Review comments
      24f28f3 [Matei Zaharia] Small rename
      c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
      315e3a5 [Matei Zaharia] Some review comments
      b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
      4fde28c2
    • Matei Zaharia's avatar
      SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove() · 066765d6
      Matei Zaharia authored
      Replaces this with an O(1) operation that does not have to shift over
      the whole tail of the array into the gap produced by the element removed.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:
      
      1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse ArrayBuffers
      eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid buffer.remove()
      066765d6
  3. Aug 04, 2014
    • Reynold Xin's avatar
      [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext · 05bf4e4a
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1772 from rxin/accumulator-dagscheduler and squashes the following commits:
      
      6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext.
      05bf4e4a
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple · 9fd82dbb
      Davies Liu authored
      serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1771 from davies/fix and squashes the following commits:
      
      1a9e336 [Davies Liu] fix unit tests
      9fd82dbb
    • Matei Zaharia's avatar
      SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter · 8e7d5ba1
      Matei Zaharia authored
      All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.
      
      In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1722 from mateiz/spark-2792 and squashes the following commits:
      
      5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
      18fe865 [Matei Zaharia] Update docs on objectStreamReset
      576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
      0374217 [Matei Zaharia] Remove super paranoid code to close file handles
      bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
      0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
      9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
      8e7d5ba1
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] pickable namedtuple · 59f84a95
      Davies Liu authored
      Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
      
      PS: pyspark should be import BEFORE "from collections import namedtuple"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1623 from davies/namedtuple and squashes the following commits:
      
      045dad8 [Davies Liu] remove unrelated code changes
      4132f32 [Davies Liu] address comment
      55b1c1a [Davies Liu] fix tests
      61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
      98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      f7b1bde [Davies Liu] add hack for CloudPickleSerializer
      0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
      93b03b8 [Davies Liu] pickable namedtuple
      59f84a95
    • Liquan Pei's avatar
      [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words · e053c558
      Liquan Pei authored
      This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
      
      To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
      
      To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :
      
      taiwan 0.8077646146334014
      korea 0.740913304563621
      japan 0.7240667798885471
      republic 0.7107151279078352
      thailand 0.6953217332072862
      tibet 0.6916782118129544
      mongolia 0.6800858715972612
      macau 0.6794925677480378
      singapore 0.6594048695593799
      manchuria 0.658989931844148
      laos 0.6512978726001666
      nepal 0.6380792327845325
      mainland 0.6365469459587788
      myanmar 0.6358614338840394
      macedonia 0.6322366180313249
      xinjiang 0.6285291551708028
      russia 0.6279951236068411
      india 0.6272874944023487
      shanghai 0.6234544135576999
      macao 0.6220588462925876
      
      The result with 10 partitions and 5 iterations is:
      taiwan 0.8310495079388313
      india 0.7737171315919039
      japan 0.756777901233668
      korea 0.7429767187102452
      indonesia 0.7407557427278356
      pakistan 0.712883426985585
      mainland 0.7053379963140822
      thailand 0.696298191073948
      mongolia 0.693690656871415
      laos 0.6913069680735292
      macau 0.6903427690029617
      republic 0.6766381604813666
      malaysia 0.676460699141784
      singapore 0.6728790997360923
      malaya 0.672345232966194
      manchuria 0.6703732292753156
      macedonia 0.6637955686322028
      myanmar 0.6589462882439646
      kazakhstan 0.657017801081494
      cambodia 0.6542383836451932
      
      Author: Liquan Pei <lpei@gopivotal.com>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #1719 from Ishiihara/master and squashes the following commits:
      
      2ba9483 [Liquan Pei] minor fix for Word2Vec test
      e248441 [Liquan Pei] minor style change
      26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
      c14da41 [Xiangrui Meng] fix styles
      384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
      e93e726 [Liquan Pei] use treeAggregate instead of aggregate
      1a8fb41 [Liquan Pei] use weighted sum in combOp
      7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
      6bcc8be [Liquan Pei] add multiple iteration support
      720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
      2e92b59 [Liquan Pei] modify according to feedback
      57dc50d [Liquan Pei] code formatting
      e4a04d3 [Liquan Pei] minor fix
      0aafb1b [Liquan Pei] Add comments, minor fixes
      8d6befe [Liquan Pei] initial commit
      e053c558
  4. Aug 03, 2014
    • DB Tsai's avatar
      SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent... · ae58aea2
      DB Tsai authored
      SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
      
      Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
      
      In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.
      
      There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.
      
      1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
      
      2) `Normalizer` - Normalizes samples individually to unit L^n norm
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
      
      78c15d3 [DB Tsai] Alpine Data Labs
      ae58aea2
    • Sarah Gerweck's avatar
      Fix some bugs with spaces in directory name. · 5507dd8e
      Sarah Gerweck authored
      Any time you use the directory name (`FWDIR`) it needs to be surrounded
      in quotes. If you're also using wildcards, you can safely put the quotes
      around just `$FWDIR`.
      
      Author: Sarah Gerweck <sarah.a180@gmail.com>
      
      Closes #1756 from sarahgerweck/folderSpaces and squashes the following commits:
      
      732629d [Sarah Gerweck] Fix some bugs with spaces in directory name.
      5507dd8e
    • Anand Avati's avatar
      [SPARK-2810] upgrade to scala-maven-plugin 3.2.0 · 6ba6c3eb
      Anand Avati authored
      Needed for Scala 2.11 compiler-interface
      
      Signed-off-by: Anand Avati <avatiredhat.com>
      
      Author: Anand Avati <avati@redhat.com>
      
      Closes #1711 from avati/SPARK-1812-scala-maven-plugin and squashes the following commits:
      
      9a22fc8 [Anand Avati] SPARK-1812: upgrade to scala-maven-plugin 3.2.0
      6ba6c3eb
    • Davies Liu's avatar
      [SPARK-1740] [PySpark] kill the python worker · 55349f9f
      Davies Liu authored
      Kill only the python worker related to cancelled tasks.
      
      The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.
      
      When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1643 from davies/kill and squashes the following commits:
      
      8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
      46ca150 [Davies Liu] address comment
      acd751c [Davies Liu] kill the worker when task is canceled
      55349f9f
    • Yin Huai's avatar
      [SPARK-2783][SQL] Basic support for analyze in HiveContext · e139e2be
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2783
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1741 from yhuai/analyzeTable and squashes the following commits:
      
      7bb5f02 [Yin Huai] Use sql instead of hql.
      4d09325 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      e3ebcd4 [Yin Huai] Renaming.
      c170f4e [Yin Huai] Do not use getContentSummary.
      62393b6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      db233a6 [Yin Huai] Trying to debug jenkins...
      fee84f0 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      f0501f3 [Yin Huai] Fix compilation error.
      24ad391 [Yin Huai] Merge remote-tracking branch 'upstream/master' into analyzeTable
      8918140 [Yin Huai] Wording.
      23df227 [Yin Huai] Add a simple analyze method to get the size of a table and update the "totalSize" property of this table in the Hive metastore.
      e139e2be
    • Cheng Lian's avatar
      [SPARK-2814][SQL] HiveThriftServer2 throws NPE when executing native commands · ac33cbbf
      Cheng Lian authored
      JIRA issue: [SPARK-2814](https://issues.apache.org/jira/browse/SPARK-2814)
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1753 from liancheng/spark-2814 and squashes the following commits:
      
      c74a3b2 [Cheng Lian] Fixed SPARK-2814
      ac33cbbf
Loading