Skip to content
Snippets Groups Projects
  1. Mar 15, 2014
    • Sean Owen's avatar
      SPARK-1254. Consolidate, order, and harmonize repository declarations in Maven/SBT builds · 97e4459e
      Sean Owen authored
      This suggestion addresses a few minor suboptimalities with how repositories are handled.
      
      1) Use HTTPS consistently to access repos, instead of HTTP
      
      2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)
      
      3) Update SBT build to match Maven build in this regard
      
      4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #145 from srowen/SPARK-1254 and squashes the following commits:
      
      42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
      97e4459e
  2. Mar 14, 2014
  3. Mar 13, 2014
    • Tianshuo Deng's avatar
      [bugfix] wrong client arg, should use executor-cores · 181b130a
      Tianshuo Deng authored
      client arg is wrong, it should be executor-cores. it causes executor fail to start when executor-cores is specified
      
      Author: Tianshuo Deng <tdeng@twitter.com>
      
      Closes #138 from tsdeng/bugfix_wrong_client_args and squashes the following commits:
      
      304826d [Tianshuo Deng] wrong client arg, should use executor-cores
      181b130a
    • Reynold Xin's avatar
      SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225. · ca4bf8c5
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #113 from rxin/jetty9 and squashes the following commits:
      
      867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file.
      d7c97ca [Reynold Xin] Return the correctly bound port.
      d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
      ca4bf8c5
    • Sandy Ryza's avatar
      SPARK-1183. Don't use "worker" to mean executor · 69837321
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #120 from sryza/sandy-spark-1183 and squashes the following commits:
      
      5066a4a [Sandy Ryza] Remove "worker" in a couple comments
      0bd1e46 [Sandy Ryza] Remove --am-class from usage
      bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
      607539f [Sandy Ryza] Address review comments
      74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
      69837321
    • Xiangrui Meng's avatar
      [SPARK-1237, 1238] Improve the computation of YtY for implicit ALS · e4e8d8f3
      Xiangrui Meng authored
      Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data.
      
      To compare the results, I also added an option to set a random seed in ALS.
      
      JIRA:
      1. https://spark-project.atlassian.net/browse/SPARK-1237
      2. https://spark-project.atlassian.net/browse/SPARK-1238
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #131 from mengxr/als and squashes the following commits:
      
      ed00432 [Xiangrui Meng] minor changes
      d984623 [Xiangrui Meng] minor changes
      2fc1641 [Xiangrui Meng] remove commented code
      4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS
      200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock
      e4e8d8f3
    • Patrick Wendell's avatar
      SPARK-1019: pyspark RDD take() throws an NPE · 4ea23db0
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #112 from pwendell/pyspark-take and squashes the following commits:
      
      daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE
      4ea23db0
  4. Mar 12, 2014
    • CodingCat's avatar
      hot fix for PR105 - change to Java annotation · 6bd2eaa4
      CodingCat authored
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits:
      
      6607155 [CodingCat] hot fix for PR105 - change to Java annotation
      6bd2eaa4
    • jianghan's avatar
      Fix example bug: compile error · 31a70400
      jianghan authored
      Author: jianghan <jianghan@xiaomi.com>
      
      Closes #132 from pooorman/master and squashes the following commits:
      
      54afbe0 [jianghan] Fix example bug: compile error
      31a70400
    • CodingCat's avatar
      SPARK-1160: Deprecate toArray in RDD · 9032f7c0
      CodingCat authored
      https://spark-project.atlassian.net/browse/SPARK-1160
      
      reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python."
      
      In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #105 from CodingCat/SPARK-1060 and squashes the following commits:
      
      286f163 [CodingCat] deprecate in JavaRDDLike
      ee17b4e [CodingCat] add message and since
      2ff7319 [CodingCat] deprecate toArray in RDD
      9032f7c0
    • Prashant Sharma's avatar
      SPARK-1162 Added top in python. · b8afe305
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
      
      ece1fa4 [Prashant Sharma] Added top in python.
      b8afe305
    • liguoqiang's avatar
      Fix #SPARK-1149 Bad partitioners can cause Spark to hang · 5d1ec64e
      liguoqiang authored
      Author: liguoqiang <liguoqiang@rd.tuan800.com>
      
      Closes #44 from witgo/SPARK-1149 and squashes the following commits:
      
      3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149
      8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
      3dad595 [liguoqiang] review comment
      e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149
      b0d5c07 [liguoqiang] review comment
      d0a6005 [liguoqiang] review comment
      3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
      ac006a3 [liguoqiang] code Formatting
      3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149
      adc443e [liguoqiang] partitions check  bugfix
      928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner
      db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149
      1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149
      3348619 [liguoqiang] Optimize performance for partitions check
      61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149
      e68210a [liguoqiang] add partition index check to submitJob
      3a65903 [liguoqiang] make the code more readable
      6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang
      5d1ec64e
    • Thomas Graves's avatar
      [SPARK-1233] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_M... · b5162f44
      Thomas Graves authored
      ...APREDUCE_APPLICATION_CLASSPATH
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #129 from tgravescs/SPARK-1233 and squashes the following commits:
      
      85ff5a6 [Thomas Graves] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH
      b5162f44
    • Thomas Graves's avatar
      [SPARK-1232] Fix the hadoop 0.23 yarn build · c8c59b32
      Thomas Graves authored
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #127 from tgravescs/SPARK-1232 and squashes the following commits:
      
      c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build
      c8c59b32
    • prabinb's avatar
      Spark-1163, Added missing Python RDD functions · af7f2f10
      prabinb authored
      Author: prabinb <prabin.banka@imaginea.com>
      
      Closes #92 from prabinb/python-api-rdd and squashes the following commits:
      
      51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
      af7f2f10
    • Sandy Ryza's avatar
      SPARK-1064 · 2409af9d
      Sandy Ryza authored
      This reopens PR 649 from incubator-spark against the new repo
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #102 from sryza/sandy-spark-1064 and squashes the following commits:
      
      270e490 [Sandy Ryza] Handle different application classpath variables in different versions
      88b04e0 [Sandy Ryza] SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in Spark assembly
      2409af9d
  5. Mar 11, 2014
    • Patrick Wendell's avatar
      SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues... · 16788a65
      Patrick Wendell authored
      This patch removes Ganglia integration from the default build. It
      allows users willing to link against LGPL code to use Ganglia
      by adding build flags or linking against a new Spark artifact called
      spark-ganglia-lgpl.
      
      This brings Spark in line with the Apache policy on LGPL code
      enumerated here:
      
      https://www.apache.org/legal/3party.html#options-optional
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #108 from pwendell/ganglia and squashes the following commits:
      
      326712a [Patrick Wendell] Responding to review feedback
      5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
      16788a65
  6. Mar 10, 2014
    • Sandy Ryza's avatar
      SPARK-1211. In ApplicationMaster, set spark.master system property to "y... · 2a2c9645
      Sandy Ryza authored
      ...arn-cluster"
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #118 from sryza/sandy-spark-1211 and squashes the following commits:
      
      d4001c7 [Sandy Ryza] SPARK-1211. In ApplicationMaster, set spark.master system property to "yarn-cluster"
      2a2c9645
    • Patrick Wendell's avatar
      SPARK-1205: Clean up callSite/origin/generator. · 2a516170
      Patrick Wendell authored
      This patch removes the `generator` field and simplifies + documents
      the tracking of callsites.
      
      There are two places where we care about call sites, when a job is
      run and when an RDD is created. This patch retains both of those
      features but does a slight refactoring and renaming to make things
      less confusing.
      
      There was another feature of an rdd called the `generator` which was
      by default the user class that in which the RDD was created. This is
      used exclusively in the JobLogger. It been subsumed by the ability
      to name a job group. The job logger can later be refectored to
      read the job group directly (will require some work) but for now
      this just preserves the default logged value of the user class.
      I'm not sure any users ever used the ability to override this.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #106 from pwendell/callsite and squashes the following commits:
      
      fc1d009 [Patrick Wendell] Compile fix
      e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite
      62e77ef [Patrick Wendell] Review feedback
      576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.
      2a516170
    • Prashant Sharma's avatar
      SPARK-1168, Added foldByKey to pyspark. · a59419c2
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
      
      db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
      a59419c2
    • jyotiska's avatar
      [SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted) · f5518989
      jyotiska authored
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #34 from jyotiska/pyspark_code and squashes the following commits:
      
      c9439be [jyotiska] replaced dict with namedtuple
      a6bf4cd [jyotiska] added callsite info for context.py
      f5518989
    • Prabin Banka's avatar
      SPARK-977 Added Python RDD.zip function · e1e09e0e
      Prabin Banka authored
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #76 from prabinb/python-api-zip and squashes the following commits:
      
      b1a31a0 [Prabin Banka] Added Python RDD.zip function
      e1e09e0e
    • Chen Chao's avatar
      maintain arbitrary state data for each key · 5d98cfc1
      Chen Chao authored
      RT
      
      Author: Chen Chao <crazyjvm@gmail.com>
      
      Closes #114 from CrazyJvm/patch-1 and squashes the following commits:
      
      dcb0df5 [Chen Chao] maintain arbitrary state data for each key
      5d98cfc1
  7. Mar 09, 2014
    • Patrick Wendell's avatar
      SPARK-782 Clean up for ASM dependency. · b9be1609
      Patrick Wendell authored
      This makes two changes.
      
      1) Spark uses the shaded version of asm that is (conveniently) published
         with Kryo.
      2) Existing exclude rules around asm are updated to reflect the new groupId
         of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop
         versions that pull in new asm versions.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #100 from pwendell/asm and squashes the following commits:
      
      9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.
      b9be1609
    • Patrick Wendell's avatar
      Fix markup errors introduced in #33 (SPARK-1189) · faf4cad1
      Patrick Wendell authored
      These were causing errors on the configuration page.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #111 from pwendell/master and squashes the following commits:
      
      8467a86 [Patrick Wendell] Fix markup errors introduced in #33 (SPARK-1189)
      faf4cad1
    • Jiacheng Guo's avatar
      Add timeout for fetch file · f6f9d02e
      Jiacheng Guo authored
          Currently, when fetch a file, the connection's connect timeout
          and read timeout is based on the default jvm setting, in this change, I change it to
          use spark.worker.timeout. This can be usefull, when the
          connection status between worker is not perfect. And prevent
          prematurely remove task set.
      
      Author: Jiacheng Guo <guojc03@gmail.com>
      
      Closes #98 from guojc/master and squashes the following commits:
      
      abfe698 [Jiacheng Guo] add space according request
      2a37c34 [Jiacheng Guo] Add timeout for fetch file     Currently, when fetch a file, the connection's connect timeout     and read timeout is based on the default jvm setting, in this change, I change it to     use spark.worker.timeout. This can be usefull, when the     connection status between worker is not perfect. And prevent     prematurely remove task set.
      f6f9d02e
    • Aaron Davidson's avatar
      SPARK-929: Fully deprecate usage of SPARK_MEM · 52834d76
      Aaron Davidson authored
      (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)
      
      This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
      SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY
      
      The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.
      
      SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.
      
      SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.
      
      Other memory considerations:
      - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
      - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).
      
      This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #99 from aarondav/sparkmem and squashes the following commits:
      
      9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
      52834d76
  8. Mar 08, 2014
    • Patrick Wendell's avatar
      SPARK-1190: Do not initialize log4j if slf4j log4j backend is not being used · e59a3b6c
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #107 from pwendell/logging and squashes the following commits:
      
      be21c11 [Patrick Wendell] Logging fix
      e59a3b6c
    • Reynold Xin's avatar
      Update junitxml plugin to the latest version to avoid recompilation in every SBT command. · c2834ec0
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #104 from rxin/junitxml and squashes the following commits:
      
      67ef7bf [Reynold Xin] Update junitxml plugin to the latest version to avoid recompilation in every SBT command.
      c2834ec0
    • Cheng Lian's avatar
      [SPARK-1194] Fix the same-RDD rule for cache replacement · 0b7b7fd4
      Cheng Lian authored
      SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194
      
      In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails  and aborts.
      
      In this PR, we keep selecting blocks *not* from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails).
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #96 from liancheng/fix-spark-1194 and squashes the following commits:
      
      2524ab9 [Cheng Lian] Added regression test case for SPARK-1194
      6e40c22 [Cheng Lian] Remove redundant comments
      40cdcb2 [Cheng Lian] Bug fix, and addressed PR comments from @mridulm
      62c92ac [Cheng Lian] Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194
      0b7b7fd4
    • Reynold Xin's avatar
      Allow sbt to use more than 1G of heap. · 8ad486ad
      Reynold Xin authored
      There was a mistake in sbt build file ( introduced by 012bd5fb ) in which we set the default to 2048 and the immediately reset it to 1024.
      
      Without this, building Spark can run out of permgen space on my machine.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #103 from rxin/sbt and squashes the following commits:
      
      8829c34 [Reynold Xin] Allow sbt to use more than 1G of heap.
      8ad486ad
    • Sandy Ryza's avatar
      SPARK-1193. Fix indentation in pom.xmls · a99fb374
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #91 from sryza/sandy-spark-1193 and squashes the following commits:
      
      a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
      a99fb374
  9. Mar 07, 2014
    • Prashant Sharma's avatar
      Spark 1165 rdd.intersection in python and java · 6e730edc
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Prashant Sharma <scrapcodes@gmail.com>
      
      Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:
      
      9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
      1fea813 [Prashant Sharma] correct the lines wrapping
      d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
      d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
      6e730edc
    • Thomas Graves's avatar
      SPARK-1195: set map_input_file environment variable in PipedRDD · b7cd9e99
      Thomas Graves authored
      Hadoop uses the config mapreduce.map.input.file to indicate the input filename to the map when the input split is of type FileSplit. Some of the hadoop input and output formats set or use this config. This config can also be used by user code.
      PipedRDD runs an external process and the configs aren't available to that process. Hadoop Streaming does something very similar and the way they make configs available is exporting them into the environment replacing '.' with '_'. Spark should also export this variable when launching the pipe command so the user code has access to that config.
      Note that the config mapreduce.map.input.file is the new one, the old one which is deprecated but not yet removed is map.input.file. So we should handle both.
      
      Perhaps it would be better to abstract this out somehow so it goes into the HadoopParition code?
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #94 from tgravescs/map_input_file and squashes the following commits:
      
      cc97a6a [Thomas Graves] Update test to check for existence of command, add a getPipeEnvVars function to HadoopRDD
      e3401dc [Thomas Graves] Merge remote-tracking branch 'upstream/master' into map_input_file
      2ba805e [Thomas Graves] set map_input_file environment variable in PipedRDD
      b7cd9e99
    • Aaron Davidson's avatar
      SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1 · dabeb6f1
      Aaron Davidson authored
      This patch allows the FaultToleranceTest to work in newer versions of Docker.
      See https://spark-project.atlassian.net/browse/SPARK-1136 for more details.
      
      Besides changing the Docker and FaultToleranceTest internals, this patch also changes the behavior of Master to accept new Workers which share an address with a Worker that we are currently trying to recover. This can only happen when the Worker itself was restarted and got the same IP address/port at the same time as a Master recovery occurs.
      
      Finally, this adds a good bit of ASCII art to the test to make failures, successes, and actions more apparent. This is very much needed.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #5 from aarondav/zookeeper and squashes the following commits:
      
      5d7a72a [Aaron Davidson] SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1
      dabeb6f1
  10. Mar 06, 2014
    • Patrick Wendell's avatar
      Small clean-up to flatmap tests · 33baf14b
      Patrick Wendell authored
      33baf14b
    • anitatailor's avatar
      Example for cassandra CQL read/write from spark · 9ae919c0
      anitatailor authored
      Cassandra read/write using CqlPagingInputFormat/CqlOutputFormat
      
      Author: anitatailor <tailor.anita@gmail.com>
      
      Closes #87 from anitatailor/master and squashes the following commits:
      
      3493f81 [anitatailor] Fixed scala style as per review
      19480b7 [anitatailor] Example for cassandra CQL read/write from spark
      9ae919c0
    • Sandy Ryza's avatar
      SPARK-1197. Change yarn-standalone to yarn-cluster and fix up running on YARN docs · 328c73d0
      Sandy Ryza authored
      This patch changes "yarn-standalone" to "yarn-cluster" (but still supports the former).  It also cleans up the Running on YARN docs and adds a section on how to view logs.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #95 from sryza/sandy-spark-1197 and squashes the following commits:
      
      563ef3a [Sandy Ryza] Review feedback
      6ad06d4 [Sandy Ryza] Change yarn-standalone to yarn-cluster and fix up running on YARN docs
      328c73d0
    • Thomas Graves's avatar
      SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets · 7edbea41
      Thomas Graves authored
      resubmit pull request.  was https://github.com/apache/incubator-spark/pull/332.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits:
      
      dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser
      05eebed [Thomas Graves] Fix dependency lost in upmerge
      d1040ec [Thomas Graves] Fix up various imports
      05ff5e0 [Thomas Graves] Fix up imports after upmerging to master
      ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase
      13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests.
      4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets
      2f77147 [Thomas Graves] Rework from comments
      50dd9f2 [Thomas Graves] fix header in SecurityManager
      ecbfb65 [Thomas Graves] Fix spacing and formatting
      b514bec [Thomas Graves] Fix reference to config
      ed3d1c1 [Thomas Graves] Add security.md
      6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments
      2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework
      5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils
      f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
      7edbea41
    • Kyle Ellrott's avatar
      SPARK-942: Do not materialize partitions when DISK_ONLY storage level is used · 40566e10
      Kyle Ellrott authored
      This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180
      
      Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer.
      
      To do this, two changes where made:
      1) The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly.
      2) The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions.
      
      Author: Kyle Ellrott <kellrott@gmail.com>
      
      Closes #50 from kellrott/iterator-to-disk and squashes the following commits:
      
      9ef7cb8 [Kyle Ellrott] Fixing formatting issues.
      60e0c57 [Kyle Ellrott] Fixing issues (formatting, variable names, etc.) from review comments
      8aa31cd [Kyle Ellrott] Merge ../incubator-spark into iterator-to-disk
      33ac390 [Kyle Ellrott] Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spark into iterator-to-disk
      2f684ea [Kyle Ellrott] Refactoring the BlockManager to replace the Either[Either[A,B]] usage. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues.
      f70d069 [Kyle Ellrott] Adding docs for spark.serializer.objectStreamReset configuration
      7ccc74b [Kyle Ellrott] Moving the 'LargeIteratorSuite' to simply test persistance of iterators. It doesn't try to invoke a OOM error any more
      16a4cea [Kyle Ellrott] Streamlined the LargeIteratorSuite unit test. It should now run in ~25 seconds. Confirmed that it still crashes an unpatched copy of Spark.
      c2fb430 [Kyle Ellrott] Removing more un-needed array-buffer to iterator conversions
      627a8b7 [Kyle Ellrott] Wrapping a few long lines
      0f28ec7 [Kyle Ellrott] Adding second putValues to BlockStore interface that accepts an ArrayBuffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication.
      656c33e [Kyle Ellrott] Fixing the JavaSerializer to read from the SparkConf rather then the System property.
      8644ee8 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk
      00c98e0 [Kyle Ellrott] Making the Java ObjectStreamSerializer reset rate configurable by the system variable 'spark.serializer.objectStreamReset', default is not 10000.
      40fe1d7 [Kyle Ellrott] Removing rouge space
      31fe08e [Kyle Ellrott] Removing un-needed semi-colons
      9df0276 [Kyle Ellrott] Added check to make sure that streamed-to-dist RDD actually returns good data in the LargeIteratorSuite
      a6424ba [Kyle Ellrott] Wrapping long line
      2eeda75 [Kyle Ellrott] Fixing dumb mistake ("||" instead of "&&")
      0e6f808 [Kyle Ellrott] Deleting temp output directory when done
      95c7f67 [Kyle Ellrott] Simplifying StorageLevel checks
      56f71cd [Kyle Ellrott] Merge branch 'master' into iterator-to-disk
      44ec35a [Kyle Ellrott] Adding some comments.
      5eb2b7e [Kyle Ellrott] Changing the JavaSerializer reset to occur every 1000 objects.
      f403826 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk
      81d670c [Kyle Ellrott] Adding unit test for straight to disk iterator methods.
      d32992f [Kyle Ellrott] Merge remote-tracking branch 'origin/master' into iterator-to-disk
      cac1fad [Kyle Ellrott] Fixing MemoryStore, so that it converts incoming iterators to ArrayBuffer objects. This was previously done higher up the stack.
      efe1102 [Kyle Ellrott] Changing CacheManager and BlockManager to pass iterators directly to the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.
      40566e10
Loading