Skip to content
Snippets Groups Projects
  1. Dec 04, 2015
    • rotems's avatar
      [SPARK-12080][CORE] Kryo - Support multiple user registrators · f30373f5
      rotems authored
      Author: rotems <roter>
      
      Closes #10078 from Botnaim/KryoMultipleCustomRegistrators.
      f30373f5
    • meiyoula's avatar
      [SPARK-12142][CORE]Reply false when container allocator is not ready and reset target · bbfc16ec
      meiyoula authored
      Using Dynamic Allocation function, when a new AM is starting, and ExecutorAllocationManager send RequestExecutor message to AM. If the container allocator is not ready, the whole app will hang on
      
      Author: meiyoula <1039320815@qq.com>
      
      Closes #10138 from XuTingjun/patch-1.
      bbfc16ec
    • Josh Rosen's avatar
      [SPARK-12112][BUILD] Upgrade to SBT 0.13.9 · b7204e1d
      Josh Rosen authored
      We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).
      
      I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
      b7204e1d
    • Marcelo Vanzin's avatar
      [SPARK-11314][BUILD][HOTFIX] Add exclusion for moved YARN classes. · d64806b3
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #10147 from vanzin/SPARK-11314.
      d64806b3
    • Burak Yavuz's avatar
      [SPARK-12058][STREAMING][KINESIS][TESTS] fix Kinesis python tests · 302d68de
      Burak Yavuz authored
      Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar.
      
      However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL.
      
      cc zsxwing tdas
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #10050 from brkyvz/kinesis-py.
      302d68de
    • Dmitry Erastov's avatar
      [SPARK-6990][BUILD] Add Java linting script; fix minor warnings · d0d82227
      Dmitry Erastov authored
      This replaces https://github.com/apache/spark/pull/9696
      
      Invoke Checkstyle and print any errors to the console, failing the step.
      Use Google's style rules modified according to
      https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
      Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
      multiple violations being present in the codebase.
      
      Suggest fixing those TODOs in a separate PR(s).
      
      More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).
      
      Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):
      
      > Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
      > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
      > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
      > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
      > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1
      
      Also fix some of the minor violations that didn't require sweeping changes.
      
      Apologies for the previous botched PRs - I finally figured out the issue.
      
      cr: JoshRosen, pwendell
      
      > I state that the contribution is my original work, and I license the work to the project under the project's open source license.
      
      Author: Dmitry Erastov <derastov@gmail.com>
      
      Closes #9867 from dskrvk/master.
      d0d82227
    • Nong's avatar
      [SPARK-12089] [SQL] Fix memory corrupt due to freeing a page being referenced · 95296d9b
      Nong authored
      When the spillable sort iterator was spilled, it was mistakenly keeping
      the last page in memory rather than the current page. This causes the
      current record to get corrupted.
      
      Author: Nong <nong@cloudera.com>
      
      Closes #10142 from nongli/spark-12089.
      95296d9b
    • kaklakariada's avatar
      Add links howto to setup IDEs for developing spark · 17e4e021
      kaklakariada authored
      These links make it easier for new developers to work with Spark in their IDE.
      
      Author: kaklakariada <kaklakariada@users.noreply.github.com>
      
      Closes #10104 from kaklakariada/readme-developing-ide-gettting-started.
      17e4e021
    • Tathagata Das's avatar
      [SPARK-12122][STREAMING] Prevent batches from being submitted twice after... · 4106d80f
      Tathagata Das authored
      [SPARK-12122][STREAMING] Prevent batches from being submitted twice after recovering StreamingContext from checkpoint
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #10127 from tdas/SPARK-12122.
      4106d80f
  2. Dec 03, 2015
    • Sun Rui's avatar
      [SPARK-12104][SPARKR] collect() does not handle multiple columns with same name. · 5011f264
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #10118 from sun-rui/SPARK-12104.
      5011f264
    • Carson Wang's avatar
      [SPARK-11206] Support SQL UI on the history server (resubmit) · b6e9963e
      Carson Wang authored
      Resubmit #9297 and #9991
      On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.
      
      To support SQL UI on the history server:
      1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus.
      2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module.
      3. The new SQL events are written to event log using Jackson.
      4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #10061 from carsonwang/SqlHistoryUI.
      b6e9963e
    • Anderson de Andrade's avatar
      [SPARK-12056][CORE] Create a TaskAttemptContext only after calling setConf. · f434f36d
      Anderson de Andrade authored
      TaskAttemptContext's constructor will clone the configuration instead of referencing it. Calling setConf after creating TaskAttemptContext makes any changes to the configuration made inside setConf unperceived by RecordReader instances.
      
      As an example, Titan's InputFormat will change conf when calling setConf. They wrap their InputFormat around Cassandra's ColumnFamilyInputFormat, and append Cassandra's configuration. This change fixes the following error when using Titan's CassandraInputFormat with Spark:
      
      *java.lang.RuntimeException: org.apache.thrift.protocol.TProtocolException: Required field 'keyspace' was not present! Struct: set_key space_args(keyspace:null)*
      
      There's a discussion of this error here: https://groups.google.com/forum/#!topic/aureliusgraphs/4zpwyrYbGAE
      
      Author: Anderson de Andrade <adeandrade@verticalscope.com>
      
      Closes #10046 from adeandrade/newhadooprdd-fix.
      f434f36d
    • felixcheung's avatar
      [SPARK-12019][SPARKR] Support character vector for sparkR.init(), check param and fix doc · 2213441e
      felixcheung authored
      and add tests.
      Spark submit expects comma-separated list
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10034 from felixcheung/sparkrinitdoc.
      2213441e
    • Tathagata Das's avatar
      [FLAKY-TEST-FIX][STREAMING][TEST] Make sure StreamingContexts are shutdown after test · a02d4727
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #10124 from tdas/InputStreamSuite-flaky-test.
      a02d4727
    • Nicholas Chammas's avatar
      [SPARK-12107][EC2] Update spark-ec2 versions · ad7cea6f
      Nicholas Chammas authored
      I haven't created a JIRA. If we absolutely need one I'll do it, but I'm fine with not getting mentioned in the release notes if that's the only purpose it'll serve.
      
      cc marmbrus - We should include this in 1.6-RC2 if there is one. I can open a second PR against branch-1.6 if necessary.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #10109 from nchammas/spark-ec2-versions.
      ad7cea6f
    • Yanbo Liang's avatar
      [MINOR][ML] Use coefficients replace weights · d576e76b
      Yanbo Liang authored
      Use ```coefficients``` replace ```weights```, I wish they are the last two.
      mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10065 from yanboliang/coefficients.
      d576e76b
    • Andrew Or's avatar
      [SPARK-12108] Make event logs smaller · 688e521c
      Andrew Or authored
      **Problem.** Event logs in 1.6 were much bigger than 1.5. I ran page rank and the event log size in 1.6 was almost 5x that in 1.5. I did a bisect to find that the RDD callsite added in #9398 is largely responsible for this.
      
      **Solution.** This patch removes the long form of the callsite (which is not used!) from the event log. This reduces the size of the event log significantly.
      
      *Note on compatibility*: if this patch is to be merged into 1.6.0, then it won't break any compatibility. Otherwise, if it is merged into 1.6.1, then we might need to add more backward compatibility handling logic (currently does not exist yet).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10115 from andrewor14/smaller-event-logs.
      688e521c
    • Shixiong Zhu's avatar
      [SPARK-12101][CORE] Fix thread pools that cannot cache tasks in Worker and AppClient · 649be4fa
      Shixiong Zhu authored
      `SynchronousQueue` cannot cache any task. This issue is similar to #9978. It's an easy fix. Just use the fixed `ThreadUtils.newDaemonCachedThreadPool`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10108 from zsxwing/fix-threadpool.
      649be4fa
    • jerryshao's avatar
      [SPARK-12059][CORE] Avoid assertion error when unexpected state transition met in Master · 7bc9e1db
      jerryshao authored
      Downgrade to warning log for unexpected state transition.
      
      andrewor14 please review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #10091 from jerryshao/SPARK-12059.
      7bc9e1db
    • Steve Loughran's avatar
      [SPARK-11314][YARN] add service API and test service for Yarn Cluster schedulers · 8fa3e474
      Steve Loughran authored
      This is purely the yarn/src/main and yarn/src/test bits of the YARN ATS integration: the extension model to load and run implementations of `SchedulerExtensionService` in the yarn cluster scheduler process —and to stop them afterwards.
      
      There's duplication between the two schedulers, yarn-client and yarn-cluster, at least in terms of setting everything up, because the common superclass, `YarnSchedulerBackend` is in spark-core, and the extension services need the YARN app/attempt IDs.
      
      If you look at how the the extension services are loaded, the case class `SchedulerExtensionServiceBinding` is used to pass in config info -currently just the spark context and the yarn IDs, of which one, the attemptID, will be null when running client-side. I'm passing in a case class to ensure that it would be possible in future to add extra arguments to the binding class, yet, as the method signature will not have changed, still be able to load existing services.
      
      There's no functional extension service here, just one for testing. The real tests come in the bigger pull requests. At the same time, there's no restriction of this extension service purely to the ATS history publisher. Anything else that wants to listen to the spark context and publish events could use this, and I'd also consider writing one for the YARN-913 registry service, so that the URLs of the web UI would be locatable through that (low priority; would make more sense if integrated with a REST client).
      
      There's no minicluster test. Given the test execution overhead of setting up minicluster tests, it'd  probably be better to add an extension service into one of the existing tests.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #9182 from steveloughran/stevel/feature/SPARK-1537-service.
      8fa3e474
    • felixcheung's avatar
      [SPARK-12116][SPARKR][DOCS] document how to workaround function name conflicts with dplyr · 43c575cb
      felixcheung authored
      shivaram
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #10119 from felixcheung/rdocdplyrmasked.
      43c575cb
    • microwishing's avatar
      [DOCUMENTATION][KAFKA] fix typo in kafka/OffsetRange.scala · 95b3cf12
      microwishing authored
      this is to fix some typo in external/kafka/src/main/scala/org/apache/spark/streaming/kafka/OffsetRange.scala
      
      Author: microwishing <wei.zhu@kaiyuandao.com>
      
      Closes #10121 from microwishing/master.
      95b3cf12
    • Jeff Zhang's avatar
      [DOCUMENTATION][MLLIB] typo in mllib doc · 7470d9ed
      Jeff Zhang authored
      \cc mengxr
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10093 from zjffdu/mllib_typo.
      7470d9ed
    • Huaxin Gao's avatar
      [SPARK-12088][SQL] check connection.isClosed before calling connection… · 5349851f
      Huaxin Gao authored
      In Java Spec java.sql.Connection, it has
      boolean getAutoCommit() throws SQLException
      Throws:
      SQLException - if a database access error occurs or this method is called on a closed connection
      
      So if conn.getAutoCommit is called on a closed connection, a SQLException will be thrown. Even though the code catch the SQLException and program can continue, I think we should check conn.isClosed before calling conn.getAutoCommit to avoid the unnecessary SQLException.
      
      Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>
      
      Closes #10095 from huaxingao/spark-12088.
      5349851f
  3. Dec 02, 2015
  4. Dec 01, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-11949][SQL] Check bitmasks to set nullable property · 0f37d1d7
      Liang-Chi Hsieh authored
      Following up #10038.
      
      We can use bitmasks to determine which grouping expressions need to be set as nullable.
      
      cc yhuai
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #10067 from viirya/fix-cube-following.
      0f37d1d7
    • Tathagata Das's avatar
      [SPARK-12087][STREAMING] Create new JobConf for every batch in saveAsHadoopFiles · 8a75a304
      Tathagata Das authored
      The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
      * The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
      * The JobConf is serialized as part of the DStream checkpoints.
      These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.
      
      The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.
      
      Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #10088 from tdas/SPARK-12087.
      8a75a304
    • Davies Liu's avatar
      [SPARK-12077][SQL] change the default plan for single distinct · 96691fea
      Davies Liu authored
      Use try to match the behavior for single distinct aggregation with Spark 1.5, but that's not scalable, we should be robust by default, have a flag to address performance regression for low cardinality aggregation.
      
      cc yhuai nongli
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10075 from davies/agg_15.
      96691fea
    • Andrew Or's avatar
      [SPARK-12081] Make unified memory manager work with small heaps · d96f8c99
      Andrew Or authored
      The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.
      
      **New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10081 from andrewor14/unified-memory-small-heaps.
      d96f8c99
    • Andrew Or's avatar
      [SPARK-8414] Ensure context cleaner periodic cleanups · 1ce4adf5
      Andrew Or authored
      Garbage collection triggers cleanups. If the driver JVM is huge and there is little memory pressure, we may never clean up shuffle files on executors. This is a problem for long-running applications (e.g. streaming).
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10070 from andrewor14/periodic-gc.
      1ce4adf5
    • Yin Huai's avatar
      [SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of... · e96a70d5
      Yin Huai authored
      [SPARK-11596][SQL] In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we should only return the simpleString.
      
      In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString.
      
      I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241).
      ```
      val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) =>
          println(s"PROCESSING >>>>>>>>>>> $idx")
          val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
          val union = curr.map(_.unionAll(df)).getOrElse(df)
          union.cache()
          Some(union)
        }
      
      c.get.explain(true)
      ```
      
      Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms.
      
      https://issues.apache.org/jira/browse/SPARK-11596
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10079 from yhuai/SPARK-11596.
      e96a70d5
    • Yin Huai's avatar
      [SPARK-11352][SQL] Escape */ in the generated comments. · 5872a9d8
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-11352
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10072 from yhuai/SPARK-11352.
      5872a9d8
Loading