Skip to content
Snippets Groups Projects
  1. Sep 15, 2015
    • Josh Rosen's avatar
      [SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator · 38700ea4
      Josh Rosen authored
      When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.
      
      This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).
      
      This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8544 from JoshRosen/SPARK-10381.
      38700ea4
    • vinodkc's avatar
      [SPARK-10575] [SPARK CORE] Wrapped RDD.takeSample with Scope · 99ecfa59
      vinodkc authored
      Remove return statements in RDD.takeSample and wrap it withScope
      
      Author: vinodkc <vinod.kc.in@gmail.com>
      Author: vinodkc <vinodkc@users.noreply.github.com>
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #8730 from vinodkc/fix_takesample_return.
      99ecfa59
    • Reynold Xin's avatar
      [SPARK-10612] [SQL] Add prepare to LocalNode. · a63cdc76
      Reynold Xin authored
      The idea is that we should separate the function call that does memory reservation (i.e. prepare) from the function call that consumes the input (e.g. open()), so all operators can be a chance to reserve memory before they are all consumed.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8761 from rxin/SPARK-10612.
      a63cdc76
    • Andrew Or's avatar
      [SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions · b6e99863
      Andrew Or authored
      *Note: this is for master branch only.* The fix for branch-1.5 is at #8721.
      
      The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.:
      ```
      (1 to 100).par.foreach { _ =>
        sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
      }
      ```
      The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8710 from andrewor14/concurrent-sql-executions.
      b6e99863
    • DB Tsai's avatar
      [SPARK-7685] [ML] Apply weights to different samples in Logistic Regression · be52faa7
      DB Tsai authored
      In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
      http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
      On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.
      
      Author: DB Tsai <dbt@netflix.com>
      Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>
      
      Closes #7884 from dbtsai/SPARK-7685.
      be52faa7
    • Wenchen Fan's avatar
      [SPARK-10475] [SQL] improve column prunning for Project on Sort · 31a229aa
      Wenchen Fan authored
      Sometimes we can't push down the whole `Project` though `Sort`, but we still have a chance to push down part of it.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8644 from cloud-fan/column-prune.
      31a229aa
    • Liang-Chi Hsieh's avatar
      [SPARK-10437] [SQL] Support aggregation expressions in Order By · 841972e2
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10437
      
      If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #8599 from viirya/orderby-agg.
      841972e2
    • Marcelo Vanzin's avatar
    • Jacek Laskowski's avatar
      [DOCS] Small fixes to Spark on Yarn doc · 416003b2
      Jacek Laskowski authored
      * a follow-up to 16b6d186 as `--num-executors` flag is not suppported.
      * links + formatting
      
      Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
      
      Closes #8762 from jaceklaskowski/docs-spark-on-yarn.
      416003b2
    • Xiangrui Meng's avatar
      Closes #8738 · 0d9ab016
      Xiangrui Meng authored
      Closes #8767
      Closes #2491
      Closes #6795
      Closes #2096
      Closes #7722
      0d9ab016
    • noelsmith's avatar
      [PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.random · 7ca30b50
      noelsmith authored
      Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.
      
      Author: noelsmith <mail@noelsmith.com>
      
      Closes #8773 from noel-smith/mllib-random-versionadded-fix.
      7ca30b50
    • Marcelo Vanzin's avatar
      [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 8abef21d
      Marcelo Vanzin authored
      This change does two things:
      
      - tag a few tests and adds the mechanism in the build to be able to disable those tags,
        both in maven and sbt, for both junit and scalatest suites.
      - add some logic to run-tests.py to disable some tags depending on what files have
        changed; that's used to disable expensive tests when a module hasn't explicitly
        been changed, to speed up testing for changes that don't directly affect those
        modules.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8437 from vanzin/test-tags.
      8abef21d
    • Yuhao Yang's avatar
      [SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS · c35fdcb7
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-10491
      
      We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.
      
      Let me know if new UT needed.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #8663 from hhbyyh/movedspr.
      c35fdcb7
    • Reynold Xin's avatar
      Update version to 1.6.0-SNAPSHOT. · 09b7e7c1
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8350 from rxin/1.6.
      09b7e7c1
    • Robin East's avatar
      [SPARK-10598] [DOCS] · 6503c4b5
      Robin East authored
      Comments preceding toMessage method state: "The edge partition is encoded in the lower
         * 30 bytes of the Int, and the position is encoded in the upper 2 bytes of the Int.". References to bytes should be changed to bits.
      
      This contribution is my original work and I license the work to the Spark project under it's open source license.
      
      Author: Robin East <robin.east@xense.co.uk>
      
      Closes #8756 from insidedctm/master.
      6503c4b5
    • Jacek Laskowski's avatar
      Small fixes to docs · 833be733
      Jacek Laskowski authored
      Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).
      
      Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
      
      Closes #8759 from jaceklaskowski/docs-submitting-apps.
      833be733
  2. Sep 14, 2015
  3. Sep 13, 2015
  4. Sep 12, 2015
    • Josh Rosen's avatar
      [SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods · b3a7480a
      Josh Rosen authored
      This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8521 from JoshRosen/SPARK-10330-part2.
      b3a7480a
    • JihongMa's avatar
      [SPARK-6548] Adding stddev to DataFrame functions · f4a22808
      JihongMa authored
      Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
      
      Author: JihongMa <linlin200605@gmail.com>
      Author: Jihong MA <linlin200605@gmail.com>
      Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
      Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
      
      Closes #6297 from JihongMA/SPARK-SQL.
      f4a22808
    • Sean Owen's avatar
      [SPARK-10547] [TEST] Streamline / improve style of Java API tests · 22730ad5
      Sean Owen authored
      Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8706 from srowen/SPARK-10547.
      22730ad5
    • Nithin Asokan's avatar
      [SPARK-10554] [CORE] Fix NPE with ShutdownHook · 8285e3b0
      Nithin Asokan authored
      https://issues.apache.org/jira/browse/SPARK-10554
      
      Fixes NPE when ShutdownHook tries to cleanup temporary folders
      
      Author: Nithin Asokan <Nithin.Asokan@Cerner.com>
      
      Closes #8720 from nasokan/SPARK-10554.
      8285e3b0
    • Daniel Imfeld's avatar
      [SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks... · 6d836780
      Daniel Imfeld authored
      [SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks important error information
      
      When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user.
      
      Manual testing shows the exception chained properly, and the test suite still looks fine as well.
      
      This contribution is my original work and I license the work to the project under the project's open source license.
      
      Author: Daniel Imfeld <daniel@danielimfeld.com>
      
      Closes #8725 from dimfeld/dimfeld-patch-1.
      6d836780
Loading