Skip to content
Snippets Groups Projects
  1. Aug 26, 2015
  2. Aug 25, 2015
    • Xiangrui Meng's avatar
      [SPARK-10238] [MLLIB] update since versions in mllib.linalg · ab431f8a
      Xiangrui Meng authored
      Same as #8421 but for `mllib.linalg`.
      
      cc dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8440 from mengxr/SPARK-10238 and squashes the following commits:
      
      b38437e [Xiangrui Meng] update since versions in mllib.linalg
      ab431f8a
    • Xiangrui Meng's avatar
      [SPARK-10233] [MLLIB] update since version in mllib.evaluation · 8668ead2
      Xiangrui Meng authored
      Same as #8421 but for `mllib.evaluation`.
      
      cc avulanov
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8423 from mengxr/SPARK-10233.
      8668ead2
    • Feynman Liang's avatar
      [SPARK-9888] [MLLIB] User guide for new LDA features · 125205cd
      Feynman Liang authored
       * Adds two new sections to LDA's user guide; one for each optimizer/model
       * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
       * Cleans up a TODO and sets a default parameter in LDA code
      
      jkbradley hhbyyh
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8254 from feynmanliang/SPARK-9888.
      125205cd
    • Davies Liu's avatar
      [SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive) · 7467b52e
      Davies Liu authored
      Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113
      
      cc chenghao-intel
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8415 from davies/decimal_div2.
      7467b52e
    • Davies Liu's avatar
      [SPARK-10245] [SQL] Fix decimal literals with precision < scale · ec89bd84
      Davies Liu authored
      In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8428 from davies/smaller_decimal.
      ec89bd84
    • Xiangrui Meng's avatar
      [SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and mllib.util · 00ae4be9
      Xiangrui Meng authored
      Same as #8421 but for `mllib.pmml` and `mllib.util`.
      
      cc dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8430 from mengxr/SPARK-10239 and squashes the following commits:
      
      a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
      00ae4be9
    • Feynman Liang's avatar
      [SPARK-9797] [MLLIB] [DOC] StreamingLinearRegressionWithSGD.setConvergenceTol default value · 92059078
      Feynman Liang authored
      Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8424 from feynmanliang/SPARK-9797.
      92059078
    • Xiangrui Meng's avatar
      [SPARK-10237] [MLLIB] update since versions in mllib.fpm · c619c755
      Xiangrui Meng authored
      Same as #8421 but for `mllib.fpm`.
      
      cc feynmanliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8429 from mengxr/SPARK-10237.
      c619c755
    • Feynman Liang's avatar
      [SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD alias · c0e9ff15
      Feynman Liang authored
      * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
      * Cleans up a note in code
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8425 from feynmanliang/SPARK-9800.
      c0e9ff15
    • Sun Rui's avatar
      [SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde. · 71a138cd
      Sun Rui authored
      This PR:
      1. supports transferring arbitrary nested array from JVM to R side in SerDe;
      2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
         from a DataFrame.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8276 from sun-rui/SPARK-10048.
      71a138cd
    • Xiangrui Meng's avatar
      [SPARK-10231] [MLLIB] update @Since annotation for mllib.classification · 16a2be1a
      Xiangrui Meng authored
      Update `Since` annotation in `mllib.classification`:
      
      1. add version to classes, objects, constructors, and public variables declared in constructors
      2. correct some versions
      3. remove `Since` on `toString`
      
      MechCoder dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8421 from mengxr/SPARK-10231 and squashes the following commits:
      
      b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
      16a2be1a
    • Feynman Liang's avatar
      [SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration · 881208a8
      Feynman Liang authored
      See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770)
      
      CC jkbradley
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8422 from feynmanliang/SPARK-10230.
      881208a8
    • Yuhao Yang's avatar
      [SPARK-8531] [ML] Update ML user guide for MinMaxScaler · b37f0cc1
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8531
      
      Update ML user guide for MinMaxScaler
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com>
      
      Closes #7211 from hhbyyh/minmaxdoc.
      b37f0cc1
    • Michael Armbrust's avatar
      [SPARK-10198] [SQL] Turn off partition verification by default · 5c08c86b
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #8404 from marmbrus/turnOffPartitionVerification.
      5c08c86b
    • Sean Owen's avatar
      [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters · 69c9c177
      Sean Owen authored
      Replace `JavaConversions` implicits with `JavaConverters`
      
      Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8033 from srowen/SPARK-9613.
      69c9c177
    • ehnalis's avatar
      Fixed a typo in DAGScheduler. · 7f1e507b
      ehnalis authored
      Author: ehnalis <zoltan.zvara@gmail.com>
      
      Closes #8308 from ehnalis/master.
      7f1e507b
    • Zhang, Liye's avatar
      [DOC] add missing parameters in SparkContext.scala for scala doc · 5c148901
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #8412 from liyezhang556520/minorDoc.
      5c148901
    • Yin Huai's avatar
      [SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors). · 0e6368ff
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10197
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8407 from yhuai/ORCSPARK-10197.
      0e6368ff
    • Josh Rosen's avatar
      [SPARK-10195] [SQL] Data sources Filter should not expose internal types · 7bc9a8c6
      Josh Rosen authored
      Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.
      
      This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0.  To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
      7bc9a8c6
    • Davies Liu's avatar
      [SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive · 2f493f7e
      Davies Liu authored
      We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.
      
      In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).
      
      Author: Davies Liu <davies@databricks.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8400 from davies/timestamp_parquet.
      2f493f7e
    • Tathagata Das's avatar
      [SPARK-10210] [STREAMING] Filter out non-existent blocks before creating BlockRDD · 1fc37581
      Tathagata Das authored
      When write ahead log is not enabled, a recovered streaming driver still tries to run jobs using pre-failure block ids, and fails as the block do not exists in-memory any more (and cannot be recovered as receiver WAL is not enabled).
      
      This occurs because the driver-side WAL of ReceivedBlockTracker is recovers that past block information, and ReceiveInputDStream creates BlockRDDs even if those blocks do not exist.
      
      The solution in this PR is to filter out block ids that do not exist before creating the BlockRDD. In addition, it adds unit tests to verify other logic in ReceiverInputDStream.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #8405 from tdas/SPARK-10210.
      1fc37581
    • Sean Owen's avatar
      [SPARK-6196] [BUILD] Remove MapR profiles in favor of hadoop-provided · 57b960bf
      Sean Owen authored
      Follow up to https://github.com/apache/spark/pull/7047
      
      pwendell mentioned that MapR should use `hadoop-provided` now, and indeed the new build script does not produce `mapr3`/`mapr4` artifacts anymore. Hence the action seems to be to remove the profiles, which are now not used.
      
      CC trystanleftwich
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8338 from srowen/SPARK-6196.
      57b960bf
    • Yu ISHIKAWA's avatar
      [SPARK-10214] [SPARKR] [DOCS] Improve SparkR Column, DataFrame API docs · d4549fe5
      Yu ISHIKAWA authored
      cc: shivaram
      
      ## Summary
      
      - Add name tags to each methods in DataFrame.R and column.R
      - Replace `rdname column` with `rdname {each_func}`. i.e. alias method : `rdname column` =>  `rdname alias`
      
      ## Generated PDF File
      https://drive.google.com/file/d/0B9biIZIU47lLNHN2aFpnQXlSeGs/view?usp=sharing
      
      ## JIRA
      [[SPARK-10214] Improve SparkR Column, DataFrame API docs - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10214)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #8414 from yu-iskw/SPARK-10214.
      d4549fe5
    • Josh Rosen's avatar
      [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only... · 82268f07
      Josh Rosen authored
      [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns
      
      This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.
      
      I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7631 from JoshRosen/SPARK-9293.
      82268f07
    • Cheng Lian's avatar
      [SPARK-10136] [SQL] A more robust fix for SPARK-10136 · bf03fe68
      Cheng Lian authored
      PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause.  The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules.  Let me have a try to give an explanation here.
      
      The structure of the problematic Parquet schema generated by parquet-avro is something like this:
      
      ```
      message m {
        <repetition> group f (LIST) {         // Level 1
          repeated group array (LIST) {       // Level 2
            repeated <primitive-type> array;  // Level 3
          }
        }
      }
      ```
      
      (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)
      
      This structure consists of two nested legacy 2-level `LIST`-like structures:
      
      1. The repeated group type at level 2 is the element type of the outer array defined at level 1
      
         This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.
      
      2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2
      
         This group should also map to an `CatalystArrayConverter.ElementConverter`.
      
      The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1.  Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.
      
      According to  parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group.  PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix.  (I didn't realize this when authoring #8341 though.)
      
      As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:
      
      > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.
      
      (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)
      
      This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2].  This PR delivers a more robust fix by adding this rule in the latter method.
      
      Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].
      
      [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305
      [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463
      [3]: https://issues.apache.org/jira/browse/PARQUET-364
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8361 from liancheng/spark-10136/proper-version.
      bf03fe68
    • Yin Huai's avatar
      [SPARK-10196] [SQL] Correctly saving decimals in internal rows to JSON. · df7041d0
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-10196
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8408 from yhuai/DecimalJsonSPARK-10196.
      df7041d0
    • zsxwing's avatar
      [SPARK-10137] [STREAMING] Avoid to restart receivers if scheduleReceivers returns balanced results · f023aa2f
      zsxwing authored
      This PR fixes the following cases for `ReceiverSchedulingPolicy`.
      
      1) Assume there are 4 executors: host1, host2, host3, host4, and 5 receivers: r1, r2, r3, r4, r5. Then `ReceiverSchedulingPolicy.scheduleReceivers` will return (r1 -> host1, r2 -> host2, r3 -> host3, r4 -> host4, r5 -> host1).
      Let's assume r1 starts at first on `host1` as `scheduleReceivers` suggested,  and try to register with ReceiverTracker. But the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will return (host2, host3, host4) according to the current executor weights (host1 -> 1.0, host2 -> 0.5, host3 -> 0.5, host4 -> 0.5), so ReceiverTracker will reject `r1`. This is unexpected since r1 is starting exactly where `scheduleReceivers` suggested.
      
      This case can be fixed by ignoring the information of the receiver that is rescheduling in `receiverTrackingInfoMap`.
      
      2) Assume there are 3 executors (host1, host2, host3) and each executors has 3 cores, and 3 receivers: r1, r2, r3. Assume r1 is running on host1. Now r2 is restarting, the previous `ReceiverSchedulingPolicy.rescheduleReceiver` will always return (host1, host2, host3). So it's possible that r2 will be scheduled to host1 by TaskScheduler. r3 is similar. Then at last, it's possible that there are 3 receivers running on host1, while host2 and host3 are idle.
      
      This issue can be fixed by returning only executors that have the minimum wight rather than returning at least 3 executors.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #8340 from zsxwing/fix-receiver-scheduling.
      f023aa2f
Loading