Skip to content
Snippets Groups Projects
  1. Sep 02, 2015
    • Wenchen Fan's avatar
      [SPARK-10389] [SQL] support order by non-attribute grouping expression on Aggregate · fc483077
      Wenchen Fan authored
      For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8548 from cloud-fan/support-order-by-non-attribute.
      fc483077
    • Wenchen Fan's avatar
      [SPARK-10034] [SQL] add regression test for Sort on Aggregate · 56c4c172
      Wenchen Fan authored
      Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name.
      
      However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8231 from cloud-fan/sort-agg.
      56c4c172
    • Chuan Shao's avatar
      [SPARK-7336] [HISTORYSERVER] Fix bug that applications status incorrect on JobHistory UI. · c3b881a7
      Chuan Shao authored
      Author: ArcherShao <shaochuan@huawei.com>
      
      Closes #5886 from ArcherShao/SPARK-7336.
      c3b881a7
  2. Sep 01, 2015
    • 0x0FFF's avatar
      [SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection · 00d9af5e
      0x0FFF authored
      This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
      The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement
      
      Issue reproduction on master:
      ```
      >>> from pyspark.sql.types import *
      >>> a = DateType()
      >>> a.fromInternal(0)
      0
      >>> a.fromInternal(1)
      datetime.date(1970, 1, 2)
      ```
      
      Author: 0x0FFF <programmerag@gmail.com>
      
      Closes #8556 from 0x0FFF/SPARK-10392.
      00d9af5e
    • 0x0FFF's avatar
      [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function · bf550a4b
      0x0FFF authored
      This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
      The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
      * Timezone information of this datetime is ignored
      * This datetime is assumed to be in local timezone, which depends on the OS timezone setting
      
      Fix includes both code change and regression test. Problem reproduction code on master:
      ```python
      import pytz
      from datetime import datetime
      from pyspark.sql import *
      from pyspark.sql.types import *
      sqc = SQLContext(sc)
      df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
      
      m1 = pytz.timezone('UTC')
      m2 = pytz.timezone('Etc/GMT+3')
      
      df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
      df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
      ```
      It gives the same timestamp ignoring time zone:
      ```
      >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
      Filter (dt#0 > 946713600000000)
       Scan PhysicalRDD[dt#0]
      
      >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
      Filter (dt#0 > 946713600000000)
       Scan PhysicalRDD[dt#0]
      ```
      After the fix:
      ```
      >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
      Filter (dt#0 > 946684800000000)
       Scan PhysicalRDD[dt#0]
      
      >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
      Filter (dt#0 > 946695600000000)
       Scan PhysicalRDD[dt#0]
      ```
      PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo
      
      Author: 0x0FFF <programmerag@gmail.com>
      
      Closes #8555 from 0x0FFF/SPARK-10162.
      bf550a4b
    • zhuol's avatar
      [SPARK-4223] [CORE] Support * in acls. · ec012805
      zhuol authored
      SPARK-4223.
      
      Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access.
      
      Manual tests to verify that: "*" works for any user in:
      a. Spark ui: view and kill stage.     Done.
      b. Spark history server.                  Done.
      c. Yarn application killing.  Done.
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #8398 from zhuoliu/4223.
      ec012805
    • Sean Owen's avatar
      [SPARK-10398] [DOCS] Migrate Spark download page to use new lua mirroring scripts · 3f63bd60
      Sean Owen authored
      Migrate Apache download closer.cgi refs to new closer.lua
      
      This is the bit of the change that affects the project docs; I'm implementing the changes to the Apache site separately.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #8557 from srowen/SPARK-10398.
      3f63bd60
    • Holden Karau's avatar
      [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover · e6e483cc
      Holden Karau authored
      Add a python API for the Stop Words Remover.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
      e6e483cc
    • Cheng Lian's avatar
      [SPARK-10301] [SQL] Fixes schema merging for nested structs · 391e6be0
      Cheng Lian authored
      This PR can be quite challenging to review.  I'm trying to give a detailed description of the problem as well as its solution here.
      
      When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning.  This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns.  However, this translation can be fairly complicated because of several reasons:
      
      1.  Requested schema must conform to the real schema of the physical file to be read.
      
          This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema.  Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.
      
      1.  Support for schema merging.
      
          A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas.  This means we may request for a column path that doesn't exist in a physical Parquet file.  All requested column paths can be nested.  For example, for a Parquet file schema
      
          ```
          message root {
            required group f0 {
              required group f00 {
                required int32 f000;
                required binary f001 (UTF8);
              }
            }
          }
          ```
      
          we may request for column paths defined in the following schema:
      
          ```
          message root {
            required group f0 {
              required group f00 {
                required binary f001 (UTF8);
                required float f002;
              }
            }
      
            optional double f1;
          }
          ```
      
          Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`.
      
          The good news is that Parquet handles non-existing column paths properly and always returns null for them.
      
      1.  The map from `StructType` to `MessageType` is a one-to-many map.
      
          This is the most unfortunate part.
      
          Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors".  For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:
      
          ```
          message m0 {
            repeated int32 f;
          }
          ```
      
          while parquet-avro generates another version:
      
          ```
          message m1 {
            required group f (LIST) {
              repeated int32 array;
            }
          }
          ```
      
          and parquet-thrift spills this:
      
          ```
          message m1 {
            required group f (LIST) {
              repeated int32 f_tuple;
            }
          }
          ```
      
          All of them can be mapped to the following _unique_ Catalyst schema:
      
          ```
          StructType(
            StructField(
              "f",
              ArrayType(IntegerType, containsNull = false),
              nullable = false))
          ```
      
          This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases.  To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`.
      
      In earlier Spark versions, we didn't try to fix this issue properly.  Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases.  Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones.  This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005].  In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005.  However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea.  So this PR is an attempt to fix the problem in a proper way.
      
      For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`:
      
      For a leaf column path `c` in `cs`:
      
      - if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`;
      - otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`;
      - no other column paths should exist in `ps'`.
      
      Then comes the most tedious part:
      
      > Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`?
      
      Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec.  They are:
      
      1.  the standard structure of nested types, and
      1.  cases defined in all backwards-compatibility rules for `LIST` and `MAP`.
      
      The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form.  Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively.  The column path selection algorithm is implemented in `clipParquetGroupFields()`.
      
      With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`.  Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries.  This situation is illustrated by [this test case] [test-case].
      
      [spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301
      [spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005
      [test-case]: https://github.com/liancheng/spark/commit/38644d8a45175cbdf20d2ace021c2c2544a50ab3#diff-a9b98e28ce3ae30641829dffd1173be2R26
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.
      391e6be0
  3. Aug 31, 2015
  4. Aug 30, 2015
    • Patrick Wendell's avatar
      SPARK-9545, SPARK-9547: Use Maven in PRB if title contains "[test-maven]" · 35e896a7
      Patrick Wendell authored
      This is just some small glue code to actually make use of the
      AMPLAB_JENKINS_BUILD_TOOL switch. As far as I can tell, we actually
      don't currently use the Maven support in the tool even though it exists.
      This patch switches to Maven when the PR title contains "test-maven".
      
      There are a few small other pieces of cleanup in the patch as well.
      
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #7878 from pwendell/maven-tests.
      35e896a7
    • Burak Yavuz's avatar
      [SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset of... · 8d2ab75d
      Burak Yavuz authored
      [SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset of matrix multiplications
      
      mengxr jkbradley rxin
      
      It would be great if this fix made it into RC3!
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #8525 from brkyvz/blas-scaling.
      8d2ab75d
    • ihainan's avatar
      [SPARK-10184] [CORE] Optimization for bounds determination in RangePartitioner · 1bfd9347
      ihainan authored
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-10184
      
      Change `cumWeight > target` to `cumWeight >= target` in `RangePartitioner.determineBounds` method to make the output partitions more balanced.
      
      Author: ihainan <ihainan72@gmail.com>
      
      Closes #8397 from ihainan/opt_for_rangepartitioner.
      1bfd9347
    • Xiangrui Meng's avatar
      [SPARK-10331] [MLLIB] Update example code in ml-guide · ca69fc8e
      Xiangrui Meng authored
      * The example code was added in 1.2, before `createDataFrame`. This PR switches to `createDataFrame`. Java code still uses JavaBean.
      * assume `sqlContext` is available
      * fix some minor issues from previous code review
      
      jkbradley srowen feynmanliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8518 from mengxr/SPARK-10331.
      ca69fc8e
    • Xiangrui Meng's avatar
      [SPARK-10348] [MLLIB] updates ml-guide · 905fbe49
      Xiangrui Meng authored
      * replace `ML Dataset` by `DataFrame` to unify the abstraction
      * ML algorithms -> pipeline components to describe the main concept
      * remove Scala API doc links from the main guide
      * `Section Title` -> `Section tile` to be consistent with other section titles in MLlib guide
      * modified lines break at 100 chars or periods
      
      jkbradley feynmanliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8517 from mengxr/SPARK-10348.
      905fbe49
  5. Aug 29, 2015
  6. Aug 28, 2015
    • martinzapletal's avatar
      [SPARK-9910] [ML] User guide for train validation split · e8ea5baf
      martinzapletal authored
      Author: martinzapletal <zapletal-martin@email.cz>
      
      Closes #8377 from zapletal-martin/SPARK-9910.
      e8ea5baf
    • felixcheung's avatar
      [SPARK-9803] [SPARKR] Add subset and transform + tests · 2a4e00ca
      felixcheung authored
      Add subset and transform
      Also reorganize `[` & `[[` to subset instead of select
      
      Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
      Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #8503 from felixcheung/rsubset_transform.
      2a4e00ca
    • Davies Liu's avatar
      [SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain · bb7f3523
      Davies Liu authored
      After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8492 from davies/fix_in.
      bb7f3523
    • Xiangrui Meng's avatar
      [SPARK-9671] [MLLIB] re-org user guide and add migration guide · 88032eca
      Xiangrui Meng authored
      This PR updates the MLlib user guide and adds migration guide for 1.4->1.5.
      
      * merge migration guide for `spark.mllib` and `spark.ml` packages
      * remove dependency section from `spark.ml` guide
      * move the paragraph about `spark.mllib` and `spark.ml` to the top and recommend `spark.ml`
      * move Sam's talk to footnote to make the section focus on dependencies
      
      Minor changes to code examples and other wording will be in a separate PR.
      
      jkbradley srowen feynmanliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8498 from mengxr/SPARK-9671.
      88032eca
    • Shuo Xiang's avatar
      [SPARK-10336][example] fix not being able to set intercept in LR example · 45723214
      Shuo Xiang authored
      `fitIntercept` is a command line option but not set in the main program.
      
      dbtsai
      
      Author: Shuo Xiang <sxiang@pinterest.com>
      
      Closes #8510 from coderxiang/intercept and squashes the following commits:
      
      57c9b7d [Shuo Xiang] fix not being able to set intercept in LR example
      45723214
    • Marcelo Vanzin's avatar
      [SPARK-9284] [TESTS] Allow all tests to run without an assembly. · c53c902f
      Marcelo Vanzin authored
      This change aims at speeding up the dev cycle a little bit, by making
      sure that all tests behave the same w.r.t. where the code to be tested
      is loaded from. Namely, that means that tests don't rely on the assembly
      anymore, rather loading all needed classes from the build directories.
      
      The main change is to make sure all build directories (classes and test-classes)
      are added to the classpath of child processes when running tests.
      
      YarnClusterSuite required some custom code since the executors are run
      differently (i.e. not through the launcher library, like standalone and
      Mesos do).
      
      I also found a couple of tests that could leak a SparkContext on failure,
      and added code to handle those.
      
      With this patch, it's possible to run the following command from a clean
      source directory and have all tests pass:
      
        mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7629 from vanzin/SPARK-9284.
      c53c902f
    • Josh Rosen's avatar
      [SPARK-10325] Override hashCode() for public Row · d3f87dc3
      Josh Rosen authored
      This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8500 from JoshRosen/SPARK-10325 and squashes the following commits:
      
      51ffea1 [Josh Rosen] Override hashCode() for public Row.
      d3f87dc3
    • Luciano Resende's avatar
      [SPARK-8952] [SPARKR] - Wrap normalizePath calls with suppressWarnings · 499e8e15
      Luciano Resende authored
      This is based on davies comment on SPARK-8952 which suggests to only call normalizePath() when path starts with '~'
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #8343 from lresende/SPARK-8952.
      499e8e15
Loading