Skip to content
Snippets Groups Projects
  1. Apr 05, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string · c59abad0
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Current, SparkSQL `initCap` is using `toTitleCase` function. However, `UTF8String.toTitleCase` implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation `toTitleCase`.
      ```
      hive> select initcap('sParK');
      Spark
      ```
      ```
      scala> sql("select initcap('sParK')").head
      res0: org.apache.spark.sql.Row = [SParK]
      ```
      
      This PR updates the implementation of `initcap` using `toLowerCase` and `toTitleCase`.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12175 from dongjoon-hyun/SPARK-14402.
      c59abad0
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for Python, and SQL · 9ee5c257
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the Python, and SQL, API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python, users can access all APIs above, but in addition they can do
       - In Python:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12136 from brkyvz/python-windows.
      9ee5c257
    • Yin Huai's avatar
      [SPARK-14123][SPARK-14384][SQL] Handle CreateFunction/DropFunction · 72544d6f
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes.
      * `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted.
      * SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`.
      * A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`.
      
      This PR is based on viirya's https://github.com/apache/spark/pull/12036/
      
      ## How was this patch tested?
      Existing tests and new tests.
      
      ## TODOs
      [x] Self-review
      [x] Cleanup
      [x] More tests for create/drop functions (we need to more tests for permanent functions).
      [ ] File JIRAs for all TODOs
      [x] Standardize the error message when a function does not exist.
      
      Author: Yin Huai <yhuai@databricks.com>
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12117 from yhuai/function.
      72544d6f
    • Devaraj K's avatar
      [SPARK-13063][YARN] Make the SPARK YARN STAGING DIR as configurable · bc36df12
      Devaraj K authored
      ## What changes were proposed in this pull request?
      Made the SPARK YARN STAGING DIR as configurable with the configuration as 'spark.yarn.staging-dir'.
      
      ## How was this patch tested?
      
      I have verified it manually by running applications on yarn, If the 'spark.yarn.staging-dir' is configured then the value used as staging directory otherwise uses the default value i.e. file system’s home directory for the user.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #12082 from devaraj-kavali/SPARK-13063.
      bc36df12
    • Shixiong Zhu's avatar
      [SPARK-14257][SQL] Allow multiple continuous queries to be started from the same DataFrame · 463bac00
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Make StreamingRelation store the closure to create the source in StreamExecution so that we can start multiple continuous queries from the same DataFrame.
      
      ## How was this patch tested?
      
      `test("DataFrame reuse")`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12049 from zsxwing/df-reuse.
      463bac00
    • Wenchen Fan's avatar
      [SPARK-14345][SQL] Decouple deserializer expression resolution from ObjectOperator · f77f11c6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR decouples deserializer expression resolution from `ObjectOperator`, so that we can use deserializer expression in normal operators. This is needed by #12061 and #12067 , I abstracted the logic out and put them in this PR to reduce code change in the future.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12131 from cloud-fan/separate.
      f77f11c6
    • Kousuke Saruta's avatar
      [SPARK-14397][WEBUI] <html> and <body> tags are nested in LogPage · e4bd5041
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      In `LogPage`, the content to be rendered is defined as follows.
      
      ```
          val content =
            <html>
              <body>
                {linkToMaster}
                <div>
                  <div style="float:left; margin-right:10px">{backButton}</div>
                  <div style="float:left;">{range}</div>
                  <div style="float:right; margin-left:10px">{nextButton}</div>
                </div>
                <br />
                <div style="height:500px; overflow:auto; padding:5px;">
                  <pre>{logText}</pre>
                </div>
              </body>
            </html>
          UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
      ```
      
      As you can see, <html> and <body> tags will be rendered.
      
      On the other hand, `UIUtils.basicSparkPage` will render those tags so those tags will be nested.
      
      ```
        def basicSparkPage(
            content: => Seq[Node],
            title: String,
            useDataTables: Boolean = false): Seq[Node] = {
          <html>
            <head>
              {commonHeaderNodes}
              {if (useDataTables) dataTablesHeaderNodes else Seq.empty}
              <title>{title}</title>
            </head>
            <body>
              <div class="container-fluid">
                <div class="row-fluid">
                  <div class="span12">
                    <h3 style="vertical-align: middle; display: inline-block;">
                      <a style="text-decoration: none" href={prependBaseUri("/")}>
                        <img src={prependBaseUri("/static/spark-logo-77x50px-hd.png")} />
                        <span class="version"
                              style="margin-right: 15px;">{org.apache.spark.SPARK_VERSION}</span>
                      </a>
                      {title}
                    </h3>
                  </div>
                </div>
                {content}
              </div>
            </body>
          </html>
        }
      ```
      
      These are the screen shots before this patch is applied.
      
      ![before1](https://cloud.githubusercontent.com/assets/4736016/14273236/03cbed8a-fb44-11e5-8786-bc1bfa4d3f8c.png)
      ![before2](https://cloud.githubusercontent.com/assets/4736016/14273237/03d1741c-fb44-11e5-9dee-ea93022033a6.png)
      
      And these are the ones after this patch is applied.
      
      ![after1](https://cloud.githubusercontent.com/assets/4736016/14273248/1b6a7d8a-fb44-11e5-8a3b-69964f3434f6.png)
      ![after2](https://cloud.githubusercontent.com/assets/4736016/14273249/1b6b9c38-fb44-11e5-9d6f-281d64c842e4.png)
      
      The appearance is not changed but the html source code is changed.
      
      ## How was this patch tested?
      
      Manually run some jobs on my standalone-cluster and check the WebUI.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12170 from sarutak/SPARK-14397.
      e4bd5041
    • Shally Sangal's avatar
      [SPARK-14284][ML] KMeansSummary deprecating size; adding clusterSizes · d3569015
      Shally Sangal authored
      ## What changes were proposed in this pull request?
      
      KMeansSummary class : deprecated size and added clusterSizes
      
      Author: Shally Sangal <shallysangal@gmail.com>
      
      Closes #12084 from shallys/master.
      d3569015
    • gatorsmile's avatar
      [SPARK-14349][SQL] Issue Error Messages for Unsupported Operators/DML/DDL in SQL Context. · 78071736
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      Currently, the weird error messages are issued if we use Hive Context-only operations in SQL Context.
      
      For example,
      - When calling `Drop Table` in SQL Context, we got the following message:
      ```
      Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be thrown, but java.lang.ClassCastException was thrown.
      ```
      
      - When calling `Script Transform` in SQL Context, we got the message:
      ```
      assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, [tKey#155,tValue#156], null
      +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at BeforeAndAfterAll.scala:187
      ```
      
      Updates:
      Based on the investigation from hvanhovell , the root cause is `visitChildren`, which is the default implementation. It always returns the result of the last defined context child. After merging the code changes from hvanhovell , it works! Thank you hvanhovell !
      
      #### How was this patch tested?
      A few test cases are added.
      
      Not sure if the same issue exist for the other operators/DDL/DML. hvanhovell
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12134 from gatorsmile/hiveParserCommand.
      78071736
    • Dilip Biswal's avatar
      [SPARK-14348][SQL] Support native execution of SHOW TBLPROPERTIES command · 2715bc68
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      This PR adds Native execution of SHOW TBLPROPERTIES command.
      
      Command Syntax:
      ``` SQL
      SHOW TBLPROPERTIES table_name[(property_key_literal)]
      ```
      ## How was this patch tested?
      
      Tests added in HiveComandSuiie and DDLCommandSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #12133 from dilipbiswal/dkb_show_tblproperties.
      2715bc68
    • Eric Liang's avatar
      [SPARK-14359] Create built-in functions for typed aggregates in Java · 06462301
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      This adds the corresponding Java static functions for built-in typed aggregates already exposed in Scala.
      
      ## How was this patch tested?
      
      Unit tests.
      
      rxin
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #12168 from ericl/sc-2794.
      06462301
  2. Apr 04, 2016
    • Yong Tang's avatar
      [SPARK-14368][PYSPARK] Support python.spark.worker.memory with upper-case unit. · 7db56244
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to address the issue in PySpark where `spark.python.worker.memory`
      could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix
      allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to
      conform to the JVM memory string as is specified in the documentation .
      
      ## How was this patch tested?
      
      This fix adds additional test to cover the changes.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12163 from yongtang/SPARK-14368.
      7db56244
    • Joseph K. Bradley's avatar
      [SPARK-14386][ML] Changed spark.ml ensemble trees methods to return concrete types · 8f50574a
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types.
      
      This PR: return concrete types
      
      The MIMA checks appear to be OK with this change.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12158 from jkbradley/hide-dtm.
      8f50574a
    • Burak Yavuz's avatar
      [SPARK-14287] isStreaming method for Dataset · ba24d1ee
      Burak Yavuz authored
      With the addition of StreamExecution (ContinuousQuery) to Datasets, data will become unbounded. With unbounded data, the execution of some methods and operations will not make sense, e.g. `Dataset.count()`.
      
      A simple API is required to check whether the data in a Dataset is bounded or unbounded. This will allow users to check whether their Dataset is in streaming mode or not. ML algorithms may check if the data is unbounded and throw an exception for example.
      
      The implementation of this method is simple, however naming it is the challenge. Some possible names for this method are:
       - isStreaming
       - isContinuous
       - isBounded
       - isUnbounded
      
      I've gone with `isStreaming` for now. We can change it before Spark 2.0 if we decide to come up with a different name. For that reason I've marked it as `Experimental`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12080 from brkyvz/is-streaming.
      ba24d1ee
    • Guillaume Poulin's avatar
      [SPARK-12425][STREAMING] DStream union optimisation · 7201f033
      Guillaume Poulin authored
      Use PartitionerAwareUnionRDD when possbile for optimizing shuffling and
      preserving the partitioner.
      
      Author: Guillaume Poulin <poulin.guillaume@gmail.com>
      
      Closes #10382 from gpoulin/dstream_union_optimisation.
      7201f033
    • Luciano Resende's avatar
      [SPARK-14366] Remove sbt-idea plugin · a172e11c
      Luciano Resende authored
      ## What changes were proposed in this pull request?
      
      Remove sbt-idea plugin as importing sbt project provides much better support.
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #12151 from lresende/SPARK-14366.
      a172e11c
    • Marcelo Vanzin's avatar
      [SPARK-13579][BUILD] Stop building the main Spark assembly. · 24d7d2e4
      Marcelo Vanzin authored
      This change modifies the "assembly/" module to just copy needed
      dependencies to its build directory, and modifies the packaging
      script to pick those up (and remove duplicate jars packages in the
      examples module).
      
      I also made some minor adjustments to dependencies to remove some
      test jars from the final packaging, and remove jars that conflict with each
      other when packaged separately (e.g. servlet api).
      
      Also note that this change restores guava in applications' classpaths, even
      though it's still shaded inside Spark. This is now needed for the Hadoop
      libraries that are packaged with Spark, which now are not processed by
      the shade plugin.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11796 from vanzin/SPARK-13579.
      24d7d2e4
    • Davies Liu's avatar
      [SPARK-14259] [SQL] Merging small files together based on the cost of opening · 400b2f86
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR basically re-do the things in #12068 but with a different model, which should work better in case of small files with different sizes.
      
      ## How was this patch tested?
      
      Updated existing tests.
      
      Ran a query on thousands of partitioned small files locally, with all default settings (the cost to open a file should be over estimated), the durations of tasks become smaller and smaller, which is good (the last few tasks will be shortest).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12095 from davies/file_cost.
      400b2f86
    • Davies Liu's avatar
      [SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame · cc70f174
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
      
      This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
      
      The JDBC server has been updated to use DataFrame.toIterator.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12114 from davies/local_iterator.
      cc70f174
    • Reynold Xin's avatar
      [SPARK-14358] Change SparkListener from a trait to an abstract class · 71439047
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Scala traits are difficult to maintain binary compatibility on, and as a result we had to introduce JavaSparkListener. In Spark 2.0 we can change SparkListener from a trait to an abstract class and then remove JavaSparkListener.
      
      ## How was this patch tested?
      Updated related unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12142 from rxin/SPARK-14358.
      71439047
    • Reynold Xin's avatar
      [SPARK-14364][SPARK] HeartbeatReceiver object should be private · 27dad6f6
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      It's a mistake that HeartbeatReceiver object was made public in Spark 1.x.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12148 from rxin/SPARK-14364.
      27dad6f6
    • Davies Liu's avatar
      [SPARK-12981] [SQL] extract Pyhton UDF in physical plan · 5743c647
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning).
      
      We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan.
      
      This PR extract Python UDFs in physical plan.
      
      Closes #10935
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12127 from davies/py_udf.
      5743c647
    • Shixiong Zhu's avatar
      [SPARK-14176][SQL] Add DataFrameWriter.trigger to set the stream batch period · 855ed44e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add a processing time trigger to control the batch processing speed
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11976 from zsxwing/trigger.
      855ed44e
    • Joseph K. Bradley's avatar
      [SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor · 89f3befa
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      **Main change**: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below)
      
      Modified numTrees method (*deprecation*)
      * Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params.
      * What this PR does: Moves method numTrees outside of trait TreeEnsembleModel.  Adds it to GBT and RF Models.  Deprecates it in RF Models in favor of new method getNumTrees.  In Spark 2.1, we can have RF Models include Param numTrees.
      
      Minor items
      * Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID.
      
      **Implementation details**
      * Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles.
      * Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs
        * These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame.
      * Split trait RandomForestParams into parts in order to add more Estimator Params to RF models
      * Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame.  Same for DefaultParamsReader.loadMetadata
      
      ## How was this patch tested?
      
      Adds standard unit tests for RF save/load
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: GayathriMurali <gayathri.m.softie@gmail.com>
      
      Closes #12118 from jkbradley/GayathriMurali-SPARK-13784.
      89f3befa
    • Davies Liu's avatar
      [SPARK-14137] [SQL] Cleanup hash join · 74542533
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR did a few cleanup on HashedRelation and HashJoin:
      
      1) Merge HashedRelation and UniqueHashedRelation together
      2) Return an iterator from HashedRelation, so we donot need a create many UnsafeRow objects.
      3) Return a copy of HashedRelation for thread-safety in BroadcastJoin, so we can re-use the UnafeRow objects.
      4) Cleanup HashJoin, share most of the code between BroadcastHashJoin and ShuffleHashJoin
      5) Removed UniqueLongHashedRelation, which will be replaced by LongUnsafeMap (another PR).
      6) Update benchmark, before this patch, the selectivity of joins are too high.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12102 from davies/cleanup_hash.
      74542533
    • Reynold Xin's avatar
      [SPARK-14360][SQL] QueryExecution.debug.codegen() to dump codegen · 0340b3d2
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We recently added the ability to dump the generated code for a given query. However, the method is only available through an implicit after an import. It'd slightly simplify things if it can be called directly in queryExecution.
      
      ## How was this patch tested?
      Manually tested in spark-shell.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12144 from rxin/SPARK-14360.
      0340b3d2
  3. Apr 03, 2016
    • Matei Zaharia's avatar
      [SPARK-14356] Update spark.sql.execution.debug to work on Datasets · 76f3c735
      Matei Zaharia authored
      ## What changes were proposed in this pull request?
      
      Update DebugQuery to work on Datasets of any type, not just DataFrames.
      
      ## How was this patch tested?
      
      Added unit tests, checked in spark-shell.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #12140 from mateiz/debug-dataset.
      76f3c735
    • Dongjoon Hyun's avatar
      [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results · 3f749f7e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
      - Fix typos(exception/log strings, testcase name, comments) in 44 lines.
      - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
      - Use diamond operators in 40 lines. (New codes after SPARK-13702)
      - Fix redundant semicolon in 5 lines.
      - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.
      
      ## How was this patch tested?
      
      Manual and pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12139 from dongjoon-hyun/SPARK-14355.
      3f749f7e
    • Marcin Tustin's avatar
      [SPARK-14163][CORE] SumEvaluator and countApprox cannot reliably handle RDDs of size 1 · 9023015f
      Marcin Tustin authored
      ## What changes were proposed in this pull request?
      
      This special cases 0 and 1 counts to avoid passing 0 degrees of freedom.
      
      ## How was this patch tested?
      
      Tests run successfully. New test added.
      
      ## Note:
      This recreates #11982 which was closed to due to non-updated diff. rxin srowen Commented there.
      This also adds tests, reworks the code to perform the special casing (based on srowen's comments), and adds equality machinery for BoundedDouble, as well as changing how it is transformed to string.
      
      Author: Marcin Tustin <mtustin@handybook.com>
      Author: Marcin Tustin <mtustin@handy.com>
      
      Closes #12016 from mtustin-handy/SPARK-14163.
      9023015f
    • bomeng's avatar
      [SPARK-14341][SQL] Throw exception on unsupported create / drop macro ddl · c238cd07
      bomeng authored
      ## What changes were proposed in this pull request?
      
      We throw an AnalysisException that looks like this:
      
      ```
      scala> sqlContext.sql("CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x))")
      org.apache.spark.sql.catalyst.parser.ParseException:
      Unsupported SQL statement
      == SQL ==
      CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x))
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
        at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
        at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
        at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:749)
        ... 48 elided
      
      ```
      
      ## How was this patch tested?
      
      Add test cases in HiveQuerySuite.scala
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #12125 from bomeng/SPARK-14341.
      c238cd07
    • Dongjoon Hyun's avatar
      [SPARK-14350][SQL] EXPLAIN output should be in a single cell · 1f0c5dce
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      EXPLAIN output should be in a single cell.
      
      **Before**
      ```
      scala> sql("explain select 1").collect()
      res0: Array[org.apache.spark.sql.Row] = Array([== Physical Plan ==], [WholeStageCodegen], [:  +- Project [1 AS 1#1]], [:     +- INPUT], [+- Scan OneRowRelation[]])
      ```
      
      **After**
      ```
      scala> sql("explain select 1").collect()
      res1: Array[org.apache.spark.sql.Row] =
      Array([== Physical Plan ==
      WholeStageCodegen
      :  +- Project [1 AS 1#4]
      :     +- INPUT
      +- Scan OneRowRelation[]])
      ```
      Or,
      ```
      scala> sql("explain select 1").head
      res1: org.apache.spark.sql.Row =
      [== Physical Plan ==
      WholeStageCodegen
      :  +- Project [1 AS 1#5]
      :     +- INPUT
      +- Scan OneRowRelation[]]
      ```
      
      Please note that `Spark-shell(Scala-shell)` trims long string output. So, you may need to use `println` to get full strings.
      ```
      scala> println(sql("explain codegen select 'a' as a group by 1").head)
      [Found 2 WholeStageCodegen subtrees.
      == Subtree 1 / 2 ==
      WholeStageCodegen
      ...
      /* 059 */   }
      /* 060 */ }
      
      ]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests. (Testcases are updated.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12137 from dongjoon-hyun/SPARK-14350.
      1f0c5dce
    • hyukjinkwon's avatar
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double... · 2262a933
      hyukjinkwon authored
      [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14231
      
      Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
      
      But there are few restrictions in Spark `DecimalType` below:
      
      1. The precision cannot be bigger than 38.
      2. scale cannot be bigger than precision.
      
      Currently, both restrictions are not being handled.
      
      This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
      
      So, the codes below:
      
      ```scala
      def doubleRecords: RDD[String] =
        sqlContext.sparkContext.parallelize(
          s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
          s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("prefersDecimal", "true")
        .json(doubleRecords)
      jsonDF.printSchema()
      ```
      
      produces below:
      
      - **Before**
      
      ```scala
      org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
      	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
      	at
      ...
      ```
      
      - **After**
      
      ```scala
      root
       |-- a: double (nullable = true)
       |-- b: double (nullable = true)
      ```
      
      ## How was this patch tested?
      
      Unit tests were used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12030 from HyukjinKwon/SPARK-14231.
      2262a933
    • Reynold Xin's avatar
      [HOTFIX] Fix Scala 2.10 compilation · 7be46205
      Reynold Xin authored
      7be46205
  4. Apr 02, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-13996] [SQL] Add more not null attributes for Filter codegen · c2f25b1a
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      JIRA: https://issues.apache.org/jira/browse/SPARK-13996
      
      Filter codegen finds the attributes not null by checking IsNotNull(a) expression with a condition if child.output.contains(a). However, the current approach to checking it is not comprehensive. We can improve it.
      
      E.g., for this plan:
      
          val rdd = sqlContext.sparkContext.makeRDD(Seq(Row(1, "1"), Row(null, "1"), Row(2, "2")))
          val schema = new StructType().add("k", IntegerType).add("v", StringType)
          val smallDF = sqlContext.createDataFrame(rdd, schema)
          val df = smallDF.filter("isnotnull(k + 1)")
      
      The code snippet generated without this patch:
      
          /* 031 */   protected void processNext() throws java.io.IOException {
          /* 032 */     /*** PRODUCE: Filter isnotnull((k#0 + 1)) */
          /* 033 */
          /* 034 */     /*** PRODUCE: INPUT */
          /* 035 */
          /* 036 */     while (!shouldStop() && inputadapter_input.hasNext()) {
          /* 037 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
          /* 038 */       /*** CONSUME: Filter isnotnull((k#0 + 1)) */
          /* 039 */       /* input[0, int] */
          /* 040 */       boolean filter_isNull = inputadapter_row.isNullAt(0);
          /* 041 */       int filter_value = filter_isNull ? -1 : (inputadapter_row.getInt(0));
          /* 042 */
          /* 043 */       /* isnotnull((input[0, int] + 1)) */
          /* 044 */       /* (input[0, int] + 1) */
          /* 045 */       boolean filter_isNull3 = true;
          /* 046 */       int filter_value3 = -1;
          /* 047 */
          /* 048 */       if (!filter_isNull) {
          /* 049 */         filter_isNull3 = false; // resultCode could change nullability.
          /* 050 */         filter_value3 = filter_value + 1;
          /* 051 */
          /* 052 */       }
          /* 053 */       if (!(!(filter_isNull3))) continue;
          /* 054 */
          /* 055 */       filter_metricValue.add(1);
      
      With this patch:
      
          /* 031 */   protected void processNext() throws java.io.IOException {
          /* 032 */     /*** PRODUCE: Filter isnotnull((k#0 + 1)) */
          /* 033 */
          /* 034 */     /*** PRODUCE: INPUT */
          /* 035 */
          /* 036 */     while (!shouldStop() && inputadapter_input.hasNext()) {
          /* 037 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
          /* 038 */       /*** CONSUME: Filter isnotnull((k#0 + 1)) */
          /* 039 */       /* input[0, int] */
          /* 040 */       boolean filter_isNull = inputadapter_row.isNullAt(0);
          /* 041 */       int filter_value = filter_isNull ? -1 : (inputadapter_row.getInt(0));
          /* 042 */
          /* 043 */       if (filter_isNull) continue;
          /* 044 */
          /* 045 */       filter_metricValue.add(1);
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #11810 from viirya/add-more-not-null-attrs.
      c2f25b1a
    • Sital Kedia's avatar
      [SPARK-14056] Appends s3 specific configurations and spark.hadoop con… · 1cf70183
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Appends s3 specific configurations and spark.hadoop configurations to hive configuration.
      
      ## How was this patch tested?
      
      Tested by running a job on cluster.
      
      …figurations to hive configuration.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #11876 from sitalkedia/hiveConf.
      1cf70183
    • Liwei Lin's avatar
      [SPARK-14342][CORE][DOCS][TESTS] Remove straggler references to Tachyon · 03d130f9
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      Straggler references to Tachyon were removed:
      - for docs, `tachyon` has been generalized as `off-heap memory`;
      - for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/).
      
      ## How was this patch tested?
      
      Existing test suites.
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12129 from lw-lin/tachyon-cleanup.
      03d130f9
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code. · 4a6e78ab
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
      (All comment-only changes over 77 files: +786 lines, −747 lines)
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
      4a6e78ab
    • Dongjoon Hyun's avatar
      [SPARK-14338][SQL] Improve `SimplifyConditionals` rule to handle `null` in IF/CASEWHEN · f7050376
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `SimplifyConditionals` handles `true` and `false` to optimize branches. This PR improves `SimplifyConditionals` to take advantage of `null` conditions for `if` and `CaseWhen` expressions, too.
      
      **Before**
      ```
      scala> sql("SELECT IF(null, 1, 0)").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [if (null) 1 else 0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4]
      :     +- INPUT
      +- Scan OneRowRelation[]
      scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [CASE WHEN null THEN 1 ELSE 2 END AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#14]
      :     +- INPUT
      +- Scan OneRowRelation[]
      ```
      
      **After**
      ```
      scala> sql("SELECT IF(null, 1, 0)").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4]
      :     +- INPUT
      +- Scan OneRowRelation[]
      scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [2 AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#4]
      :     +- INPUT
      +- Scan OneRowRelation[]
      ```
      
      **Hive**
      ```
      hive> select if(null,1,2);
      OK
      2
      hive> select case when cast(null as boolean) then 1 else 2 end;
      OK
      2
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new extended test cases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12122 from dongjoon-hyun/SPARK-14338.
      f7050376
    • Reynold Xin's avatar
      a3e29354
    • Jacek Laskowski's avatar
      [MINOR] Typo fixes · 06694f1c
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Typo fixes. No functional changes.
      
      ## How was this patch tested?
      
      Built the sources and ran with samples.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #11802 from jaceklaskowski/typo-fixes.
      06694f1c
Loading