Skip to content
Snippets Groups Projects
  1. Sep 13, 2017
    • Yuming Wang's avatar
      [SPARK-20427][SQL] Read JDBC table use custom schema · 17edfec5
      Yuming Wang authored
      ## What changes were proposed in this pull request?
      
      Auto generated Oracle schema some times not we expect:
      
      - `number(1)` auto mapped to BooleanType, some times it's not we expect, per [SPARK-20921](https://issues.apache.org/jira/browse/SPARK-20921).
      -  `number` auto mapped to Decimal(38,10), It can't read big data, per [SPARK-20427](https://issues.apache.org/jira/browse/SPARK-20427).
      
      This PR fix this issue by custom schema as follows:
      ```scala
      val props = new Properties()
      props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean")
      val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", props)
      dfRead.show()
      ```
      or
      ```sql
      CREATE TEMPORARY VIEW tableWithCustomSchema
      USING org.apache.spark.sql.jdbc
      OPTIONS (url '$jdbcUrl', dbTable 'tableWithCustomSchema', customSchema'ID decimal(38, 0), N1 int, N2 boolean')
      ```
      
      ## How was this patch tested?
      
      unit tests
      
      Author: Yuming Wang <wgyumg@gmail.com>
      
      Closes #18266 from wangyum/SPARK-20427.
      17edfec5
    • Armin's avatar
      [SPARK-21970][CORE] Fix Redundant Throws Declarations in Java Codebase · b6ef1f57
      Armin authored
      ## What changes were proposed in this pull request?
      
      1. Removing all redundant throws declarations from Java codebase.
      2. Removing dead code made visible by this from `ShuffleExternalSorter#closeAndGetSpills`
      
      ## How was this patch tested?
      
      Build still passes.
      
      Author: Armin <me@obrown.io>
      
      Closes #19182 from original-brownbear/SPARK-21970.
      b6ef1f57
    • Sean Owen's avatar
      [SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile · 4fbf748b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Put Kafka 0.8 support behind a kafka-0-8 profile.
      
      ## How was this patch tested?
      
      Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19134 from srowen/SPARK-21893.
      4fbf748b
  2. Sep 06, 2017
    • Bryan Cutler's avatar
      [SPARK-19357][ML] Adding parallel model evaluation in ML tuning · 16c4c03c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid.  The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently.  This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately.  The default value is `1` which will train/evaluate in serial.
      
      ## How was this patch tested?
      Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel.  Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.
      16c4c03c
  3. Aug 30, 2017
    • Bryan Cutler's avatar
      [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
      
      ## How was this patch tested?
      
      Manually ran examples and verified that output is consistent for different APIs
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
      4133c1b0
  4. Aug 15, 2017
    • Marcelo Vanzin's avatar
      [SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99
      Marcelo Vanzin authored
      This version fixes a few issues in the import order checker; it provides
      better error messages, and detects more improper ordering (thus the need
      to change a lot of files in this patch). The main fix is that it correctly
      complains about the order of packages vs. classes.
      
      As part of the above, I moved some "SparkSession" import in ML examples
      inside the "$example on$" blocks; that didn't seem consistent across
      different source files to start with, and avoids having to add more on/off blocks
      around specific imports.
      
      The new scalastyle also seems to have a better header detector, so a few
      license headers had to be updated to match the expected indentation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18943 from vanzin/SPARK-21731.
      3f958a99
  5. Jul 18, 2017
    • Sean Owen's avatar
      [SPARK-21415] Triage scapegoat warnings, part 1 · e26dac5f
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Address scapegoat warnings for:
      - BigDecimal double constructor
      - Catching NPE
      - Finalizer without super
      - List.size is O(n)
      - Prefer Seq.empty
      - Prefer Set.empty
      - reverse.map instead of reverseMap
      - Type shadowing
      - Unnecessary if condition.
      - Use .log1p
      - Var could be val
      
      In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18635 from srowen/Scapegoat1.
      e26dac5f
  6. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  7. Jun 19, 2017
    • Dongjoon Hyun's avatar
      [MINOR][BUILD] Fix Java linter errors · ecc56313
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR cleans up a few Java linter errors for Apache Spark 2.2 release.
      
      ## How was this patch tested?
      
      ```bash
      $ dev/lint-java
      Using `mvn` from path: /usr/local/bin/mvn
      Checkstyle checks passed.
      ```
      
      We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18345 from dongjoon-hyun/fix_lint_java_2.
      ecc56313
  8. Jun 09, 2017
    • junzhi lu's avatar
      Fix bug in JavaRegressionMetricsExample. · 6491cbf0
      junzhi lu authored
      the original code cant visit the last element of the"parts" array.
      so the v[v.length–1] always equals 0
      
      ## What changes were proposed in this pull request?
      change the recycle range from (1 to parts.length-1) to (1 to parts.length)
      
      ## How was this patch tested?
      
      debug it in eclipse (´〜`*) zzz.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: junzhi lu <452756565@qq.com>
      
      Closes #18237 from masterwugui/patch-1.
      6491cbf0
  9. May 26, 2017
    • zero323's avatar
      [SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide · ae33abf7
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add Scala, Python and Java examples for `partitionBy`, `sortBy` and `bucketBy`.
      - Add _Bucketing, Sorting and Partitioning_ section to SQL Programming Guide
      - Remove bucketing from Unsupported Hive Functionalities.
      
      ## How was this patch tested?
      
      Manual tests, docs build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17938 from zero323/DOCS-BUCKETING-AND-PARTITIONING.
      ae33abf7
    • Zheng RuiFeng's avatar
      [SPARK-20849][DOC][SPARKR] Document R DecisionTree · a97c4970
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, add an example for sparkr `decisionTree`
      2, document it in user guide
      
      ## How was this patch tested?
      local submit
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18067 from zhengruifeng/dt_example.
      a97c4970
  10. May 25, 2017
    • Shixiong Zhu's avatar
      [SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project · 98c38529
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`.
      
      ## How was this patch tested?
      
      manually tested it.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #18101 from zsxwing/add-missing-example-dep.
      98c38529
  11. May 18, 2017
  12. May 17, 2017
  13. May 16, 2017
  14. May 12, 2017
    • Sean Owen's avatar
      [SPARK-20554][BUILD] Remove usage of scala.language.reflectiveCalls · fc8a2b6e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove uses of scala.language.reflectiveCalls that are either unnecessary or probably resulting in more complex code. This turned out to be less significant than I thought, but, still worth a touch-up.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17949 from srowen/SPARK-20554.
      fc8a2b6e
  15. May 09, 2017
    • uncleGen's avatar
      [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute · c0189abc
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
      The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.
      
      Changes:
      - In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.
      
      Depends upon:
      - [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.
      
      Others:
      - A typo fix in example.
      
      ## How was this patch tested?
      
      add new unit test.
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #17896 from uncleGen/SPARK-20373.
      c0189abc
  16. May 04, 2017
    • Felix Cheung's avatar
      [SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming... · b8302ccd
      Felix Cheung authored
      [SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example
      
      ## What changes were proposed in this pull request?
      
      Add
      - R vignettes
      - R programming guide
      - SS programming guide
      - R example
      
      Also disable spark.als in vignettes for now since it's failing (SPARK-20402)
      
      ## How was this patch tested?
      
      manually
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #17814 from felixcheung/rdocss.
      b8302ccd
  17. May 03, 2017
    • Sean Owen's avatar
      [SPARK-20523][BUILD] Clean up build warnings for 2.2.0 release · 16fab6b0
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings primarily related to Breeze 0.13 operator changes, Java style problems
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17803 from srowen/SPARK-20523.
      16fab6b0
    • MechCoder's avatar
      [SPARK-6227][MLLIB][PYSPARK] Implement PySpark wrappers for SVD and PCA (v2) · db2fb84b
      MechCoder authored
      Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).
      
      Based on #7963, updated.
      
      ## How was this patch tested?
      
      New doc tests and unit tests. Ran all examples locally.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.
      db2fb84b
  18. Apr 29, 2017
    • Yuhao Yang's avatar
      [SPARK-19791][ML] Add doc and example for fpgrowth · add9d1bb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add a new section for fpm
      Add Example for FPGrowth in scala and Java
      
      updated: Rewrite transform to be more compact.
      
      ## How was this patch tested?
      
      local doc generation.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17130 from hhbyyh/fpmdoc.
      add9d1bb
  19. Apr 24, 2017
  20. Apr 18, 2017
    • zero323's avatar
      [SPARK-20208][R][DOCS] Document R fpGrowth support · 702d85af
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Document  fpGrowth in:
      
      - vignettes
      - programming guide
      - code example
      
      ## How was this patch tested?
      
      Manual tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17557 from zero323/SPARK-20208.
      702d85af
    • Tathagata Das's avatar
      [SPARK-20377][SS] Fix JavaStructuredSessionization example · 74aa0df8
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Extra accessors in java bean class causes incorrect encoder generation, which corrupted the state when using timeouts.
      
      ## How was this patch tested?
      manually ran the example
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17676 from tdas/SPARK-20377.
      74aa0df8
  21. Apr 12, 2017
    • hyukjinkwon's avatar
      [MINOR][DOCS] JSON APIs related documentation fixes · bca4259f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes corrections related to JSON APIs as below:
      
      - Rendering links in Python documentation
      - Replacing `RDD` to `Dataset` in programing guide
      - Adding missing description about JSON Lines consistently in `DataFrameReader.json` in Python API
      - De-duplicating little bit of `DataFrameReader.json` in Scala/Java API
      
      ## How was this patch tested?
      
      Manually build the documentation via `jekyll build`. Corresponding snapstops will be left on the codes.
      
      Note that currently there are Javadoc8 breaks in several places. These are proposed to be handled in https://github.com/apache/spark/pull/17477. So, this PR does not fix those.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17602 from HyukjinKwon/minor-json-documentation.
      bca4259f
  22. Apr 10, 2017
    • Sean Owen's avatar
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish... · a26e3ed5
      Sean Owen authored
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems
      
      ## What changes were proposed in this pull request?
      
      Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
      
      The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17527 from srowen/SPARK-20156.
      a26e3ed5
  23. Apr 09, 2017
  24. Apr 07, 2017
    • actuaryzhang's avatar
      [SPARK-20258][DOC][SPARKR] Fix SparkR logistic regression example in... · 1ad73f0a
      actuaryzhang authored
      [SPARK-20258][DOC][SPARKR] Fix SparkR logistic regression example in programming guide (did not converge)
      
      ## What changes were proposed in this pull request?
      
      SparkR logistic regression example did not converge in programming guide (for IRWLS). All estimates are essentially zero:
      
      ```
      training2 <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm")
      df_list2 <- randomSplit(training2, c(7,3), 2)
      binomialDF <- df_list2[[1]]
      binomialTestDF <- df_list2[[2]]
      binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
      
      17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.
      
      > summary(binomialGLM)
      
      Coefficients:
                       Estimate
      (Intercept)    9.0255e+00
      features_0     0.0000e+00
      features_1     0.0000e+00
      features_2     0.0000e+00
      features_3     0.0000e+00
      features_4     0.0000e+00
      features_5     0.0000e+00
      features_6     0.0000e+00
      features_7     0.0000e+00
      ```
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17571 from actuaryzhang/programGuide2.
      1ad73f0a
    • actuaryzhang's avatar
      [SPARK-20026][DOC][SPARKR] Add Tweedie example for SparkR in programming guide · 870b9d9a
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Add Tweedie example for SparkR in programming guide.
      The doc was already updated in #17103.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #17553 from actuaryzhang/programGuide.
      870b9d9a
  25. Apr 06, 2017
  26. Apr 05, 2017
  27. Apr 03, 2017
    • Yuhao Yang's avatar
      [SPARK-19969][ML] Imputer doc and example · 4d28e843
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
      
      ## How was this patch tested?
      
      local doc generation and example execution
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17324 from hhbyyh/imputerdoc.
      4d28e843
  28. Mar 30, 2017
    • Jacek Laskowski's avatar
      [DOCS] Docs-only improvements · 0197262a
      Jacek Laskowski authored
      …adoc
      
      ## What changes were proposed in this pull request?
      
      Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).
      
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17417 from jaceklaskowski/window-expression-scaladoc.
      0197262a
  29. Mar 29, 2017
    • wm624@hotmail.com's avatar
      [MINOR][SPARKR] Add run command comment in examples · 471de5db
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      There are two examples in r folder missing the run commands.
      
      In this PR, I just add the missing comment, which is consistent with other examples.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #17474 from wangmiao1981/stat.
      471de5db
  30. Mar 23, 2017
    • sureshthalamati's avatar
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to... · c7911807
      sureshthalamati authored
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table
      
      ## What changes were proposed in this pull request?
      Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
      
      The solution is to allow users to specify database column data type for the create table  as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).
      
      All supported target database types can not be specified ,  the data types has to be valid spark sql data types also.  For example user can not specify target database  CLOB data type. This will be supported in the follow-up PR.
      
      Example:
      ```Scala
      df.write
      .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
      .jdbc(url, "TEST.DBCOLTYPETEST", properties)
      ```
      ## How was this patch tested?
      Added new test cases to the JDBCWriteSuite
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
      c7911807
  31. Mar 02, 2017
    • Nick Pentreath's avatar
      [SPARK-19345][ML][DOC] Add doc for "coldStartStrategy" usage in ALS · 9cca3dbf
      Nick Pentreath authored
      [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489) added the ability to skip `NaN` predictions during `ALSModel.transform`. This PR adds documentation for the `coldStartStrategy` param to the ALS user guide, and add code to the examples to illustrate usage.
      
      ## How was this patch tested?
      
      Doc and example change only. Build HTML doc locally and verified example code builds, and runs in shell for Scala/Python.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17102 from MLnick/SPARK-19345-coldstart-doc.
      9cca3dbf
    • Yun Ni's avatar
      [MINOR][ML] Fix comments in LSH Examples and Python API · 3bd8ddf7
      Yun Ni authored
      ## What changes were proposed in this pull request?
      Remove `org.apache.spark.examples.` in
      Add slash in one of the python doc.
      
      ## How was this patch tested?
      Run examples using the commands in the comments.
      
      Author: Yun Ni <yunn@uber.com>
      
      Closes #17104 from Yunni/yunn_minor.
      3bd8ddf7
  32. Mar 01, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19460][SPARKR] Update dataset used in R documentation, examples to... · 89cd3845
      wm624@hotmail.com authored
      [SPARK-19460][SPARKR] Update dataset used in R documentation, examples to reduce warning noise and confusions
      
      ## What changes were proposed in this pull request?
      
      Replace `iris` dataset with `Titanic` or other dataset in example and document.
      
      ## How was this patch tested?
      
      Manual and existing test
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #17032 from wangmiao1981/example.
      89cd3845
  33. Feb 27, 2017
Loading