Skip to content
Snippets Groups Projects
  1. Jun 04, 2015
    • Mike Dusenberry's avatar
      [SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column reference. · df7da07a
      Mike Dusenberry authored
      Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests.  Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:
      
      514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
      2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
      6bf7c0e [Mike Dusenberry] Minor code formatting change.
      e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
      5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
      4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
      986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests.  Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.
      df7da07a
    • Davies Liu's avatar
      [SPARK-7956] [SQL] Use Janino to compile SQL expressions into bytecode · c8709dcf
      Davies Liu authored
      In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode.
      
      After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6479 from davies/janino and squashes the following commits:
      
      cc689f5 [Davies Liu] remove globalLock
      262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      eec3a33 [Davies Liu] address comments from Josh
      f37c8c3 [Davies Liu] fix DecimalType and cast to String
      202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      a21e968 [Davies Liu] fix style
      0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      551a851 [Davies Liu] fix tests
      c3bdffa [Davies Liu] remove print
      6089ce5 [Davies Liu] change logging level
      7e46ac3 [Davies Liu] fix style
      d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      da4926a [Davies Liu] fix tests
      03660f3 [Davies Liu] WIP: use Janino to compile Java source
      f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
      f7d66cf [Davies Liu] use template based string for codegen
      c8709dcf
    • Daniel Darabos's avatar
      Fix maxTaskFailures comment · 10ba1880
      Daniel Darabos authored
      If maxTaskFailures is 1, the task set is aborted after 1 task failure. Other documentation and the code supports this reading, I think it's just this comment that was off. It's easy to make this mistake — can you please double-check if I'm correct? Thanks!
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      
      Closes #6621 from darabos/patch-2 and squashes the following commits:
      
      dfebdec [Daniel Darabos] Fix comment.
      10ba1880
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · 9982d453
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #5976 (close requested by 'JoshRosen')
      Closes #4576 (close requested by 'pwendell')
      Closes #3430 (close requested by 'pwendell')
      Closes #2495 (close requested by 'pwendell')
      9982d453
  2. Jun 03, 2015
    • Andrew Or's avatar
      [BUILD] Fix Maven build for Kinesis · 984ad601
      Andrew Or authored
      A necessary dependency that is transitively referenced is not
      provided, causing compilation failures in builds that provide
      the kinesis-asl profile.
      984ad601
    • Andrew Or's avatar
      [BUILD] Use right branch when checking against Hive · 9cf740f3
      Andrew Or authored
      Right now we always run hive tests in branch-1.4 PRs because we compare whether the diff against master involves hive changes. Really we should be comparing against the target branch itself.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6629 from andrewor14/build-check-hive and squashes the following commits:
      
      450fbbd [Andrew Or] [BUILD] Use right branch when checking against Hive
      9cf740f3
    • Andrew Or's avatar
      [BUILD] Increase Jenkins test timeout · e35cd36e
      Andrew Or authored
      Currently hive tests alone take 40m. The right thing to do is
      to reduce the test time. However, that is a bigger project and
      we currently have PRs blocking on tests not timing out.
      e35cd36e
    • Shivaram Venkataraman's avatar
      [SPARK-8084] [SPARKR] Make SparkR scripts fail on error · 0576c3c4
      Shivaram Venkataraman authored
      cc shaneknapp pwendell JoshRosen
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6623 from shivaram/SPARK-8084 and squashes the following commits:
      
      0ec5b26 [Shivaram Venkataraman] Make SparkR scripts fail on error
      0576c3c4
    • Ryan Williams's avatar
      [SPARK-8088] don't attempt to lower number of executors by 0 · 51898b51
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #6624 from ryan-williams/execs and squashes the following commits:
      
      b6f71d4 [Ryan Williams] don't attempt to lower number of executors by 0
      51898b51
    • Hari Shreedharan's avatar
      [HOTFIX] History Server API docs error fix. · 566cb594
      Hari Shreedharan authored
      Minor error in the monitoring docs. Also made indentation changes in `ApiRootResource`
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6628 from harishreedharan/eventlog-formatting and squashes the following commits:
      
      a12553d [Hari Shreedharan] Javadoc updates.
      ca399b6 [Hari Shreedharan] [HOTFIX] History Server API docs error fix.
      566cb594
    • Andrew Or's avatar
      [HOTFIX] [TYPO] Fix typo in #6546 · bfbdab12
      Andrew Or authored
      bfbdab12
    • leahmcguire's avatar
      [SPARK-6164] [ML] CrossValidatorModel should keep stats from fitting · d8662cd9
      leahmcguire authored
      Added stats from cross validation as a val in the cross validation model to save them for user access.
      
      Author: leahmcguire <lmcguire@salesforce.com>
      
      Closes #5915 from leahmcguire/saveCVmetrics and squashes the following commits:
      
      49b507b [leahmcguire] fixed tyle error
      67537b1 [leahmcguire] rebased
      85907f0 [leahmcguire] fixed name
      59987cc [leahmcguire] changed param name and test according to comments
      36e71e3 [leahmcguire] rebasing
      4b8223e [leahmcguire] fixed name
      4ddffc6 [leahmcguire] changed param name and test according to comments
      3a995da [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
      d8662cd9
    • Xiangrui Meng's avatar
      [SPARK-8051] [MLLIB] make StringIndexerModel silent if input column does not exist · 26c9d7a0
      Xiangrui Meng authored
      This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6595 from mengxr/SPARK-8051 and squashes the following commits:
      
      b6a36b9 [Xiangrui Meng] add doc
      f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051
      8ee7c7e [Xiangrui Meng] use SparkFunSuite
      e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist
      26c9d7a0
    • Shivaram Venkataraman's avatar
      [SPARK-3674] [EC2] Clear SPARK_WORKER_INSTANCES when using YARN · d3e026f8
      Shivaram Venkataraman authored
      cc andrewor14
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits:
      
      db244ae [Shivaram Venkataraman] Make Python Lint happy
      0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN
      d3e026f8
    • Hari Shreedharan's avatar
      [HOTFIX] Fix Hadoop-1 build caused by #5792. · a8f1f154
      Hari Shreedharan authored
      Replaced `fs.listFiles` with Hadoop-1 friendly `fs.listStatus` method.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6619 from harishreedharan/evetlog-hadoop-1-fix and squashes the following commits:
      
      6192078 [Hari Shreedharan] [HOTFIX] Fix Hadoop-1 build caused by #5972.
      a8f1f154
    • zsxwing's avatar
      [SPARK-7989] [CORE] [TESTS] Fix flaky tests in ExternalShuffleServiceSuite and... · f2713478
      zsxwing authored
      [SPARK-7989] [CORE] [TESTS] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
      
      The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.
      
      This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6546 from zsxwing/SPARK-7989 and squashes the following commits:
      
      5560e09 [zsxwing] Fix a typo
      3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
      f2713478
    • zsxwing's avatar
      [SPARK-8001] [CORE] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout · 1d8669f1
      zsxwing authored
      Some places forget to call `assert` to check the return value of `AsynchronousListenerBus.waitUntilEmpty`. Instead of adding `assert` in these places, I think it's better to make `AsynchronousListenerBus.waitUntilEmpty` throw `TimeoutException`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6550 from zsxwing/SPARK-8001 and squashes the following commits:
      
      607674a [zsxwing] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
      1d8669f1
    • Marcelo Vanzin's avatar
      [SPARK-8059] [YARN] Wake up allocation thread when new requests arrive. · aa40c442
      Marcelo Vanzin authored
      This should help reduce latency for new executor allocations.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6600 from vanzin/SPARK-8059 and squashes the following commits:
      
      8387a3a [Marcelo Vanzin] [SPARK-8059] [yarn] Wake up allocation thread when new requests arrive.
      aa40c442
    • Timothy Chen's avatar
      [SPARK-8083] [MESOS] Use the correct base path in mesos driver page. · bfbf12b3
      Timothy Chen authored
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #6615 from tnachen/mesos_driver_path and squashes the following commits:
      
      4f47b7c [Timothy Chen] Use the correct base path in mesos driver page.
      bfbf12b3
    • Andrew Or's avatar
      [MINOR] [UI] Improve confusing message on log page · c6a6dd0d
      Andrew Or authored
      It's good practice to check if the input path is in the directory
      we expect to avoid potentially confusing error messages.
      c6a6dd0d
    • Joseph K. Bradley's avatar
      [SPARK-8054] [MLLIB] Added several Java-friendly APIs + unit tests · 20a26b59
      Joseph K. Bradley authored
      Java-friendly APIs added:
      * GaussianMixture.run()
      * GaussianMixtureModel.predict()
      * DistributedLDAModel.javaTopicDistributions()
      * StreamingKMeans: trainOn, predictOn, predictOnValues
      * Statistics.corr
      * params
        * added doc to w() since Java docs do not inherit doc
        * removed non-Java-friendly w() from StringArrayParam and DoubleArrayParam
        * made DoubleArrayParam Java-friendly w() actually Java-friendly
      
      I generated the doc and verified all changes.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6562 from jkbradley/java-api-1.4 and squashes the following commits:
      
      c16821b [Joseph K. Bradley] Small fixes based on code review.
      d955581 [Joseph K. Bradley] unit test fixes
      29b6b0d [Joseph K. Bradley] small fixes
      fe6dcfe [Joseph K. Bradley] Added several Java-friendly APIs + unit tests: NaiveBayes, GaussianMixture, LDA, StreamingKMeans, Statistics.corr, params
      20a26b59
    • Reynold Xin's avatar
    • Reynold Xin's avatar
      [SPARK-8074] Parquet should throw AnalysisException during setup for data... · 939e4f3d
      Reynold Xin authored
      [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6608 from rxin/parquet-analysis and squashes the following commits:
      
      b5dc8e2 [Reynold Xin] Code review feedback.
      5617cf6 [Reynold Xin] [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.
      939e4f3d
    • Sun Rui's avatar
      [SPARK-8063] [SPARKR] Spark master URL conflict between MASTER env variable... · 708c63bb
      Sun Rui authored
      [SPARK-8063] [SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #6605 from sun-rui/SPARK-8063 and squashes the following commits:
      
      51ca48b [Sun Rui] [SPARK-8063][SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
      708c63bb
    • Hari Shreedharan's avatar
      [SPARK-7161] [HISTORY SERVER] Provide REST api to download event logs fro... · d2a86eb8
      Hari Shreedharan authored
      ...m History Server
      
      This PR adds a new API that allows the user to download event logs for an application as a zip file. APIs have been added to download all logs for a given application or just for a specific attempt.
      
      This also add an additional method to the ApplicationHistoryProvider to get the raw files, zipped.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #5792 from harishreedharan/eventlog-download and squashes the following commits:
      
      221cc26 [Hari Shreedharan] Update docs with new API information.
      a131be6 [Hari Shreedharan] Fix style issues.
      5528bd8 [Hari Shreedharan] Merge branch 'master' into eventlog-download
      6e8156e [Hari Shreedharan] Simplify tests, use Guava stream copy methods.
      d8ddede [Hari Shreedharan] Remove unnecessary case in EventLogDownloadResource.
      ffffb53 [Hari Shreedharan] Changed interface to use zip stream. Added more tests.
      1100b40 [Hari Shreedharan] Ensure that `Path` does not appear in interfaces, by rafactoring interfaces.
      5a5f3e2 [Hari Shreedharan] Fix test ordering issue.
      0b66948 [Hari Shreedharan] Minor formatting/import fixes.
      4fc518c [Hari Shreedharan] Fix rat failures.
      a48b91f [Hari Shreedharan] Refactor to make attemptId optional in the API. Also added tests.
      0fc1424 [Hari Shreedharan] File download now works for individual attempts and the entire application.
      350d7e8 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into eventlog-download
      fd6ab00 [Hari Shreedharan] Fix style issues
      32b7662 [Hari Shreedharan] Use UIRoot directly in ApiRootResource. Also, use `Response` class to set headers.
      7b362b2 [Hari Shreedharan] Almost working.
      3d18ebc [Hari Shreedharan] [WIP] Try getting the event log download to work.
      d2a86eb8
    • animesh's avatar
      [SPARK-7980] [SQL] Support SQLContext.range(end) · d053a31b
      animesh authored
      1. range() overloaded in SQLContext.scala
      2. range() modified in python sql context.py
      3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py
      
      Author: animesh <animesh@apache.spark>
      
      Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:
      
      935899c [animesh] SPARK-7980:python+scala changes
      d053a31b
    • Patrick Wendell's avatar
      [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 · 2c4d550e
      Patrick Wendell authored
      Author: Patrick Wendell <patrick@databricks.com>
      
      Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:
      
      2f42d02 [Patrick Wendell] A few more excludes
      4bebcf0 [Patrick Wendell] Update to RC4
      61aaf46 [Patrick Wendell] Using new release candidate
      55f1610 [Patrick Wendell] Another exclude
      04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
      36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
      2c4d550e
    • Yin Huai's avatar
      [SPARK-7973] [SQL] Increase the timeout of two CliSuite tests. · f1646e10
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-7973
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6525 from yhuai/SPARK-7973 and squashes the following commits:
      
      763b821 [Yin Huai] Also change the timeout of "Single command with -e" to 2 minutes.
      e598a08 [Yin Huai] Increase the timeout to 3 minutes.
      f1646e10
    • Yuhao Yang's avatar
      [SPARK-7983] [MLLIB] Add require for one-based indices in loadLibSVMFile · 28dbde38
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7983
      
      Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).
      
      add a quick check.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6538 from hhbyyh/loadSVM and squashes the following commits:
      
      79d9c11 [Yuhao Yang] optimization as respond to comments
      4310710 [Yuhao Yang] merge conflict
      96460f1 [Yuhao Yang] merge conflict
      20a2811 [Yuhao Yang] use require
      6e4f8ca [Yuhao Yang] add check for ascending order
      9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception
      5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
      28dbde38
    • Wenchen Fan's avatar
      [SPARK-7562][SPARK-6444][SQL] Improve error reporting for expression data type mismatch · d38cf217
      Wenchen Fan authored
      It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is.
      
      This PR mainly works on apply input types checking for arithmetic and predicate expressions.
      
      TODO: apply type checking interface to more expressions.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6405 from cloud-fan/6444 and squashes the following commits:
      
      b5ff31b [Wenchen Fan] address comments
      b917275 [Wenchen Fan] rebase
      39929d9 [Wenchen Fan] add todo
      0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private
      3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison
      8883025 [Wenchen Fan] apply type check interface to CaseWhen
      cffb67c [Wenchen Fan] to have resolved call the data type check function
      6eaadff [Wenchen Fan] add equal type constraint to EqualTo
      3affbd8 [Wenchen Fan] more fixes
      654d46a [Wenchen Fan] improve tests
      e0a3628 [Wenchen Fan] improve error message
      1524ff6 [Wenchen Fan] fix style
      69ca3fe [Wenchen Fan] add error message and tests
      c71d02c [Wenchen Fan] fix hive tests
      6491721 [Wenchen Fan] use value class TypeCheckResult
      7ae76b9 [Wenchen Fan] address comments
      cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch
      d38cf217
    • Reynold Xin's avatar
      [SPARK-8060] Improve DataFrame Python test coverage and documentation. · ce320cb2
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:
      
      baa8ad5 [Reynold Xin] Code review feedback.
      f081d47 [Reynold Xin] More documentation updates.
      c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
      ce320cb2
    • MechCoder's avatar
      [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust · 452eb82d
      MechCoder authored
      The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`
      
      It fails in my system since I have version `1.10` :P
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6579 from MechCoder/np_ver and squashes the following commits:
      
      15430f8 [MechCoder] fix syntax error
      893fb7e [MechCoder] remove equal to
      e35f0d4 [MechCoder] minor
      e89376c [MechCoder] Better checking
      22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
      452eb82d
    • Yuhao Yang's avatar
      [SPARK-8043] [MLLIB] [DOC] update NaiveBayes and SVM examples in doc · 43adbd56
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8043
      
      I found some issues during testing the save/load examples in markdown Documents, as a part of 1.4 QA plan
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #6584 from hhbyyh/naiveDocExample and squashes the following commits:
      
      a01a206 [Yuhao Yang] fix for Gaussian mixture
      2fb8b96 [Yuhao Yang] update NaiveBayes and SVM examples in doc
      43adbd56
    • WangTaoTheTonic's avatar
      [MINOR] make the launcher project name consistent with others · ccaa8232
      WangTaoTheTonic authored
      I found this by chance while building spark and think it is better to keep its name consistent with other sub-projects (Spark Project *).
      
      I am not gonna file JIRA as it is a pretty small issue.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #6603 from WangTaoTheTonic/projName and squashes the following commits:
      
      994b3ba [WangTaoTheTonic] make the project name consistent
      ccaa8232
    • Joseph K. Bradley's avatar
      [SPARK-8053] [MLLIB] renamed scalingVector to scalingVec · 07c16cb5
      Joseph K. Bradley authored
      I searched the Spark codebase for all occurrences of "scalingVector"
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6596 from jkbradley/scalingVec-rename and squashes the following commits:
      
      d3812f8 [Joseph K. Bradley] renamed scalingVector to scalingVec
      07c16cb5
    • Josh Rosen's avatar
      [SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors · cafd5056
      Josh Rosen authored
      This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.
      
      At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`.  In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods.  This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.
      
      The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:
      
      - #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
      - #6218: DataFrame.describe() should cast all aggregates to String
      - #6400: Use output schema, not relation schema, for data source input conversion
      
      Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema.  According to the `createDataFrame()` Scaladoc:
      
      >  It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.
      
      Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats.  This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.
      
      In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows.  Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch.  Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:
      
      740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
      befc613 [Josh Rosen] Add tests to document Option-handling behavior.
      5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
      6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
      3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
      6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
      677ff27 [Josh Rosen] Fix null handling bug; add tests.
      8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
      85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
      9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
      ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
      7ca7fcb [Josh Rosen] Comments and cleanup
      1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
      cafd5056
  3. Jun 02, 2015
    • DB Tsai's avatar
      [SPARK-7547] [ML] Scala Example code for ElasticNet · a86b3e9b
      DB Tsai authored
      This is scala example code for both linear and logistic regression. Python and Java versions are to be added.
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6576 from dbtsai/elasticNetExample and squashes the following commits:
      
      e7ca406 [DB Tsai] fix test
      6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
      136e0dd [DB Tsai] address feedback
      1ec29d4 [DB Tsai] fix style
      9462f5f [DB Tsai] add example
      a86b3e9b
    • Ram Sriharsha's avatar
      [SPARK-7387] [ML] [DOC] CrossValidator example code in Python · c3f4c325
      Ram Sriharsha authored
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6358 from harsha2010/SPARK-7387 and squashes the following commits:
      
      63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
      aeb6bb6 [Ram Sriharsha] Python Style Fix
      54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      615e91c [Ram Sriharsha] cleanup
      204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
      7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python
      c3f4c325
    • Cheng Lian's avatar
      [SQL] [TEST] [MINOR] Follow-up of PR #6493, use Guava API to ensure Java 6 friendliness · 5cd6a63d
      Cheng Lian authored
      This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness.
      
      cc andrewor14 pwendell, this should also be back ported to branch-1.4.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6547 from liancheng/override-log4j and squashes the following commits:
      
      c900cfd [Cheng Lian] Addresses Shixiong's comment
      72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness
      5cd6a63d
    • Xiangrui Meng's avatar
      [SPARK-8049] [MLLIB] drop tmp col from OneVsRest output · 89f21f66
      Xiangrui Meng authored
      The temporary column should be dropped after we get the prediction column. harsha2010
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6592 from mengxr/SPARK-8049 and squashes the following commits:
      
      1d89107 [Xiangrui Meng] use SparkFunSuite
      6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output
      89f21f66
Loading